Homework Assignment 3
CS 333 Natural Language Processing
Fall, 2011
Due:  October 16


Write three versions of a part-of-speech tagger:
  1. A simple tagger that simply chooses the most likely tag for each word of its input.
  2. An HMM tagger that chooses the best tag using a greedy, left-to-right approach.
  3. An HMM tagger using the Viterbi algorithm.
Your taggers should implement the Tagger interface provided in hw3.tar.gz.  The tar file also contains three pretagged files for training and testing:  treebank.tagged, treebank.tagged.large, and treebank.test.

Train each of your taggers with the treebank.tagged file.  Then test each tagger with the treebank.test file.  (Note that this file is fully tagged, so you will need to strip off the tags before using it as input to your tagger.  It has already been tokenized and split into sentences, so it is not necessary to use your tokenizer or sentence splitter.)
  1. Compare the results of your taggers with the original tagging of the test file.  What is the accuracy of each of your taggers on the test data?  (Give a percentage of correct tags.)  How do they compare with each other

  2. Use a confusion matrix to identify the five most frequent types of mistagging.

  3. Run your taggers on the larger training file treebank.train.large.  Run them on the same test data file as before.  What is the accuracy now?
Finally, write a program that will take a file of raw data and run it through a simple NLP pipeline:  tokenize, sentence split, and tag.  Apply it to a sample from one of the Gutenberg texts.  Give a qualitative assessment of how well it seems to do.