Homework Assignment 3
CS 333 Natural Language Processing
Fall, 2011
Due: October 16
Write three versions of a part-of-speech tagger:
- A simple tagger that simply chooses the most likely tag for
each word of its input.
- An HMM tagger that chooses the best tag using a greedy,
left-to-right approach.
- An HMM tagger using the Viterbi algorithm.
Your taggers should implement the Tagger interface provided in hw3.tar.gz. The tar file also contains
three pretagged files for training and testing:
treebank.tagged, treebank.tagged.large, and treebank.test.
Train each of your taggers with the treebank.tagged file. Then
test each tagger with the treebank.test file. (Note that this
file is fully tagged, so you will need to strip off the tags before
using it as input to your tagger. It has already been
tokenized and split into sentences, so it is not necessary to use
your tokenizer or sentence splitter.)
- Compare the results of your taggers with the original tagging
of the test file. What is the accuracy of each of your
taggers on the test data? (Give a percentage of correct
tags.) How do they compare with each other
- Use a confusion matrix to identify the five most frequent
types of mistagging.
- Run your taggers on the larger training file
treebank.train.large. Run them on the same test data file
as before. What is the accuracy now?
Finally, write a program that will take a file of raw data and run
it through a simple NLP pipeline: tokenize, sentence split,
and tag. Apply it to a sample from one of the Gutenberg
texts. Give a qualitative assessment of how well it seems to
do.