Homework Assignment 4
CS 333 Natural Language Processing
Fall, 2011
Due:  November 22


You may work with a partner on this assignment.

The assignment is to implement a probabilistic CKY parser, as outlined in figure 14.3 in Jurafsky and Martin.  Note that this version of the algorithm uses a three-dimensional array of probabilities to represent the parsing table.  I am providing a PCFG in Chomsky normal form (almost) in three files:  lexicon.dat (lexical rules in the form preterminal --> word), unary.dat (unary rules in the form nonterminal --> nonterminal), and binary.dat (binary rules in the form nonterminal --> nonterminal nonterminal).  These files are contained in hw4.tar.gz.

The parser should contain two parse methods:  one whose input is a list of words (i.e., Strings) and one whose input is a list of TaggedWords.  In both cases, the parse method should return the most probable parse tree for the given input.

The second version of the parser assumes that an input sentence has already been tagged by a part-of-speech tagger.  In this case, the lexical rules don't need to be used at all.

Each node of the parse tree should contain a symbol (i.e., a nonterminal or word), the probability of that tree, and a set of children.  I recommend writing a Tree class with a toString method that can be used to display the Tree in Scheme notation.

Note that the grammar I've provided is not fully in Chomsky normal form, in that it contains rules in the form nonterminal --> nonterminal.  This makes it easier to reconstruct the original (non-Chomsky) parse tree, but it requires a modification of the parsing algorithm.  After filling in a cell in the parsing table using either the lexical rules or binary rules, it is necessary to apply unary rules to the nonterminals in the cell, as long as they result in higher probabilities for some of the nonterminals.  Using the notation of figure 14.3,

for each unary rule A --> B in the grammar {
    if(table[i,j,A] < table[i,j,B] * P(A --> B)){
        table[i,j,A] = table[i,j,B] * P(A --> B);
        back[i,j,A] = B;
    }
}

You'll need to look up the lexical rules according to their right-hand sides, so I recommend creating a hash maps mapping Strings to the lexical Rules.  The jar file for the assignment contains class definitions for PCFGRule, LexRule, UnaryRule, and BinaryRule.  You can start by loading the grammar files, creating rule objects, and storing them in hash maps or lists.

When a tree is displayed, it should be displayed without the artificial nodes introduced by the conversion to Chomsky normal form.  This can be handled with a method on the Tree class, or within the toString method of the Tree class.

I'm supplying my version of the Tree class in this jar file:  tree.jar