Homework Assignment 4
CS 333 Natural Language Processing
Fall, 2011
Due: November 22
You may work with a partner on this assignment.
The assignment is to implement a probabilistic CKY parser, as
outlined in figure 14.3 in Jurafsky and Martin. Note that this
version of the algorithm uses a three-dimensional array of
probabilities to represent the parsing table. I am providing a
PCFG in Chomsky normal form (almost) in three files:
lexicon.dat (lexical rules in the form preterminal --> word), unary.dat (unary rules
in the form nonterminal -->
nonterminal), and binary.dat (binary rules in the form nonterminal --> nonterminal
nonterminal). These files are contained in hw4.tar.gz.
The parser should contain two parse methods: one whose input
is a list of words (i.e., Strings) and one whose input is a list of
TaggedWords. In both cases, the parse method should return the
most probable parse tree for the given input.
The second version of the parser assumes that an input sentence has
already been tagged by a part-of-speech tagger. In this case,
the lexical rules don't need to be used at all.
Each node of the parse tree should contain a symbol (i.e., a
nonterminal or word), the probability of that tree, and a set of
children. I recommend writing a Tree class with a toString
method that can be used to display the Tree in Scheme notation.
Note that the grammar I've provided is not fully in Chomsky normal
form, in that it contains rules in the form nonterminal --> nonterminal.
This makes it easier to reconstruct the original (non-Chomsky) parse
tree, but it requires a modification of the parsing algorithm.
After filling in a cell in the parsing table using either the
lexical rules or binary rules, it is necessary to apply unary rules
to the nonterminals in the cell, as long as they result in higher
probabilities for some of the nonterminals. Using the notation
of figure 14.3,
for each unary rule A --> B in the grammar {
if(table[i,j,A] < table[i,j,B] * P(A -->
B)){
table[i,j,A] = table[i,j,B] *
P(A --> B);
back[i,j,A] = B;
}
}
You'll need to look up the lexical rules according to their
right-hand sides, so I recommend creating a hash maps mapping
Strings to the lexical Rules. The jar file for the assignment
contains class definitions for PCFGRule, LexRule, UnaryRule, and
BinaryRule. You can start by loading the grammar files,
creating rule objects, and storing them in hash maps or lists.
When a tree is displayed, it should be displayed without the
artificial nodes introduced by the conversion to Chomsky normal
form. This can be handled with a method on the Tree class, or
within the toString method of the Tree class.
I'm supplying my version of the Tree class in this jar file: tree.jar