Lab 9 - Tries

Due: Sunday, April 29, 2007 by midnight

In today's lab, you'll be implementing a portion of the Set interface using the Trie data structure. You can refer to your class notes or chapter 12.3 in your textbook for more discussion on them.

The Set interface is a collection that contains no duplicate elements. That is, for 2 nodes e1 and e2, you will never have e1.equals(e2) return true. (Similarly, e1.compareTo(e2) will never be 0.)

Creating the Trie class

Instance variables

Define the data members of the Trie as follows:

    boolean isWord;
     Trie[] children;

Optionally, you may want to store information on the size and height of this (sub-)trie in each node. This will require additional bookkeeping during add() operations, but can simplify other methods.

    int height;
    int size;

Constructor

You should create a no-argument constructor for a Trie which creates and initializes an empty Trie. The result is a single Trie node with the isWord flag set to false and the array of children all set to null.

The size of the children array is determined by the number of characters in the alphabet -- in this case 26. You should create a static final variable to hold this constant value, eliminating the need to hard-code "26" elsewhere in the program.

Methods

Create the following public methods in Trie.java:

boolean contains(String string);
Return true if the Trie contains the given string, false otherwise.
boolean add(String string);
Insert string into the trie, but only if it is not already present.
Return true if and only if the trie is modified by this operation. (i.e., adding a duplicate returns false)
int size();
Return the count of the number of strings contained in this Trie.
boolean isEmpty();
Return true if the Trie contains no strings, false otherwise.
int height();
Return the height of the Trie. This is equivalent to asking the length of the longest string contained within it.
Object[] toArray();
Create and return an array of Strings of all the Strings contained within this Trie in lexicographic order (i.e., alphabetic order).
Any of the Collection classes supplies a toArray() method like this. A simple way to implement this would be to construct a LinkedList or an ArrayList, add all the strings to it, and then use their toArray() method.
Trie subtrie(String prefix);
Return the subtrie of this Trie containing all words that begin with the given prefix. If there are no words in the Trie that begin with the prefix, you should return null.
The returned Trie will contain the remainders of the strings beginning with the prefix, but not the prefix itself.

Many of these methods will be implemented using recursive tree traversal techniques.

Notes on Strings and characters

  1. You need to be able to decompose a String into its individual characters. You can use String's charAt(int) method for this.
  2. Another useful method will be the substring(int) method allowing you to extract the tail of a String.
  3. When traversing the Trie, it will be useful for you to be able to transform a char into an int that goes from 0 to 25 (the indices of children). So a => 0, b => 0, ..., z => 0. Fortunately, this can be done straightforwardly using character mathematics.

    Assuming you transform all of your strings to lower case, you can simply subtract out the value of 'a' from the character under consideration to get a number you can use as the index.

  4. Consider what will happen if a user inputs a string that contains non-alphabetic characters (e.g., spaces, numbers, punctuation). What will happen when you try to do an add()? Most likely, an ArrayIndexOutOfBoundsException. Your program needs to handle this exception either in the Trie class itself or (preferred) in the application programs using your class.

Testing the Trie class

To test your Trie class, write the following application programs. I've got two test data files for you to use: small.txt and lexicon.txt.

TrieTest1.java

This program takes a series of words specified as command-line arguments. For each, it inserts the word into a Trie and then outputs:

  1. The size of the Trie
  2. The height of the Trie
  3. A list of all words in the Trie, in alphabetic order

TrieTest2.java

This program takes a file name as its first command-line argument. The file contains one word per line, just like the above examples. You should then just use a Scanner to read in the data into a Trie.

Then the program takes the remaining command-line arguments and searches for them in the Trie one at a time. It displays the words that are found one per line.

Don't forget that some methods will require you to handle checked exceptions.

TrieTest3.java

This program takes 2 command-line arguments: a file name and a prefix string. It starts by loading all the words from the file into a Trie as in the previous program. It then searches the Trie for all words which start with the given prefix string. It should then display:

TrieTest4.java (optional)

It might be interesting to see how many nodes are used to construct a given Trie. Create a program that examines this by:

  1. Creating an additional method in the Trie class called nodeCount() that returns an int of the number of nodes used.
  2. Have your main() for this program read a list of words froma a file specified on the command line and stores them in the Trie.
  3. It then prints out the number of words in the file and the number of nodes in the Trie.

Test it out on the lexicon.txt file.