CSCI 151 - Lab 6 - Spring 2008 Computing Word Frequencies in Web Pages using AVL trees

11:59.59pm, Sunday, April 6


You may work with one partner on this assignment.

This assignment is the first part of a series of related assignments about the world wide web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Yahoo!, Lycos, or Google.

The first step is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this assignment you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out the most frequent words found.

Getting Started

Starting point code is in lab6.jar.

Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL) or a string. TestScanner contains a main method designed to test the HTMLScanner.

The constructor for the HTMLScanner class requires two parameters: a String parameter representing a file name a URL, or itself, and an int parameter indicating the type of the first parameter. It tokenizes its input with its "hasNext()" and "next()" methods that you are familiar with from the Iterator interface. For our purposes a token will be a single non-alphabetic and non-numeric character, such as punctuation, or a string of contiguous alphabetic and numeric characters up until white space is encountered or the end of file.

Test Scanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/".

Part 1 - Finish Implementing MyTreeMap

First you'll be completing an implementation of TreeMap called MyTreeMap. I've implemented most of the class, but there are a few things you still need to finish.

private V get(K searchKey)
Return the current mapping of the given key. If no mapping exists, return null.
I've already included the actual public method get(Object key) which takes care of the casting.
public V put(K key, V value)
Insert a (key, value) mapping into the map. If a mapping for this key already exists, the value should replace the new value in the map.
The return value of put is the previous value for the key if there was one, or null if there was not.
Here is a sequence of operations assuming recursive implementation:
  1. Add node to correct location in tree as a leaf (if needed). Remember that in spots where leaves can go test for isEmpty() will be true. To add yourself at that location as a new leaf you should:
    1. Set the value of key
    2. Set the value of value
    3. Fix the size
    4. Set both left and right to be new empty MyTreeMap (i.e., new MyTreeMap<K,V>())
  2. Update the key value if already there
  3. Call restructure(this) if the tree is unbalanced
  4. Call this.setHeight()
  5. Recalculate your size by adding 1 to the sum of the size of your children
private void restructure(MyTreeMap<K, V> node)
This is already mostly implemented, you just need to assign the appropriate values to the variables listed as described. Everything else should already be implemented.

For testing, write a main method that does the following:

  1. Read words from a URL using a HTMLScanner
  2. Insert each word along with the length of the word into a MyTreeMap
  3. Display the contents of the final tree in alphabetical order using the toString method

Part 2 - Write a word frequency application

You will then create a class WordFrequencyTree that uses a MyTreeMap as a backing storage and has a main method that can be used from the command line. Your program will take two or three command line arguments:

 %java  WordFrequencyTree input_file_name  min_freq_num  [ignore_file_name]

The first is an input file that your program will process using the HTMLScanner class. The optional third argument is an ignore list file containing words that should be ignored in the input file. If there is an ignore list command line argument, then your program should process the ignore list first and create a data structure for storing the list of tokens to ignore. Next, your program should process the input file. For all tokens in the input file that are not on the ignore list, you will create a WordFrequency object and add it to a binary search tree sorted alphabetically. you should add the word as the Key and a counter as the Value into a MyTreeMap.

Only unique words should be added to the tree. When a duplicate word is encountered, the frequency count of its WordFrequency object counter stored as a Value should be incremented.

After processing the input file. Your program should print out all words in the WordFrequencyTree in alphabetical order that occur at least min_freq_num times.

Required public methods

public int getWordCount()
Return the total number of words seen, not the count of the different numbers of words.
public int getMinFreq()
Return the minimum frequency for reporting words. This is the value you set via the command line.
public boolean contains(String s)
Return true if the string is in the file, false otherwise. I expect this to be a super-short method.
public int getCount(String s)
Returns the count for the given word. If the word isn't in your tree, you should return 0.
public double getFreq(String s)
Returns the frequency of the given word in the tree. This is just the count over the total word count. Be careful of integer division!
public Iterator<String> words()
Just return the MyTreeMap keys() inorder traversal.
public String toString()
Just return the MyTreeMap toString() value.

In addition it should have at least two constructors: a constructor that takes an input file argument and a constructor that takes an input file and an ignore file argument. These constructors will process the file(s) and create the initial word frequency tree as specified above. You also need a way to specify the minimum frequency, so you probably should just pass that in as an additional String argument.

If the file passed on the command line begins with either "http://" or "file:" then it should be processed by the HTMLScanner as an URL instead of as a file. See String.startsWith() for a possibly helpful method.

The following are examples of what your command line should look like, and what your program should do for different calls

# program will keep a count (using Key,Value pairs) for each token
# in the index.html input file that is not on the html_ignore_list
# and will print out in alphabetical order all words that occur 
# at least 3 times in the tree 
#
% java WordFrequencyTree index.html 3 html_ignore_list 


# program will keep a count (using Key,Value pairs) for each token
# in the index.html input file and will print out in alphabetical 
# order all words that occur at least 4 times in the tree
#
% java WordFrequencyTree index.html 4


# program will print a usage message to stdout and exit
# (ex)   usage: WordFrequencyTree  inputfile  min_freq_num 
#           or: WordFrequencyTree  inputfile  min_freq_num  ignorelist 
#
% java WordFrequencyTree index.html

Part of this assignment includes developing a good ignorelist file for html files. As you test your program for different .html input files, develop a list of words that should be ignored from html files, and use this list as the optional third command line argument to your program. You will submit your html_ignore_list file as part of your homework solution.

Programming Notes

Your WordFrequencyTree main program should handle all exceptions, not dump stack traces back to the user.

Integer is immutable, that is, you cannot change it's value once it has been set. I'd recommend making a WordCounter class that has an int for storage and an increment() method that lets you increase its value. Then you can just get() and increment().

You might want to create a method to do the opening of an HTMLScanner.

I'd suggest converting all words to lower case before inserting them into the various trees.

Here's a suggested way of processing the various files:

  1. Open the ignorefile and load those counts into a MyTreeMap
  2. Open the main URL/file and load the frequency counts into a MyTreeMap, skip over anything in the ignore tree
  3. Prune the first tree by creating another tree, iterating through all the nodes in the tree, and only adding in nodes that have a final word count above your minimum threshold. This is also a great time to do the overall word count too.

Do not keep old HTMLScanner or Scanner objects around. This will cause you much grief in the next assignment.

Sample run

This is without my ignorefile, so most of it is HTML garbage.

% java WordFrequencyTree http://www.cs.oberlin.edu/~kuperman/csci151/ 100  

Words of frequency 100 or more
--------------------------------
   157 ( 1.2) !
   848 ( 6.4) "
   702 ( 5.3) -
   994 ( 7.6) .
  2286 (17.4) /
   151 ( 1.1) 0
   169 ( 1.3) 1
   205 ( 1.6) 5
   233 ( 1.8) :
   102 ( 0.8) ;
  2022 (15.4) <
   438 ( 3.3) =
  2022 (15.4) >
   423 ( 3.2) a
   149 ( 1.1) api
   129 ( 1.0) b
   117 ( 0.9) com
   155 ( 1.2) docs
   205 ( 1.6) href
   171 ( 1.3) html
   134 ( 1.0) http
   284 ( 2.2) java
   154 ( 1.2) li
   115 ( 0.9) sun
   572 ( 4.4) td
   211 ( 1.6) tr

Use the handin program to submit a directory containing the following:

  1. All .java files necessary for compiling your code (including any of the classes that I gave you that you use in your solution).
  2. An ignorefile, containing tokens that should be ignored from an html input file.
  3. A README file with:

If you work with a partner, please only one of you submit your joint solution using handin.


Last Modified: April 01, 2008 - Benjamin A. KupermanVI Powered