CSCI 151 - Lab 6 - Spring 2009 Computing Word Frequencies in Web Pages using trees

11:59.59pm, Sunday, April 5


You may work with one partner on this assignment.

This assignment is the first part of a series of related assignments about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Yahoo!, AltaVista, or Google.

The first step is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this assignment you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out the most frequent words found.

Getting Started

Starting point code is in lab6.jar.

Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL), or a string. TestScanner contains a main method designed to test the HTMLScanner.

The constructor for the HTMLScanner class requires two parameters: a String parameter representing a file name, a URL, or itself, and an int parameter indicating the type of the first parameter. It tokenizes its input with its "hasNext()" and "next()" methods that you are familiar with from the Iterator interface. For our purposes a token will be a single non-alphabetic and non-numeric character, such as punctuation, or a string of contiguous alphabetic and numeric characters up until white space is encountered or the end of file.

Test Scanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/".

Part 1 - Finish Implementing MyTreeMap

First you'll be completing an implementation of TreeMap called MyTreeMap. I've implemented most of the class, but there are a few things you still need to finish. Recall that Map has two type parameters: K, representing keys, and V, representing values.

private V get(K searchKey)
Return the current mapping of the given key. If no mapping exists, return null.
public V put(K key, V value)
Insert a (key, value) mapping into the map. If a mapping for this key already exists, the value passed in should replace the previous value in the map.
The return value of put is the previous value for the key if there was one, or null if there was not.

For testing, write a program that does the following:

  1. Read words from a URL using a HTMLScanner
  2. Insert each word (as the key) along with the length of the word (as the value) into a MyTreeMap
  3. Display the contents of the final tree in alphabetical order using the toString method

Part 2 - Write a word frequency application

Create a class WordFrequencyTree that uses a MyTreeMap as a backing storage and has a main method that can be used from the command line. (You may choose to extend MyTreeMap for this class if you prefer.) Your program will take two or three command line arguments:

% java WordFrequencyTree input_file_name  min_freq_num [ignore_file_name] 

The first is an input file that your program will process using the HTMLScanner class. The optional third argument is an ignore list file containing words that should be ignored in the input file. If there is an ignore list command line argument, then your program should process the ignore list first and create a data structure for storing the list of tokens to ignore. Next, your program should process the input file. For all tokens in the input file that are not on the ignore list, you should add the word as the Key and a counter as the Value into a MyTreeMap.

Only unique words should be added to the tree. When a duplicate word is encountered, the counter stored as a Value should be incremented.

After processing the input file. Your program should print out all words in the WordFrequencyTree in alphabetical order that occur at least min_freq_num times.

Required public methods

public boolean contains(String s)
Return true if the string is in the file, false otherwise. I expect this to be a super-short method.
public int getCount(String s)
Returns the count for the given word. If the word isn't in your tree, you should return 0.
public String toString()
Just return the MyTreeMap toString() value.

In addition it should have at least two constructors. These constructors will process the file(s) and create the initial word frequency tree as specified above.

public WordFrequencyTree(String infile, String min_freq)
public WordFrequencyTree(String infile, String min_freq, String ignorefile)
Process infile and store the frequency counts
Uses min_freq to either prune the finished tree or to filter requests to getCount()

If the file passed on the command line begins with either "http://" or "file:" then it should be processed by the HTMLScanner as an URL instead of as a file. See String.startsWith() for a possibly helpful method.

The following are examples of what your command line should look like, and what your program should do for different calls

# program will keep a count (using Key,Value pairs) for each token
# in the index.html input file that is not on the html_ignore_list
# and will print out in alphabetical order all words that occur 
# at least 3 times in the tree 
#
% java WordFrequencyTree index.html 3 html_ignore_list 


# program will keep a count (using Key,Value pairs) for each token
# in the index.html input file and will print out in alphabetical 
# order all words that occur at least 4 times in the tree
#
% java WordFrequencyTree index.html 4


# program will print a usage message to stdout and exit
# (ex)   usage: WordFrequencyTree  inputfile  min_freq_num 
#           or: WordFrequencyTree  inputfile  min_freq_num  ignorelist 
#
% java WordFrequencyTree index.html

Part of this assignment includes developing a good ignorelist file for html files. As you test your program for different .html input files, develop a list of words that should be ignored from html files, and use this list as the optional third command line argument to your program. You will submit your html_ignore_list file as part of your homework solution.

Programming Notes

Your WordFrequencyTree main program should handle all exceptions, not dump stack traces back to the user.

Integer is immutable, that is, you cannot change it's value once it has been set. I'd recommend making a WordCounter class that has an int for storage and an increment() method that lets you increase its value. Then you can just get() and increment().

You might want to create a method to do the opening of an HTMLScanner. You know, check what the string starts with, does the opening, returns a reference.

I'd suggest converting all words to lower case before inserting them into the various trees.

Here's a suggested way of processing the various files:

  1. Open the ignorefile and load those counts into a MyTreeMap
  2. Open the main URL/file and load the frequency counts into a MyTreeMap, skip over anything in the ignore tree
  3. Prune the first tree by creating another tree, iterating through all the nodes in the tree, and only adding in nodes that have a final word count above your minimum threshold. This is also a great time to do the overall word count too.

Do not keep old HTMLScanner or Scanner objects around. This will cause you much grief in the next assignment. You should not save these values in class variables -- just use them with local variables so that they don't remain open.

A better metric than word count is word frequency. Think about how you could change your WordFrequencyTree to support a "getFrequency(String word)" method.

Part 3 - Keeping the tree balanced

Although we know that the insert and search methods for binary search tree are O(log n), we also know that in the worst case, they are O(n). We can avoid the worst case performance by periodically rebalancing the tree. This requires two steps:

Write a public method called "rebalance" in the MyTreeMap class. It should do the following:

When should the rebalance method be called? Here are four possible strategies to choose from:

Sample run

This is without my ignorefile, so most of it is HTML garbage. (Not yet updated for 2009)

% java WordFrequencyTree http://www.cs.oberlin.edu/~kuperman/csci151/ 100  

Words of frequency 100 or more
--------------------------------
   157 ( 1.2) !
   848 ( 6.4) "
   702 ( 5.3) -
   994 ( 7.6) .
  2286 (17.4) /
   151 ( 1.1) 0
   169 ( 1.3) 1
   205 ( 1.6) 5
   233 ( 1.8) :
   102 ( 0.8) ;
  2022 (15.4) <
   438 ( 3.3) =
  2022 (15.4) >
   423 ( 3.2) a
   149 ( 1.1) api
   129 ( 1.0) b
   117 ( 0.9) com
   155 ( 1.2) docs
   205 ( 1.6) href
   171 ( 1.3) html
   134 ( 1.0) http
   284 ( 2.2) java
   154 ( 1.2) li
   115 ( 0.9) sun
   572 ( 4.4) td
   211 ( 1.6) tr

Use the handin program to submit a directory containing the following:

  1. All .java files necessary for compiling your code (including any of the classes that I gave you that you use in your solution).
  2. An ignorefile, containing tokens that should be ignored from an html input file.
  3. A README file with:

If you work with a partner, please only one of you submit your joint solution using handin.


Last Modified: April 01, 2009 - Benjamin A. KupermanVI Powered