11:59.59pm, Sunday, April 5
You may work with one partner on this assignment.
This assignment is the first part of a series of related assignments about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Yahoo!, AltaVista, or Google.
The first step is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this assignment you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out the most frequent words found.
Starting point code is in lab6.jar.
Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL), or a string. TestScanner contains a main method designed to test the HTMLScanner.
The constructor for the HTMLScanner class requires two parameters: a String parameter representing a file name, a URL, or itself, and an int parameter indicating the type of the first parameter. It tokenizes its input with its "hasNext()" and "next()" methods that you are familiar with from the Iterator interface. For our purposes a token will be a single non-alphabetic and non-numeric character, such as punctuation, or a string of contiguous alphabetic and numeric characters up until white space is encountered or the end of file.
Test Scanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/".
First you'll be completing an implementation of TreeMap called MyTreeMap. I've implemented most of the class, but there are a few things you still need to finish. Recall that Map has two type parameters: K, representing keys, and V, representing values.
null.
(key, value) mapping into the map. If a mapping
for this key already exists, the value passed in should replace the
previous value in the map.
For testing, write a program that does the following:
Create a class WordFrequencyTree that uses a MyTreeMap as a backing storage and has a main method that can be used from the command line. (You may choose to extend MyTreeMap for this class if you prefer.) Your program will take two or three command line arguments:
% java WordFrequencyTree input_file_name min_freq_num [ignore_file_name]
The first is an input file that your program will process using the HTMLScanner class. The optional third argument is an ignore list file containing words that should be ignored in the input file. If there is an ignore list command line argument, then your program should process the ignore list first and create a data structure for storing the list of tokens to ignore. Next, your program should process the input file. For all tokens in the input file that are not on the ignore list, you should add the word as the Key and a counter as the Value into a MyTreeMap.
Only unique words should be added to the tree. When a duplicate word is encountered, the counter stored as a Value should be incremented.
After processing the input file. Your program should print out all words
in the WordFrequencyTree in alphabetical order that occur at
least min_freq_num times.
In addition it should have at least two constructors. These constructors will process the file(s) and create the initial word frequency tree as specified above.
If the file passed on the command line begins with either "http://" or "file:" then it should be processed by the HTMLScanner as an URL instead of as a file. See String.startsWith() for a possibly helpful method.
The following are examples of what your command line should look like, and what your program should do for different calls
# program will keep a count (using Key,Value pairs) for each token # in the index.html input file that is not on the html_ignore_list # and will print out in alphabetical order all words that occur # at least 3 times in the tree # % java WordFrequencyTree index.html 3 html_ignore_list # program will keep a count (using Key,Value pairs) for each token # in the index.html input file and will print out in alphabetical # order all words that occur at least 4 times in the tree # % java WordFrequencyTree index.html 4 # program will print a usage message to stdout and exit # (ex) usage: WordFrequencyTree inputfile min_freq_num # or: WordFrequencyTree inputfile min_freq_num ignorelist # % java WordFrequencyTree index.html
Part of this assignment includes developing a good ignorelist file for html files. As you test your program for different .html input files, develop a list of words that should be ignored from html files, and use this list as the optional third command line argument to your program. You will submit your html_ignore_list file as part of your homework solution.
Your WordFrequencyTree main program should handle all exceptions, not dump stack traces back to the user.
Integer is immutable, that is, you cannot change it's value once it has
been set. I'd recommend making a WordCounter class that has an int for
storage and an increment() method that lets you increase its
value. Then you can just get() and increment().
You might want to create a method to do the opening of an HTMLScanner. You know, check what the string starts with, does the opening, returns a reference.
I'd suggest converting all words to lower case before inserting them into the various trees.
Here's a suggested way of processing the various files:
Do not keep old HTMLScanner or Scanner objects around. This will cause you much grief in the next assignment. You should not save these values in class variables -- just use them with local variables so that they don't remain open.
A better metric than word count is word frequency. Think about how you could change your WordFrequencyTree to support a "getFrequency(String word)" method.
Although we know that the insert and search methods for binary search tree are O(log n), we also know that in the worst case, they are O(n). We can avoid the worst case performance by periodically rebalancing the tree. This requires two steps:
Write a public method called "rebalance" in the MyTreeMap class. It should do the following:
When should the rebalance method be called? Here are four possible strategies to choose from:
This is without my ignorefile, so most of it is HTML garbage. (Not yet updated for 2009)
% java WordFrequencyTree http://www.cs.oberlin.edu/~kuperman/csci151/ 100
Words of frequency 100 or more
--------------------------------
157 ( 1.2) !
848 ( 6.4) "
702 ( 5.3) -
994 ( 7.6) .
2286 (17.4) /
151 ( 1.1) 0
169 ( 1.3) 1
205 ( 1.6) 5
233 ( 1.8) :
102 ( 0.8) ;
2022 (15.4) <
438 ( 3.3) =
2022 (15.4) >
423 ( 3.2) a
149 ( 1.1) api
129 ( 1.0) b
117 ( 0.9) com
155 ( 1.2) docs
205 ( 1.6) href
171 ( 1.3) html
134 ( 1.0) http
284 ( 2.2) java
154 ( 1.2) li
115 ( 0.9) sun
572 ( 4.4) td
211 ( 1.6) tr
Use the handin program to submit a directory containing the following:
If you work with a partner, please only one of you submit your joint solution using handin.