11:59.59pm, Sunday, April 6
You may work with one partner on this assignment.
This assignment is the first part of a series of related assignments about the world wide web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Yahoo!, Lycos, or Google.
The first step is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this assignment you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out the most frequent words found.
Starting point code is in lab6.jar.
Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL) or a string. TestScanner contains a main method designed to test the HTMLScanner.
The constructor for the HTMLScanner class requires two parameters: a String parameter representing a file name a URL, or itself, and an int parameter indicating the type of the first parameter. It tokenizes its input with its "hasNext()" and "next()" methods that you are familiar with from the Iterator interface. For our purposes a token will be a single non-alphabetic and non-numeric character, such as punctuation, or a string of contiguous alphabetic and numeric characters up until white space is encountered or the end of file.
Test Scanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/".
First you'll be completing an implementation of TreeMap called MyTreeMap. I've implemented most of the class, but there are a few things you still need to finish.
null.
get(Object key) which takes care of the casting.
(key, value) mapping into the map. If a mapping for this
key already exists, the value should replace the new value in the map.
For testing, write a main method that does the following:
You will then create a class WordFrequencyTree that uses a MyTreeMap as a backing storage and has a main method that can be used from the command line. Your program will take two or three command line arguments:
%java WordFrequencyTree input_file_name min_freq_num [ignore_file_name]
The first is an input file that your program will process using the
HTMLScanner class. The optional third argument is an ignore list file
containing words that should be ignored in the input file. If there is an
ignore list command line argument, then your program should process the ignore
list first and create a data structure for storing the list of tokens to
ignore. Next, your program should process the input file. For all tokens in
the input file that are not on the ignore list,
you will create a WordFrequency object and add it to a binary search
tree sorted alphabetically.
you should add the word as the Key and a counter as the Value into a MyTreeMap.
Only unique words should be added to the tree. When a duplicate word is
encountered, the frequency count of its WordFrequency object
counter stored as a Value should be incremented.
After processing the input file. Your program should print out all words
in the WordFrequencyTree in alphabetical order that occur at
least min_freq_num times.
In addition it should have at least two constructors: a constructor that takes an input file argument and a constructor that takes an input file and an ignore file argument. These constructors will process the file(s) and create the initial word frequency tree as specified above. You also need a way to specify the minimum frequency, so you probably should just pass that in as an additional String argument.
If the file passed on the command line begins with either "http://" or "file:" then it should be processed by the HTMLScanner as an URL instead of as a file. See String.startsWith() for a possibly helpful method.
The following are examples of what your command line should look like, and what your program should do for different calls
# program will keep a count (using Key,Value pairs) for each token # in the index.html input file that is not on the html_ignore_list # and will print out in alphabetical order all words that occur # at least 3 times in the tree # % java WordFrequencyTree index.html 3 html_ignore_list # program will keep a count (using Key,Value pairs) for each token # in the index.html input file and will print out in alphabetical # order all words that occur at least 4 times in the tree # % java WordFrequencyTree index.html 4 # program will print a usage message to stdout and exit # (ex) usage: WordFrequencyTree inputfile min_freq_num # or: WordFrequencyTree inputfile min_freq_num ignorelist # % java WordFrequencyTree index.html
Part of this assignment includes developing a good ignorelist file for html files. As you test your program for different .html input files, develop a list of words that should be ignored from html files, and use this list as the optional third command line argument to your program. You will submit your html_ignore_list file as part of your homework solution.
Your WordFrequencyTree main program should handle all exceptions, not dump stack traces back to the user.
Integer is immutable, that is, you cannot change it's value once it has
been set. I'd recommend making a WordCounter class that has an int for
storage and an increment() method that lets you increase its
value. Then you can just get() and increment().
You might want to create a method to do the opening of an HTMLScanner.
I'd suggest converting all words to lower case before inserting them into the various trees.
Here's a suggested way of processing the various files:
Do not keep old HTMLScanner or Scanner objects around. This will cause you much grief in the next assignment.
This is without my ignorefile, so most of it is HTML garbage.
% java WordFrequencyTree http://www.cs.oberlin.edu/~kuperman/csci151/ 100
Words of frequency 100 or more
--------------------------------
157 ( 1.2) !
848 ( 6.4) "
702 ( 5.3) -
994 ( 7.6) .
2286 (17.4) /
151 ( 1.1) 0
169 ( 1.3) 1
205 ( 1.6) 5
233 ( 1.8) :
102 ( 0.8) ;
2022 (15.4) <
438 ( 3.3) =
2022 (15.4) >
423 ( 3.2) a
149 ( 1.1) api
129 ( 1.0) b
117 ( 0.9) com
155 ( 1.2) docs
205 ( 1.6) href
171 ( 1.3) html
134 ( 1.0) http
284 ( 2.2) java
154 ( 1.2) li
115 ( 0.9) sun
572 ( 4.4) td
211 ( 1.6) tr
Use the handin program to submit a directory containing the following:
If you work with a partner, please only one of you submit your joint solution using handin.