Due by 6 pm, Sunday, April 2
This assignment is the first part of a series of related assignments about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Yahoo!, Bing, or Google.
In this first step, you will use an AVL tree to create an index of all the words contained on a webpage. You will then be able to query the index to find out how frequently a word appears on a page and in what locations. Presumably, there is some correlation between words on a page and that page's content. And a page that frequently contains a query word is probably a better match for that querythan other pages that don't.
The goals of this lab are for you to:
As usual, you may work with a partner on this assignment.
You probably use a search engine at least 10 times a day to dig through the vast amount of information on the web. Some folks don't even know how to get to their favorite web sites without using a search engine! So what motivates today's search companies to spend so much time, effort, and money to give you good results to your search queries? Some of the first search engines were written by people doing nifty things and sharing it out of the goodness of their hearts (and some still do), but mostly it is done for profit. Along with your search results, the search engine also displays a bunch of advertisements related to your query. For every ad-click that you make, they get some money. Sometimes, they get paid just to display an URL at the top of the search results! By getting high-quality/relevant search results, you are more likely to continue using a particular search engine, thus increasing their chances to profit from you.
How do the search engines produce their list of relevant URLs? Well, they don't share all the specifics, but we know a number of basic ideas that most of them use. In this lab, we'll be trying out 2 techniques that are used by most search engines by building an index of the words on a page. For each word we encounter on a web page, we'll keep track of what order we encountered it in (0th, 1st, 2nd, 3rd, etc.) and keep a list of all the locations for each word.
The first query technique is related to the frequency of a query word on a given page. If you've got a page that has the word "monkey" occurring repeatedly, then it is quite likely to be about monkeys. And a page that has 10% of the words on it being "monkey" is probably more relevant to a query on "monkey" than a page that only has it there 0.5%.
The second technique will be to use the set of indexes to find phrases on a page. If some is looking for "robot ninjas", then you would need to go through all of the locations where "robot" appears and check to see if "ninjas" is located in a spot one greater. Some search engines also support the ability to search for words NEAR each other, and you could do this too using the index, but it isn't required.
Starting point code is in lab6.zip. We are using an external library Jsoup to parse HTML for us (it simplifies things greatly). You will need to add jsoup-1.8.3.jar to the build path in Eclipse. Go to the Project menu and select Properties. One of the properties is the Build Path. Select this, and click on the Libraries tab. One of the options is Add External JARs. Click on this; it should find the Jsoup jar file and let you add it to the Build Path. After you do this the errors in the HTMLScanner.java file should no longer be present.
If you are working from the command line, you can compile and run things using the -classpath parameter.
% javac -classpath jsoup-1.8.3.jar:. HTMLScanner.java TestScanner.java % java -classpath jsoup-1.8.3.jar:. TestScanner http://www.cs.oberlin.edu/
Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL) or a string. TestScanner contains a main method designed to test the HTMLScanner.
HTMLScanner is designed to work similarly to the normal Scanner. You give the constructor a String representing the URL or file you want to read in, it then reads the file and lets you use hasNext() and next() to access the words on a page. I've also included features to iterate through the links on a page (hasNextLink() and nextLink). The Jsoup HTML parser supports other features (keywords, title, etc.) which you can read about in Jsoup's API docs if you want.
HTMLScanner currently only returns contiguous blocks of alpha-numeric characters -- so "sing-song95" on a page will return "sing" and then "song95".
TestScanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/", or a filename such as "testscannerfile".
First you'll be completing an implementation of TreeMap called MyTreeMap. Most of the implementation is already provided for you, but there are a few things you still need to finish. MyTreeMap is just an AVL tree in disguise, where the nodes in the tree contain "key-value" pairs. You place items in the tree ordered by their key, and this key has an associated value tagging along with it. Now, the key can be any reference type, and so can the value. Therefore, our MyTreeMap class will be parameterized by not one but two generic types: K for the key, and V for the value.
The methods you have to implement are listed below. You should peruse the class to see how it is implemented; it is not the same as the binary tree lab (Lab 5). In particular, a TreeMap contains a (key,value) pair, and a reference to its left and right subtrees (which are also TreeMaps). An empty TreeMap is one for which its left and right subtrees are null; a leaf TreeMap is one for which its left and right subtrees are non-null, but are themselves empty TreeMaps. You can explore this further by looking at the provided constructors.
Note that the generic type K of the key implements the Comparable interface, and therefore, you can (and should!) use the compareTo method to determine whether two keys are equal, or to determine their order.
get(Object key)which takes care of the casting.
get(Object key)method is required for any TreeMap implementation.)
(key, value)mapping into the map, ordered by its key.
isEmpty()that you ended in during your search, and then make it into a leaf. In more detail, you should:
For testing, you should create a class called MyTreeMapTest.java
that thoroughly tests the new methods that you implemented. Be sure to try
examples that will require calls to each of the various configurations of
restructure(). I strongly suggest drawing out your
examples by hand rather than just making them up in your head.
Don't forget about to test the case where you overwrite a value for an
Be thorough with your tests because you want this tree to be working before you proceed!
Next you'll be implementing your own version of TreeSet<T>
using a MyTreeMap<T><Boolean> as the backing storage.
We won't be implementing a
remove( ) method, but if we did you could "remove" an item from the set by setting its Boolean value to false.
Recall that in a Set, you only keep one copy of any item added. With a working MyTreeMap, implementing a MyTreeSet is pretty straightforward. Here are the methods you need to implement:
itemto the set if it isn't already in there.
You should now create a file called MyTreeSetTest.java that contains JUnit tests for this class. As most of the methods you created are likely just a small wrapper around existing MyTreeMap methods, you will hopefully not run into too many issues while testing.
Now that you have a working MyTreeMap and MyTreeSet, you will use it to implement a data structure that will contain the index representation of a web page. You will use a MyTreeMap<String, LinkedList<Integer>> to keep track of the indexes of each word on a page, and a MyTreeSet<String> to keep track of the links contained in the page. You should also keep track of the URL used to build the index and the total number of words on the page.
You will need the following public methods:
baseUrl. Keep a running counter of the number of words you run into when stepping through the page using
trueif the word
sappeared as text anywhere on the page.
sappeared on the page.
sappeared on the page (i.e., the value from MyTreeMap).
sdoes not appear on the page, return an empty list, not null.
Once you have those methods working, you should implement the ability to look for phrases. To do this, what you'll want to do is take a string and break it up along whitespace boundaries into individual words. Look to see if each word appears in the sequence provided.
My suggestion is to either use s.split("\\s+") to turn the input
into an array of Strings or a Scanner to step through s.
The String method split( ) takes a regular expression and uses it
to split the string into an array of substrings. The regular expression
"\\s+" matches any sequence of one or more whitespace
characters. (You might be tempted to use
s.split(" ") but this
runs into trouble if there are seveal whitespace characters in a row, which
is not uncommon.) Either way, you need to find the individual words of s.
For each word there, create a parallel structure of Lists using
getLocations (I had an array). Loop through the values for the
first word and see if the next has a value 1 greater, the next 2
greater, etc. You only have a phrase match if every one has an
sis in the web page.
sappears on the page
sappears on the page divided by the total number of words on the page.
son the page.
As you might expect by now, you will need to create a WebPageIndexTest.java file that thoroughly tests your WebPageIndex objects. (A good habit to get into is to create this file early on and add tests as you add in individual features.)
The main method should take an argument from the command line and build a WebPageIndex from it. Your main method should handle all exceptions and not display stack traces to the user. You should then display a list of all the words on the page, their frequencies, and their locations. Follow this up with a list of all the links that were on the page.
Here are some sample outputs from my program:
% java -classpath jsoup-1.8.3.jar:. WebPageIndex testscannerfile Frequency and index of words in testscannerfile happening 0.083333  hi 0.083333  if 0.083333  important 0.083333  is 0.250000 [3, 5, 10] it 0.166667 [6, 9] tagged 0.083333  there 0.083333  what 0.083333  Links: http://www.google.com/
% java -classpath jsoup-1.8.3.jar:. WebPageIndex http://www.cs.oberlin.edu/~bob/cs151/Labs/Lab6/sample.html Frequency and index of words in http://www.cs.oberlin.edu/~bob/cs151/Labs/Lab6/sample.html 6 0.076923 [4, 11, 15] a 0.051282 [6, 23] be 0.025641  book 0.025641  children 0.025641  cow 0.102564 [27, 30, 33, 36] for 0.051282 [2, 9] from 0.025641  i 0.025641  if 0.025641  just 0.025641  lab 0.051282 [3, 10] me 0.025641  milk 0.025641  moo 0.128205 [28, 31, 34, 37, 38] on 0.025641  page 0.051282 [8, 21] popular 0.025641  s 0.025641  sample 0.025641  short 0.025641  text 0.025641  that 0.025641  will 0.025641  wonder 0.025641  you 0.025641  Links: http://www.cs.oberlin.edu/~kuperman/csci151/lab06/index.html http://www.goodreads.com/book/show/926239.Cow_Moo_Me
Use the handin program to submit a directory containing the following:
If you work with a partner, please only one of you submit your joint solution using handin.