In this lab, you will use AVL trees as a first approximation to web searching. In particular, for a given url, you will construct an AVL tree of the words on the page, and their locations. Presumably, if there is some correlation between word frequency and content, this will allow us to determine whether a given url is a good match for a single-word search query (in particular, it is a good match if the frequency of the query is high). We will then augment this search technique to allow for phrase queries, that is, queries that consist of multiple words.
This lab is the first part of a series of related labs about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Google, Yahoo!, or Bing.
The purpose of this lab is to:
As usual, you may work with one partner on this assignment, if you choose.
You probably use a search engine such as Google or Yahoo! at least 10 times on any given day to navigate through all the information on the web. When you type in a search query such as "robot ponies" (as one does), a good search engine will show you what it believes to be the most relevant websites for your robot pony information quest. What motivates a search engine company to provide quality "hits"? Do they do it out of the goodness of their itty bitty computer scientist hearts? Maybe a little. But mostly, they do it for profit: when the search engine lists its most relevant websites, it also displays advertisements relevant to people interested in robot ponies. For every "ad-click," the search engine receives revenue (i.e. moolah). Therefore, it behooves the search engine to actually list websites and advertisements that are most relevant to your search query: the better the selection, the more likely you are to click on the related banner ads, and the more likely you are to use their service for a future search, say on mechanical puppies.
So the million dollar question is: how do the search engines produce their list of relevant urls? They aren't likely to divulge such proprietary (and money-making) secrets to us. But that won't stop us from exploring some of the possibilities...
In this lab, we will make one simplistic attempt at determining the relevance of a given url for a specific topic. Our first observation is, given a search query word, we say that the website with the highest frequency of the word is the most relevant, where the frequency of a word on a page is the number of times it appears on the page divided by the total number of words on the page. The intuition here is that there is often a correlation between the actual words on a website and its content. For example, a website with 20% of its words being "monkey" is more likely to be a good source on monkeys than a website with only 1% of its words being "monkey". (Of course, there are exceptions. This page has already mentioned "monkey" far more times than necessary. Monkey monkey monkey monkey.)
If we are looking for a multi-word phrase, however, we will need to look for all words of the phrase, in order, on the page. We will build an index of the words on a page, keeping track of where on the page we encountered each word (a list of locations, in fact, since a word can appear more than once.) Now if someone is looking for a page about "robot ponies," then we want to find pages that not only have a high-ish frequency of both the words "robot" and "ponies," but also have them appearing close together.
Therefore, our task is as follows: given a website's url (or, a plain ol' file), we will need to count up all the words, how many times they appear and in what locations, and then use this information as a first step towards world domination.
You should first obtain the starting point code in lab06.jar. Unjar the file into a work directory, then create a new project from that directory.
We are using an external library Jsoup to parse HTML for us (it simplifies things greatly). You will need to add jsoup-1.6.1.jar to the build path in Eclipse (right-click the file in the package window. Go to BuildPath->Add to Build Path). If you are working from the command line, you can compile and run things using the -classpath parameter.
% javac -classpath jsoup-1.6.1.jar:. HTMLScanner.java TestScanner.java % java -classpath jsoup-1.6.1.jar:. TestScanner http://www.cs.oberlin.edu/
Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL) or a string. TestScanner contains a main method designed to test the HTMLScanner.
HTMLScanner is designed to work similar to the normal Scanner. You give the constructor a String representing the URL or file you want to read in, it then reads the file and lets you use hasNext() and next() to access the words on a page. (Look Ma, it's an Iterator!) I've also included features to iterate through the links on a page (hasNextLink() and nextLink). The Jsoup HTML parser supports other features (keywords, title, etc.) which you can read about in Jsoup's API docs if you want.
HTMLScanner currently only returns contiguous blocks of alpha-numeric characters --- so "sing-song95" on a page will return "sing" and then "song95".
Test Scanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/". (By the way, a URL either must start with http:// or file:, or you will suffer the consequences of a MalformedURLException.)
First you'll be completing an implementation of TreeMap called MyTreeMap. Most of the implementation is already provided for you, but there are a few things you still need to finish. MyTreeMap is just an AVL tree in disguise, where the nodes in the tree contain "key-value" pairs. You place items in the tree ordered by their key, and this key has an associated value tagging along with it. (As tempting as it is to order the items by their value, DO NOT!) Now, the key can be any reference type, and so can the value. Therefore, our MyTreeMap class will be parameterized by not one but two generic types: K for the key, and V for the value. In fact, since this is a binary search tree and we are ordering out items by the key, the key must not only be a reference type, but a Comparable reference type, to boot.
The methods you have to implement are listed below. You should peruse the class to see how it is implemented; it is not the same as the binary tree lab. In particular, a TreeMap contains a (key,value) pair, and a reference to its left and right subtrees (which are also TreeMaps). An empty TreeMap is one for which its left and right subtrees are null; a leaf TreeMap is one for which its left and right subtrees are non-null, but are themselves empty TreeMaps. (We talked about a few different ways of representing empty trees in class. This is one such way.) You can explore this further by looking at the provided constructors.
Note that the generic type K of the key is a Comparable, and therefore, you can (and should!) use the compareTo method to determine whether two keys are equal, or to determine their order.
get(Object key)which takes care of the casting.
get(Object key)method is required for any TreeMap implementation.)
(key, value)mapping into the map, ordered by its key.
For testing, you should write jUnit tests to test the get and put methods you just wrote. Try examples that will require restructuring the tree; see the prelab for some examples that you can do by hand. Don't forget to test when you overwrite a value for an already-existing key! To help, we wrote some code in the MyTreeMap's main method; you can copy and paste some of it to get you started in your jUnit tests.
Be thorough with your tests, because you want that tree to work before you proceed!
Next you'll be implementing your own version of TreeSet using a MyTreeMap as the backing storage.
Recall that in a Set, you only keep one copy of any item added. With a working MyTreeMap, implementing a MyTreeSet is pretty straightforward. Here are the methods you need to implement:
itemto the set if it isn't already in there.
You should test that your MyTreeSet works with jUnit tests. It depends on your MyTreeMap working correctly, so be sure that is done first.
Now that you have a working MyTreeMap and MyTreeSet, you will use it to implement a data structure that will contain the index representation of a web page. You will use a MyTreeMap to keep track of the indexes of each word on a page, and a MyTreeSet to keep track of the links contained in the page. You should also keep track of the URL used to build the index and the total number of words on the page.
You will need the following public methods:
baseUrl. Keep a running counter of the number of words you run into when stepping through the page.
trueif the word
sappeared as text anywhere on the page.
sappeared on the page.
sappeared on the page (i.e., the value from MyTreeMap).
sdoes not appear on the page, return an empty list, not null.
Once you have those methods working (see the jUnit testing section below), you should go on and implement the ability to look for phrases. To do this, what you'll want to do is take a string and break it up along whitespace boundaries into individual words. Look to see if each word appears in the sequence provided.
My suggestion is to either use s.split("\\s") to turn the input into an array of Strings, or use a Scanner to step through s. For each word there, create a parallel structure of Lists using getLocations (I had an array of Lists). Loop through the values for the first word and see if the next has a value 1 greater, the next 2 greater, etc. You only have a phrase match if every one has an appropriate value.
sis in the web page.
sappears on the page
sappears on the page divided by the total number of words on the page.
son the page.
You should write jUnit tests to thoroughly check all of your methods. For each test method, create two WebPageIndexes: one out of the provided file testscannerfile, and one from the url http://www.cs.oberlin.edu/~asharp/cs151/labs/lab06/lab06-sample.html.
Some helpful facts about testscannerfile are as follows:
Frequency and index of words in testscannerfile happening 0.083333  hi 0.083333  if 0.083333  important 0.083333  is 0.250000 [3, 5, 10] it 0.166667 [6, 9] tagged 0.083333  there 0.083333  what 0.083333  Links: http://www.google.com/
Some helpful facts about lab06-sample.html are:
6 0.076923 [4, 11, 15] a 0.051282 [6, 23] be 0.025641  book 0.025641  children 0.025641  cow 0.102564 [27, 30, 33, 36] for 0.051282 [2, 9] from 0.025641  i 0.025641  if 0.025641  just 0.025641  lab 0.051282 [3, 10] me 0.025641  milk 0.025641  moo 0.128205 [28, 31, 34, 37, 38] on 0.025641  page 0.051282 [8, 21] popular 0.025641  s 0.025641  sample 0.025641  short 0.025641  text 0.025641  that 0.025641  will 0.025641  wonder 0.025641  you 0.025641  Links:
Use the handin program to submit a directory containing the following:
If you work with a partner, please only one of you submit your joint solution using handin.