Lab 6

Web Searchin' via Word Frequencies
Due by 8pm, Sunday 4 Nov 2012

In this lab, you will use AVL trees as a first approximation to web searching. In particular, for a given url, you will construct an AVL tree of the words on the page, and their locations. Presumably, if there is some correlation between word frequency and content, this will allow us to determine whether a given url is a good match for a single-word search query (in particular, it is a good match if the frequency of the query is high). We will then augment this search technique to allow for phrase queries, that is, queries that consist of multiple words.

This lab is the first part of a series of related labs about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Google, Yahoo!, or Bing.

The purpose of this lab is to:

As usual, you may work with one partner on this assignment, if you choose.


Motivation

You probably use a search engine such as Google or Yahoo! at least 10 times on any given day to navigate through all the information on the web. When you type in a search query such as "robot ponies" (as one does), a good search engine will show you what it believes to be the most relevant websites for your robot pony information quest. What motivates a search engine company to provide quality "hits"? Do they do it out of the goodness of their itty bitty computer scientist hearts? Maybe a little. But mostly, they do it for profit: when the search engine lists its most relevant websites, it also displays advertisements relevant to people interested in robot ponies. For every "ad-click," the search engine receives revenue (i.e. moolah). Therefore, it behooves the search engine to actually list websites and advertisements that are most relevant to your search query: the better the selection, the more likely you are to click on the related banner ads, and the more likely you are to use their service for a future search, say on mechanical puppies.

So the million dollar question is: how do the search engines produce their list of relevant urls? They aren't likely to divulge such proprietary (and money-making) secrets to us. But that won't stop us from exploring some of the possibilities...

In this lab, we will make one simplistic attempt at determining the relevance of a given url for a specific topic. Our first observation is, given a search query word, we say that the website with the highest frequency of the word is the most relevant, where the frequency of a word on a page is the number of times it appears on the page divided by the total number of words on the page. The intuition here is that there is often a correlation between the actual words on a website and its content. For example, a website with 20% of its words being "monkey" is more likely to be a good source on monkeys than a website with only 1% of its words being "monkey". (Of course, there are exceptions. This page has already mentioned "monkey" far more times than necessary. Monkey monkey monkey monkey.)

If we are looking for a multi-word phrase, however, we will need to look for all words of the phrase, in order, on the page. We will build an index of the words on a page, keeping track of where on the page we encountered each word (a list of locations, in fact, since a word can appear more than once.) Now if someone is looking for a page about "robot ponies," then we want to find pages that not only have a high-ish frequency of both the words "robot" and "ponies," but also have them appearing close together.

Therefore, our task is as follows: given a website's url (or, a plain ol' file), we will need to count up all the words, how many times they appear and in what locations, and then use this information as a first step towards world domination.

Part 0 - Getting Started

You should first obtain the starting point code in lab06.jar. Unjar the file into a work directory, then create a new project from that directory.

We are using an external library Jsoup to parse HTML for us (it simplifies things greatly). You will need to add jsoup-1.6.1.jar to the build path in Eclipse (right-click the file in the package window. Go to BuildPath->Add to Build Path). If you are working from the command line, you can compile and run things using the -classpath parameter.

% javac -classpath jsoup-1.6.1.jar:. HTMLScanner.java TestScanner.java

% java -classpath jsoup-1.6.1.jar:. TestScanner http://www.cs.oberlin.edu/

Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL) or a string. TestScanner contains a main method designed to test the HTMLScanner.

HTMLScanner is designed to work similar to the normal Scanner. You give the constructor a String representing the URL or file you want to read in, it then reads the file and lets you use hasNext() and next() to access the words on a page. (Look Ma, it's an Iterator!) I've also included features to iterate through the links on a page (hasNextLink() and nextLink). The Jsoup HTML parser supports other features (keywords, title, etc.) which you can read about in Jsoup's API docs if you want.

HTMLScanner currently only returns contiguous blocks of alpha-numeric characters --- so "sing-song95" on a page will return "sing" and then "song95".

Test Scanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/". (By the way, a URL either must start with http:// or file:, or you will suffer the consequences of a MalformedURLException.)


Part 1 - AVL Trees

First you'll be completing an implementation of TreeMap called MyTreeMap. Most of the implementation is already provided for you, but there are a few things you still need to finish. MyTreeMap is just an AVL tree in disguise, where the nodes in the tree contain "key-value" pairs. You place items in the tree ordered by their key, and this key has an associated value tagging along with it. (As tempting as it is to order the items by their value, DO NOT!) Now, the key can be any reference type, and so can the value. Therefore, our MyTreeMap class will be parameterized by not one but two generic types: K for the key, and V for the value. In fact, since this is a binary search tree and we are ordering out items by the key, the key must not only be a reference type, but a Comparable reference type, to boot.

The methods you have to implement are listed below. You should peruse the class to see how it is implemented; it is not the same as the binary tree lab. In particular, a TreeMap contains a (key,value) pair, and a reference to its left and right subtrees (which are also TreeMaps). An empty TreeMap is one for which its left and right subtrees are null; a leaf TreeMap is one for which its left and right subtrees are non-null, but are themselves empty TreeMaps. (We talked about a few different ways of representing empty trees in class. This is one such way.) You can explore this further by looking at the provided constructors.

Note that the generic type K of the key is a Comparable, and therefore, you can (and should!) use the compareTo method to determine whether two keys are equal, or to determine their order.

private V get(K searchKey)
Return the current mapping of the given key, that is, the value associated with the provided searchKey.
If no mapping exists, return null.
I've already included the actual public method get(Object key) which takes care of the casting.
(The get(Object key) method is required for any TreeMap implementation.)

public V put(K key, V value)
Insert a (key, value) mapping into the map, ordered by its key.
If a mapping for this key already exists, the new value should replace the old value in the map.
The return value of put is the previous value for the key if there was one, or null if there was not.
Here is a sequence of operations assuming recursive implementation:
  1. If the key does not yet exist in the tree, add a node to the correct location in the tree as a leaf, that is, add the (key,value) pair to an empty TreeMap (and make it a leaf). In more detail, you should:
    1. Set the value of key
    2. Set the value of value
    3. Set the size of the tree appropriately...to 1
    4. Set both left and right to be new empty MyTreeMaps (i.e., new MyTreeMap<K,V>())
  2. If the key already exists, update its value.
  3. Call restructure(this) if the tree is unbalanced.
  4. Call this.setHeight() to update the tree's height.
  5. Recalculate your size by adding 1 to the sum of the size of your children.
private void restructure(MyTreeMap<K, V> node)
Rebalances the MyTreeMap rooted at node, if it is unbalanced.
The actual rotation is already implemented; that is, once it knows which subtrees need to be rotated, it will do it.
What you need to do is tell it which subtrees need to be rotated.
You will do this by setting certain variables appropriately, as described in the comments.
The first case is done for you (when the left child is the tallest, and its left child is the tallest).
Everything else should already be implemented.
Please have scratch paper with you on which to draw what is happening. Trying to figure it out all in your head is just asking for trouble.

jUnit Testing

For testing, you should write jUnit tests to test the get and put methods you just wrote. Try examples that will require restructuring the tree; see the prelab for some examples that you can do by hand. Don't forget to test when you overwrite a value for an already-existing key! To help, we wrote some code in the MyTreeMap's main method; you can copy and paste some of it to get you started in your jUnit tests.

Be thorough with your tests, because you want that tree to work before you proceed!

Part 2 - MyTreeSet

Next you'll be implementing your own version of TreeSet using a MyTreeMap as the backing storage.

Recall that in a Set, you only keep one copy of any item added. With a working MyTreeMap, implementing a MyTreeSet is pretty straightforward. Here are the methods you need to implement:

public MyTreeSet()
Create an empty Set by creating an empty MyTreeMap.
public boolean add(T item)
Add in item to the set if it isn't already in there.
You should return true if the set changed, false otherwise.
public Iterator<T> iterator()
Just return the inorder iterator keys("in") from MyTreeMap.
public int size()
Return the size of the MyTreeMap.
public void clear()
Clear the MyTreeMap.

jUnit Testing

You should test that your MyTreeSet works with jUnit tests. It depends on your MyTreeMap working correctly, so be sure that is done first.


Part 3 - WebPageIndex

Now that you have a working MyTreeMap and MyTreeSet, you will use it to implement a data structure that will contain the index representation of a web page. You will use a MyTreeMap to keep track of the indexes of each word on a page, and a MyTreeSet to keep track of the links contained in the page. You should also keep track of the URL used to build the index and the total number of words on the page.

You will need the following public methods:

public WebPageIndex(String baseUrl)
Create a HTMLScanner from baseUrl. Keep a running counter of the number of words you run into when stepping through the page.
When you first encounter a word (i.e., it isn't in your MyTreeMap already), you should create a new LinkedList<Integer> containing the current index. If you've already seen the word, you should just add the current index onto the end of the existing List of locations.
Then you should step through the links using nextLink() and add them each into your MyTreeSet.
Hint: converting all words to lower case using String.toLowerCase() is highly recommended.
public String getUrl()
Return the URL used to create this index.
public int getWordCount()
Return the count of the total number of words on the page (not just unique words).
public boolean contains(String s)
Return true if the word s appeared as text anywhere on the page.
public int getCount(String s)
Return the number of times the word s appeared on the page.
public double getFrequency(String s)
Return the frequency the word s appeared on the page (i.e., the count for that word divided by the total number of words).
Be careful of integer division!
public List<Integer> getLocations(String s)
Return the List representing the locations where the word s appeared on the page (i.e., the value from MyTreeMap).
If s does not appear on the page, return an empty list, not null.
public Iterator<String> words()
Return an iterator over all the words on the page in alphabetical order.
Hint: your MyTreeMap already has something that will create this.
public String toString()
Just return the MyTreeMap's toString() value.

Once you have those methods working (see the jUnit testing section below), you should go on and implement the ability to look for phrases. To do this, what you'll want to do is take a string and break it up along whitespace boundaries into individual words. Look to see if each word appears in the sequence provided.

My suggestion is to either use s.split("\\s") to turn the input into an array of Strings, or use a Scanner to step through s. For each word there, create a parallel structure of Lists using getLocations (I had an array of Lists). Loop through the values for the first word and see if the next has a value 1 greater, the next 2 greater, etc. You only have a phrase match if every one has an appropriate value.

public boolean containsPhrase(String s)
Return true if the phrase s is in the web page.
public int getPhraseCount(String s)
Return the number of times the phrase s appears on the page
public double getPhraseFrequency(String s)
Return the number of times the phrase s appears on the page divided by the total number of words on the page.
(Note: I'm open to suggestions on how to improve this metric.)
public List<Integer> getPhraseLocations(String s)
Return a List marking the stating point of each instance of phrase s on the page.
If the phrase does not occur, you should return an empty List, not null.

jUnit Testing

You should write jUnit tests to thoroughly check all of your methods. For each test method, create two WebPageIndexes: one out of the provided file testscannerfile, and one from the url http://www.cs.oberlin.edu/~asharp/cs151/labs/lab06/lab06-sample.html.

Some helpful facts about testscannerfile are as follows:

Frequency and index of words in testscannerfile
happening       0.083333    [4]
hi              0.083333    [0]
if              0.083333    [8]
important       0.083333    [7]
is              0.250000    [3, 5, 10]
it              0.166667    [6, 9]
tagged          0.083333    [11]
there           0.083333    [1]
what            0.083333    [2]

Links:
http://www.google.com/

Some helpful facts about lab06-sample.html are:

6               0.076923    [4, 11, 15]
a               0.051282    [6, 23]
be              0.025641    [17]
book            0.025641    [26]
children        0.025641    [24]
cow             0.102564    [27, 30, 33, 36]
for             0.051282    [2, 9]
from            0.025641    [22]
i               0.025641    [12]
if              0.025641    [14]
just            0.025641    [5]
lab             0.051282    [3, 10]
me              0.025641    [29]
milk            0.025641    [35]
moo             0.128205    [28, 31, 34, 37, 38]
on              0.025641    [19]
page            0.051282    [8, 21]
popular         0.025641    [18]
s               0.025641    [25]
sample          0.025641    [0]
short           0.025641    [7]
text            0.025641    [1]
that            0.025641    [20]
will            0.025641    [16]
wonder          0.025641    [13]
you             0.025641    [32]

Links:

Handin

Use the handin program to submit a directory containing the following:

  1. All .java files necessary for compiling your code (including any of the classes that I gave you that you use in your solution).
  2. A README file with:
    • Your name (and your partner's name if you had one)
    • A description of any parts of the project that are not working

If you work with a partner, please only one of you submit your joint solution using handin.