Prelab 6

Web Searchin' via Word Frequencies
Bonus Prelab: Optionally due by 10am, Wednesday 31 Oct 2012

In this (bonus) prelab, you will familiarize yourself with some of the design and implementation issues of the upcoming lab 6. Please write or type up your solutions, and hand in a paper copy before class on Wednesday. You are not required to do this prelab; if you choose to do so, your score will count as bonus points in your final grade calculation.

This lab is the first part of a series of related labs about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Google, Yahoo!, or Bing.

In this first step, you will use an AVL tre to create an index of all the words contained on a webpage. You will then be able to query the index to find out how frequently a word appears on a page and in what locations.

You may work with a partner on this lab (although the prelab must be done individually.)


Motivation

You probably use a search engine such as Google or Yahoo! at least 10 times on any given day to navigate through all the information on the web. When you type in a search query such as "robot ponies" (as one does), a good search engine will show you what it believes to be the most relevant websites for your robot pony information quest. What motivates a search engine company to provide quality "hits"? Do they do it out of the goodness of their itty bitty computer scientist hearts? Maybe a little. But mostly, they do it for profit: when the search engine lists its most relevant websites, it also displays advertisements relevant to people interested in robot ponies. For every "ad-click," the search engine receives revenue (i.e. moolah). Therefore, it behooves the search engine to actually list websites and advertisements that are most relevant to your search query: the better the selection, the more likely you are to click on the related banner ads, and the more likely you are to use their service for a future search, say on mechanical puppies.

So the million dollar question is: how do the search engines produce their list of relevant urls? They aren't likely to divulge such proprietary (and money-making) secrets to us. But that won't stop us from exploring some of the possibilities...

In the upcoming lab, we will make two simplistic attempts at determining the relevance of a given url for a specific topic. In particular, given a search query word, we say that the website with the highest frequency of the word is the most relevant, where the frequency of a word on a page is the number of times it appears on the page divided by the total number of words on the page. The intuition here is that there is often a correlation between the actual words on a website and its content. For example, a website with 20% of its words being "monkey" is more likely to be a good source on monkeys than a website with only 1% of its words being "monkey". (Of course, there are exceptions. This page has already mentioned "monkey" far more times than necessary. Monkey monkey monkey monkey.)

Therefore, our task is as follows: given a website's url (or, a plain ol' file), we will calculate the word frequency of the words contained therein. This is our first step towards world (web) domination.


Part 1 - AVL Trees

First up is an implementation of TreeMap called MyTreeMap. MyTreeMap is just an AVL tree in disguise, where the nodes in the tree contain "key-value" pairs. You place items in the tree ordered by their key, and this key has an associated value tagging along with it. For example, the following is a MyTreeMap containing (key,value) pairs. Note that the items are ordered by their keys (the animal names), and not by their values (the numbers).


Now, the key can be any reference type that is Comparable, and the value is a referene type that needn't be Comparable. Therefore, our MyTreeMap class will be parameterized by not one but two generic types: K for the key, and V for the value. The class header will therefore be

    public class MyTreeMap<K,V> extends AbstractMap<K,V>

(In actual fact, we want our type K to be Comparable, and therefore our method header will be

    public class MyTreeMap<K extends Comparable<? super K>,V> extends AbstractMap<K,V> 

but let's not worry about that too much right now.)

  1. Draw the AVL tree that results from the insertion (as a leaf-node, as discussed in class), in the following order, of (aardvark,10), (pot bellied pig,3), and (yak,6) into the tree above. Remember, the tree is ordered by the animals in alphabetical order, not the integer values.







  2. Draw the AVL tree that results from the inesrtion of (flamingo,5) into the original tree (the one without aardvark, pot bellied pig, and yak).
    • First draw the tree as it would be if we were not concerned about balance (that is, if this were just a normal binary search tree).
    • Next, label the root of the smallest unbalanced subtree with the letter Z.
    • Next, label the root of Z's tallest child with the letter Y.
    • Next, label the root of Y's tallest child with the letter X.
    • Out of X,Y, and Z, what is the smallest key value? Label this as a.
    • Out of X,Y, and Z, what is the middle key value? Label this as b.
    • Out of X,Y, and Z, what is the largest key value? Label this as c.
    • Finally, draw the rebalanced AVL tree. If you've done things correctly, it should have the key b as its root, with left and right subtrees a and c, respectively.








  3. Repeat the above labellings for the insertion of (moo-cow,4) into the original tree. That is,
    • First draw the tree as it would be if we were not concerned about balance (that is, if this were just a normal binary search tree).
    • Next, label the root of the smallest unbalanced subtree with the letter Z.
    • Next, label the root of Z's tallest child with the letter Y.
    • Next, label the root of Y's tallest child with the letter X.
    • Out of X,Y, and Z, what is the smallest key value? Label this as a.
    • Out of X,Y, and Z, what is the middle key value? Label this as b.
    • Out of X,Y, and Z, what is the largest key value? Label this as c.
    • Finally, draw the rebalanced AVL tree. If you've done things correctly, it should have the key b as its root, with left and right subtrees a and c, respectively.








On the lab itself, parts of the MyTreeMap implementation will be given to you, and you will fill in the remainder. Unlike the binary tree lab, we will have a single non-abstract MyTreeMap class. Each MyTreeMap contains a key and value pair, and a reference to its left and right subtrees (which are also MyTreeMaps). An empty MyTreeMap is one for which its left and right subtrees are null; a leaf MyTreeMap is one for which its left and right subtrees are non-null, but are themselves empty MyTreeMaps.

You will be required to complete the implementation of the following three methods.

private V get(K searchKey)
Return the current mapping of the given key, that is, the value associated with the provided searchKey.
If no mapping exists, return null.

public V put(K key, V value)
Insert a (key, value) mapping into the map, ordered by its key.
If a mapping for this key already exists, the new value should replace the old value in the map.
The return value of put is the previous value for the key if there was one, or null if there was not.

private void restructure(MyTreeMap<K, V> node)
Rebalances the MyTreeMap rooted at node, if it is unbalanced.

Part 2 - WebPageIndex

In this lab, we'll be trying out a technique that is used by most search engines by building an index of the words on a page. For each word we encounter on a web page, we'll keep track of what order we encountered it in (0th, 1st, 2nd, 3rd, etc.) and keep a list of all the locations for each word.

Certainly having a list of the locations can also tell us how many times a certain query exists. This give us one crude measure by which to evaluate a page's relevance to a given query word.

Of course, if the query is more than one word, we need to be careful. If someone is looking for "robot ponies", then you would need to know where the words "robot" and "ponies" appear on the page, and determine whether they appear sequentially in the correct order. Some search engines also support the ability to search for words NEAR each other.

  1. Given the following paragraph of text, go through and index the order of all the words (start from 0):
    How much wood could a woodchuck chuck, if a woodchuck could chuck wood?
    As much wood as a woodchuck would, if a woodchuck could chuck wood.
  2. Fill in the following table of words with a list of their indexed locations from the previous question.

    WordCountFrequencyLocations
    a    
    as    
    chuck    
    could 3 0.115 3, 10, 23
    how 1 0.038 0
    if    
    much    
    wood    
    woodchuck    
    would    
  3. Explain how you could use this information to find out if a given phrase (e.g., "chuck wood") exists within a body of text.



  4. Assuming that this information is to be stored in some type of balanced binary search tree, what is the running time for creating this table? Explain your reasoning.



  5. Assuming that this table was already constructed and you could use a binary search to locate a row, what would the running time be to do a lookup of word frequency? What about if it was to determine the existence of a phrase?




Part 4 - MyTreeSet

In this section you'll be implementing your own version of TreeSet using a MyTreeMap as the backing storage.

Recall that a Set is a collection of unique items; that is, it stores items of one type K, and is limited to keep one copy of any item added. With a working MyTreeMap<K,V>, implementing a MyTreeSet<K> is pretty straightforward.

  1. Assuming you have a working MyTreeMap, briefly explain how you could use a Map (which has get(key) and put(key,value) methods) to implement a Set that has a boolean returning add(item) method that returns true if the item wasn't already in the Set.