CSCI 151 - Web Page Index Indexing the web using AVL trees

Due 10am, Monday, Octobober 31st, 2016

In this prelab, you will familiarize yourself with some of the design and implementation issues in the upcoming lab 6. Please write or type up your solutions, and hand in a paper copy before 10am on Monday. Remember, late prelabs receive zero credit!

This assignment is the first part of a series of related assignments about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Yahoo!, Bing, or Google.

In this first step, you will use an AVL tree to create an index of all the words contained on a webpage. You will then be able to query the index to find out how frequently a word appears on a page and in what locations. Presumably, there is some correlation between words on a page and that page's content. And a page that frequently contains a word is probably a better match than other pages that don't.

As usual, you may work with a partner on this assignment.

Motivation

You probably use a search engine at least 10 times a day to dig through the vast amount of information on the web. Some folks don't even know how to get to their favorite web sites without using a search engine! Back in the days before search engines, folks had to rely upon browser bookmarks to keep track of their favorite or even buy a book listing web sites. (I guess some folks still do!)

So what motivates today's search companies to spend so much time, effort, and money to give you good results to your search queries? Sure, some of the first search engines were computer science folks doing nifty things and sharing it out of the goodness of their hearts (and some still do), but mostly it is done for profit these days. Along with your search results, the search engine also displays a bunch of advertisements related to your query. For every ad-click that you make, they get some money. Sometimes, they get paid just to display an URL at the top of the search results! By getting high-quality/relevant search results, you are more likely to continue using a particular search engine, thus increasing their chances to profit from you.

So the million dollar question is: How do the search engines produce their list of relevant URLs? Well, they don't share all the specifics, but we know a number of basic ideas that most of them use.

Part 1 - Building indexes

In this lab, we'll be trying out 2 techniques that are used by most search engines by building an index of the words on a page. For each word we encounter on a web page, we'll keep track of what order we encountered it in (0th, 1st, 2nd, 3rd, etc.) and keep a list of all the locations for each word.

The first query technique is related to the frequency of a query word on a given page. If you've got a page that has the word "monkey" occurring repeatedly, then it is quite likely to be about monkeys. And a page that has 10% of the words on it being "monkey" is probably more relevant to a query on "monkey" than a page that only has it there 0.5%.

The second technique will be to use the set of indexes to find phrases on a page. If some is looking for "robot ninjas", then you would need to know where the words "robot" and "ninjas" appear on the page. Some search engines also support the ability to search for words NEAR each other.

Given the following paragraph of text, go through and index the order of all the words (start from 0):
How much wood could a woodchuck chuck, if a woodchuck could chuck wood?
As much wood as a woodchuck would, if a woodchuck could chuck wood.

Fill in the following table of words with a list of their indexed locations from the previous question.

Word	Count	Frequency	Locations
a
as
chuck
could	3	0.115	3, 10, 23
how	1	0.038	0
if
much
wood
woodchuck
would

Explain how you could use the information in this table to find out if a given phrase (e.g., "chuck wood") exists within a body of text.

Part 2 - AVL Trees

First you'll be completing an implementation of TreeMap called MyTreeMap. Most of the implementation is already provided for you, but there are a few things you still need to finish. MyTreeMap is just an AVL tree in disguise, where the nodes in the tree contain "key-value" pairs. You place items in the tree ordered by their key, and this key has an associated value tagging along with it. Now, the key can be any reference type, and so can the value. Therefore, our MyTreeMap class will be parameterized by not one but two generic types: K for the key, and V for the value.

Give a recursive algorithm for doing a get(key) operation on a recursive AVL tree. Assume that each node has fields left and right for subtrees and key and value for their key-value pairs. Be aware that all of the key items must be Comparable and therefore have the compareTo() method. Return null if the key isn't in the tree, or the associated value if it is.
Assume that we are at a node we refer to node that is unbalanced (we called it Z in lecture). Specifically, its left child (Y) is taller than its right, and that node has a taller right child (X) than left (so it would need a double rotation to fix). Assuming the same nodes from above, but now including a pre-calculated height field. (Section 11.2 of the text may be helpful.)
```
          Z     <== This is "node"
        /   \
       Y     t3
      / \
     t0  X
        / \
      t1   t2
```
Give the two "if" statements that would be true to determine this specific height condition. (i.e., determine node's taller child and then taller grandchild)
Assuming we are restructuring the previously described unbalanced tree (Z, Y, X) to the following rebalanced tree:
```
          b
        /   \
       a     c
      / \   / \
     t0 t1 t2 t3
```
Give values for a, b, c in terms of node. (node.left.right, node.right, etc.) Remember: node is Z
What are the values for t0, t1, t2, and t3 then?
Hint: t0 = node.left.left;

Part 3 - MyTreeSet

In this section you'll be implementing your own version of TreeSet using a MyTreeMap as the backing storage.

Recall that in a Set, you only keep one copy of any item added. With a working MyTreeMap, implementing a MyTreeSet is pretty straightforward.

Assuming you have a working MyTreeMap, briefly explain how you could use a Map (which has get(key) and put(key,value) methods) to implement a Set that has a public method boolean add(item) that returns true if the item wasn't already in the Set.

Last Modified: October 15, 2015 - Benjamin A. Kuperman