In this (bonus) prelab, you will familiarize yourself with some of the design and implementation issues of the upcoming lab 6. Please write or type up your solutions, and hand in a paper copy before class on Wednesday. You are not required to do this prelab; if you choose to do so, your score will count as bonus points in your final grade calculation.
This lab is the first part of a series of related labs about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Google, Yahoo!, or Bing.
In this first step, you will use an AVL tre to create an index of all the words contained on a webpage. You will then be able to query the index to find out how frequently a word appears on a page and in what locations.
You may work with a partner on this lab (although the prelab must be done individually.)
You probably use a search engine such as Google or Yahoo! at least 10 times on any given day to navigate through all the information on the web. When you type in a search query such as "robot ponies" (as one does), a good search engine will show you what it believes to be the most relevant websites for your robot pony information quest. What motivates a search engine company to provide quality "hits"? Do they do it out of the goodness of their itty bitty computer scientist hearts? Maybe a little. But mostly, they do it for profit: when the search engine lists its most relevant websites, it also displays advertisements relevant to people interested in robot ponies. For every "ad-click," the search engine receives revenue (i.e. moolah). Therefore, it behooves the search engine to actually list websites and advertisements that are most relevant to your search query: the better the selection, the more likely you are to click on the related banner ads, and the more likely you are to use their service for a future search, say on mechanical puppies.
So the million dollar question is: how do the search engines produce their list of relevant urls? They aren't likely to divulge such proprietary (and money-making) secrets to us. But that won't stop us from exploring some of the possibilities...
In the upcoming lab, we will make two simplistic attempts at determining the relevance of a given url for a specific topic. In particular, given a search query word, we say that the website with the highest frequency of the word is the most relevant, where the frequency of a word on a page is the number of times it appears on the page divided by the total number of words on the page. The intuition here is that there is often a correlation between the actual words on a website and its content. For example, a website with 20% of its words being "monkey" is more likely to be a good source on monkeys than a website with only 1% of its words being "monkey". (Of course, there are exceptions. This page has already mentioned "monkey" far more times than necessary. Monkey monkey monkey monkey.)
Therefore, our task is as follows: given a website's url (or, a plain ol' file), we will calculate the word frequency of the words contained therein. This is our first step towards world (web) domination.
First up is an implementation of TreeMap called MyTreeMap. MyTreeMap is just an AVL tree in disguise, where the nodes in the tree contain "key-value" pairs. You place items in the tree ordered by their key, and this key has an associated value tagging along with it. For example, the following is a MyTreeMap containing (key,value) pairs. Note that the items are ordered by their keys (the animal names), and not by their values (the numbers).
Now, the key can be any reference type that is Comparable, and the value is a referene type that needn't be Comparable. Therefore, our MyTreeMap class will be parameterized by not one but two generic types: K for the key, and V for the value. The class header will therefore be
public class MyTreeMap<K,V> extends AbstractMap<K,V>
(In actual fact, we want our type K to be Comparable, and therefore our method header will be
public class MyTreeMap<K extends Comparable<? super K>,V> extends AbstractMap<K,V>
but let's not worry about that too much right now.)
On the lab itself, parts of the MyTreeMap implementation will be given to you, and you will fill in the remainder. Unlike the binary tree lab, we will have a single non-abstract MyTreeMap class. Each MyTreeMap contains a key and value pair, and a reference to its left and right subtrees (which are also MyTreeMaps). An empty MyTreeMap is one for which its left and right subtrees are null; a leaf MyTreeMap is one for which its left and right subtrees are non-null, but are themselves empty MyTreeMaps.
You will be required to complete the implementation of the following three methods.
(key, value)mapping into the map, ordered by its key.
In this lab, we'll be trying out a technique that is used by most search engines by building an index of the words on a page. For each word we encounter on a web page, we'll keep track of what order we encountered it in (0th, 1st, 2nd, 3rd, etc.) and keep a list of all the locations for each word.
Certainly having a list of the locations can also tell us how many times a certain query exists. This give us one crude measure by which to evaluate a page's relevance to a given query word.
Of course, if the query is more than one word, we need to be careful. If someone is looking for "robot ponies", then you would need to know where the words "robot" and "ponies" appear on the page, and determine whether they appear sequentially in the correct order. Some search engines also support the ability to search for words NEAR each other.
How much wood could a woodchuck chuck, if a woodchuck could chuck wood?
As much wood as a woodchuck would, if a woodchuck could chuck wood.
Fill in the following table of words with a list of their indexed locations from the previous question.
|could||3||0.115||3, 10, 23|
In this section you'll be implementing your own version of TreeSet using a MyTreeMap as the backing storage.
Recall that a Set is a collection of unique items; that is, it stores items of one type K, and is limited to keep one copy of any item added. With a working MyTreeMap<K,V>, implementing a MyTreeSet<K> is pretty straightforward.