CSCI 151 - Lab 8 - Spring 2009 Caching and a Web Interface

11:59.59pm, Sunday, April 19


You may work with one partner on this assignment.

Part 1 is to implement a hashtable using the chaining method. Part 2 is to use the hashtable to improve the performance of the application by keeping track of duplicate URLs and by caching the results of search queries. Part 3 of the assignment is to build a GUI front-end to your search engine that can also perform the fetching and displaying of URLs similar to a web browser.


Part 1: Implement MyHashMap

Create a MyHashMap class that implements a hashtable utilizing chaining. The table consists of an array of LinkedLists (or ArrayLists).

The methods are a subset of the java.util.Map interface. Note that you aren't expected to implement the interface, just use it for reference if needed. You should use equals() to determine if keys are equal. Don't assume that keys implement Comparable. Attempts to insert null values or keys should generate a NullPointerException.

Constructors

MyHashMap(int capacity, float load_factor)
Create a hashtable with capacity buckets and a maximum load factor of load_factor.
MyHashMap()
Create a hashtable of size 11 with a load factor of 0.75.

Methods from java.util.Map

int size()
return the number of items in the hashtable
boolean isEmpty()
true if size==0, false otherwise
void clear()
empties out the hashtable
V put(K key, V value)
associates the specified value with the given key. It returns the previous value associated with the key, or null if there was no mapping.
V get(K key)
returns the value associated with the key. If no value exists, it returns null.
V remove(K key)
deletes the mapping (key,value) from the hashtable. It returns the previous value or null if there was no such value.
boolean containsKey(K key)
returns true if the specified key is in the table
boolean containsValue(V value)
returns true if the specified value is in the table

Methods not in java.util.Map

Iterator<K> keys()
creates an iterator of all the keys in the hashtable.
Iterator<V> values()
creates an iterator of all the values in the hashtable.

Your hashtable must be able to dynamically resize itself after the load factor threshold has been reached (remember that it needs to rehash all the existing entries). Write a private resize() method to do this and call it whenever the maximum load factor is exceeded. The resize method should always at least double the array size and maintain an odd size. It is best to resize to a prime number. Remember, to determine if a number is prime, you need to test that it has no factors less than or equal to its square root. Here is a list of primes each at least twice the value of the previous if you just want to do a table lookup:

11 
23 
47 
97 
197 
397 
797 
1597 
3203 
6421 
12853 
25717 
51437 
102877 
205759 
411527 
823117 
1646237 
3292489 
6584983 
13169977 
26339969 
52679969 
105359939 
210719881 
421439783 
842879579 
1685759167 
3371518343 

Programming notes:

You can build your hashtable on top of either an array or ArrayList. I think that an array would be much simpler. When resizing, remember that not every item in a bucket-chain has the same key.

To compute a hash function, first apply the hashCode() method and then apply the "% tableSize" operation -- be sure the result is non-negative.

You may find it very useful to create an Entry class to use to store both key and value, and then just store Entry data items in the table. You will probably want to override the hashCode() and equals() methods such that they apply only to the key.


Part 2: Caching

Use your hashtable data structure for two purposes:

  1. Detecting duplicate entries in the url_list.

    If the url_list contains a duplicate entry, there is no need to build its word frequency table, etc. To accomplish this, just create an empty hashtable and add in every URL string as a key with something (other than null) for a value. If the result of the put() operation is not null, you know that you have already seen the string before.

  2. Caching query results.

    Use a hashtable to store the results of each query. If the same query is entered again, it can be handled by a table lookup rather than performing the query again. In your table, use a query string or word list as the key and store as the value at least the search results, but possibly all the information on the state of things to display in the GUI.

When a duplicate URL is found while reading url_list, you should print a message on the console similar to

Duplicate URL (line 20): http://www.cs.oberlin.edu/

When processing a query, the query result display area should include a line indicating whether or not the query results were found in the cache. For example,

Results for your query "book":
------------------------------------------------
Query: book
    Query results NOT found in cache

Relevant pages:
(priority =    4) http://www.cs.oberlin.edu/~kuperman/csci151/
(priority =    4) http://www.cs.oberlin.edu/~kuperman/csci317/
(priority =    1) http://www.cs.oberlin.edu/

Search for: book

Results for your query "book":
------------------------------------------------
Query: book
    Query results found in cache

Relevant pages:
(priority =    4) http://www.cs.oberlin.edu/~kuperman/csci151/
(priority =    4) http://www.cs.oberlin.edu/~kuperman/csci317/
(priority =    1) http://www.cs.oberlin.edu/

Note that the first time the query is entered, the results are not in the cache, but the second time the results are. Unless you have modified the search query system to take word order into account, you should be able to find previous results based on any ordering of search terms. There are many possible ways to do this, but the simplest include:

  1. Sort the words as part of the search query.
  2. Ensure that the hashCode() of the search terms does not depend on word order.

Part 3: GUI

Build a GUI front-end for your search engine. It should have at least the following components:

  1. A query text box -- so the user can type in a query
  2. A search button
  3. A URL text box -- where the user can type in a URL
  4. A fetch URL button
  5. A display area for the search results
  6. A display area for the fetched URL

However, you may also improve upon the design as you see fit.

The WebBrowser's main method should require a url_list and ignore_list as command line arguments, and optionally take a count parameter to ignore words of limited frequency. The web browser acts as a front end to the code you wrote in the WordFrequencyTree and ProcessQueries labs.

The GUI should operate as follows:

Programming Notes

There is a simple web browser in lab8.jar that you can look at if you want. Or you can refer back to the code I provided during the lectures on GUIs: gui-demo.tgz

Appendix B of your text has more information on GUIs too.

Since the main method of the application is now in the WebBrowser class, you may need to modify the code you wrote for the previous lab. Specifically, you need a ProcessQueries object that can respond to requests from the web browser.

The main method in the web browser can instantiate the ProcessQueries object and give it the information it needs to get started. The ProcessQueries object should have (at least) a method which will allow the browser to obtain the results of a query.

Remember that it is much faster to generate intermediate results using a StringBuffer and then calling a toString on that StringBuffer than it is to simply concatenate a long sequence of Strings with the plus operator.

Remember that the JEditorPane class interprets HTML. You might want to modify some text based methods to embed HTML tags to make things look spiffier.


What to Hand In

Use handin to submit the following files:

  1. All .java files necessary for compiling your code (including those from previous labs)
  2. A README file with:
    1. Your name (and your partner's name if you had one)
    2. The name of the class containing your GUI program and how to invoke it.
    3. Any known problems or interesting design decisions that you made

If you work with a partner, just submit one solution per team.


Last Modified: April 12, 2009 - Benjamin A. KupermanVI Powered