11:59.59pm, Sunday, April 19
You may work with one partner on this assignment.
Part 1 is to implement a hashtable using the chaining method. Part 2 is to use the hashtable to improve the performance of the application by keeping track of duplicate URLs and by caching the results of search queries. Part 3 of the assignment is to build a GUI front-end to your search engine that can also perform the fetching and displaying of URLs similar to a web browser.
Create a MyHashMap class that implements a hashtable utilizing chaining. The table consists of an array of LinkedLists (or ArrayLists).
The methods are a subset of the
java.util.Map
interface. Note that you aren't expected to implement the interface, just
use it for reference if needed.
You should use equals() to determine if keys are equal.
Don't assume that keys implement Comparable.
Attempts to insert null values or keys should generate a
NullPointerException.
Your hashtable must be able to dynamically resize itself after the load factor threshold has been reached (remember that it needs to rehash all the existing entries). Write a private resize() method to do this and call it whenever the maximum load factor is exceeded. The resize method should always at least double the array size and maintain an odd size. It is best to resize to a prime number. Remember, to determine if a number is prime, you need to test that it has no factors less than or equal to its square root. Here is a list of primes each at least twice the value of the previous if you just want to do a table lookup:
11 23 47 97 197 397 797 1597 3203 6421 12853 25717 51437 102877 205759 411527 823117 1646237 3292489 6584983 13169977 26339969 52679969 105359939 210719881 421439783 842879579 1685759167 3371518343
You can build your hashtable on top of either an array or ArrayList. I think that an array would be much simpler. When resizing, remember that not every item in a bucket-chain has the same key.
To compute a hash function, first apply the hashCode() method and then apply the "% tableSize" operation -- be sure the result is non-negative.
You may find it very useful to create an Entry class to use to store both key and value, and then just store Entry data items in the table. You will probably want to override the hashCode() and equals() methods such that they apply only to the key.
Use your hashtable data structure for two purposes:
Detecting duplicate entries in the url_list.
If the url_list contains a duplicate entry, there is no need to build its word frequency table, etc. To accomplish this, just create an empty hashtable and add in every URL string as a key with something (other than null) for a value. If the result of the put() operation is not null, you know that you have already seen the string before.
Caching query results.
Use a hashtable to store the results of each query. If the same query is entered again, it can be handled by a table lookup rather than performing the query again. In your table, use a query string or word list as the key and store as the value at least the search results, but possibly all the information on the state of things to display in the GUI.
When a duplicate URL is found while reading url_list, you should print a message on the console similar to
Duplicate URL (line 20): http://www.cs.oberlin.edu/
When processing a query, the query result display area should include a line indicating whether or not the query results were found in the cache. For example,
Results for your query "book":
------------------------------------------------
Query: book
Query results NOT found in cache
Relevant pages:
(priority = 4) http://www.cs.oberlin.edu/~kuperman/csci151/
(priority = 4) http://www.cs.oberlin.edu/~kuperman/csci317/
(priority = 1) http://www.cs.oberlin.edu/
Search for: book
Results for your query "book":
------------------------------------------------
Query: book
Query results found in cache
Relevant pages:
(priority = 4) http://www.cs.oberlin.edu/~kuperman/csci151/
(priority = 4) http://www.cs.oberlin.edu/~kuperman/csci317/
(priority = 1) http://www.cs.oberlin.edu/
Note that the first time the query is entered, the results are not in the cache, but the second time the results are. Unless you have modified the search query system to take word order into account, you should be able to find previous results based on any ordering of search terms. There are many possible ways to do this, but the simplest include:
Build a GUI front-end for your search engine. It should have at least the following components:
However, you may also improve upon the design as you see fit.
The WebBrowser's main method should require a url_list and ignore_list as command line arguments, and optionally take a count parameter to ignore words of limited frequency. The web browser acts as a front end to the code you wrote in the WordFrequencyTree and ProcessQueries labs.
The GUI should operate as follows:
There is a simple web browser in lab8.jar that you can look at if you want. Or you can refer back to the code I provided during the lectures on GUIs: gui-demo.tgz
Appendix B of your text has more information on GUIs too.
Since the main method of the application is now in the WebBrowser class, you may need to modify the code you wrote for the previous lab. Specifically, you need a ProcessQueries object that can respond to requests from the web browser.
The main method in the web browser can instantiate the ProcessQueries object and give it the information it needs to get started. The ProcessQueries object should have (at least) a method which will allow the browser to obtain the results of a query.
Remember that it is much faster to generate intermediate results using a StringBuffer and then calling a toString on that StringBuffer than it is to simply concatenate a long sequence of Strings with the plus operator.
Remember that the JEditorPane class interprets HTML. You might want to modify some text based methods to embed HTML tags to make things look spiffier.
Use handin to submit the following files:
If you work with a partner, just submit one solution per team.