11:59.59pm, Sunday, April 13
You may work with one partner on this assignment.
For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. A query consists of a list of word to search for. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:
java ProcessQueries urlListFile ignoreFile [count]
A URL can be written in many different formats. See the documentation for java.net.URL for details of those supported. You will probably want to consider
An example urlListFile might contain:
http://occs.cs.oberlin.edu/faculty http://www.cs.oberlin.edu/~rms/ http://cs.oberlin.edu/~jdonalds http://www.cs.oberlin.edu/~asharp/ http://www.cs.oberlin.edu/~kuperman/
The ignoreFile should contain a list of words that you would like to ignore as you count word frequencies in html files (just as you did in the last lab).
In order to process queries from a user, you'll need to create a new class that combines a URL string with a WordFrequencyTree representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile. Construct a WordFrequencyTree for each URL in the list.
Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URLs that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis with each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list.
% java ProcessQueries urls-catalog ignorefile 10 Processing large number of URLS Processed 500 done. Enter a query on one line or -1 to quit Search for: csci Relevant pages: (priority = 63) http://www.oberlin.edu/catalog01/college/compsci.html (priority = 33) http://www.oberlin.edu/catalog02/college/compsci.html (priority = 25) http://www.oberlin.edu/catalog/college/compsci.html (priority = 20) http://www.oberlin.edu/catalog04/college/compsci.html (priority = 16) http://www.oberlin.edu/catalog03/college/compsci.html (priority = 2) http://www.oberlin.edu/catalog/college/quantitative.html (priority = 1) http://www.oberlin.edu/catalog04/college/neuro.html (priority = 1) http://www.oberlin.edu/catalog04/college/quantitative.html (priority = 1) http://www.oberlin.edu/catalog01/con/conreq/musiceducation.html (priority = 1) http://www.oberlin.edu/catalog01/college/neuro.html Search for: computer science Relevant pages: (priority = 129) http://www.oberlin.edu/catalog02/college/compsci.html (priority = 115) http://www.oberlin.edu/catalog01/college/compsci.html (priority = 115) http://www.oberlin.edu/catalog03/college/compsci.html (priority = 108) http://www.oberlin.edu/catalog/college/compsci.html (priority = 104) http://www.oberlin.edu/catalog04/college/compsci.html (priority = 20) http://www.oberlin.edu/catalog04/college/quantitative.html (priority = 20) http://www.oberlin.edu/catalog/college/quantitative.html (priority = 19) http://www.oberlin.edu/catalog03/college/quantitative.html (priority = 18) http://www.oberlin.edu/catalog02/college/quantitative.html (priority = 18) http://www.oberlin.edu/catalog01/college/quantitative.html Search for: kuperman Relevant pages: Search for: underwater basket weaving Relevant pages: (priority = 1) http://www.oberlin.edu/catalog04/college/chem.html (priority = 1) http://www.oberlin.edu/catalog01/college/athlet.html (priority = 1) http://www.oberlin.edu/catalog03/college/chem.html (priority = 1) http://www.oberlin.edu/catalog/college/fyearsem.html (priority = 1) http://www.oberlin.edu/catalog03/college/athlet.html (priority = 1) http://www.oberlin.edu/catalog/college/chem.html (priority = 1) http://www.oberlin.edu/catalog02/college/chem.html (priority = 1) http://www.oberlin.edu/catalog04/college/fyearsem.html (priority = 1) http://www.oberlin.edu/catalog02/college/athlet.html (priority = 1) http://www.oberlin.edu/catalog04/college/athlet.html Search for: -1
To find the results of the query in order, construct a priority queue of URLContent objects, one per web page. The priority value should be computed by adding the frequencies of the words in the query. The priority queue can be used to print out the matching URLs in order.
To compare URLContent objects, use a Comparator. Comparator<E> is a java interface which contains the method
public int compare(E item1, E item2);
The compare method returns a negative integer if item1<item2, a
positive integer if item1>item2, and zero if
item1==item2.
The lab jar file contains a sample Comparator called StringComparator.java. You need to write your own Comparator to compare URLContent objects, based on the current query.
What you need from the previous Lab:
What you are given in the lab7.jar file:
docs folder has the javadoc for this.What you need to write:
Start by implementing the MyPriorityQueue class. Test that it works before moving on to the next part. (A good test program would create a MyPriorityQueue of the Strings on the command line and then, one by one, remove them from the queue and print them. You can use my StringComparator to perform the comparisons.)
Write the URLContent class. It should hold a URL string and the WordFrequencyTree for that URL.
Next, implement the part of your main method that processes the urlListFile. For each URL read in, create the appropriate HTMLScanner to process it. Then build a table of word frequencies for that file. Put all of the URLContent objects in a list.
Create a comparator that can compare two URLContent objects based on a list of Strings input by the user.
Next, implement that part of the main method that reads in a search query, uses that to construct a comparator, and then builds a heap of URLContent objects using that comparator.
The main method then prints out the matching URLs in order from best to worst match until there are no matches or the user specified limit is reached. Continue reading and processing queries in a loop until you reach end of file on System.in or some designated terminator string (e.g., "-1").
Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WordFrequencyTree for all three words to determine the URL's priority.
You might want to have your constructor for your URLComparator take a String argument. (You should check for the terminator string before creating the comparator.) The HTMLScanner class can also parse a String into tokens, and if you use it to break up the user query, you'll have the same set of tokens as you put into the WordFrequencyTree.
Don't "drop" the results as you are pulling them out of the heap. Just stick them in a list of some sort as you remove them and add them back in afterwards.
When you change a comparator in the heap, you need to reheapify things. You could just create a new ArrayList and add in all the old items. A better choice would be to reheapify using the linear time algorithm discussed in class.
You might run into a message that indicates that you have too many files open. Every HTMLScanner you create (or Scanner to read a file) uses one of a limited number of file descriptor slots. You should get rid of the reference to either the Scanner or the HTMLScanner when you are done using it. If you keep them in your WordFrequencyTree, you will run out of descriptors to use.
You might also run out of memory on a large url-list file if you run Java in the default manner (it only allocates 64MB of RAM). You can increase the amount of memory given to the Java Virtual Machine with an option to the java command. For example,
java -Xmx1g ProcessQueries urls_all ignorefile 10
where the -Xmx indicates that you want it to use more memory, and the 1g indicates to use up to 1 gigabyte.
When you are dealing with a large number of urls, you might want to print out a status message every N items just to let yourself know that things are still progressing. You can be extra fancy if you use "\r" in a System.out.print() statement. Try out the following:
public static void main(String[] args) throws InterruptedException {
for (int i=0; i<10; i++) {
Thread.sleep(500); // causes the exception, just slows things down
System.out.print("\rCounting up to " + i );
}
System.out.println();
}
Use the handin program to submit a directory containing
If you work with a partner, please only one of you submit your joint solution using handin.
Here are a few suggestions as to how you might improve your search engine.
If you try any of these, or come up with another technique, write something in your README file to describe what you did, how difficult it was, and how well (if at all) it improved your search results.