Lab 7

Processing Web Search Queries
Due by 8pm, Sunday 11 Nov 2012

In the previous lab, you created a WebPageIndex class that represents the data from a single document (either a local file or URL). In this lab, you will create a collection of those indexes, and then determine which of the collection best matches what a user is searching for.

The purpose of this lab is to:

You may work with a partner on this assignment.

Part 0 - Project Description

In this lab you will implement part of a web search engine that, like Google or Bing, orders web pages based on how well they match a search query. A query consists of a list of words and phrases to search for. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:

java ProcessQueries urlListFile [count]

A URL can be written in many different formats. See the documentation for java.net.URL for details of those supported. You will probably want to consider

An example urlListFile such as urls-profs might contain:

http://occs.cs.oberlin.edu/faculty
http://www.cs.oberlin.edu/~asharp/
http://www.cs.oberlin.edu/~bob/
http://www.cs.oberlin.edu/~ctaylor/
http://www.cs.oberlin.edu/~kuperman/
http://www.cs.oberlin.edu/~scrain/
http://www.cs.oberlin.edu/~wexler/

You already have a class WebPageIndex from the previous lab that represents the index of words on a given webpage (what words are on the page, and in what frequency). It also stores the URL that the index was constructed from in order to display it to the user. Your program will go through the urlListFile and attempt to create a WebPageIndex for each of the items listed there.

Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URLs that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis with each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list. In effect, you are performing a Google-like search of the given query on a restricted subset of pages (the ones in your url list file.) Wow!

% java -Xmx4g -classpath csci151lab6.jar:. ProcessQueries urls-oberlinreview.org 10
Fetched: 1185     Errors: 0	 out of 1185
Enter a query on one line or -1 to quit

Search for: computer science
Relevant pages:
(priority = 11) http://www.oberlinreview.org/article/editorial-athletes-vs-mathletes/
(priority = 10) http://www.oberlinreview.org/article/cuff-john-harwood/
(priority = 9) http://www.oberlinreview.org/article/visiting-speaker-inspects-myspace-friendships/
(priority = 8) http://www.oberlinreview.org/article/editorial-barbie-free-response-professor-mehta/
(priority = 8) http://www.oberlinreview.org/article/barbie-free-response-professor-mehta/
(priority = 8) http://www.oberlinreview.org/article/editorial-board-editorial-board-has-sex/comments/
(priority = 5) http://www.oberlinreview.org/article/editorial-fearless-not-re-internet/
(priority = 4) http://www.oberlinreview.org/article/student-project-creates-search-engine-alternative-/
(priority = 4) http://www.oberlinreview.org/article/review-vs-internet/
(priority = 4) http://www.oberlinreview.org/article/luminary-jaron-lanier-unites-digital-media-music-d/

Search for: "computer science"
Relevant pages:
(priority = 3) http://www.oberlinreview.org/article/visiting-speaker-inspects-myspace-friendships/
(priority = 2) http://www.oberlinreview.org/article/student-project-creates-search-engine-alternative-/
(priority = 1) http://www.oberlinreview.org/article/editorial-fearless-not-re-internet/
(priority = 1) http://www.oberlinreview.org/article/oberlin-project-envisions-new-future-city-and-coll/
(priority = 1) http://www.oberlinreview.org/article/review-vs-internet/
(priority = 1) http://www.oberlinreview.org/author/17/

Search for: -1

Thank you and good-bye!

% java -Xmx4g -classpath jsoup-1.6.1.jar:. ProcessQueries urls-cs 6 
Fetched: 733	 Errors: 5   out of 738
Enter a query on one line or -1 to quit

Search for: csci
Relevant pages:
(priority = 18) http://www.cs.oberlin.edu/~asharp/courses.html
(priority = 18) https://occs.cs.oberlin.edu/classes/
(priority = 18) http://occs.cs.oberlin.edu/classes/
(priority = 6) http://occs.cs.oberlin.edu/page/2/
(priority = 6) https://occs.cs.oberlin.edu/page/2/
(priority = 5) https://occs.cs.oberlin.edu/category/information/

Search for: "principles of computer science"
Relevant pages:
(priority = 4) https://occs.cs.oberlin.edu/classes/
(priority = 4) http://occs.cs.oberlin.edu/classes/
(priority = 1) http://www.cs.oberlin.edu/~wexler/cs150/index.html
(priority = 1) http://www.cs.oberlin.edu/~wexler/cs150
(priority = 1) http://www.cs.oberlin.edu/~asharp/cs151/
(priority = 1) http://www.cs.oberlin.edu/~asharp/cs151/2010sp/index.html

Search for: -1

Thank you and good-bye!

% java -Xmx4g -classpath jsoup-1.6.1.jar:. ProcessQueries urls-catalog-1213 5 
Fetched: 338	 Errors: 0   out of 338
Enter a query on one line or -1 to quit

Search for: "computer science"
Relevant pages:
(priority = 9) http://catalog.oberlin.edu/content.php?catoid=30&navoid=628
(priority = 3) http://catalog.oberlin.edu/content.php?catoid=30&navoid=637
(priority = 1) http://catalog.oberlin.edu/content.php?catoid=30&navoid=692
(priority = 1) http://catalog.oberlin.edu/content.php?catoid=30&navoid=630
(priority = 0) http://catalog.oberlin.edu/preview_course_nopop.php?catoid=30&coid=67394

Search for: extremely difficult class
Relevant pages:
(priority = 15) http://catalog.oberlin.edu/content.php?catoid=30&navoid=643
(priority = 13) http://catalog.oberlin.edu/content.php?catoid=30&navoid=633
(priority = 9) http://catalog.oberlin.edu/content.php?catoid=30&navoid=628
(priority = 8) http://catalog.oberlin.edu/content.php?catoid=30&navoid=692
(priority = 6) http://catalog.oberlin.edu/preview_course_nopop.php?catoid=30&coid=61616

Search for: -1


Important Files for this Lab

What you need from the previous lab:

There is also a jar file containing working versions of these classes that you may elect to use instead.

What you are given in the lab07.jar file:

What you will write in this lab:

Part 1 - MyPriorityQueue

Write a heap-based implementation of a PriorityQueue, extending AbstractQueue. It should contain an array or ArrayList to hold the data items in the priority queue and a Comparator to compare the relevance of two web pages to a given query. In addition to the interface methods, it needs to contain at least one constructor, presumably one that takes a Comparator as input. Remember that this is supposed to be a min-heap, so the smallest value is at the top.

There is a skeleton MyPriorityQueue java file provided for you in the jar file, if you want to use it. It is not much more than what eclipse would provide you, so it is up to you whether you start from scratch or from this file.

Don't forget to test each method you write, ideally as you write it with jUnit tests before moving on to the next part.

You will need the following public and private methods:

public int size()
Return the number of items in the priority queue.

public void clear()
Efficiently empty your heap such that garbage collection can take place. Feel free to use methods in your nested data structures.

public T peek()
Return the highest priority (smallest value) item in the priority queue, without removing it.

public T poll()
Remove and return the highest priority (smallest value) item in the priority queue.
You will need to call your private percolateDown() method on the root after rearranging things.

public boolean offer(T item)
Add item in the correct place in the priority queue.
Return true if the item was correctly added, and false otherwise.

public Iterator<T> iterator()
Return a new Iterator over the items in the priority queue.
This iterator should be implemented as an anonymous class (see
prelab 04 for an example), and
can return the items in any order (including their order in the array.)

public void setComparator( Comparator<T> cmp )
Sets the class's comparator to cmp
This changes the relationship between items in your heap, so
you nead to reheapify things. Use the linear-time heapify method
we discussed in class.

private void percolateDown( int hole )
Percolate the item at position hole down through the heap.
Be careful to handle the case of single children

private void percolateUp( int hole )
Percolate the item at position hole up through the heap.

private int parent(int x)
Return the index of the parent of the node at index x.

private int leftChild(int x)
Return the index of the left child of the node at index x.

private int rightChild(int x)
Return the index of the right child of the node at index x.

JUnit Tests

Don't forget to write jUnit tests as you go along, to test your priority queue. You can check the behaviour of MyPriorityQueue with that of Java's PriorityQueue. Since your priority queue requires a Comparator in order to construct it, you may want to use the provided StringComparator, and make priority queues out of Strings.


Part 2 - URLComparator

In this section of the lab you will write a Comparator to compare WebPageIndex objects, based on the current query. Comparator<E> is a java interface which contains the method

public int compare(E item1, E item2);
return a negative number if item1<item2
return a positive integer if item1>item2
return zero if item1==item2
The lab jar file contains a sample Comparator called StringComparator.java, which I am providing to illustrate how to write a Comparator. Take a look at it before you tackle the URLComparator.

Once you understand StringComparator, write your own comparator, URLComparator, that compares two WebPageIndex objects based on their relevance to a given query (indicated as a parameter to the constructor).

You should include a method (possibly a public one...) that allows you to compute a score for a given WebPageIndex object using the current query. To remind you, the score you are using is the sum of the the word counts of the each word in the query.

Test your Comparator with jUnit tests before proceeding!

Note: If your URLComparator class is not "recognizing" WebPageIndex, it is probably because you declared your URLComparator class incorrectly. You should use

class URLComparator implements Comparable<WebPageIndex>
not
class URLComparator<WebPageIndex> implements Comparable<WebPageIndex>
Do you see why?


Part 3 - ProcessQueries

This class contains the main method of the application. The program has two basic parts. The first part is to build a list of WebPageIndex objects from the URLs listed in the urlFileList. The second part is to enter a loop to process a series of user queries. Create a class ProcessQueries whose main method will have this functionality.

First, implement the part of your program that processes the urlListFile. For each URL read in, use that URL to construct a WebPageIndex. Put all of the WebPageIndex objects in a list. You should do this as a method and not just have all the code in main.

Second, you need to process the user queries. To find the results of the query in best-to-worst order, construct a priority queue of WebPageIndex objects, one per web page in the urlListFile. The priority value should be computed by adding the counts of the words and phrases in the query. The priority queue can be used to print out the matching URLs in order.

For every subsequent search query, use the new query to construct a comparator, and then reheaps your collection of WebPageIndex objects using that comparator.

You should also support searching for phrases contained withing double quotes. String objects have a number of methods like startsWith(), endsWith(), and substring() that I found to be useful when constructing a phrase. You can wait and add this in at the end once you have everything else working if you'd like.

The program then prints out the matching URLs in order from best to worst match until there are no matches or the user specified limit is reached. Continue reading and processing queries in a loop until you reach end of file on System.in or some designated terminator string (e.g., "-1").

Notes:

Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WebPageIndex for all three words to determine the URL's priority.

Don't "drop" the results as you are pulling them out of the heap. Just stick them in a list of some sort as you remove them and add them back in afterwards.

When you change a comparator in the heap, you need to reheapify things. You could just create a new ArrayList and add in all the old items. A better choice would be to reheapify using the linear time algorithm discussed in class.

Your output should include each page's score / priority. We deliberately haven't said how to do this, and there are a couple of different approaches that will work. We hope you can find a way to solve this on your own, but if you are stuck on how to get those priorities, you can ask and we'll point you in a right direction.

You might run into a message that indicates that you have too many files open. Every HTMLScanner you create (or Scanner to read a file) uses one of a limited number of file descriptor slots. You should get rid of the reference to either the Scanner or the HTMLScanner when you are done using it. If you keep them in your WebPageIndex class, you will run out of descriptors to use.

You might also run out of memory on a large url-list file if you run Java in the default manner (it only allocates 64MB of RAM). You can increase the amount of memory given to the Java Virtual Machine with an option to the java command. For example,

    java -Xmx1g ProcessQueries urls-all 10

where the -Xmx indicates that you want it to use more memory, and the 1g indicates to use up to 1 gigabyte.

When you are dealing with a large number of urls, you might want to print out a status message every N items just to let yourself know that things are still progressing. You can be extra fancy if you use "\r" in a System.out.print() statement. Try out the following which you can modify for your own purposes (be sure to take out the sleep):

    public static void main(String[] args) throws InterruptedException {
        for (int i=0; i<10; i++) {
            Thread.sleep(500);
            System.out.print("\rCounting up to " + i );
        }
        System.out.println();
    }

Parallel loading

Once you have your program working correctly, you may have noticed that it takes a while for it to load all of the URLs at the start. There are a couple of ways to address this. One is to cache the results from your fetching -- you could do that by writing out your WebPageIndex objects to disk. However, another way would be to fetch multiple pages simultaneously.

The class WebPageLoader is designed to do just that. If you give it a list of URLs and a number, it will fetch that number of pages simultaneously. This class is still experimental and I found you might get more errors when using this than you do from just fetching things sequentially.

Once your ProcessQueries program is working correctly, you are welcome to try using this class to see if it speeds things up. You should comment out your existing method that creates all of the WebPageIndex objects (you did do that as a method, didn't you?) and add in a call to a new method that uses WebPageLoader to do the fetching. We may need to be able to test your program with the sequential fetching, so leave all of that code in, just call a different method.

You'll want to be careful to not set it up to make too many parallel requests. I found that 5-10 works well, but something like 20 often resulted in many more failed page loads than before.

Handin

Use the handin program to submit a directory containing

If you work with a partner, please only one of you submit your joint solution using handin.

Improving your search engine

Here are a few suggestions as to how you might improve your search engine.

  1. Ignore case: Convert all terms that are inserted in your WebPageIndex to be lower case. This will allow "Book", "BOOK", and "book" to be considered as the same word for web queries.
  2. AND operator: Require that ALL terms specified in the search query are actually present on the page.
  3. NEAR operator: Allow some way of finding things that are not necessarily phrases, but are within X items of each other.
  4. - operator: Allow the user to specify terms that should *not* be included in the results.
  5. TF-IDF: Term Frequency - Inverse Document Frequency. The idea is that you weight the score for a particular term relative to the number of words in the document as well as the number of documents in which the term appears.
  6. Base things on the frequency of the terms, not just the count of them.
  7. Use serialization to cache WordFrequencyTrees to avoid network costs. (Note, probably shouldn't do this with files written to your home directory on a lab machine.)

If you try any of these, or come up with another technique, write something in your README file to describe what you did, how difficult it was, and how well (if at all) it improved your search results.


Last Modified: Oct. 30, 2012 - Thanks to Benjamin A. KupermanVI Powered for this lab