Prelab 7

Processing Web Search Queries
Due by 10am, Monday 5 Nov 2012

In this prelab, you will get some practice with structural induction and familiarize yourself with some of the design and implementation issues of the upcoming lab 7. Please write or type up your solutions, and hand in a paper copy before class on Monday. Remember, late prelabs receive zero credit.


Part 0 - Structural Induction

  1. (and 2. and 3.)  Print the following worksheet and hand it in with the rest of your prelab.

Problem Description

In the previous lab, you created a WebPageIndex class that represents the data from a single document (either local file or URL). In this lab, you will be creating a collection of those indexes and then determining which page best matches what a user is searching for.

Part 1 - MyPriorityQueue

Before you get to the algorithms themselves, you will implement a heap-based priority queue needed for some of the algorithms. Your MyPriorityQueue<T> will extend AbstractQueue<T>, and it will contain either an array or ArrayList to hold the data in the priority queue.

To begin, you should probably look over the JavaDoc for a PriorityQueue<T> and for java.util.AbstractQueue<T> which it extends.

Methods

Inside our binary heap, there are a few private methods we will need to implement. We discussed both of these in class, but you will need to implement them yourself.

  1. Give the definition for the method percolateUp(int x) that takes the value currently located at index x and moves it up to the correct location.

  2. Give the definition for the method percolateDown(int x) that takes the value currently located at index x and moves it down to the correct location. Be sure to consider how you will detect and handle both leaf nodes and internal nodes that only have one child.


Part 2 - URLComparator

In order to make these heaps work, you will need to create Comparators of various sorts. Begin by looking over the JavaDoc for java.util.Comparator<T>. Pay special attention to the compare() method you are required to implement.

  1. Give the Java code for a comparator class StringComparator that compares two Strings, but does not care about the case of the strings themselves.
  2.     public class StringComparator implements Comparable<String> {
    	public int compare( String o1, String o2 ) {
    	    // fill me in -- see Comparable API for what I should return
    	    // Hint: you can make this a one-liner if you piggyback from
    	    // one of the String class' methods...
    	}
        }
        
  3. Give the Java code for a comparator class PointComparator that compares two Point objects in terms of their distance from an additional third reference point (passed in as part of the constructor). Recall that the distance between two points is the square root of the differences of the squares of the X and Y locations. You may want to use a method to do the distance calculation for you.

    d = sqrt( (x1 - x2)^2 + (y1 - y2)^2 )
        public class PointComparator implements Comparable<Point> {
    	private Point referencePoint; // The point to which all points compare 
    
    	public PointComparator( Point referencePoint ) {
    	   this.referencePoint = referencePoint; 
    	}
    	public int compare( Point p1, Point p2 ) {
    	    // fill me in 
    
    
    	}
    	public double distance( Point p1, Point p2 ) {
    	    return Math.sqrt( (p1.x - p2.x)*(p1.x - p2.x) + (p1.y - p2.y)*(p1.y - p2.y) );
    	}
        }
        


Scoring webpages (so that we can compare them)

In order to tell which webpage is better for a given query, we will need a way to "score" websites based on their query.

  1. Explain how you compute the "score" of a particular web page given a String that represents a user query of one or more words under the following conditions (use pseudocode):

    1. Based on just the sum of the word counts. (Given a web page and a query of (possibly multiple) words, describe how to compute this web page's score based on the sum of the word counts of the words in the query.)
    2.     First, parse the query into its constituent words.
          For each word W in the query
             Use the WebPageIndex to...
      
          

    3. Based on just the sum of the word counts, but requiring every word to be present

    4. Based on the frequencies of the words

Once you have a way to score webpages, you can use this score to help compare to websites.


  1. Describe how you could use the previous score calculations within a Comparator to have the best scoring page be at the top of the heap (and therefore at the front of your PriorityQueue).
        public class URLComparator implements Comparator<WebPageIndex> {
    	private String query;
    	public URLComparator( String query ) {
    	    this.query = query;
    	}
    	public int compare( WebPageIndex w1, WebPageIndex w2 ) {
    	    // fill me in. Use the class variable query in conjunction with w1 and w2
    
    	}
        }
    	

We'll be using these comparators in our PriorityQueue which will be based on a binary heap. Recall that in class we discussed that these are "min-heaps" -- heaps where the minimum value is at the root.

In the application portion of this lab, you will be reading in and creating a number of WebPageIndex objects (from Lab 6), storing them in your heap, and then processing user search queries on those objects.

Advanced queries

Our WebPageIndex objects allow us to also search for phrases in our web pages.

  1. Explain how you would process a user query to identify phrases that are set off by double quotes, and then score the various pages. For example,

    pancakes "maple syrup" bacon

    is looking for pages that contain the words "pancakes" and "bacon" as well as the phrase "maple syrup".

  2. Another common feature of search engines is the use of a minus sign at the start of a word or phrase to indicate you only want results without that word or phrase. Explain how you could also add this feature into your scoring methods.