The purpose of this lab is to:

*  Implement a sorting algorithm.
*  Practice with Sets and Dictionaries.
* Work on reading from files and string manipulation.
* Do fun stuff with words.

# Dictionaries

As we've seen in class, a dictionary, like a list, is one of the built-in data structures supported by Python for representing collections of data. The entries in a dictionary are key-value pairs. A dictionary key can be any immutable type (typically a number or a string), while a dictionary value can be anything at all. We think of a dictionary as something that maps keys to values. In a traditional dictionary, the keys are words, and the values are their definitions. In a Python dictionary, your key-value pairs could be words and definitions, or usernames and passwords, or phone numbers and names.

Unlike a list, dictionaries are not ordered -- they are just collections of key-value pairs. So there is no element 0, element 1, etc. Instead, values in a dictionary are indexed by their keys. If you print a dictionary, the elements will be listed in a seemingly random order. One advantage of using a dictionary is that testing membership or looking up a value in a dictionary is very fast. To evaluate the boolean expression a in myList, Python needs to look through the whole list, so this gets slower and slower as the list grows. Evaluating a in myDictionary, however, is very fast, even for very large dictionaries. Plus, sometimes it's just easier to have values indexed by arbitrary keys.

Here are some examples of syntax involving dictionaries.


    score = {}                         # set score to be an empty dictionary
    score = {"adam": 17, "roberto": 4}# set score to be a dictionary with two key-value pairs
    score["roberto"]                   # 4
    score["bob"]                       # error (key not found)
    "bob" in score                     # False
    score["bob"] =  42                 # adds key "bob" with value 42
    score["roberto"] =  18             # updates value for key "roberto" to 18
    "bob" in score                     # True
    del score["adam"]                  # remove key "adam" and its value
    list(score.keys())                 # returns a list of the keys ["roberto", "bob"]
    len(score)                         # 2 (number of keys in score)
    for k in score:                    # iterates over the keys in score
                  
# Distilling Text
**distill.py: 18 points**

Consider the following text: 

     Question:
     Whether nobler mind suffer
     Slings arrows outrageous fortune
     Take arms against sea troubles
     By opposing them.
                
This is a condensed version of the first few lines of Hamlet's famous "To be, or not to be" soliloquy. The original unedited text is: 

     To be, or not to be -- that is the question:
     Whether 'tis nobler in the mind to suffer
     The slings and arrows of outrageous fortune
     Or to take arms against a sea of troubles
     And by opposing end them.
                
The former was produced by finding the 30 most commonly used words in the speech (only some of which is shown above) and removing them. Your first challenge is to write a program called distill.py that prompts the user for the name of a text file and a number n, and prints the contents of that text file with the n most common words removed.

# Program Outline

The details of implementing a solution are up to you, but here is a suggested outline of how to approach the problem. As usual, think about the 6 steps of program development, and test each piece as you go.

* Read in a text document. Here are two for you to try: [hamlet.txt](http://www.cs.oberlin.edu/~ctaylor/classes/150S20/Labs/Lab08/hamlet.txt), [lincoln.txt](http://www.cs.oberlin.edu/~ctaylor/classes/150S20/Labs/Lab08/lincoln.txt). 

* Create a dictionary wordcount to keep track of the word counts for the given text file. That is, the keys in your dictionary should be strings (the words) and the value for a given key should indicate how often that word appeared. 

* Once you've built the dictionary, implement your own version of insertion sort (or selection sort or bubble sort) to build a list called sortcount of word-count pairs (as tuples), sorted by word frequency, with the most common words first. For example, the first portion of the sortcount for hamlet.txt looks like [('the', 20), ('to', 15), ('of', 15), ('and', 12), ('that', 7),.... 

* Create a list commonwords by dropping all but the highest n elements of sortcount and only copying the word part of each pair (we don't need the counts any more). Continuing with the previous example, commonwords would become ['the', 'to', 'of', 'and', 'that',...]. 

*Reopen the original text document, and print each word so long as it doesn't appear in commonwords.

**Suggestions and Tips**

* Recall that to read in a text file as a single string, you can use



```
    textfile = open(filename, "r"")
    textstring = textfile.read()
```


                  
* The string function split() returns a list of all the "words" in a string, where "words" are substrings that don't contain whitespace. 

* Since you want to identify all the strings that look like "for", "For", "for,", etc. as the same string, it may be helpful to write a function called cleanstring that takes in a string, and returns a version of that string with lowercase letters and all punctuation removed. Testing if a character is punctuation isn't too hard: for example, to test whether ch is a comma, period, colon, semi-colon, single quote, double quote, or newline, you can write: 
```
    if ch in ".,:;'\"\n"
```

*  Of course, there may be other punctuation you want to check for as well (hint: string.punctuation in the string module might be very helpful!). 

* To make all alphabetic characters lowercase, use the string function lower().

* When you're deciding whether to print a word or not, you'll want to look at the "cleaned" version, but you may want to print the original version. 

* Try to preserve newlines, punctuation and capitalization in your output. The exact details of how to handle these cases is left up to you (for example, you might want to capitalize the first character of each new line, or drop words that don't contain any letters, such as "--"). 

* Keep in mind that your output may differ slightly due to ties in word counts.
