CSCI 150: Lab 8

Wordplay
Due: 10PM on Tuesday, April 17

The purpose of this lab is to:

  • Implement a sorting algorithm.
  • Practice with Sets and Dictionaries.
  • Work on reading from files and string manipulation.
  • Do fun stuff with words.

Getting Started

You may work in pairs on this lab. As usual, if you do choose to work with a partner, only one person per pair should submit code, and the names of both partners must be included in the header of each file.

Dictionaries

As we've seen in class, a dictionary, like a list, is one of the built-in data structures supported by Python for representing collections of data. The entries in a dictionary are key-value pairs. A dictionary key can be any immutable type (typically a number or a string), while a dictionary value can be anything at all. We think of a dictionary as something that maps keys to values. In a traditional dictionary, the keys are words, and the values are their definitions. In a Python dictionary, your key-value pairs could be words and definitions, or usernames and passwords, or phone numbers and names.

Unlike a list, dictionaries are not ordered -- they are just collections of key-value pairs. So there is no element 0, element 1, etc. Instead, values in a dictionary are indexed by their keys. If you print a dictionary, the elements will be listed in a seemingly random order. One advantage of using a dictionary is that testing membership or looking up a value in a dictionary is very fast. To evaluate the boolean expression a in myList, Python needs to look through the whole list, so this gets slower and slower as the list grows. Evaluating a in myDictionary, however, is very fast, even for very large dictionaries. Plus, sometimes it's just easier to have values indexed by arbitrary keys.

Here are some examples of syntax involving dictionaries.


    score = {}                         # set score to be an empty dictionary
    score = {"adam": 17, "roberto": 4}# set score to be a dictionary with two key-value pairs
    score["roberto"]                   # 4
    score["bob"]                       # error (key not found)
    "bob" in score                     # False
    score["bob"] =  42                 # adds key "bob" with value 42
    score["roberto"] =  18             # updates value for key "roberto" to 18
    "bob" in score                     # True
    del score["adam"]                  # remove key "adam" and its value
    list(score.keys())                 # returns a list of the keys ["roberto", "bob"]
    len(score)                         # 2 (number of keys in score)
    for k in score:                    # iterates over the keys in score
                  

Distilling Text

distill.py: 18 points, partner allowed.

Consider the following text:

     Question:
     Whether nobler mind suffer
     Slings arrows outrageous fortune
     Take arms against sea troubles
     By opposing them.
                
I suspect some of you recognize this as a condensed version of the first few lines of Hamlet's famous "To be, or not to be" soliloquy. The original unedited text is:

     To be, or not to be -- that is the question:
     Whether 'tis nobler in the mind to suffer
     The slings and arrows of outrageous fortune
     Or to take arms against a sea of troubles
     And by opposing end them.
                

The former was produced by finding the 30 most commonly used words in the speech (only some of which is shown above) and removing them. Your first challenge is to write a program called distill.py that prompts the user for the name of a text file and a number n, and prints the contents of that text file with the n most common words removed.

Program Outline


The details of implementing a solution are up to you, but here is a suggested outline of how to approach the problem. As usual, think about the 6 steps of program development, and test each piece as you go.

  1. Read in a text document. Here are two for you to try: hamlet.txt, lincoln.txt.

  2. Create a dictionary wordcount to keep track of the word counts for the given text file. That is, the keys in your dictionary should be strings (the words) and the value for a given key should indicate how often that word appeared.

  3. Once you've built the dictionary, implement your own version of insertion sort (or selection sort or bubble sort) to build a list called sortcount of word-count pairs (as tuples), sorted by word frequency, with the most common words first. For example, the first portion of the sortcount for hamlet.txt looks like [('the', 20), ('to', 15), ('of', 15), ('and', 12), ('that', 7),....

  4. Create a list commonwords by dropping all but the highest n elements of sortcount and only copying the word part of each pair (we don't need the counts any more). Continuing with the previous example, commonwords would become ['the', 'to', 'of', 'and', 'that',...].

  5. Reopen the original text document, and print each word so long as it doesn't appear in commonwords.

Suggestions and Tips


  • Recall that to read in a text file as a single string, you can use
    
        textfile = open(filename, "r"")
        textstring = textfile.read()
                      
  • The string function split() returns a list of all the "words" in a string, where "words" are substrings that don't contain whitespace.

  • Since you want to identify all the strings that look like "for", "For", "for,", etc. as the same string, it may be helpful to write a function called cleanstring that takes in a string, and returns a version of that string with lowercase letters and all punctuation removed. Testing if a character is punctuation isn't too hard: you can write
    
        if ch in ".,:;'\"\n"
                      
    for example, to test whether ch is a comma, period, colon, semi-colon, single quote, double quote, or newline. Of course, there may be other punctuation you want to check for as well (hint: string.punctuation in the string module might be very helpful!). To make all alphabetic characters lowercase, use the string function lower().

  • When you're deciding whether to print a word or not, you'll want to look at the "cleaned" version, but you may want to print the original version.

  • Try to preserve newlines, punctuation and capitalization in your output. The exact details of how to handle these cases is left up to you (for example, you might want to capitalize the first character of each new line, or drop words that don't contain any letters, such as "--").

  • Keep in mind that your output may differ slightly due to ties in word counts.

Sets

A set is another built-in data structures supported by Python for the mathematical notion of a set, i.e. a collection of elements. Unlike a dictionary, the elements in a set don't have values associated with them. You could simulate a set using a dictionary, but adding a key for each element, and setting that key's value to something arbitrary, like 0, or an empty string, or none. That said, if you don't have data associated with each element, and simply whant to keep track of a set of items, using a set is the way to go.

Like dictionarys (and unlike lists), sets are not ordered, but testing membership and addinging or removing elements is very fast. Sets do not store duplicate elements: adding an element to a set that already contains that element has no effect.

Here are some examples of syntax involving sets.


    team = set()                       # makes a set with 0 elements
    team = {"adam", "roberto""}        # makes a set with 2 elements
    len(team)                          # 2 
    team.add("bob")                    # adds "bob" to team
    team.remove("adam")                # removes "adam" from the team
    for p in team :                    # iterates through elements of team
    "bob" in team                      # True
    "jackie" in team                   # False
    "sam" not in team                  # True
                  

Anagrams

anagrams.py: 20 points, partner allowed.

An anagram is just a rearrangement of the letters in the word to form another word or words. For example, if you rearrange the letters in

    oberlin student 
                
you can get

    let none disturb 
                
or
   
    intends trouble
                
and many many more. But I'm fairly confident that those are probably the best ones.

For this part of the assignment, you'll be writing a program called anagrams.py to generate your own anagrams. To decide which anagrams are at least plausibly interesting, your program will have to decide which strings are legitimate words. Your program should prompt the user for a file containing a word list, and a word for anagrammating. It's vaguely concievable that that is a word.

Once we get the basic functionality down, we'll add a few other features. For example, we'd like to allow the user to specify that they are only interested in using words of length 4 or greater, or anagrams containing at most 3 words.

Program Outline


Broadly speaking, the steps you'll follow will include the following.

  1. Read in a text document containing a word list. Here are two: words1.txt, words2.txt. The first is very small, just for testing purposes. The second contains about 4000 common words. Even for relatively short strings, we'll need to use some optimizations if we want to generate anagrams using that word list.

  2. Build a set words containing each word from the text file. Since we have a lot of words, by using a set rather than a list will save us a lot of time when testing membership (which is basically all we'll be using it for).

  3. Create a function called contains(s, word) which returns two values. The first value should be a boolean indicating whether the string s contains the letters necessary to spell word. If the answer is True, the second value should be what remains of word after the letters in s have been removed. If the answer is False, the second value returned should just be an empty string. For example,

    
      	contains("zombiepig", "bozo")        # returns False, ""
      	contains("zombiepig", "biz")         # returns True, "omepig"
                    	
  4. Create a recursive function called grams(s, words, sofar) that takes in a string s, a set of words words, and a list of words sofar. This function finds all anagrams of s using elements found in words. Each of these anagrams is printed, along with the words in sofar, on its own line.

    You might be wondering why we're passing around the variable sofar. Indeed, when we want to find the anagrams of a string given by the user, we'll pass in an empty list. However, that list will be critical for making use of recursion. Let's look at an example to see why. Suppose we want to find anagrams of

       
          robopirate 
                    	
    We'll look through our wordlist for words that are contained in this string. The string "cat" doesn't appear in "robopirate", but "air" does. So one thing our function call will do is begin looking through the remainder of "robopirate" with "air" removed, looking for further anagrams. That is, it'll continue to look for strings contained in

        	robopte
                	    
    Our list includes "bro", which is contained in "robopte", so another recursive call will be made on the remains, namely "opte". Our wordlist contains "poet", leaving us with an empty string. At this point we've used up all the letters in the string, so we have an anagram, namely

          air bro poet 
                      
    Unfortunately, if we want to print our anagram, we're in trouble, since we haven't kept a record of the previous words we found. (Could we have just printed words as we found them?) That's where sofar comes in. This list will track the words we've found so far in this particular branch of the recursion. That is,

    
          grams("robopirate", words, [])
                      
    will call (among other things)

    
      	  grams("robopte", words, ["air"]) 
                    	
    which in turn calls

    
      	  grams("opte", words, ["air", "bro"]) 
                    	
    which in turn calls

    
      	  grams("", words, ["air", "bro", "poet"]) 
                    	
    which can now print the complete anagram.

    With this in mind, we're ready to describe the overall structure of this function. We loop over every word w in our wordlist. For each word w that's found in our string s, we make a recursive call on the remainder of s, and with a new list, equal to the current list with w added on. (Why does it need to be a new list? I.e., why would sofar.append(w) be the wrong thing to do? Then every recursive call would add to the same list, and we'd be saving words that we won't eventually use in case they don't work out.) If we make a call where s is the empty string, we can just print the contents of sofar.

Suggestions and Tips


  • When trying to determine whether one string s contains a word w, the string replace function is very useful. In particular,

    
      	  s.replace(ch,'',1) 
                    	
    creates a new string that is identical to s except the first occurence (parameter 3) of ch (parameter 1) is replaced by the empty string (parameter 2).

  • When printing a list of strings, the string join function is pretty handy:

    
        	" ".join(L) 
                    	
    returns a new string containing every element of the list L, glued together with a space. Of course, you could join the elements with any string, but a space makes the most sense here.

Test output


If you print all strings from words1.txt that are contained in "robopirate", you should get the following:

      or bro rat bat air ape poet poor ripe taboo orbit
                
Note that this doesn't contain "rabbit" ("robotpirate" only has one "b"). If you run your program using words1.txt for your word list on the string "robopirate", you should get

     or ape orbit
     or orbit ape
     bro air poet
     bro poet air
     air bro poet
     air poet bro
     ape or orbit
     ape orbit or
     poet bro air
     poet air bro
     orbit or ape
     orbit ape or
                
It doesn't matter if your output is in another order.

Improvements


Once that's working, add in the following optimization and improvements:

  • Things slow down a lot when we have longer word lists. One nice optimization makes use of the following observation: if the user's string s doesn't contain a particular word w, then no remainder of s will contain w either. So instead of iterating through the set of all words at each step of the recursion, we need only iterate through the "plausible" words.

    Add a "preprocessing" step: Instead of adding every string found in the word file to the set words, only add those that are contained in the user's input string. Fortunately you already have created a function that can help you out here!

  • Let the user specify a minimum length of words allowable in their anagrams.

  • Let the user specify a maximum number of words allowable in their anagrams.

  • Have all parameters (the string, the word file, min length, and max words) input via command line arguments. In particular, we'd like to be able to type in the Terminal:

              python3 anagrams.py robopirate words1.txt 3 4
                      
    to annagramate "robopirate" using words from words1.txt, require at least 3 characters per words, and allow at most 4 words per anagram, or

              python3 anagrams.py robopirate words2.txt 2 10
                      
    to annagramate "robopirate" using words from words2.txt, require at least 2 characters per words, and allow at most 10 words per anagram.

    How can your program make use of the arguments you add after the program name? Easy -- if you add import sys at the beginning of your program, you'll get access to the variable sys.argv, which is a list of the arguments passed to Python. The first of these is always the name of the program itself. But if you were to run

              python3 anagrams.py robopirate words2.txt 2 10
                      
    and that program included the statement print(sys.args), we'd get as output

              ['anagrams.py', 'robopirate', 'words2.txt', '2', '10']
                      
    Given this (and possibly judicious use of the int function), you should be able to get all the input you need from command-line arguments.

    Remember to use exceptions to handle invalid user input.

Wrap Up

README: 2 points, individual.

As with every lab, your last job prior to submission is to complete a brief write-up in a README file. If you haven't already done so, please create a new README file in your lab08 folder.

In this file, write a sentence or two about what you learned in this lab. Also give an estimate of the amount of time you spent on the lab. If you have further thoughts about the lab (e.g. parts that were confusing, helpful, annoying, fun, or challenging), please let us know.

Handin

If you followed the Honor Code in this assignment, insert a paragraph attesting to the fact within your README file.

I affirm that I have adhered to the Honor Code in this assignment.

You now just need to electronically handin all your files. As a reminder

 
   % cd             # changes to your home directory
   % cd cs150       # goes to your cs150 folder
   % handin         # starts the handin program
                    # class is 150
                    # assignment is 8
                    # file/directory is lab08
   % lshand         # should show that you've handed in something
                

You can also specify the options to handin from the command line

 
   % cd ~/cs150     # goes to your cs150 folder
   % handin -c 150 -a 8 lab08
                

File Checklist


You should have submitted the following files:
   distill.py
   anagrams.py
   words1.txt
   words2.txt
   README