CSCI 150: Lab 11

Concordances
Due: 10PM on Tuesday, May 2

The purpose of this lab is to:

Get more practice with lists and dictionaries
Create a practical program for text analysis

Getting Started

You will want to download the following test files for you program:

Prufrock.txt (T.S. Eliot's The Love Song of J. Alfred Prufrock)
Jabberwocky.txt (The Lewis Carrol poem)
Test.txt (a file for testing your line numbering)

Concordances.

In this lab you will create a concordance. What is a concordance? It is an index to the words of a text or of a body of texts. For example, if you are writing an essay about Shakespeare's view of kingship, you might want to look at the instances in his plays where the word "king" is used. There are a lot of these instances. You can find them all by looking at a concordance to Shakespeare -- look up the word "king" and you will get references by Play, Act, Scene and Line Number, to every use of this word in every one of Shakespeare's plays.. The Oberlin College library has concordances to Shakespeare and Donne and Chaucer and Dante and Vergil and Plato and even to Joyce's Finnegan's Wake. It has several concordances to the Bible, and the Qur'an and the Guanzi. In fact, the library has more than 150 books whose title starts "A concordance to ..."

One of the issues that the creator of a concordance faces is how to refer to a specific use of a word. We are going to take the easy way out and just use line numbers. This is great for making a concordance to a single poem, and less practical for a novel. Here is one small portion of the output of our concordance for The Love Song of J. Alfred Prufrock by T.S. Eliot:

      etherized 3
      evening 2 17 77
      evenings 50
      eyes 55 56

So the word "etherized" appears on line 3, "evening" appears 3 times, on lines 2, 17 and 77, and so forth. In this lab you will write a program that asks the user for the name of a text file, and then prints a concordance of the text in that file.

Data Structures: Lists and Dictionaries

The interesting parts of this lab are the structures we use to create the concordance. We need to store line numbers, possibly one and possibly many, for each word in the text. This is a problem of association -- we want to associate line numbers with words. Dictionaries are the structures to use for this. Dictionaries are designed to efficiently associate one datum with another. In dictionary terminology, keys are the things we use to look up values. The keys act like indexes. For our situation the words of the text will be our keys; the line numbers on which a word is found will be the value associated with that word. The line numbers themselves should be sequential -- we want to store them in increasing order. Lists are good for this, and are easy to use. Altogether, our concordance will be stored as a dictionary, where the keys are words (strings) and the values are lists of line numbers.

We have talked about both dictionaries and lists in class and in Lab 5 and Lab 8. Here are reminders of how these structures work in Python:

Dictionaries


    Ages = {}                       # sets Ages to be an empty dictionary
    
    Ages["Hermione"]                # returns the value associated with "Hermione", 
                                    # presumbly her age.  Throws an error if 
                                    # "Hermione" is not a key of Ages.
                                    
    Ages["Hermione"] = 18           # Makes "Hermione" a key and associates 18 with it.
    
    del Ages["Hermione"]            # removes key "Hermione" and the value associated with it.
    
    Ages.keys()                     # returns a "view" of the keys of Ages. You can 
                                    # treat this like a list of the keys.
                                    
    len(Ages)                       # returns the number of keys in Ages
    
    for person in Ages:             # Iterates over the keys in Ages.

Lists


    Numbers = []                    # sets Numbers to be an empty list
    
    Numbers[3]                      # returns the fourth entry of Numbers. 
                                    # Throws an error if Numbers does not have
                                    # 4 entries.
                                    
    Numbers.append(18)              # adds entry 18 onto the end of Numbers
    
    del Numbers[3]                  # removes the fourth entry of Numbers, shifting
                                    # later entries down.
                
    for x in Numbers:               # Iterates over the list Numbers.

Finally, in this lab we will make repeated use of several methods of the String class:

String Methods

In these examples assume s is a string.


    s.strip()                       # returns a string like s only with leading and
                                    # trailing white space delelted
                                    
    "bob".strip()                   # "bob"
    
    s.strip(p)                      # Here p is a string of punctuation characters to be
                                    # deleted.  This returns a string like s, only with
                                    # all of the letters of p, in any order, deleted
                                    # from the front and back of s.
                                    
    "{bob!*!!".strip( "(!*" )       # "bob"
    
    "isn't".strip( "!'?" )          # "isn't"
    
    s.split( )                      # returns a list of the "words" in s, using 
                                    # white space as the separator between words
                                    
    "The time is now!".split( )     # ["The", "time", "is", "now!"]
    
    s.split( delim )                # returns a list of the "words" in s, using the 
                                    # string delim as the separator between words.
                                    
    "8/10/2015".split( "/" )        # ["8", "10", "2015"]
    
    s.lower()                       # returns a copy of s with all letters converted to lower-case
                                    
    s.upper()                       # returns a copy of s with all letters converted to upper-case
                                    
    "aBC(De)fG".lower()             # "abc(de)fg"

Your Program

concordance.py: 20 points, partner.

Your program should ask the user for the name of one file. That file should be in the same folder as your program. Most text files have names that end in ".txt", so be sure to type this as part of the file name when you are running the program. Your program should open this file and read it one line at a time (a for-loop does this nicely), counting the line numbers (only count the non--blank lines; the first non-blank line should be numbered 1). Each word in the line should be stripped of punctuation marks, converted to lower-case, and added to your concordance with its line number. After the entire file is processed, you should print all of the words that are keys of your concordance, in alphabetical order, along with the list of line numbers for each word. Finally, at the end you should print the number of lines in the file and the number of unique words found.

For example, consider the file Test.txt:

    one!!
          
    Two Two
    !!!! --
    four four four four
      
    five five Five!   'five five

Here is the output we want from this file:

    five 5 5 5 5 5
    four 4 4 4 4
    one 1
    two 2 2
    I found 5 lines containing 4 unique words.

The word "one" appears once on the first line of the file; "two" appears twice on the line numbered 2 (we ignored the blank line between 1 and 2). There is a line 3, but the "words" on it consist only of punctuation characters so they are never added to the concordance.

There are 3 isses to consider with this program:

How to read the file line-by-line, counting the line numbers (see Lab 5)
How to get the individual words from a line, strip off their punctuation, and convert them to lower-case (see Lab 8)
How to add the words and their line numbers to the concordance.

Handling the dictionary

As we have said, the Concordance is a dictionary. At the start of your program you will create an empty dictionary for your concordance:


    Concordance = { }

Each time you come across a word you need to know if it is already a key in your concordance:


    if word in Concordance.keys():

What is different between this Lab and Lab 8 is that we won't have a single count as the value for each key in the dictionary (see distill.py in Lab 8), but instead a list of values. So whenever you add a word as a new key into the dictionary, you will use a list as its value (where the list only contains the current line number L):


    Concordence[word] = [L]      # make a list containing L and add it to the Concordance under word

And if the word already exists as a key in the dictionary, then you append the current line number to the end of the list for that word:


    Concordence[word].append(L)  # appends L to the end of the list saved under word in Concordance

After processing the entire text file you need to print all of the words in alphabetical order, followed by their line numbers. You can't directly sort a key structure; you need to first convert it to a list and then sort it. You should print the list of line numbers in a nice way so that it looks like the examples below.

The design of your program is up to you, but you should certainly divide the work to be done among several functions. One way to do this would be to use the following functions:

RemovePunctuation(s):: This function returns a new string that has the letters of s translated to lower-case, with all of the punctuation removed.
AddWord(word, lineNumber, C):: This handles the work of adding the fact that the given word was found on the given lineNumber to the dictionary C
PrintEntry(word, C):: This handles one word of the output. C is the concordance, so C[word] is the list of line numbers on which word occurs.
main( ):: This gets the file name, opens the file,and has a loop to read the file one line at a time, then splitting the line into words. RemovePunctuation( ) prepares the word for adding to the concordance; AddWord( ) actually does the addition. Finally, a loop over the keys of the concordance calls PrintEntry( ) on each word to handle the output

Testing your work

Here are several files that will help you test out your program:

Prufrock.txt (T.S. Eliot's The Love Song of J. Alfred Prufrock)
Jabberwocky.txt (The Lewis Carrol poem)
Test.txt (a file for testing your line numbering)

File Test.txt should give you the following output:

    five 5 5 5 5 5
    four 4 4 4 4
    one 1
    two 2 2
    I found 5 lines containing 4 unique words.

If you get something else there is either a problem with your line numbering or the way you are stripping punctuation. The other files are mainly useful for checking punctuation; there are many different punctuation characters used in these files and you should remove all of them. Look carefully at your output. If you see what appears to be a blank word followed by line numbers, it probably comes in the following way. The split( ) method separates a string into words by using white space as as delimiter, so some "words" might be just sequences of punctuation characters, such as "!!!". When you strip off the punctuation you are left with the empty string. Before you add a word and its line numbere to the concordance you should check if the word is the empty string; if it is, just don't add it.

If you want to play with your concordance, here are a few additional files you might work with:

Beowulf.txt (translated to modern English by Hall)
DavidCopperfield.txt (all 626 pages of the Dickens novel)
Inferno.txt (the first third of Dante's Divine Comedy, translated by Norton)
KingLear.txt (the Shakespear play)
Republic.txt (Plato's "Republic")

Thanks to Project Gutenberg, there are many, many more text files available on the Web.

Wrap Up

README: 2 points, individual.

As with every lab, your last job prior to submission is to complete a brief write-up in a README file. If you haven't already done so, please create a new README file in your lab11 folder. Please make sure to list who you worked with in this file, if you worked with a partner.

In this file, write a sentence or two about what you learned in this lab. Also give an estimate of the amount of time you spent on the lab. If you have further thoughts about the lab (e.g. parts that were confusing, helpful, annoying, fun, or challenging), please let us know.

Handin

If you followed the Honor Code in this assignment, insert a paragraph attesting to the fact within your README file.

I affirm that I have adhered to the Honor Code in this assignment.

You now just need to electronically handin all your files. Everyone needs to handin a README file, and if you worked with a partner, only one person in your group needs to submit your files. As a reminder

 
       % cd             # changes to your home directory
       % cd cs150       # goes to your cs150 folder
       % handin         # starts the handin program
                        # class is 150
                        # assignment is 11
                        # file/directory is lab11
       % lshand         # should show that you've handed in something

You can also specify the options to handin from the command line

 
       % cd ~/cs150     # goes to your cs150 folder
       % handin -c 150 -a 11 lab11

File Checklist

You should have submitted the following files:

      concordance.py
      README (with the Honor Pledge)