CSCI 150: Lab 5

Strings and Lists
Due: 10PM on Tuesday March 19

The purpose of this lab is to:

  • Practice using strings, lists, and files
  • Play a game!
  • Explore two connections between computer science and biology
  • Encounter machine learning

Before you begin, please create a folder called lab05 inside your cs150 folder. This is where you should put all files made for this lab.

Part 1 - Mind Mastery

game.py: 14 points, individual.

Mastermind is a neat (although oftentimes frustrating) puzzle game. It works a something like this: There are two players. One player is the codemaker (your program), the other is the codebreaker (the user). The codemaker chooses a sequence of four colored pegs, out of a possible six colors (red, blue, green, yellow, orange, and purple). They may repeat colors and place them in any order they wish. This sequence is hidden from the codebreaker. The codebreaker has 10 chances to guess the sequence. The codebreaker places colored pegs down to indicate each of their guesses. After each guess, the codemaker is required to reveal certain information about how close the guess was to the actual hidden sequence.

Describe the Problem:

In this part of the lab, you will create a program to play Mastermind, where computer is playing the codemaker, and the human user is the codebreaker. Thus your program needs to generate a secret code, and repeatedly prompt the user for guesses. For each guess, your program needs to give appropriate feedback (more detail below). The game ends when either the user guesses correctly (wins) or uses up 10 guesses (loses).

Understand the Problem:

The trickiest part of this game is determining how to provide feedback on the codebreaker's guesses. In particular, next to each guess that the codebreaker makes, the codemaker places up to four clue pegs. Each clue peg is either black or white. Each black peg indicates a correct color in a correct spot. Each white peg indicates a correct color in an incorrect spot. No indication is given as to which clue corresponds to which guess.

For example, suppose that the code is RYGY (red yellow green yellow). Then the guess GRGY (green red green yellow) would cause the codemaker to put down 2 black pegs (since guesses 3 and 4 were correct) and 1 white peg (since the red guess was correct, but out of place). Note that no peg was given for guess 1 even though there was a green in the code; this is because that green had already been "counted" (a black peg had been given for that one).

As another example, again using RYGY as our code, the guess YBBB would generate 1 white peg and 0 black; yellow appears twice in the code, but the guess only contains one yellow peg. Likewise, for the guess BRRR, only 1 white peg is given; there is an R in the code, but only one. Below is a table with guesses and the correponding number of black and white pegs given for that guess (still assuming the code is RYGY).

guess black pegs white pegs
YYYY 2 0
YRYR 0 3
BBPO 0 0
PGYR 0 3
YYYG 1 2
RYGY 4 0
Check here for an online graphical version of the game (where their red pegs are our black pegs).

A sample run of our text-based program may look like this:

Sample output

  %python3 game.py
    
  I have a 4 letter code, made from 6 colours.
  The colours are R, G, B, Y, P, or O.

	Your guess: GGGG
    Not quite. You get 0 black pegs, 0 white pegs.

        Your guess: YYYY
    Not quite. You get 1 black pegs, 0 white pegs.

        Your guess: YOYO
    Not quite. You get 0 black pegs, 2 white pegs.

        Your guess: PPYO
    Not quite. You get 1 black pegs, 2 white pegs.

        Your guess: POYB
    Not quite. You get 1 black pegs, 3 white pegs.

        Your guess: PBOY
    You win! So clever.
                          

Design an Algorithm

Once you understand how the game works, you should design a pseudocode plan of attack. The general steps are:

  • Randomly choose the codemaker's code.
  • Repeatedly prompt the user for their guess
    • If their guess is correct, end the game with a congratulatory message.
    • Otherwise, give clues (pegs) that correspond to their guess.
Some of these steps are straight-forward, but certainly it would be worth your while to write down an approach to randomly generating the code, and giving the clue. You did some of this on your prelab; our recommendation for the later algorithm is as follows:

  • First assign the black pegs. Do this by iterating through both strings one character at a time, assigning a black peg if the characters in the same position match. If they do match, change that character/peg in your guess to an 'x' and in the code strings to a 'z', say, so that you know you have used both of these pegs for a clue (and you won't use them again when assigning white pegs and accidentaly double-count). (You can find some useful string methods here in order to do this.)
  • Next assign the white pegs. Do this by considering the first peg in the guess string and trying to find the matching character in the code string (note that if the first character is an 'x' then there definitely won't be a match in the code string). If you find a match, again change the guess character to 'x', the matching code character to 'y' and continue with the second peg in the guess string, then the third, and so on. Because of your changes to 'x' and 'y', you won't assign both a black and white peg for one guess or clue peg.

Implement a Design

Now that you have some of the kinks worked out in theory, it is time to write your program game.py.

You may assume the user always provides a guess with the available colors, and always in uppercase.

Make and use an integer constant NUM_TURNS that represents the number of allowable turns (say, 10).

generateCode()

To generate the code, write a method generateCode() that generates the codemaker's code (and returns it as a String to the caller). That is, this method should randomly generate 4 colored pegs, selected from R, B, G, Y, O, and P, and return it as a 4-letter string. You'll want to use the random methods as discussed in lab03 in order to randomly generate a color for each peg. In particular, you'll generate an integer between 0 and 5 inclusive, and use if-statements to map each result to one of the 6 colors (if the random number is a 0, add a "R" to the code; if the random number is a 1, add a "B" to the code; etc.). Test your generateCode() method thoroughly before continuing. No, seriously, test it before continuing.

clue(code, guess)

Next, write a method clue(code, guess) that prints out the white and black clue pegs according to the given guess and code, and returns True if code equals guess, and False otherwise. Translate the pseudocode above to help you out.

Note that you can "change" the i-th character in a string s to an 'x' as follows:


  s = s[0:i] + "x" + s[i+1:len(s)]
                        
Also note you can omit the len(s) from the above expression. That is, if you write s[i:], Python interprets that as the substring of s from position i to the end. Similarly, s[:i] denotes the substring of s from the beginning up to (but not including) i.

Test the Program

It is hard to test your program when you are given a random code that you don't know. Therefore, you should print out a hint message at the beginning of the program with the actual code, so that the graders know what the correct answer is when evaluating the number of black and white pegs your program provides.

Part 2 - Guess that Flower!

predict.py: 12 points, partner allowed.

Files Needed: (download each to your lab05 folder)

In many areas of life, we can make accurate predictions about something we do not know by using information we have already seen or learned. In computer science, we call this process supervised machine learning, where a computer learns how to make predictions using only previously seen examples (without explicit programming using conditionals or rules for how to make predictions).

One of the first classic examples (studied by Ronald Fisher in 1936) of using machine learning to make predictions is in guessing the species of a flower given only measurements about the length and width of its petals and sepals. In this lab, we will complete a program that performs machine learning using an algorithm called k-nearest neighbor.

k-Nearest Neighbor

The k-Nearest Neighbor algorithm works something like this. Say you have a data point consisting of multiple measurements (e.g., flower size, ingredient amounts, pixel colors) that you want to make a prediction about (e.g., flower species, recipe type, what type of animal is in the picture). We can compare that data point against others for which we know their type or category (e.g., iris versicolor, brownies, cat photo) and then use the type or category of the closest matching data point(s) as our prediction.

Describe the Problem:

Complete the already started program predict.py that makes predictions about the species of flowers based on their measurements using comparisons to other flowers for which we know their species.

Understand the Problem:

We can represent each flower as a list of four measurements: sepal length, sepal width, petal length, and petal width.

Thus, one flower might be represented as a list:

  flower = [6.7, 2.5, 5.8, 1.8]
                        
and a list of 5 flowers might be represented as a two-dimensional list:

  fiveFlowersList = [[6.7, 2.5, 5.8, 1.8], [7.4, 2.8, 6.1, 1.9], [6.8, 3, 5.5, 2.1], 
                     [6.2, 3.4, 5.4, 2.3], [5.6, 3, 4.5, 1.5]]
                        
Note: we will read in all the flower measurements from file, so you do not need to make any assignments like the above (they merely illustrate what the data looks like). The files you will need in your directory are: iris_measurements.csv (the measurements of flowers for which we know their species), iris_species.csv (the species of the known flowers), iris_measurements_unknown.csv (the measurements of flowers for which we want to guess their species), and iris_species_unknown.csv (the actual species we want to guess -- this is only used for testing and should not be used in the code you write). Please download each of these files to the same directory as your program.

The program predict.py already provides a main function for you that reads in all of the flowers from the CSV files mentioned above. Specifically, it creates three lists for you:
  1. knownFlowers is a list of measurements of flowers for which we know their species
  2. knownSpecies is a list of species for each of the flowers in knownFlowers (knownSpecies[0] is the species of knownFlowers[0], etc.)
  3. unknownFlowers is a list of measurements of flowers for which we want to predict their species
Your assignment is to use those three lists to create a new list called predictions that contains the predicted species for each flower in unknowFlowers. To make those predictions, it might help to create two functions that perform the following:
  1. Calculating the distance between any two flowers
  2. Finding the closest flower from knownFlowers to a single flower from unknownFlowers

Design an Algorithm:

To calculate the distance between any two flowers based on their measurements, we an reuse the Euclidian distance formula we used in Lab 03 for the Monte Carlo pi approximation program. If two data points (call them x and y) have four measurements, their Euclidian distance is:

If we want to find the closest flower from a list knownFlowers to a given flower newFlower, we can simply calculate the distance between newFlower and knownFlowers[0], knownFlowers[1], etc. Then the closest one in knownFlowers is the one with the smallest distance to newFlower.

Implement a Design

You will implement your solution by adding code to the predict.py program. You will also need to download the predict_helpers.py module into the same directory (which includes some predefined functions that you will need, e.g., for reading data from file and for calculating the accuracy of your predictions).

Test the Program:

Run your program. If correctly implemented, you should achieve very high accuracy (not quite 100%, but close!). Of note -- such high accuracy is actually pretty remarkable. The entire data set contains equal numbers of the three species of iris flowers. So random guessing (assuming the program learns nothing) would only achieve 33% accuracy! The fact that we achieve much better results demonstrates the potential power of machine learning, especially as applied to other disciplines (such as biology).

Maintain:

Make sure your code is "readable": use short but meaningful variable names and make sure to add your own comments explaining what you are doing.

Handin

Be sure to hand in what you have finished so far.

Part 3 - Looking for a Match

match.py: 20 points, partner allowed.

Files Needed: (download each to your lab05 folder)

As you may know, proteins are chains of molecules called amino acids. There are 20 amino acids, each of which is typically represented by a single letter, and any protein can be specified by its sequence of amino acids. This sequence determines the properties of the protein, including its 3D structure.

Left: A general amino acid (structure of R determines the particular amino acid).
Right: 3D structure of a protein. Image source: wikipedia.org.

When a new protein is found, one way in which we might attempt to guess the functionality of that protein would be to see if it contains certain markers common to a known class of proteins. For example (and an entirely bogus example at that), suppose we discover a new protein, that we've named Duane, with the following amino acid sequence:

STTECQLKDNRAWTSLFIHTGHTECA
                

We may also suspect that Duane might belong to one of two possible classes of proteins: Spiffs and Blorts. As you well know, most Spiffs contain the pattern TECQRKMN or at least something close to it. That is, most of the sequences in the class of Spiff proteins have the subsequence TECQRKMN with only a few of the letters changed. Blorts, meanwhile, have the pattern ALFHHTTGT, or something very similar.

In this case, we can deduce that Duane is most likely a Spiff: Duane contains the pattern TECQLKDN which only has 2 mismatches from TECQRKMN (the errors are marked with a ^ below).

TECQLKDN
TECQRKMN
    ^ ^ 
                

The closest pattern to the Blort sequence is

SLFIHTGHT
ALFHHTTGT
^  ^  ^^ 
                

which has 4 mismatches.

Describe the Problem:

Input: A file that contains a string s representing a protein sequence, along with some number of strings, each representing a marker sequence.

Goal: For each marker sequence, find its best match in the protein sequence and report its location and the number of errors in the match.

Understand the Problem:

The file test.txt is in the format you should expect for your input (and is the file you should use to test your program). In particular, the first line will always contain the protein sequence. Following the protein sequence will be some number of pattern sequences. For each of these sequences, you should report the location of the best match, and the number of errors at that location.

For example, the contents of test.txt are as follows:

STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
ALFHHTTGT
TTECQ
HT
ZZZ
TTZZZRAWT
                        
For this file your program should have something like the following output:

Example Output

Sequence 1 has 2 errors at position 2.
Sequence 2 has 4 errors at position 14.
Sequence 3 has 0 errors at position 1.
Sequence 4 has 0 errors at position 18.
Sequence 5 has 3 errors at position 0.
Sequence 6 has 5 errors at position 5.
                          

Design an Algorithm:

Make sure you come up with a plan of attack (on paper) before you begin coding.

Implement a Design:

Unlike previous assignments in which data was entered by the user or hard-coded into the program, here your data will come from a file. As such, you'll need a few tools for handling files.

Reading from a File

To work with an external file, you'll use the open function:


  <variable> = open(<filename>, <mode>)
                          
This function opens the file with the given name and loads it into the specified variable. The <mode> can be either "r" or "w", depending on whether you intend to read from the file or write to the file (you can only do one type of operation at a time). In this case we're just reading from the file, so you'll want something like this:


  inputFile = open("test.txt","r")
                          
You can now use functions associated with a file object, including:
  • inputFile.read() Returns the rest of the file as a single string.
  • inputFile.readline() Returns the next line of the file as a string.
  • inputFile.readlines() Returns a list of all remaining lines.
Note that the file object keeps track of what you've read so far. So if you call the readline() function twice, the first call will return the first line of the file while the second call will return the second. If you want to start at the beginning again, <file>.seek(0) resets the implicit cursor to the beginning of the file.

Keep in mind that the lines returned by all these functions include the newline character at the end of each line. So if you call the print function on one of these lines, you'll print two newlines (creating a blank line). If you want to avoid this, you can either tell print not to add a newline at the end (end='') or you can just work with all but the last character in the line (myLine[:-1]).

You can also use a for loop to iterate through all the remaining lines in the file. For example,


  for line in inputFile :
    print(line[0],len(line))
                         	
will print the first character and the length of each line in the file. Note that if some lines had already been read before this for-loop, those lines wouldn't be interated through.
Your program should allow the user to pick the name of the file to be read. You can make sure your program works by creating other txt files and testing it on them. Be sure your program works for an arbitrary number of sequences. I.e. don't have a loop that is coded to grab 6 lines and then stop; there may be more of fewer. Using the for-loop to iterate through the file is helpful here.

Test the Program:

Try your program on a variety of sequences and inputs. Make sure the program still works when the protein sequence or some pattern sequence is empty.

Maintain:

Make sure your code is "readable": use short but meaningful variable names, use constants where appropriate, use functions where you can, and comment any code that does anything substantial (for example, it would be a good idea to put a comment before your for loops explaining the purpose of the for loops, before each function explaining the function's parameters and purpose, etc.) It is not necessary to comment assignment statements and simple things like that.

You should also handle exceptions so as to make your program robust to runtime errors (e.g. the user if enters a file name that doesn't exist).

Handin

Be sure to hand in what you have finished so far.

Part 4 - Wrap Up

As with every lab, your last job prior to submission is to complete a brief write-up by filling out a Google Form.

Handin

You now just need to electronically handin all your files. As a reminder

 
  % cd             # changes to your home directory
  % cd cs150       # goes to your cs150 folder
  % handin         # starts the handin program
                  # class is 150
                  # assignment is 5
                  # file/directory is lab05
  % lshand         # should show that you've handed in something
                

You can also specify the options to handin from the command line

 
  % cd ~/cs150     # goes to your cs150 folder
  % handin -c 150 -a 5 lab05
                

File Checklist

You should have submitted the following files:

  game.py
  predict.py
  match.py
  predict_helpers.py (for ease of grading)
  iris_measurements.csv (for ease of grading)
  iris_species.csv (for ease of grading)
  iris_measurements_unknown.csv (for ease of grading)
  iris_species_unknown.csv (for ease of grading)
  test.txt (for ease of grading)
                

A. Eck, T. Wexler, A. Sharp.