Protein Matching

match.py: 15 points


Background

One of the many important collaborations between computer science and biology is the study of protein sequences (in an area of research called bioinformatics). As you may know, proteins are chains of molecules called amino acids that make up important building blocks for organic life. Human cells can make 20 different amino acids, each of which is typically represented by a single letter code. We can therefore define any protein as a specific sequence of amino acids. This sequence determines the properties of the protein, including its 3D structure.

3D model of the insulin protein.

Identifying Proteins

When a new protein is found, we can guess its functionality by comparing it to proteins we have already studied. Specifically, we can check if the protein contains certain markers common to a known class of proteins. For example (and an entirely bogus example at that), suppose we discover a new protein, that we’ve named Duane, with the following amino acid sequence:

STTECQLKDNRAWTSLFIHTGHTECA

We may also suspect that Duane might belong to one of two possible classes of proteins: Spiffs and Blorts. Most Spiffs contain the pattern TECQRKMN or at least something close to it. That is, most of the amino acid sequences in the class of Spiff proteins have the subsequence TECQRKMN with only a few of the letters changed. Blorts, meanwhile, contain the pattern ALFHHTTGT, or something very similar.

Now let’s try to compare the marker for Spiffs and the marker for Blorts against the amino acid sequence for Duane. We can see that Duane contains the pattern TECQLKDN which only has 2 mismatches from the Spiffs marker TECQRKMN (the errors are marked with a ^ below).

STTECQLKDNRAWTSLFIHTGHTECA
  TECQRKMN
      ^ ^ 

On the other hand, the closest pattern to the Blort marker has 4 mismatches:

STTECQLKDNRAWTSLFIHTGHTECA
              ALFHHTTGT
              ^  ^  ^^

In this case, we can deduce that Duane is most likely a Spiff! Pretty spiffy, eh?!

Your Program

In this part of the lab, we will be writing a program called match.py that takes a given protein’s amino acid sequence and determines how closely the protein matches the markers for different classes of proteins.

In particular, the match.py program has two global variables already defined:

PROTEIN = “STTECQLKDNRAWTSLFIHTGHTECA”

which is the protein sequence from the example above, and

MARKERS = [“TECQRKMN”, “ALFHHTTGT”, “TTECQ”, “HT”, “ZZZ”, “TTZZZRAWT”]

which is a list of strings representing the markers of six classes of proteins (including the two from the example above).

Pay close attention to the global variable names above. Notice that they are both capitalized. It will be our convention to use capitalized variable names for global variables. This will help us keep track of them in our code and avoid issues with scope.

Alignments

To determine whether part of PROTEIN’s amino acid sequence closely matches each of the strings in MARKERS, we need to think about all of the ways we can align a marker with the protein sequence. For instance, using the marker TECQRKMN as an example, we could align it with the start of the PROTEIN (so the starting position is 0):

STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
^^^^^^^^

which has errors in all 8 characters of the alignment. We could also shift the marker by one character (i.e., starting position = 1) to get a different alignment:

STTECQLKDNRAWTSLFIHTGHTECA
 TECQRKMN
  ^^^^^^^

which has errors in 7 characters of the alignment (only the first character was correct). We could again shift the marker by two characters (i.e., starting position = 2) to get another alignment:

STTECQLKDNRAWTSLFIHTGHTECA
  TECQRKMN
      ^ ^ 

which has errors in only two characters. And the final alignment is when the marker is shifted all the way to the right (i.e., starting position = 18):

STTECQLKDNRAWTSLFIHTGHTECA
                  TECQRKMN
                  ^^^^^^^^

which also has errors in all 8 characters of the alignment.

Finding Closest Matches

Your program should compare each marker in the MARKERS list to all of the possible alignments in PROTEIN where the marker could start. For each marker, the program should print both (1) the starting position in PROTEIN of the alignment that provided the closest match, and (2) how many errors were found in that particular alignment. The output of your program should look something like this:

Marker TECQRKMN has 2 errors at starting position 2.
Marker ALFHHTTGT has 4 errors at starting position 14.
Marker TTECQ has 0 errors at starting position 1.
Marker HT has 0 errors at starting position 18.
Marker ZZZ has 3 errors at starting position 0.
Marker TTZZZRAWT has 5 errors at starting position 5.

ReadMe

To organize your solution to this program, you might think about what loops are needed. First, since we are comparing multiple markers to a single PROTEIN, we need a loop over each marker in the MARKERS list. Second, since we are comparing all of the possible alignments of the current marker with PROTEIN, we will need a second loop over all the starting positions where we could start the marker inside of PROTEIN. Finally, since we want to count the number of errors for each alignment, we will need a third loop that compares all of the characters in the current alignment.

This problem can be solved by creating three nested loops, or you could think about splitting it up into functions containing only one or two loops each. Both solutions are equally valid – pick the one that best matches how you think about problem solving. As you work on the problem, you might find that using functions to represent the different for loops makes the solution easier to understand and implement.

Finally, make sure you are including and using a main() function for this program too!