CSCI 151 - Lab 7 Generating Text With a Markov Model

Due by 6PM on Sunday, May 1

Here is a zipped folder of starter code for this lab: Lab7.zip.

A system that evolves over time is said to have the Markov Property if its future state is determined by its current state without regard to its history. Such systems are sometimes called "memoryless". This property is popular with modelers because it means that fewer things need to be taken into account in a model.

In this lab we will implement a Markov model for generating text. The model includes an integer parameter K, which is typically between 2 and 10 or so. At each step we look at the K most recent charcters we have generated - that is our current "state". We use this state to choose the next character. We drop the oldest of our K characters, add the new one, and that gives us a new state that generates another new character, and the process continues.

How do we generate the next character from a state? We train our model on a text sample. For every state in the training text, we look at every instance where that state occurred and what character it was followed by in each instance. If a state occurred 10 times in the training sample and was followed twice by the letter 'e', three times by 'y' and 5 times by 's', then when we reach this state while generating text 20% of the time we will use 'e' as the next character, 30% we will use 'y' and 50% we will use 's'.

That is the entire model. It is quite simple, but if the training text is good and K is well chosen it can generate text that appears meaningful at first glance. Of course, there is no understanding built into the model; we are only generating character sequences.

Here, for example, is a prose poem generated by a model trained on all 154 sonnets by William Shakespeare:

O, what a torment wouldst use the slow offence
Of my dull bearer when from highmost pitch, with weary car,
Like feeble age, he reeleth from thee going he went wilful-slow,
Towards the pebbled shore,
So do our minutes kill.
Yet fear her, O thou mine, I thine,
Even as when first your eye I eyed,
Such seems your beauteous day,
And make Time's spoils despised,
Whilst it hath my duty strongly knit,
To thee I send this written embassage,
To witness duty, not to give the lie to my true sight,
And Time that gave doth now his gift confounded to decay,

Overview

We will use 3 classes to implement this model:

Each object of class State represents a string of K characters. In addition to the string itself the class has a TreeMap<Character, Integer> called suffixes that stores every character this string is followed by in the training text and how often that character was the followup. Class State has a method generate( ) that uses this information to randomly generate a character to follow this state. For example, if we use for training the string "aabaababc" with K=2, there are 4 states, representing the strings "aa", "ab", "ba", and "bc". For "aa" the only entry of suffixes is the pair ('b', 2) because 'b' is the only character that follows "aa" and it does so twice. For the state "ab" suffixes has 2 pairs ('a', 2) and ('c', 1). Note that state "bc" has no followup characters.

Class MarkovModel has a HashMap<String, State> that lets us find the State information for any K-character string. MarkovModel also has a method void train(String fName) that populates the model's HashMap from a text file and a method generateText( ) that generates a string of characters of any length from the model. So we first train the model, then we use it to generate text.

Class TextGenerator is our application program; it only has a main( ) method. This main( ) method constructs a new MarkovModel object, trains it with a text file, calls the model's generateText( ) method, and prints what it returns. This class is complete and you shouldn't need to change it.

Part 1 -- Some elements of Java that we will use.

We will use two kinds of map in this lab: TreeMap<Character, Integer> suffixes and HashMap<String, State> model. The major methods of TreeMap and HashMap are the same:

get( ) takes as argument a key and returns the value associated with it. For example, in class State suffixes.get('b') is the number of times the state's string was followed by the letter 'b'. In class MarkovModel model("ab") is the state corresponding to string "ab".
put( ) takes as arguments a key and a value and inserts them into the map. A key can only be in the map once, so if the key was previously bound to a different value, that association is updated to the new value.
containsKey( ). This returns true if its argument is already a key of the map.
keySet( ). This returns a structure holding all of the keys of the map. You can iterate through the keys with a for-loop:

      for (Character c: suffixes.keySet() ) {
 
                    ....
  
      }

We need random numbers for this lab. Java makes that easy:

Random is the class of random number generators. The only method of this class that we need is nextInt(int N), which returns a random integer between 0 and N-1. Thus,

     Random rand = new Random();
       ...
     rand.nextInt(2);

will randomly give either 0 or 1.

We have been using Scanners to read files. In this lab we need to deal with individuial characters rather than words or lines, so Scanners are not very convenient.

For this lab we will use class FileReader. The FileReader constructor takes a file name as its argument. FileReader has a method int read( ), that reads and returns the next character in the file. If it is at the end of the file it returns -1. If it is not at the end of the file you need to cast the int it returns into a char in order to use it as a character rather than an int.. The FileReader constructor throws a FileNotFoundExceptions and read( ) throws an IOException, so you need to use these in a try-catch block with two catches.

Here, for example, is a block of code that open the file "foo.txt" and reads it into String s:

    try {
          FileReader R = new FileReader("foo.txt");
          String s = ""
          boolean done = false;
          while (!done) {
               int  c = R.read();
               if (c == -1)
                   done = true;
               else
                   s += (char)  c;

} } catch (FileNotFoundException e) {System.out.println( "Bad file name" );} catch (IOException e) { System.out.println( "IOEException" ); }

Part 2 -- Building the Model

Start with the State class. This has an instance variable str to hold the State's string, an integer variable counter for how frequently it occurs, and a TreeMap<Character, Integer> suffixes to hold its follow-up characters. The State constructor should take one argument: the String str, and initialize these variables.

Class State has an add( ) method:

void add(char c) adds to the suffiixes TreeMap the fact that we found another instance of c as the followup character to this State: If c is already a key of suffixes get the value it is associated with, add 1 to this value, and put the (c, value) pair back in the map. If c is not a key then put (c, 1) into the map. In either case you should add 1 to counter.

For testing it is convenient to be able to print the information in class State. We have given the class this method:

     public String toString() {
           String s = String.format("%d %s:", counter, str);
           for (Character ch : suffixes.keySet() )
                 s += String.format(" (%c %d) ", ch, suffixes.get(ch));
           return s;
     }

The State class also has a generate( ) method, which we will discuss below.

Now turn to the MarkovModel class. This has two instance variables: the modeling parameter K and the HashMap<Strimg, State> model variable that holds all of the states from the training text. The constructor is given integer K and a String fileName. It starts by initializing the two instance variables, and then calls its train() method with the file name as an argument.

void train(String fileName ) is the longest method of this lab. It starts by opening a FileReader on the named file. Remember that this needs to be inside a try-catch block. It reads the first K letters of the file into a string s.We won't worry about the artificial situation where the file doesn't have K characters; just use a for-loop to read that many characters and add them onto s. Now go into a loop that ends when the FileReader object gives us a -1 signal that it has reached the end of the file. At each step we use our FileReader to read one character c. If c is -1 we exit the loop; for any other character c we add to the model the fact that s is followed by c (As usual, if s is a key for the model we get its State and call the State's add() and add((char) c) methods. If s is not a key we make a new State for it, call the State's add() and add( (char) c ) methods, and put it in the model. Then we add (char) c to s, and let s become s.substring(1) to drop the first character of s, and go around the loop again.

It is time to do some testing. The MarkovModel class has a method printModel( ) which prints all of the states of a trained model. The starter code includes a simple program that opens the file "SampleTextFiles/markovTest.txt", uses it to train a model, and then prints the model. The markovTest.txt file = consists of 1 line containing the following string: "agggcagcgggcg". As you can see, it has 5 different 2-character states: "ag", "gg", "gc", "ca", and "cg". The output should be this:

      5 distinct states:   
        4 gg: (c 2)  (g 2) 
        1 cg: (g 1)    
        2 ag: (c 1)  (g 1) 
        3 gc: (a 1)  (g 2) 
        1 ca: (g 1)

Part 3 -- Generate Text

Now go back to the State class. This has the stub of a method generate( ) . We want generate( ) to randomly choose among the suffixes for the state in way that reflects how often they occur. Here is a way to do this: Class State has an instance variable counter that indicates how often it occurred in the training text.. We also know how often each suffix followed that state. So generate a random number R between 0 and counter-1:

R = rand.nextInt(counter)

does this. Walk through the suffixes and subtract the count of each from R. Return the suffix that makes R go negative.

For example, suppose counter is 5 and we have 3 suffix letters: 'c' is used once, 'f' twice, and 'e' twice. Our variable R will have a value between 0 and 4.

If R is 0, we subtract 1 for 'c' and the result is negative, so we return 'c'.
If R is 1 we subtract 1 for 'c' and get 0, then we subtract 2 for 'f' and get -2, so we return 'f'.
If R is 2 we subtract 1 for 'c' and get 1, then we subtract 2 for 'f' and get -1, so again we return'f'.
If R is 3 we subtract 1 for 'c' and get 2, then we subtract 2 for 'f' and get 0, then we subtract 2 for 'e' and the result is negative, so we return 'e'.
Similarly, if R is 4 we return 'e'. Out of the 5 equally likely values of R, one causes us to return 'c', two return 'f', and two return 'e'. Our generate( ) method returns the suffixes randomly, but with the same frequency as they had in the training text.

The last method you need to implement is generateText() for the MarkovModel class

     public String generateText( int M, String start)

Here M is the number of characters the model should generante; start is an initial string of length K. Now initialize two string variables to the value of start: String text will consist of all of the text you have generated and will be returned at the end as the result of this method. String s will always be the last K characters of text. At each step we get the State associated with s and call its generate( ) method to get a letter c. Add c onto both text and s<, and drop the first letter of s (i.e. s=s.substring(1) ) to maintain its length at K. This continues until the length of text is M; at that point we return text.

One issue might arise: if the last K letters of the training text form a substring that doesn't appear anywhere else, then this substring will not be one of the states in the model (becaues it is not followed by anything). You need to check at each step of generateText() that s is one of the keys of the model. If it is not, one easy fix is to replace s by start and continue generating characters.

The TextGenerator program makes use of the State and MarkovModel classes you have build.This program, which you should not need to modify, takes three command-line arguments in the following order: K, M, fileName. It constructs a MarkovModel and uses the named file to train the model, then reads the first K characters of the file into string start and calls the model's generateText method. Finally, it prints the string the generateText( ) method returns.

Note that line breaks are just characters like any other character to this program. You might find that it generates some very long lines. If this happens copy the output into a word processor like MS Word to make it easier to read.

handin

Make sure you have included your name (and your partner's name if you worked with a partner) at the top of the State.java and MarkovModel.java files. The project you hand in should include those two files and the TextGenerator.java program.

Include in your submission a file named README. The contents of the README file should include the following:

Your name and your partner's name if you worked with someone
A statement of the Honor Pledge
Any known problems with your classes or program

As usual, make a zipped copy of you project folder (which should be Lab7<your last name>) and hand it in on Blackbard as Lab7.