CSCI 151 - Prelab 8 Million Monkeys at Typewriters

Due 9am, Monday, 11 April 2016

In this prelab, you will familiarize yourself with some of the design and implementation issues of the upcoming lab. Please write or type up your solutions, and hand in a paper copy before class on Monday.

Motivation

In this lab you will design and implement an order k Markov model from a piece of input text. Sound scary? It isn't. Basically, we'll use these Markov model things to read in a sample text, then generate a new random text based off the sample. For example, the sample text may be a compilation of work by Dr. Seuss, which contains content such as

    Would you like them here or there?
    I would not like them here or there.
    I would not like them anywhere. 
    I do not like green eggs and ham.
    I do not like them Sam-I-am. 
    Would you like them on a house? Would you like them with a mouse?

Our Markov model will read in all of Dr. Seuss' fine work, then generate random text in Seuss' style, such as

    That Sam-I-am! That makes a story that needs a comb? 
    No time for more, I'm almost home. I swung 'round the smoke-smuggered stars.
    Now all that cart! And THEN! Who was back in the dark. 
    Not one little house Leaving a thing he took every presents! 
    The whole aweful lot, nothing at all, built a radio-phone. 
    I put in a house. I do not like them with a goat. 
    I will not eat them off. 
    'Where willing to meet, rumbling like them in a box. I do not like them all! 
    No more tone, have any fun? Tell me. What will show you. 
    You do not like them, Sam-I-am! And you taking our Christmas a lot.

As you can see, our random text certainly ressembles the original in spirit, although it may not make a whole lot of sense (although I suppose, in this case, one could argue whether Dr. Seuss' original work makes sense...)

Markov Models

For the lab, you will use a Markov model for the somewhat silly purpose of generating stylized pseudo-random text; however, Markov models have plenty of "real" applications in speech recognition, handwriting recognition, information retrieval, and data compression. (In fact, there is a whole course on such models in the math department, called graphical models.)

Our Markov model is going to generate one character of our output at a time. In order to determine what this next character is, we will need to look at the sample text to determine what character is most likely to occur at this point in the text. In order to determine what character is most likely to occur, we look at the last few characters we generated, and try to find those character in our sample text. Hopefully we'll find it a bunch of times, and from these occurrences we try to figure out what character should occur next.

For example, suppose we have already generated the text "I do not like them, ", and we want to determine the next character. Then, we may look in the sample text for all occurrences of the substring "ke them, ", and we may find that the substring occurs 10 times: 7 times followed by "Sam-I-am," 2 times followed by " on a boat," and once followed by "on a house". Then, with 7/10 probability, the next character is an S, and with 3/10 probability it is an o.

Now if you think about it, the further back we look in the text, the more our generated text will ressemble the original. However, looking further back requires a lot more work and space, and produces less interesting text. So there are trade-offs to consider. The Markov model formalizes this notion as follows.

An order 0 Markov model looks in the sample text for the previous 0 characters of our generated text. That is, given an input text, you compute the Markov model of order 0 by counting up the number of occurrences of each letter in the input and use these as the frequencies. For example, if the input text is "agggcagcgggcg", then the order 0 Markov model predicts that each character is a with probability 2/13, c with probability 3/13, and g with probability 8/13. This has the effect of predicting that each character in the alphabet occurs with fixed probability, independent of previous characters.

Characters in English text are not independent, however. An order k Markov model looks back at the previous k characters in the generated text, and bases its prediction on that substring of length k. That is, given an input text, you compute a Markov model of order k by counting up the number of occurrences of each letter that follows each sequence of k letters. For example, if the text has 100 occurrences of th, with 50 occurrences of the, 25 occurrences of thi, 20 occurrences of tha, and 5 occurrences of tho, the order 2 Markov model predicts that the next character following th is e with probability 1/2, i with probability 1/4, a with probability 1/5, and o with probability 1/20.

So this is how we generate text. The details will become clear later. Right now, let's get on with the show.

Part 1 - Hash Map

As usual, you will start off your lab by implementing a data structure. This week you will implement your own hash map with chaining in a class called MyHashMap. You will build your hash table on top of an array; this array should consist of an array of LinkedLists, one linked list per "bucket".

You will implement a subset of the java.util.Map interface, but you won't actually implement the interface.

Part 2 - Basic Markov Model

Markov Class

In the lab you will create a class Markov to represent a specific k-character substring that appears in the input text. Ultimately, it will have a random() method that returns a random "next" character according to the Markov model (that is, according to what characters follow the substring in the input text). For now, it just stores the substring it represents and an integer that counts the number of times the substring appears. You will have a constructor, a method to increment the frequency count, and the usual toString method for output.

    public Markov(String substring)
    public void add()
    public String toString()

Frequency Counts

In the lab you will implement a program FrequencyCounter that reads the order parameter k of the Markov model from the command-line, a text string from standard input, and uses a hash table to insert each k-character substring (key) from the text. For example, if k is 2 and the input string is "agggcagcgggcg", then your program should create Markov objects for each of the 5 distinct keys (substrings of length 2), and call the add method 12 times total: ag gg gg gc ca ag gc cg gg gg gc cg. This should maintain an integer count of the number of occurrences of each key. You will use your hash table's methods to print out the number of distinct keys and the number of times each key appears in the text. For the example above, your program should output:

    5 distinct keys
    2 ag
    1 ca
    2 cg
    3 gc
    4 gg

What would your program output on the string "flippyfloppies" and k=2?

Suppose your program has a hash map hm to store the substrings of length k as keys, along with their count. Write pseudocode that reads through a string input, adding the substrings of length k to the hashtable, with the appropriate counts. You will probably want to make use of the substring( ) method of class String.

Part 3 - Language Generation

To generate random text, given a k character key, your Markov object for this key must know all of the different letters that follow the k character key in the input text. This operation is at the crux of the matter, as you will need it to generate random characters in accordance with the Markov model. You will need to modify your Markov class so that in addition to frequency counts (how many times the k character key appears), it records the breakdown depending on the next letter (how many of the times that the key appears is it followed by an a? how many times with a b? etc.). You will create your own tree map to keep track of the list of suffix characters along with their frequencies. You will also need to include the following method in your Markov class to increment the count of a suffix character.

    public void add(char c)

Then, you will implement a program TextGeneration based off FrequencyCounter that inserts keys into the hash table (if necessary), and calls add(char c) to add the appropriate suffix characters to the Markov model. It should produce the following output on the example input.

    5 distinct keys
    2 ag: 1 c 1 g
    1 ca: 1 g
    1 cg: 1 g
    3 gc: 1 a 2 g
    4 gg: 2 c 2 g

What would your program output on the string "flippyfloppies" and k=2?

Suppose your Markov class stores the suffix characters and their counts in a TreeMap<Character,Integer> called suffixes. Write pseudocode for the add(char c) method. (The method should increment the count of the suffix "c", or add it if it's not already in the tree map.)

Now you will add a method random() to Markov that returns a pseudo-random character according to the language model.

You will use random() as follows: you will modify TextGeneration so that it takes as command line input an integer k, an integer M, and a filename file, and prints out M characters according to the order k Markov model based on file. You should start by printing the first k characters of the original text. Then, repeatedly generate successive pseudo-random characters. Using the example above, if the Markov object m represents the substring "gg", then m.random() should return c or g, each with probability 1/2. After you generate a character, move over one character position, always using the last k characters generated to determine the probabilities for the next. For example, if your program chooses c in the example above, then the next Markov object would represent the substring "gc," and according to the Markov model, the next character should be a with probability 1/3 and g with probability 2/3. Continue the process until you have output M characters.

Given your solution to question 3, what are the probabilities of the given suffixes, for the given strings of length 2?
- substring: fl
  suffix i probability:
  suffix o probability:
- substring: pp
  suffix y probability:
  suffix i probability:
- substring: ip
  suffix p probability:
What is one possible outcome of the TextGeneration program when run on the input string "flippyfloppies" with k=2 and M=14?
Suppose the input string is still "flippyfloppies" and M=14, but that k=3. What is one possible outcome of the TextGeneration program? What has happened and why?

Last Modified: April 09, 2015 - Benjamin A. Kuperman - original by Alexa Sharp