Lab 8

Text Generation
Due by 8pm, Sunday 18 Nov 2012

In this lab, you will use hash tables to do generate pseudo-random text.

The purpose of this lab is to:

As usual, you may work with one partner on this assignment, if you choose.


Motivation

In this lab you will design and implement an order k Markov model from a piece of input text. Sound scary? It isn't. Basically, we'll use these Markov model things to read in a sample text, then generate a new random text based off the sample. For example, the sample text may be a compilation of work by Dr. Seuss, which contains content such as

    Would you like them here or there?
    I would not like them here or there.
    I would not like them anywhere. 
    I do not like green eggs and ham.
    I do not like them Sam-I-am. 
    Would you like them on a house? Would you like them with a mouse?

Our Markov model will read in all of Dr. Seuss' fine work, then will generate random text in Seuss' style, such as

    That Sam-I-am! That makes a story that needs a comb? 
    No time for more, I'm almost home. I swung 'round the smoke-smuggered stars.
    Now all that cart! And THEN! Who was back in the dark. 
    Not one little house Leaving a thing he took every presents! 
    The whole aweful lot, nothing at all, built a radio-phone. 
    I put in a house. I do not like them with a goat. 
    I will not eat them off. 
    'Where willing to meet, rumbling like them in a box. I do not like them all! 
    No more tone, have any fun? Tell me. What will show you. 
    You do not like them, Sam-I-am! And you taking our Christmas a lot.

As you can see, our random text certainly ressembles the original in spirit, although it may not make a whole lot of sense.

Markov Models

For this lab, you will be using a Markov model for the somewhat silly purpose of generating stylized pseudo-random text; however, Markov models have plenty of "real" applications in speech recognition, handwriting recognition, information retrieval, and data compression. (In fact, there is a whole course on such models in the math department, called graphical models.)

Our Markov model is going to generate one character of our output at a time. In order to determine what this next character is, we will need to look at the sample text to determine what character is most likely to occur at this point in the text. In order to determine what character is most likely to occur, we look at the last few characters we generated, and try to find those character in our sample text. Hopefully we'll find it a bunch of times, and from these occurrences we try to figure out what character should occur next.

For example, suppose we have already generated the text "I do not like them, ", and we want to determine the next character. Then, we may look in the sample text for all occurrences of the substring "ke them, ", and we may find that the substring occurs 10 times: 7 times it is followed by "Sam-I-am", 2 times it is followed by " on a boat", and once it is followed by "on a house". Then, with 7/10 probability, the next character is an S, and with 3/10 probability it is an o.

Now if you think about it, the further back we look in the text, the more our generated text will ressemble the original. However, looking farther back requires a lot more work and space, and produces less interesting text. So there are trade-offs to consider. The Markov model formalizes this notion as follows.

An order 0 Markov model looks in the sample text for the previous 0 characters of our generated text. That is, given an input text, you compute the Markov model of order 0 by counting up the number of occurrences of each letter in the input and use these as the frequencies. For example, if the input text is "agggcagcgggcg", then the order 0 Markov model predicts that each character is a with probability 2/13, c with probability 3/13, and g with probability 8/13. This has the effect of predicting that each character in the alphabet occurs with fixed probability, independent of previous characters.

Characters in English text are not independent, however. An order k Markov model looks back at the previous k characters in the generated text, and bases its prediction on that substring of length k. That is, given an input text, you compute a Markov model of order k by counting up the number of occurrences of each letter that follows each sequence of k letters. For example, if the text has 100 occurrences of th, with 50 occurrences of the, 25 occurrences of thi, 20 occurrences of tha, and 5 occurrences of tho, the order 2 Markov model predicts that the next character following th is e with probability 1/2, i with probability 1/4, a with probability 1/5, and o with probability 1/20.

So this is how we generate text. The details will become clear later. Right now, let's get on with the show.


Part 1 - Hash Map

First you'll implement your own hash map with separate chaining in a class called MyHashMap<K,V>. You will build your hash table on top of an array; this array should consist of an array of LinkedLists, one linked list per "bucket".

The methods you will implement are a subset of the java.util.Map interface, but you won't actually implement the interface. You may not assume that keys implement Comparable, but as all objects, they have an equals method.

Data Members

You're going to need an array to store the buckets of your hash map. Because you're using separate chaining, each of these buckets will be a linked list of elements, in fact, they'll be a linked list of (key,value) pairs (since each element is really one such pair).

In order to store both the key and value of an element in a single linked list, you will need to create a MyEntry<K,V> class that represents a key-value pair. In this way, each bucket can be represented by a linked list of MyEntries.

So, step one is to create a MyEntry<K,V> class (which can be declared inside the MyHashMap class -- if you go this route and nest it in MyHashMap, you should drop the Generics from MyEntry (the angle brackets stuff), as you would've done for a nested Node class on your Linked List lab) that has a class variable key of type K, a class variable value of type V, and overridden hashCode() and equals() methods (so that they apply only to the key). (You are overriding the Object class' hashCode() and equals() methods.) You may use the key's hashCode and equals methods directly.

Now that you have the MyEntry class, you can create the following class members in your MyHashMap class:

LinkedList<MyEntry>[] table;
set of buckets used in your hashtable
int size;
current number of items in the table (not # buckets, use table.length for that)
float maxLoadFactor;
maximum permitted load factor for the table (load factor is number of items (i.e. size) divided by the number of buckets (i.e. table.size())

You may also want constants for the default hashtable size (say, 11) and a default maximum load factor (say, 0.75).

Constructors

MyHashMap(int capacity, float loadFactor)
Create a hashtable with capacity buckets and a maximum load factor of loadFactor.
First you need to initialize the array of linked lists:
   table = (LinkedList<MyEntry<K,V>> []) new LinkedList[capacity];
Then, you need to go through each table entry and initialize each linked list.

MyHashMap()
Create a hashtable of size 11 with a load factor of 0.75.
(You can do this in one line by calling the previous constructor.)

Public Methods from java.util.Map

int size()
Return the number of items in the hashtable

boolean isEmpty()
Return true if size==0, false otherwise

void clear()
Empties out the hashtable
You can (and should!) make use of the LinkedList.clear() method.

String toString()
Return a String representation of your hash table (anything useful to you will be fine).

Before continuing, you should set up jUnit tests for your MyHashMap class, if you haven't already. You can compare the behaviour of your MyHashMap to that of Java's HashMap. Test the above methods as best you can, considering you haven't implemented an put method yet!


V put(K key, V value)
Associate the specified value with the given key.
Return the previous value associated with the key, or null if there was no mapping.
To compute the hash function, first apply the key's hashCode() method and then apply the % tableSize operation.
If the load factor threshold has been reached, call the private resize() method.
Attempts to insert null values or keys should generate a NullPointerException.

V get(K key)
Return the value associated with the key. If no value exists, return null.
This operation should not examine every index in the table, but rather, should only examine the linked list indicated by the hashcode.

V remove(K key)
Delete the mapping (key,value) from the hashtable. Return the previous value or null if there was no such value.

boolean containsKey(K key)
Return true if key is already in the table.
You can (and should) make use of the LinkedList.contains method.
And you should only have to examine a single bucket in order to get your answer.

boolean containsValue(V value)
Return true if value is already in the table.
This may require inspection of all buckets.

Public Methods not in java.util.Map

Iterator<K> keys()
Create an iterator of all the keys in the hashtable.
You may want to use the LinkedList's iterator for this...
That is, your iterator will keep track of (a) the current bucket you're iterating over, and (b) that bucket's iterator.
Once you iterate through one bucket's iterator, you move on to the next one

Iterator<V> values()
creates an iterator of all the values in the hashtable.
This can be very similar to the keys iterator.

Private Methods

resize()
Dynamically resize the array.
Make a new array of at least double the size, and then rehash the items into the new array (not every item in a bucket-chain will have the same key and therefore will hash differently in your new array!)
This should be called whenever the maximum load factor is exceeded.
Always at least double the array size and maintain an odd size. It is best to resize to a prime number. Remember, to determine if a number is prime, you need to test that it has no factors less than or equal to its square root. Here is a list of primes each at least twice the value of the previous if you just want to do a table lookup:
11 
23 
47 
97 
197 
397 
797 
1597 
3203 
6421 
12853 
25717 
51437 
102877 
205759 
411527 
823117 
1646237 
3292489 
6584983 
13169977 
26339969 
52679969 
105359939 
210719881 
421439783 
842879579 
1685759167 
(If you choose this method, you should write a private helper method that takes an integer parameter i and returns the first prime number in this list that is greater than i.)

jUnit Testing

Be sure to test your hash table methods with jUnit tests before continuing. One good test would create a hash table of (String,Integer) pairs, and add the first 100 elements (""+i,i), printing out your hash table as you go along. Remove the elements afterwards, one-by-one. Make sure you resize when you are supposed to.


Part 2 - Basic Markov Model

Markov Class

Create a class Markov to represent a k-character substring. Ultimately, it will have a random method that returns a random character according to the Markov model. For now, just make it store the substring and an integer that counts the number of times the substring appears. You will need a constructor, a method to increment the frequency count, and the usual toString method for output.

public Markov(String substring)
Construct a new Markov object representing the string substring
public void add()
increment a counter.
public String toString()

Frequency Counts

Implement a program FrequencyCounter that reads the order parameter k of the Markov model from the command-line, a text string from standard input (i.e. System.in), and uses a hash table to insert each k-character substring (key) from the text (where the value of the key is a Markov object representing the key's substring). For example, if k is 2 and the input string is "agggcagcgggcg", then your program should create Markov objects for each of the 5 distinct keys, and call the Markov class' add method 12 times total: ag gg gg gc ca ag gc cg gg gg gc cg. Maintain an integer count of the number of occurrences of each key in your Markov object. Use your hash table's methods to print out the number of distinct keys and the number of times each key appears in the text. For the example above, your program should output:

    5 distinct keys
    2 ag
    1 ca
    2 cg
    3 gc
    4 gg

Part 3 - Language Generation

To generate random text, given a k character key, your Markov objects must know all of the letters that follow the k character key. This operation is at the crux of the matter, as you will need it to generate random characters in accordance with the Markov model. Modify your Markov class so that in addition to frequency counts, it records the breakdown depending on the next letter. Create a class variable of type TreeMap<Character,Integer> to keep track of the list of suffix characters along with their frequencies (remember, you made your own MyTreeMap in lab 6). Modify the toString method so that it prints out the list of suffixes, along with the substring and frequency count. Include the following method to insert a suffix character.

    public void add(char c)

You may also want to add other constructors or methods, as you see fit.

Implement a program SuffixCounter based off FrequencyCounter that inserts keys into the hash table (if necessary), and calls add(char c) to add the appropriate suffix characters to the Markov model. It should produce the following output on the example input (you do not have to format your output exactly the same, but it should contain the same information in a reasonable layout.)

    5 distinct keys
    2 ag: 1 c 1 g
    1 ca: 1 g
    1 cg: 1 g
    3 gc: 1 a 2 g
    4 gg: 2 c 2 g

You'll probably need to read up on the TreeMap operations, and if you choose to use its entrySet method, you will need to look at the Set and Map.Entry classes.

Note that since the last cg substring doesn't have a "next" character, we don't include it in the model.

Now add a method random to Markov that returns a pseudo-random character according to the language model. Be sure to get the probabilities right, as we will be checking this. (And, it may take some thought to figure out how to translate the probabilities into characters.)

Now, create a class TextGeneration that takes as command line input an integer k, an integer M, and a filename file, and prints out M characters according to the order k Markov model based on file. You should start by printing the first k characters of the original text. Then, repeatedly generate successive pseudo-random characters.

Using the example above, if the Markov object m represents the substring "gg", then m.random() should return c or g, each with probability 1/2. After you generate a character, move over one character position, always using the last k characters generated to determine the probabilities for the next. For example, if your program chooses c in the example above, then the next Markov object would represent the substring "gc," and according to the Markov model, the next character should be a with probability 1/3 and g with probability 2/3. Continue the process until you have output M characters. If the language model contains less than 100 k-tuples (prefixes), then print the language model (the keys, their suffixes and counts) before you output M randomly generated characters.

Note: If you are using a Scanner to read the files line-by-line, you will need to append a newline character at the end of the input line, otherwise you won't have any in your output. Also, you should carry the last k characters from the previous line to the start of the next line. Finally, print a newline at the end of your text generation to clean up the appearance when the command prompt returns.

Note: If your final sequence of k characters does not appear anywhere else in your text, you may encounter a situation where a lookup in the table returns no Markov object. For example, "ies" only appears at the end of "flippyfloppies". In this circumstance, you can just reset back to the original start string.


Testing

You should test out your text generation with very simple inputs first, such as with a file containing "flippyfloppies", and small k and M's.

Once you get that working, you should try it on some of the files provided below. You will find that the random text with low-order models starts to sound more and more like the original text as you increase the order, as illustrated in the examples below. As you can see, there are limitless opportunities for amusement here. Try your model on some of your own text, or find something interesting on the net.

Here are a few sample texts of interest: Dr.Suess, Shakespeare, 1 million digits of Pi, Buffy the Vampire Slayer (Season 1), 2012 Presidential Debate, and Doctor Who (Series 1) .

Example input: As youLike It, excerpts [link to full text]

	[Enter DUKE SENIOR, AMIENS, and two or three Lords,
	like foresters]

DUKE SENIOR	Now, my co-mates and brothers in exile,
	Hath not old custom made this life more sweet
	Than that of painted pomp? Are not these woods
	More free from peril than the envious court?
	Here feel we but the penalty of Adam,
	The seasons' difference, as the icy fang
	And churlish chiding of the winter's wind,
	Which, when it bites and blows upon my body,
	Even till I shrink with cold, I smile and say
	'This is no flattery: these are counsellors
	That feelingly persuade me what I am.'
	Sweet are the uses of adversity,
	Which, like the toad, ugly and venomous,
	Wears yet a precious jewel in his head;
	And this our life exempt from public haunt
	Finds tongues in trees, books in the running brooks,
	Sermons in stones and good in every thing.
	I would not change it.

AMIENS	Happy is your grace,
	That can translate the stubbornness of fortune
	Into so quiet and so sweet a style.

DUKE SENIOR	Come, shall we go and kill us venison?
	And yet it irks me the poor dappled fools,
	Being native burghers of this desert city,
	Should in their own confines with forked heads
	Have their round haunches gored.

Example output: random Shakespeare, using order 6 model, excerpts [link to full text]

DUKE SENIOR	Now, my co-mates and thus bolden'd, man, how now, monsieur Jaques,
	Unclaim'd of his absence, as the holly!
	Though in the slightest for the fashion of his absence, as the only wear.

TOUCHSTONE	I care not for meed!
	This I must woo yours: your request than your father: the time,
	That ever love I broke
	my sword upon some kind of men
	Then, heigh-ho! sing, heigh-ho! sing, heigh-ho! sing, heigh-ho! unto the needless stream;
	'Poor deer,' quoth he,
	'Call me not so keen,
	Because thou the creeping hours of the sun,
	As man's feasts and women merely players:
	Thus we may rest ourselves and neglect the cottage, pasture?

	[Exit]

	[Enter DUKE FREDERICK	Can in his time in my heartily,
	And have me go with your fortune
	In all this fruit
	Till than bear
	the arm's end: I will through
	Cleanse the uses of the way to look you.
	Know you not, master,
	Sighing like upon a stone another down his bravery is not so with his effigies with my food:
	To speak my mind, and inquisition
	And unregarded age in corners throat,
	He will come hither:
	He dies that hath engender'd:
	And you to
	the bed untreasured of the brutish sting it.

Example output: random Buffy the Vampire Slayer, using order 12 model, excerpts [link to full text]

In every generation there is a Chosen One. She alone will stand against 
the vampires, the demons and the forces of darkness. She is the Slayer.

The Bronze at night. Cut inside. The camera follows her out.

Cordelia:  Well, just one dance.

They dance close.

Owen:  It's weird.

Buffy:  I know.

A vampire brings the demons, which ends 
the world.

Willow:  Angel stopped by? Wow. Was there... Well, I mean, was it having 
to do with kissing?

Buffy:  Mom! Mom, can you hear me? / Can you see me? / What's inside of me? /
Oh, I just wanted to start over. Be like everybody else. Have some friends,
y'know, maybe three isn't company anymore.

Buffy:  Why are you following me? I just had this feeling 
she'd do just about enough!

Buffy shoots Xander a look.

Snyder:  I don't know. Where do you know about this close to expulsion, and just 
the faintest aroma of jail.

Giles:  (to Buffy) Well, he is young.

Buffy:  It shouldn't be. (starts back to their original form, which is, uh, uh,
slightly l


Handin

Use handin to submit the following files:

  1. All .java files necessary for compiling your code (including those from previous labs)
  2. One or two of the most amusing language-modeling examples that you come up with.
  3. A README file with:
    1. Your name (and your partner's name if you had one)
    2. Any known problems or interesting design decisions that you made

If you work with a partner, just submit one solution per team.