CSCI 241 - Homework 6:
Huffman's Algorithm

Due by 11:59.59pm Monday, 17 Nov 2008

Introduction

For this assignment, you will be creating two programs (encode and decode) that will be performing the calculations needed for simple file compression. (For small files, it'll actually make things bigger, but this is to make your life easier.)

Things to note


Updates


Program behavior

Your encode program should read a text file specified on the command line and write a Huffman encoded version of that file to the specified output file. Similarly, the decode program will read a file generated by encode and write a decoded version of that file to a specified output file.

If no output file is specified, write to stdout.

% ./encode book.txt   book.huf      # encodes book.txt and writes it to
                                    # book.huf

% ./encode book.txt > book.huf      # encodes book.txt and writes it to
                                    # stdout (redirected to book.huf)

% ./decode book.huf   book.txt.2    # decodes the file

% diff -q book.txt book.txt.2       # check to see if files are the same
                                    # should print nothing if they are

Getting started

You will likely need to divide your code into three different parts, and therefore should be stored in 3 different files:

Now, it is possible to have only a single program that can do both encoding and decoding based on the filename, but to handle that, you'd need to check the value of argv[0] and determine which function to perform. It's probably easier to just make 2 separate programs.

Recall that you can make an object file by using the "-c" flag when compiling. Then you can link the various object files together to make actual programs.

Program Design

Encoding

In order to encode a file, you will first need to construct a Huffman tree based on the frequency of letters in the file. Your first step should be to read the file from start to end and calculate either absolute or relative frequency of all the characters encountered. You should include the frequency of EOF (-1) which should always be 1.

You will then need to create a sorted list of nodes based on ascending frequency. To do this, I recommend that you use an insertion sort on a linked list. Insert new nodes starting with the index value of -1 and going up to index value of 255. Insert before items of equal value. Skip nodes with a frequency count of 0.

Then, you will need to convert this sorted list into a Huffman tree. While there are more than 2 nodes in your list, you should create a new node, attach the head item in the list as the left child, the second item in the list as the right child, update the frequency count for this new node, and insert it into your linked list. Be sure you've removed the two nodes that are now children before re-inserting.

Now traverse the tree keeping track of the string needed to reach that node based on using a character '0' for a left branch and '1' for a right branch. When you reach a leaf node, you will know what string is to be used to represent that character.

Now re-read the input file from the beginning and for each letter encountered, print the bit string that corresponds to that character. Be sure to output the EOF string too and stop after you do so.

File Format

The files to be encoded can be treated as simple 8-bit character files (but use CHAR_BIT instead of 8). What I mean by this is that if you call fgetc() you will get a character until you reach the end, at which point you will get EOF (which has a value of -1). By treating these files as byte-oriented rather than printable ASCII, you should be able to encode both text and binary files.

The Huffman encoded output files will be a bit different. You need to include the frequency counts in order to be able to reconstruct the Huffman tree. Initially, you can just write out the frequency counts in order from -1 (EOF) through 255 as long integers using fwrite(). (You can use fread() to read the data in during decoding.)

After the table, you should output the individual bits that are needed to represent the input file. You'll have to buffer the bits until you get CHAR_BIT of them and then output it. (The most significant bit is the first bit, and then they progress downward.) Pad out the last incomplete character in the file with 0 bits. If you write (CHAR_BIT-1) 0-bits out then it will flush any remaining bits without creating a new character.

Here is the info from one of the sample files encoding the string "a\n":

char   count bitstring
----   ----- ---------
 -1        1 0
 10        1 11
 97        1 10

10110

Decoding

To decode the file, you should first open the file specified on the command line. Read in a line at a time processing each line for a frequency count and/or a string to match against. Once you encounter a blank line, you know you have read the entire table in. Recall that strings returned by fgets() include the newline at the end of the line.

If you need to, you can create the Huffman tree from the frequency counts, or you can use the pre-computed strings from the header as well.

Now, switch from line oriented input, to character oriented. Read in each '0' and '1' and whenever you have matched a Huffman string, you should output that character to the output file. Once you have read the EOF character, you should stop reading and printing, and close both files.

NOTE: You should not print out the EOF character, and you should never reach the actual end of the encoded file.

Sample run -- with internal state

INPUT:
cheese

Frequency Counts:
EOF     1
\n      1
c       1
e       3
h       1
s       1

Linked List (initial):
s(1) -> h(1) -> c(1) -> \n(1) -> EOF(1) -> e(3)

First pass:
c(1) -> \n(1) -> EOF(1) -> (2) -> e(3)
                          /   \
                      s(1)     h(1)    

Second pass:
c(1) -> \n(1) -> EOF(1) -> (2)   -->   (2) -> e(3)
                          /   \       /   \
                       c(1)  \n(1)  s(1)   h(1)    

Third pass:
EOF(1) --->  (2)  --->   (2) -> e(3)
            /   \       /   \
         c(1)  \n(1)  s(1)   h(1)    


Fourth pass:
   (2)  ------>  (3)  ---> e(3)
  /   \         /   \
s(1)   h(1) EOF(1)  (2)
                   /   \
                 c(1) \n(1)

Fifth pass:
e(3) -----------> (5)
               /       \
           (2)           (3)
          /   \         /   \
        s(1)   h(1) EOF(1)  (2)
                           /   \
                         c(1) \n(1)

Sixth (and final) pass:
         (8)
    /           \
e(3)              (5)
               /       \
           (2)           (3)
          /   \         /   \
        s(1)   h(1) EOF(1)  (2)
                           /   \
                         c(1) \n(1)

OUTPUT: (including padding)
char   count bitstring
----   ----- ---------
 -1        1 110
 10        1 1111
 99        1 1110
101        3 0
104        1 101
115        1 100

11101010 01000111 11100000

Design Ideas

You'll need to be dynamically creating nodes, so malloc() and free() are your friends. Be sure to free() all the allocated data once you are done with it, and close() all files you opened. Valgrind should report that there were no memory leaks.

You might want to create a node struct that can be used in both a linked list and a tree simultaneously. So, you'll want to have both "left" and "right" pointers as well as a "next" pointer.

You can create an array that has a valid position at index -1 by dynamically allocating an array and then setting a pointer to the address of the first item in the array. If you then use the pointer in an array context, you can go from index -1 to N-2.


There is also sample binaries for you to play with in ~kuperman/pub/cs241/hw06/


Extra Credit

Recording the frequency information at the start of the file takes up a substantial amount of space (32 or 64 bits for each of the 257 values). For extra credit, you can implement a more compact scheme to encode the information.


handin

README

Create a file called README that contains

  1. Your name and partner's name (if any)
  2. A description of the programs
  3. A listing of the files with a short one line description of the contents
  4. Any known bugs or incomplete functions
  5. Any interesting design decisions you'd like to share
  6. Describe any unresolved warnings that are generated by valgrind and what you believe them to be caused by.
  7. If you did the extra credit, describe the technique you used and the approximate space savings.

Now you should make clean to get rid of your executables and handin your folder containing your source files, Makefile, and README.

    % cd ~/cs241
    % handin -c 241 -a 6 hw6

    % lshand

Grading

Here is what I am looking for in this assignment:


VI PoweredLast Modified: November 09, 2008 - Benjamin A. Kuperman