Programming Project 4

Due : Oct. 27               CSCI 271                    Fall 2008

Problem:

Phase one of Huffman encoding.  We are going to write a very efficient program to do data compression and decompression using the Huffman encoding algorithm.  This project will be the first phase of the project.  We have discussed the Huffman algorithm, so we will not go through that in too much detail.

Our project output:

All we will output is a table that has an ASCII code (integer) for a character and the bit string that will represent that character.  For example:

42 01
58 101
66 110
25 111

You will just output one line per entry (no need for you to do a table or array) with the ASCII code and a sequence of 1's and 0's (can each be held as a character or integer for this phase).

Project Methodology

Here is the basic algorithm in its major steps:

  1. Read in the characters from the file counting the frequency of each character.
  2. Create the Huffman encoding tree based on the frequencies.
  3. Traverse the tree printing out the bit codes for the edges and the character in the leaf nodes.

Each of these phases is somewhat involved, so we will look at each of them.

Frequency

We need to produce a frequency table for each character.  We could keep a 2 column table; one column for the character and one column for the frequency.  This way, we do not have entries for characters that never appear in the table.  However, this would entail searching the first column each time.  After thinking about it, we realize that there are only 256 possible characters.  On most modern machines, an array of size 256 is a very small array and trying to save space by making it smaller is really a waste of time.

The process we can utilize is to read in a character; find the ASCII code for that character; use that integer as the subscript and add one to that entry in the frequency table.  (Don't forget to zero out all the entries first).

Create the Huffman Tree

The first step in creating the tree is to sort the frequency table.  We do not need to include the items with 0 frequency.  To see how to hold the results of our sorted table, let's look ahead to how we are going to use it.  Recall, from the Huffman algorithm that we took the 2 least frequent items, placed them under the same parent and placed the sum of their frequencies into the parent node.  This frequency was then placed back into the sorted frequencies in order to repeat the process.  So, once we have the sorted frequencies, we will be doing lots of deletions and additions to the sorted frequencies.

This calls for a list, where insertions and deletions are easy.  We can create the initial sorted list by going through our table of 256 frequencies and when the frequency is not 0 insert that ASCII value and frequency into our linked list in order.  This way, we can find the 2 least frequent items quickly.

But what should a node in our linked list look like?  Looking ahead again, we see the values are going to be used to build a binary tree.  Once we have combined two nodes, we are inserting a newly created binary tree node into our linked list.  So, rather than having the node in our linked list contain an ASCII code and a frequency, we should just have the data part of our linked list node contain a pointer to a binary tree node.

Now, when we create the linked list, we first create a new binary tree node, place the ASCII character and the frequency into the binary tree node then place the pointer to this node in the data field of the linked list node.  This will make our algorithm flow much smoother.

Printing the Tree

Once we have the tree created, how do we print the values of the edges?  We need to do a tree traversal that goes through the entire tree.  As we go down the tree, we need to keep track of how we got to where we are.  When we back up the tree, we need to remove that part of the path we are keeping track of.  The best way to do this is to use a stack.  As we go left, we push a 0 onto the stack.  As we go right, we push a 1 onto the stack.  When we return we pop the top value off the stack.   When we hit a leaf, we print the values in the stack (in reverse order).  Here is the recursive algorithm: