Tags C++ Programming

Animals & Plants Arts & Entertainment Auto Beauty & Health Books and Literature Business Electronics Engineering & Technology Food & Drink History Hobbies Jobs & Education Law & Government Math People & Society Science Social Studies Sports Travel & Places

Subjects>Science>Engineering

How does one implement the Huffman tree algorithm in c plus plus?

Updated: 10/26/2022

Wiki User

∙ 10y ago

Best Answer

As a student in 1951, David A. Huffman (1925-1999) was given the option of sitting a final exam or submitting a term paper. He chose the latter. The term paper outlined the problem: find the most efficient binary code. In other words, given any amount of data, find the most efficient method of encoding that data in the least number of bits. When we encode data using standard 8-bit symbols, every symbol is obviously the same length (8 bits). If we alter the encoding so that we use shorter codes for symbols that appear most often, and longer codes for those that appear less often, we can encode the data more efficiently. For instance, if the most frequent symbol could be encoded with just 1 bit (say, 0), then we save 7 bits for every occurrence of that symbol. However, no other symbol can use an encoding that begins with 0, they must all be longer than 1 bit and must begin with a 1. Moreover, no symbol can start with the same sequence of bits assigned to any other symbol.

Huffman quickly ascertained that he was looking for a prefix coding method. While efficient encoding methods certainly existed at the time (Shannon-Fano being one), none could be guaranteed to provide the most efficient coding. Indeed, Huffman had all but given up and resigned himself to the final exam when the answer finally dawned on him. If he used a frequency-sorted binary tree, he could encode any symbol in the tree by tracing the path back to the root. The symbols formed the leaves of the tree and if he followed the left path to the leaf's parent, he prefixed a 0 bit, otherwise he prefixed a 1 bit. Repeating this process all the way to the root, every symbol could be assigned a unique prefix code. Moreover, when reading the encoded data back, he simply started at the root and followed the left or right paths according to the bits in the prefix. When he reached a leaf node, he'd found his symbol, and started at the root to find the next.

Huffman wasn't the first to discover this of course, but what had eluded all the mathematicians and programmers of the day was simply how to go about building the tree in the first place. Normally we build trees from the top down, from the root. Huffman's revelation was in building the tree from the bottom up, starting with the leaves. With this simple change, not only did he complete his term paper with honours, he exceeded even his professor's expectations. His method proved to be the most efficient encoding and Huffman's name was entered into the computing history books.

Huffman started by creating a frequency table of symbols which he then sorted in descending order of frequency. He then copied the table and, while the copy contained more than one node, he extracted (removed) the last two nodes and constructed a parent node, with the last node attached to its left node, and its predecessor to its right. The parent node's frequency was the sum of its left and right node frequencies. The parent node was then inserted back into the copy, maintaining the descending order of frequency. When only one node remained in the copy, the tree was complete. The one remaining node becomes the root of the Huffman tree.

You then encode the tree. Initially, all nodes have no prefix code but, starting from the root, you copy its prefix (an empty string) to its left child and append a "0", and then recursively encode the left child (thus its left child will be prefix "00"). Do the same for the right child, but append a "1" instead. Thus the child to the right of the root's left child will be prefix "01". When you reach a leaf, its prefix code determines the precise route you took from the root to the leaf (where "0" means go left, and "1" means go right). Since the tree was a copy of the frequency table, each leaf in the original frequency table will also be encoded.

You are now ready to begin encoding the file. You begin by outputing the original file size (so you know when to stop reading when decoding the encoded file). You then output the binary representation of the tree and its symbols. That is, starting from the root node, output a 0 bit if it is a parent node, or a 1 bit followed by the 8-bit unencoded symbol if it is a leaf. Repeat the process for the node's left and right children if it has any (it must have both or neither, it cannot have one but not the other). Finally, read the original file one symbol at a time, locate it in the frequency table, and output its prefix code. When writing individual bits, you need to use a bitwriter that buffers up to 8 bits at a time and then write a complete byte. When you write the final bit, pad any unused bits with zeroes to make a full byte.

To decode the encoded file, use a bitreader to read back 1 bit at a time. Decoding is simply a matter of reading the file size, then reconstructing the tree from its binary representation, and then reading one bit a time to navigate the tree from the root. when you reach a leaf, output the symbol, decrement the file size and return to the root. Repeat until the file size is zero, at which point the original file is restored.

The following implementation is not complete by any means. For instance, if you were to encode a file containing the text "The quick brown fox jumps over the lazy dog.", you will find that this 44 byte file compresses to a 66 byte file. This seems unimpressive, but remember that the encoded file also contains the file size and the Huffman tree as well as the encoded data. The encoded data is actually smaller than the original file, but because the file is so small to begin with, the overhead of the additional information produces a larger file. Huffman encoding works best on larger files (the bigger the better). An alternative approach is to use a static table, thus eliminating the need to transmit the table along with the encoded data. However this only works when the frequency table is highly predictable -- it won't be suitable for all file types. Another approach is to create several static tables, and use the one that produces the smallest file. Again, this is less than optimal, but if you have 255 different tables, then you only need to transmit 8 additional bits to determine which table to use for decoding (where table 0 implies no encoding, for those files that cannot be reduced with any of the tables). But for larger files, the overhead of the frequency table is more than outweighed by the reduction in the encoded data. For instance, a file that contains nothing but 0 bits can be reduced to a file a little over 1/8th the size of the original file. Of course, such a file could easily be compressed to a fraction of that simply by utilising run length encoding, but you can also combine both methods, using RLE to reduce the repeated bit sequences, and then Huffman encoding to reduce the RLE file. These implementations (amongst many others, many of which are patented) are left for the reader to discover.

In the meantime, be aware that the following implementation provides only the minimum functionality to both encode and decode using Huffman's algorithm, and makes use of standard output for the DESTINATION file (you may wish to make the destination file a command-line option, and test the source and destination are actually different files). Note also that besides some simple error-checks (some of which are unnecessary with prudent assertions and verifications), overall robustness has been sacrificed for the sake of simplicity.

// Huffman.exe

#include<iostream>

#include<list>

// The bitreader class facilitates the reading

// of binary files, 1 bit at a time.

class bitreader

{

public:

// The constructor accepts a file pointer to a file that is

// already opened for binary reading. The m_size and m_buffer

// members are initially zero.

bitreader( FILE* file ): m_file( file ), m_buffer( 0 ), m_size( 0 ){}

~bitreader(){};

int readbit();

public:

FILE* m_file;

unsigned char m_buffer;

int m_size;

};

// Returns a single bit (0 or 1) from the file.

// Returns -1 if an error occurs.

int bitreader::readbit()

{

// Are there any bits in the buffer?

if( !m_size )

{

// Read 8 bits into the buffer.

size_t read = fread( &m_buffer, 1, 1, m_file );

if( read == 1 )

m_size = 8;

else

// Fatal error!

return( -1 );

}

// If we get this far, there is at least 1 bit in the buffer.

// Extract the MSB from the buffer, shift the remaining bits

// left and decrement the size.

int bit = m_buffer&0x80?1:0;

m_buffer<<=1;

--m_size;

// Return the extracted bit.

return( bit );

}

// The bitwriter class facilitates the writing

// of binary files, 1 bit at a time.

class bitwriter

{

public:

// The constructor simply initialises the members.

bitwriter(): m_buffer( 0 ), m_unused_bits( 8 ){}

~bitwriter();

void writebit( unsigned char bit = 1 );

public:

char m_buffer; // The bit buffer.

int m_unused_bits; // Unused bits.

};

// The destructor pads any unwritten bits to make a full byte.

bitwriter::~bitwriter()

{

while( m_unused_bits != 8 )

writebit( 0 );

}

// Writes a single bit to the buffer. When the buffer is full,

// outputs the buffer to standard output and resets the buffer.

void bitwriter::writebit( unsigned char bit /* = 1 */ )

{

// Bit must be 0 or 1.

if( bit > 1 )

return;

// Decrement the buffer length.

--m_unused_bits;

if( bit )

{

// Shift the bit left if necessary.

if( m_unused_bits )

bit <<= m_unused_bits;

// Combine bit with current buffer.

m_buffer |= bit;

}

// Do we have a complete byte?

if( !m_unused_bits )

{

// Write the byte and empty the buffer.

std::cout << m_buffer;

m_buffer = 0;

m_unused_bits = 8;

}

// A node object acts as both a leaf and a parent. Only the

// leafs store any symbols. Parents store both a left and

// right node. During encoding, leafs store the frequency

// of their symbols while parents store the sum of the

// frequencies of their two children. During decoding,

// the frequency is not used.

class node

{

// Used when sorting the Huffman table.

friend bool compare(const node* x, const node* y);

public:

node( unsigned char data );

node( node& left, node& right );

~node( void );

int encode( bitwriter& writer );

int decode( bitreader& reader );

char get_symbol( void ) const { return( m_symbol ); }

unsigned int get_freq( void ) const { return( m_freq ); }

const std::string& get_code( void ) const { return( m_code ); }

node* get_left( void ) const { return( m_left ); }

node* get_right( void ) const { return( m_right ); }

void inc_freq( void ){ ++m_freq; }

private:

unsigned char m_symbol;

unsigned int m_freq;

std::string m_code;

node* m_left;

node* m_right;

};

// Constructs a leaf.

node::node( unsigned char symbol )

: m_symbol( symbol )

, m_freq( 1 )

, m_code( "" )

, m_left( NULL )

, m_right( NULL ) {}

// Constructs a parent.

node::node( node& left, node& right )

: m_symbol( 0 )

, m_freq( left.m_freq + right.m_freq )

, m_code( "" )

, m_left( &left )

, m_right( &right ){}

// Recursive destructor.

node::~node( void )

{

delete( m_left );

delete( m_right );

}

// Recursively decodes (rebuilds) the tree from the given bit reader.

int node::decode( bitreader& reader )

{

// Read the next bit.

int result = reader.readbit();

switch( result )

{

case( -1 ): // Fatal error!

return( result );

case( 0 ): // This node is a parent.

// Instantiate the left node.

if( m_left = new node(0) )

{

// Decode the left node (recursively).

result = m_left->decode( reader );

// Ensure success (zero).

if( !result )

// Instantiate the right node.

if( m_right = new node(0) )

// Decode the right node (recursively).

result = m_right->decode( reader );

}

// Return the result.

return( result );

case( 1 ): // This node is a leaf.

// Read the next 8 bits and store the resultant symbol.

for( int i=0; i<8; ++i )

{

m_symbol<<=1;

result = reader.readbit();

if( result == -1 )

return( result );

m_symbol |= result;

}

// Success!

return( 0 );

}

// Encodes the Huffman tree. If this node is a parent,

// write a 0 bit, otherwise write a 1 bit followed by

// the 8-bit unencoded symbol.

int node::encode(bitwriter& writer )

{

int error=-1;

// Is this node a leaf?

if( !m_left && !m_right )

{

// Write a 1 bit.

writer.writebit(1);

// Store the symbol locally.

unsigned char symbol=m_symbol;

// Write each bit from the MSB to the LSB.

for(int i=0; i<8; ++i, symbol<<=1)

writer.writebit(symbol&0x80?1:0);

// Success!

error=0;

}

// Or is this node a parent?

else if( m_left && m_right )

{

// Write a zero bit.

writer.writebit(0);

// Copy this node's code to the left node

// and append a 0.

m_left->m_code = m_code;

m_left->m_code += "0";

// Recurse for the left node.

error = m_left->encode( writer );

if( !error )

{

// Copy this node's code to the right node

// and append a 1.

m_right->m_code = m_code;

m_right->m_code += "1";

// Recurse for the right node.

error = m_right->encode( writer );

}

return( error );

}

// Compare method used to sort the Huffman table

// in descending order of frequency.

bool compare( const node* x, const node* y )

{

return( x->m_freq>y->m_freq );

}

// Decodes the given file name and redirects the

// decoded data to standard output.

int decode( const char* filename )

{

// Open the input file for binary reading.

FILE* input = fopen( filename,"rb" );

if( !input )

{

std::cerr<<"File access error: "<<filename<<std::endl;

return( -1 );

}

// Instantiate a bit reader object.

bitreader reader( input );

int error = -1, bit=0;

// Determine the length of the decoded data.

unsigned int len=0;

for( int i=0; i<sizeof(unsigned int)*8; ++i )

{

len <<= 1;

bit = reader.readbit();

if( bit == -1 )

return( error );

len |= bit;

}

// Instantiate the Huffman tree root.

node root( 0 );

// Decode (rebuild) the tree.

root.decode( reader );

// Begin reading the encoded data until all symbols

// have been read.

while( len-- )

{

// Point to the root of the Huffman tree.

node* pnode = &root;

// Repeat while the current node is a parent node.

while( pnode->get_left() && pnode->get_right() )

{

// Read the next bit.

bit = reader.readbit();

if( bit == error )

// Fatal error!

return( error );

// Navigate right if the bit is set, otherwise navigate left.

pnode = bit?pnode->get_right():pnode->get_left();

}

// We've reached a leaf, output the data.

std::cout << pnode->get_symbol();

}

// Success!

fclose( input );

return(0);

}

// Encodes the given file name and redirects the encoded

// data to standard output.

int encode( const char* filename )

{

// Open the file for binary reading.

FILE* input=fopen( filename,"rb" );

if( !input )

{

std::cerr<<"File access error: "<<filename<<std::endl;

return(-1);

}

// Instantiate the Huffman frequency table.

std::list<node*> table;

// Some variables.

unsigned int len=0;

char symbol=0;

// Read one symbol at a time.

while( fread( &symbol, 1, 1, input ) == 1 )

{

// Increment the file length.

++len;

// Locate the symbol in the Huffman frequency table.

std::list<node*>::iterator it=table.begin();

while( it!=table.end() )

{

// Reference the node.

node& rnode = *(*it);

// Check for equality.

if( rnode.get_symbol() == symbol )

{

// Symbol found, increment its frequency.

rnode.inc_freq();

break;

}

++it;

}

// Was the symbol found?

if( it==table.end() )

{

// No -- create a new node for the symbol

// and add it to the table.

node* pnode = new node( symbol );

table.push_back( pnode );

}

if( ferror( input ))

{

fclose( input );

std::cerr << "File read error: " << filename << std::endl;

return( -1 );

}

// Sort the Huffman frequency table in descending order of frequency.

table.sort( compare );

// Instantiate the Huffman tree (copy the Huffman frequency table).

std::list<node*> tree( table );

// Repeat while the Huffman tree has more than one node.

while( tree.size() > 1 )

{

// Extract the last two nodes.

std::list<node*>::reverse_iterator rit = tree.rbegin();

node* left = *rit++;

node* right = *rit;

tree.pop_back();

// Create a parent node for the two extracted nodes.

node* parent = new node( *left, *right );

// Add the parent node and resort the tree.

tree.push_back( parent );

tree.sort( compare );

}

// Instantiate a bit writer.

bitwriter writer;

// Write the file length (32 bits).

for(int i=0; i<sizeof( unsigned int )*8; ++i, len<<=1)

writer.writebit( len&0x80000000?1:0 );

// Reference the root node of the Huffman tree.

node& root = *( *tree.begin() );

// Recursively encode the Huffman tree (also

// encodes the Huffman frequency table, since

// both contain pointers to the same nodes).

root.encode(writer);

// Read symbols from the start of the input file.

fseek( input, 0, SEEK_SET );

while( fread( &symbol, 1, 1, input ) == 1 )

{

// Locate the symbol in the Huffman frequency table.

for(std::list<node*>::iterator it=table.begin(); it!=table.end(); ++it )

{

// Reference the current node.

node& rnode = *(*it);

if( rnode.get_symbol() == symbol )

{

// Write the symbol's code.

for(unsigned int i=0; i<rnode.get_code().size(); ++i)

writer.writebit( rnode.get_code().at(i)=='1'?1:0 );

break;

}

if( ferror( input ))

{

fclose( input );

std::cerr << "File read error: " << filename << std::endl;

return( -1 );

}

// Finished with the input file.

fclose(input);

// Success!

return(0);

}

// The main function handles the command line.

int main(int argc, char* argv[])

{

if(argc==3)

{

std::string option( argv[2] );

if( !option.compare("-d") !option.compare("-D") )

if( decode(argv[1] ))

std::cerr<<"The input file cannot be decoded.\n"<<std::endl;

else

return(0);

else if( !option.compare("-e") !option.compare("-E") )

if( encode(argv[1] ))

std::cerr<<"The input file could not be encoded.\n"<<std::endl;

else

return(0);

}

std::cerr<<"Usage:\n\n"

<<argv[0]<<" SOURCE -switch > DEST\n"

<<"\nWhere:\n\n"

<<"\tSOURCE\t- source file name\n"

<<"\tDEST\t- destination file name\n\n"

<<"Switch:\n\n"

<<"\td\t- decode\n"

<<"\te\t- encode\n"<<std::endl;

return(-1);

}

Wiki User

∙ 10y ago

This answer is: