answersLogoWhite

0


Best Answer

The function of a data compression utility is to reduce the file size and thus maximise storage capacity. The compression has to be done in such a way that the original data can be restored by a corresponding decompression utility.

The simplest way to compress data is to use a variable-length encoding rather than a fixed-length encoding. Normally, data is stored using a sequence of symbols with fixed-width codes (typically 8-bit bytes). However, some symbols appear more frequently than others, so by using fewer bits to represent the most frequent symbols and more bits for the less frequent ones, data can be stored much more efficiently.

To achieve this, no two symbols can have the same prefix. This is, if the space character (which is the most common symbol in text data) is represented by the 1-bit code 0, then no other symbol can be prefixed with a 0 -- they must all be prefixed with a 1. The letter 'e' is also quite common, thus we might give it the 2-bit prefix 10, which means all other symbols must have a 11 prefix. The problem with this encoding scheme is that with each new symbol we have to add another bit. If our data contains 127 unique symbols (the entire ANSI character set), then the least-frequent symbol would consume 127 bits! Although we may still be able to reduce the overall size of the data, this is not the most efficient method of doing so. We could achieve very similar savings just by using a 7-bit byte instead (which is sufficient to encode the entire ANSI character set).

Prior to 1951 there was no algorithm capable of producing the most efficient prefix encoding. We knew that binary trees offered the solution, where leaf nodes represented the symbols we wished to encode and the path from the root node of the tree determined the prefix (appending a 0 bit when we branched left and a 1 when we branched right). Thus every symbol had a unique path and therefore a unique prefix. However, building such a tree was a complex process and there was no guarantee the end result would produce the most efficient encoding without a good deal of trial and error.

David A. Huffman came up with the solution in 1951. Instead of building the tree from the root, he built the tree from the bottom up. First, he created a frequency table of all the unique symbols in the data, sorted in descending order of frequency. He then extracted the bottom entry (the least frequent symbol) and attached it to the left node of a parent node. He then took the next least frequent symbol and attached it to the right. The parent's frequency became the sum of its two nodes and it was re-inserted into the table in sorted order. In other words, he removed two entries and inserted one, thus reducing the table by one entry overall. He then repeated this process continually until there was only one entry left. That one entry became the root of the binary tree. As simple as it sounds, this algorithm produces the most efficient prefix encoding, all we have to do is branch left (append a 0) or branch right (append a 1) and when we reach a symbol (which is always a leaf) we have the prefix for that symbol. We can then rescan the original data and output a contiguous sequence of prefixes to create the payload.

In order to decompress Huffman coded data, we must also be able to recreate the original tree that was used to compress it. Since there are only two types of node (parent or leaf) and every parent has two nodes and every leaf holds a symbol, we simply need to traverse the tree in depth-first order from the root. This is a recursive operation where we examine the left node followed by the right node. If we examine the left node and find it is a leaf, we output a 1 bit followed by its 8-bit symbol. But if the left node is a parent, we output a 0, traverse to it, and repeat the process with that node. Eventually we get back to the root where we can examine its right node. Again, if it is a leaf we output a 1 followed by its symbol, otherwise we output a 0 and traverse to it. Eventually we arrive back at the root and the entire tree has been encoded.

The only other piece of information we need is the number of symbols in the payload (equal to the size of the original data in bytes). Once we have all three we can build the compressed data, outputting the size, followed by the encoded tree, followed by the payload. If the final size is not a multiple of 8 bits, we simply pad the remaining bits with zeroes.

To decompress, we read back the size and store it. We then create a parent node for the root of the tree and begin reading the encoded tree one bit at a time. If the bit is a 1, we create a leaf node on the left and read the next 8 bits to determine its symbol, otherwise we create a parent on the left, traverse to it. Eventually we arrive back at the root and can process its right node in the same way. Once that's done we'll have rebuilt the tree.

Then we can process the payload, starting at the root. If we read a 0, we traverse left, otherwise we traverse right. If the node we traverse to is a leaf, we output the symbol, otherwise we read the next bit and traverse left or right until we do reach a symbol. Once we output a symbol, we increment a counter and start again from the root. When the counter is equal to the size we read at the start, we can stop processing; any remaining bits are just padding.

The algorithm is elegant and, even with the added weight of the encoded tree, can produce significant savings in space for any moderate sized file. For example, the Project Gutenberg Etext version of Moby Dick comes in at 1,256,165 bytes, but is only 720,043 bytes when compressed with Huffman encoding (57% of the original). Huffman coding can be found in more complex algorithms which can compress data even further. For instance, RAR (Roshal Archive, by Eugene Roshal) can compress the same file down to just 443,371 bytes (35%)!

User Avatar

Wiki User

9y ago
This answer is:
User Avatar
More answers
User Avatar

Wiki User

9y ago

The function is to compress data. This means to store data in a way that uses less space - though the trade-off is that the data needs to be "uncompressed" again, to actually use it.

This answer is:
User Avatar

Add your answer:

Earn +20 pts
Q: What is the function of data-compression utility?
Write your answer...
Submit
Still have questions?
magnify glass
imp