Share on Facebook Share on Twitter Email
Answers.com

Data compression

 
Sci-Tech Dictionary: data compression
 
(′dad·ə kəm′presh·ən)

(computer science) The technique of reducing the number of binary digits required to represent data.


Search unanswered questions...
Enter a word or phrase...
All Community Q&A Reference topics
 
Sci-Tech Encyclopedia: Data compression
Top

The process of transforming information from one representation to another, smaller representation from which the original, or a close approximation to it, can be recovered. The compression and decompression processes are often referred to as encoding and decoding. Data compression has important applications in the areas of data storage and data transmission. Besides compression savings, other parameters of concern include encoding and decoding speeds and workspace requirements, the ability to access and decode partial files, and error generation and propagation.

The data compression process is said to be lossless if the recovered data are assured to be identical to the source; otherwise the compression process is said to be lossy. Lossless compression techniques are requisite for applications involving textual data. Other applications, such as those involving voice and image data, may be sufficiently flexible to allow controlled degradation in the data.

Data compression techniques are characterized by the use of an appropriate data model, which selects the elements of the source on which to focus; data coding, which maps source elements to output elements; and data structures, which enable efficient implementation.

Information theory dictates that, for efficiency, fewer bits be used for common events than for rare events. Compression techniques are based on using an appropriate model for the source data in which defined elements are not all equally likely. The encoder and the decoder must agree on an identical model. See also Information theory.

A static model is one in which the choice of elements and their assumed distribution is invariant. For example, the letter “e” might always be assumed to be the most likely character to occur. A static model can be predetermined with resulting unpredictable compression effect, or it can be built by the encoder by previewing the entire source data and determining element frequencies. The benefits of using a static model include the ability to decode without necessarily starting at the beginning of the compressed data.

An alternative dynamic or adaptive model assumes an initial choice of elements and distribution and, based on the beginning part of the source stream that has been processed prior to the datum presently under consideration, progressively modifies the model so that the encoding is optimal for data distributed similarly to recent observations. Some techniques may weight recently encountered data more heavily. Dynamic algorithms have the benefit of being able to adapt to changes in the ensemble characteristics. Most important, however, is the fact that the source is considered serially and output is produced directly without the necessity of previewing the entire source.

In a simple statistical model, frequencies of values (characters, strings, or pixels) determine the mapping. In the more general context model, the mapping is determined by the occurrence of elements, each consisting of a value which has other particular adjacent values. For example, in English text, although generally “u” is only moderately likely to appear as the “next” character, if the immediately preceding character is a “q” then “u” would be overwhelmingly likely to appear next.

The use of a model determines the intended sequence of values. An additional mapping via one coding technique or a combination of coding techniques is used to determine the actual output. Several data coding techniques are in common use.

Digitized audio and video signals

The information content of speech, music, and television signals can be preserved by periodically sampling at a rate equal to twice the highest frequency to be preserved. This is referred to as Nyquist sampling. However, speech, music, and television signals are highly redundant, and use of simple Nyquist sampling to code them is inefficient. Reduction of redundancy and application of more efficient sampling results in compression of the information rate needed to represent the signal without serious impairment to the quality of the remade source signal at a receiver. For speech signals, redundancy evident in pitch periodicity and in the format (energy-peaks) structure of the signal's spectrum along with aural masking of quantizing noise is used to compress the information rate. In music, which has much wider bandwidth than speech and far less redundancy, time-domain masking and frequency-domain masking are principally used to achieve compression. For television, redundancy evident in the horizontal and vertical correlation of the pixels of individual frames and in the frame-to-frame correlation of a moving picture, combined with visual masking that obscures quantizing noise resulting from the coding at low numbers of bits per sample, is used to achieve compression. See also Speech.

Compression techniques may be classified into two types: waveform coders and parametric coders. Waveform coders replicate a facsimile of a source-signal waveform at the receiver with a level of distortion that is judged acceptable. Parametric coders use a synthesizer at the receiver that is controlled by signal parameters extracted at the transmitter to remake the signal. The latter may achieve greater compression because of the information content added by the synthesizer model at the receiver.

Waveform compression methods include adaptive differential pulse-code modulation (ADPCM) for speech and music signals, audio masking for music, and differential encoding and sub-Nyquist sampling of television signals. Parametric encoders include vocoders for speech signals and encoders using orthogonal transform techniques for television.


 
Computer Desktop Encyclopedia: data compression
Top

Encoding data to take up less storage space and less bandwidth for transmission. Digital data are compressed by finding repeatable patterns of binary 0s and 1s. The more patterns can be found, the more the data can be compressed. Text can typically be compressed to approximately 40% of its original size, and graphics files from 20% to 90%. Some files compress very little. It depends entirely on the type of file and compression algorithm used. See archive and archive formats.

Compression Sparked a Revolution

Data compression is an important part of the digital world. For example, MPEG compression allowed a two-hour movie to fit on a DVD disc. Compression enables voice and video calls over the Internet. MP3 compression sparked a revolution by enabling people to download a song from the Internet 10 times faster than the original CD format. See codec.

Dictionary and Statistical Methods

Two major ways to compress data are the dictionary and statistical methods. The widely used dictionary method creates a list of repeatable phrases. For example, GIF images and ZIP and JAR archives are compressed with this method (see LZW). The statistical method converts characters into variable length strings of bits based on frequency of use (see Huffman coding).

Lossless Vs. Lossy

When text and financial data are compressed, they must be decompressed back to a perfect original, bit for bit. This is known as "lossless compression." However, audio and video can be compressed down to as little as 5% of its original size using "lossy compression." Some of the data are actually lost, but the loss is not noticeable to the human ear and eye. See lossless compression, codec examples and capacity optimization.

Download Computer Desktop Encyclopedia to your iPhone/iTouch

 
Marketing Dictionary: data compression
Top

In a computer system, storage of data in a format that occupies less space than the original form. In order to effect data compression, special software packages are required to compress the data and to reopen it to its original size. Because data can be compressed to a fraction of its original size, data compression is particularly useful in communications, allowing for faster transmission of information.

data entry process of entering data into a computer system. Prior to entry, data may be converted to codes that facilitate storage and retrieval of information. Data entry can be accomplished in several ways, including key entry, tape entry and scan entry . Tape entry is the most accurate and most efficient type of data entry, requiring the least amount of clerical involvement, followed by scan entry, then key entry.

 
Business Dictionary: Data Compression
Top

A technology that reduces the size of a computer file. Especially important for files used on Web pages: graphics and sound files are compressed so they can be downloaded faster. Compression methods are described as lossless (no data is lost) or lossy (some data is sacrificed to achieve greater compression).

 
Geography Dictionary: data compression
Top

In Geographic Information Systems, the encoding of data in order to reduce its overall volume.

 
Britannica Concise Encyclopedia: data compression
Top

Process of reducing the amount of data needed for storage or transmission of a given piece of information (text, graphics, video, sound, etc.), typically by use of encoding techniques. Data compression is characterized as either lossy or lossless depending on whether some data is discarded or not, respectively. Lossless compression scans the data for repetitive sequences or regions and replaces them with a single "token." For example, every occurrence of the word the or region with the colour red might be converted to $. ZIP and GIF are the most common lossless formats for text and graphics, respectively. Lossy compression is frequently used for photographs, video, and sound files where the loss of some detail is generally unnoticeable. JPEG and MPEG (see MP3) are the most common lossy formats.

For more information on data compression, visit Britannica.com.

 
Wikipedia: Data compression
Top

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes.

As with any communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.

Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.

Contents

Lossless versus lossy compression

Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.

Another kind of compression, called lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. Generally, a lossy data compression will be guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" some of this less-important information. Lossy data compression provides a way to obtain the best fidelity for a given amount of compression. In some cases, transparent (unnoticeable) compression is desired; in other cases, fidelity is sacrificed to reduce the amount of data as much as possible.

Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.

However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually (text files usually can be compressed more after being compressed, due to fewer symbols), result in an expansion, as will attempts to compress all but the most trivially encrypted data.

In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, like for example always removing the last byte of a file, will always compress a file up to the point where it is empty.

An example of lossless vs. lossy compression is the following string:

25.888888888

This string can be compressed as:

25.[9]8

Interpreted as, "twenty five point 9 eights", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using

26

instead, the original data is lost, at the benefit of a smaller file size.

Applications

The above is a very simple example of run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).

For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.

Lossy image compression is used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Similarly, DVDs use the lossy MPEG-2 codec for video compression.

In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline from "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.

Theory

The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) and by rate-distortion theory. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Cryptography and coding theory are also closely related. The idea of data compression is deeply connected with statistical inference.

Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.

The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, therefore compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (Lempel-Ziv-Welch) is used in GIF images. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). A current LZ-based coding scheme that performs well is LZX, used in Microsoft's CAB format.

The very best compressors use probabilistic models, in which predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu. The text entry system, Dasher, is an inverse-arithmetic-coder.

There is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for "general intelligence" [1].

See also

Data compression topics

Compression algorithms

Lossless data compression

Lossy data compression

Example implementations

  • DEFLATE (a combination of LZ77 and Huffman coding) – used by ZIP, gzip and PNG files
  • LZMA used by 7-Zip
  • LZO (very fast LZ variation, speed oriented)
  • NOSSO (high compression, used for Adobe Reader[2])
  • LZX (an LZ77 family compression algorithm)
  • Unix compress utility (the .Z file format), and GIF use LZW
  • Unix pack utility (the .z file format) used Huffman coding
  • bzip2 (a combination of the Burrows-Wheeler transform and Huffman coding)
  • PAQ (very high compression based on context mixing, but extremely slow; competing in the top of the highest compression competitions)
  • JPEG (image compression using a discrete cosine transform, then quantization, then Huffman coding)
  • MPEG (audio and video compression standards family in wide use, using DCT and motion-compensated prediction for video)
    • MP3 (a part of the MPEG-1 standard for sound and music compression, using subbanding and MDCT, perceptual modeling, quantization, and Huffman coding)
    • AAC (part of the MPEG-2 and MPEG-4 audio coding specifications, using MDCT, perceptual modeling, quantization, and Huffman coding)
  • Vorbis (DCT based AAC-alike audio codec, designed with a focus on avoiding patent encumbrance)
  • JPEG 2000 (image compression using wavelets, then quantization, then entropy coding)
  • TTA (codec) (uses linear predictive coding for lossless audio compression)
  • FLAC (linear predictive coding for lossless audio compression)

Corpora

Data collections, commonly used for comparing compression algorithms.

References

External links



 
Best of the Web: Data compression
Top

Some good "Data compression" pages on the web:


How?
computer.howstuffworks.com
 
 
 

 

Copyrights:

Sci-Tech Dictionary. McGraw-Hill Dictionary of Scientific and Technical Terms. Copyright © 2003, 1994, 1989, 1984, 1978, 1976, 1974 by McGraw-Hill Companies, Inc. All rights reserved.  Read more
Sci-Tech Encyclopedia. McGraw-Hill Encyclopedia of Science and Technology. Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved.  Read more
Computer Desktop Encyclopedia. THIS COPYRIGHTED DEFINITION IS FOR PERSONAL USE ONLY.
All other reproduction is strictly prohibited without permission from the publisher.
© 1981-2009 Computer Language Company Inc.  All rights reserved.  Read more
Marketing Dictionary. Dictionary of Marketing Terms. Copyright © 2000 by Barron's Educational Series, Inc. All rights reserved.  Read more
Business Dictionary. Dictionary of Business Terms. Copyright © 2000 by Barron's Educational Series, Inc. All rights reserved.  Read more
Geography Dictionary. A Dictionary of Geography. Copyright © Susan Mayhew 1992, 1997, 2004. All rights reserved.  Read more
Britannica Concise Encyclopedia. Britannica Concise Encyclopedia. © 2006 Encyclopædia Britannica, Inc. All rights reserved.  Read more
Wikipedia. This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Data compression" Read more