data compression
(computer science) The technique of reducing the number of binary digits required to represent data.
|
Results for data compression
|
On this page:
|
(computer science) The technique of reducing the number of binary digits required to represent data.
The process of transforming information from one representation to another, smaller representation from which the original, or a close approximation to it, can be recovered. The compression and decompression processes are often referred to as encoding and decoding. Data compression has important applications in the areas of data storage and data transmission. Besides compression savings, other parameters of concern include encoding and decoding speeds and workspace requirements, the ability to access and decode partial files, and error generation and propagation.
The data compression process is said to be lossless if the recovered data are assured to be identical to the source; otherwise the compression process is said to be lossy. Lossless compression techniques are requisite for applications involving textual data. Other applications, such as those involving voice and image data, may be sufficiently flexible to allow controlled degradation in the data.
Data compression techniques are characterized by the use of an appropriate data model, which selects the elements of the source on which to focus; data coding, which maps source elements to output elements; and data structures, which enable efficient implementation.
Information theory dictates that, for efficiency, fewer bits be used for common events than for rare events. Compression techniques are based on using an appropriate model for the source data in which defined elements are not all equally likely. The encoder and the decoder must agree on an identical model. See also Information theory.
A static model is one in which the choice of elements and their assumed distribution is invariant. For example, the letter “e” might always be assumed to be the most likely character to occur. A static model can be predetermined with resulting unpredictable compression effect, or it can be built by the encoder by previewing the entire source data and determining element frequencies. The benefits of using a static model include the ability to decode without necessarily starting at the beginning of the compressed data.
An alternative dynamic or adaptive model assumes an initial choice of elements and distribution and, based on the beginning part of the source stream that has been processed prior to the datum presently under consideration, progressively modifies the model so that the encoding is optimal for data distributed similarly to recent observations. Some techniques may weight recently encountered data more heavily. Dynamic algorithms have the benefit of being able to adapt to changes in the ensemble characteristics. Most important, however, is the fact that the source is considered serially and output is produced directly without the necessity of previewing the entire source.
In a simple statistical model, frequencies of values (characters, strings, or pixels) determine the mapping. In the more general context model, the mapping is determined by the occurrence of elements, each consisting of a value which has other particular adjacent values. For example, in English text, although generally “u” is only moderately likely to appear as the “next” character, if the immediately preceding character is a “q” then “u” would be overwhelmingly likely to appear next.
The use of a model determines the intended sequence of values. An additional mapping via one coding technique or a combination of coding techniques is used to determine the actual output. Several data coding techniques are in common use.
Digitized audio and video signals
The information content of speech, music, and television signals can be preserved by periodically sampling at a rate equal to twice the highest frequency to be preserved. This is referred to as Nyquist sampling. However, speech, music, and television signals are highly redundant, and use of simple Nyquist sampling to code them is inefficient. Reduction of redundancy and application of more efficient sampling results in compression of the information rate needed to represent the signal without serious impairment to the quality of the remade source signal at a receiver. For speech signals, redundancy evident in pitch periodicity and in the format (energy-peaks) structure of the signal's spectrum along with aural masking of quantizing noise is used to compress the information rate. In music, which has much wider bandwidth than speech and far less redundancy, time-domain masking and frequency-domain masking are principally used to achieve compression. For television, redundancy evident in the horizontal and vertical correlation of the pixels of individual frames and in the frame-to-frame correlation of a moving picture, combined with visual masking that obscures quantizing noise resulting from the coding at low numbers of bits per sample, is used to achieve compression. See also Speech.
Compression techniques may be classified into two types: waveform coders and parametric coders. Waveform coders replicate a facsimile of a source-signal waveform at the receiver with a level of distortion that is judged acceptable. Parametric coders use a synthesizer at the receiver that is controlled by signal parameters extracted at the transmitter to remake the signal. The latter may achieve greater compression because of the information content added by the synthesizer model at the receiver.
Waveform compression methods include adaptive differential pulse-code modulation (ADPCM) for speech and music signals, audio masking for music, and differential encoding and sub-Nyquist sampling of television signals. Parametric encoders include vocoders for speech signals and encoders using orthogonal transform techniques for television.
In a computer system, storage of data in a format that occupies less space than the original form. In order to effect data compression, special software packages are required to compress the data and to reopen it to its original size. Because data can be compressed to a fraction of its original size, data compression is particularly useful in communications, allowing for faster transmission of information.
data entry process of entering data into a computer system. Prior to entry, data may be converted to codes that facilitate storage and retrieval of information. Data entry can be accomplished in several ways, including key entry, tape entry and scan entry . Tape entry is the most accurate and most efficient type of data entry, requiring the least amount of clerical involvement, followed by scan entry, then key entry.
A technology that reduces the size of a computer file. Especially important for files used on Web pages: graphics and sound files are compressed so they can be downloaded faster. Compression methods are described as lossless (no data is lost) or lossy (some data is sacrificed to achieve greater compression).
In Geographic Information Systems, the encoding of data in order to reduce its overall volume.
For more information on data compression, visit Britannica.com.
In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an un-encoded representation would use through use of specific encoding schemes. For example, this article could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp." One popular instance of compression with which many computer users are familiar is the ZIP file format, which, as well as providing compression, acts as an archiver, storing many files in a single output file.
As with any communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.
Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be viewed (or heard), and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.
Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression, is possible if some loss of fidelity is acceptable. For example, a person viewing a picture or television video scene might not notice if some of its finest details are removed or not represented perfectly (i.e. may not even notice compression artifacts). Similarly, two clips of audio may be perceived as the same to a listener even though one is missing details found in the other. Lossy data compression algorithms introduce relatively minor differences and represent the picture, video, or audio using fewer bits.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty.
An example of lossless vs. lossy compression is the following string:
This string can be compressed as:
Interpreted as, "5 eights, 7 threes", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using
instead, the original data is lost, at the benefit of a smaller filesize.
The above is a very simple example of run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).
For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.
Lossy image compression is used in digital cameras, greatly increasing their storage capacities while hardly degrading picture quality at all. Similarly, DVDs use the lossy MPEG-2 codec for video compression.
In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline than "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.
The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) and by rate-distortion theory. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Doyle and Carlson (2000) wrote that data compression "has one of the simplest and most elegant design theories in all of engineering". Cryptography and coding theory are also closely related. The idea of data compression is deeply connected with statistical inference.
Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.
The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, although compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (Lempel-Ziv-Welch) is used in GIF images. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). A current LZ-based coding scheme that performs well is LZX, used in Microsoft's CAB format.
The very best compressors use probabilistic models whose predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu. The text entry system, Dasher, is an inverse-arithmetic-coder.
Matt Mahoney, one of the 3 founders of the Hutter Prize, claims that "Compression is Equivalent to General Intelligence" [1].
Independent comparison of different methods of data compression (Results & Softwares, in French. Airelle, 2007). Numbers in parenthesis are the rank of the method of compression for the category of file specified above.
| Comparison of different methods of data compression | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Files | *.avi | *.dll | *.doc | *.exe | *.gif | *.htm | *.jpg | *.mp3 | *.mpg | *.txt | *.wav | *.zip | Notation | TOTAL | |
| Number of files |
16 | 26 | 138 | 24 | 246 | 79 | 44 | 29 | 8 | 36 | 8 | 1 | 19 | 674 | |
| Initial size |
5,261,152 | 5,254,220 | 5,254,656 | 5,254,056 | 5,246,209 | 5,261,187 | 5,246,116 | 5,250,432 | 5,257,720 | 5,257,876 | 5,253,436 | 5,256,024 | 5,262,680 | 68,315,764 | |
| 7z | 4,524,067 (2) | 1,543,179 (3) | 147,690 (3) | 3,910,541 (3) | 4 620 354 (1) | 341,996 (4) | 4,770,061 (4) | 5,053,813 (2) | 4,879,067 (5) | 4,258,863 (3) | 1,270,884 (3) | 3,670,225 (5) | 5,226,742 (14) | 16/20 | 44,217,482 |
| arj | 4,696,659 (9) | 2,160,530 (15) | 1,018,050 (17) | 4,130,505 (11) | 4,702,449 (12) | 898,370 (17) | 4,803,740 (11) | 5,108,093 (17) | 4,910,699 (16) | 4,606,736 (15) | 1,875,329 (16) | 4,450,535 (12) | 5,223,905 (13) | 6.1/20 | 48,585,600 |
| 4,703,291 (12) | 2,156,986 (12) | 1,010,284 (15) | 4,128,594 (9) | 4,693,021 (9) | 889,650 (15) | 4,806,914 (13) | 5,105,811 (13) | 4,904,209 (11) | 4,601,545 (13) | 1,848,972 (13) | 4,451,648 (15) | 5,201,639 (4) | 7.5/20 | 48,502,564 | |
| bz2 | 4,720,926 (18) | 2,095,832 (7) | 573,721 (5) | 4,273,885 (18) | 4,896,084 (18) | 645,243 (5) | 4,743,918 (2) | 5,069,593 (4) | 4,888,293 (7) | 4,444,829 (5) | 1,531,448 (6) | 3,771,508 (7) | 5,238,677 (16) | 11.7/20 | 46,893,957 |
| bza | 4,639,340 (6) | 2,166,940 (17) | 987,806 (11) | 4,231,254 (17) | 4,878,327 (17) | 783,188 (8) | 4,787,973 (7) | 5,076,189 (5) | 4,873,810 (2) | 4,618,970 (17) | 1,516,326 (5) | 3,770,938 (6) | 5,227,572 (15) | 9.8/20 | 47,558,633 |
| cab | 4,701,113 (11) | 2,148,386 (10) | 893,796 (7) | 4,127,044 (8) | 4,678,810 (5) | 842,129 (10) | 4,798,500 (8) | 5,099,787 (8) | 4,900,314 (10) | 4,584,969 (8) | 1,846,233 (12) | 4,451,857 (18) | 5,201,717 (5) | 10.8/20 | 48,274,655 |
| gza | 4,703,371 (13) | 2,157,116 (13) | 1,001,990 (13) | 4,126,436 (7) | 4,693,136 (10) | 874,444 (12) | 4,803,739 (10) | 5,105,765 (12) | 4,904,249 (12) | 4,597,720 (11) | 1,840,188 (11) | 4,451,638 (14) | 5,201,436 (3) | 9.2/20 | 48,461,228 |
| j | 4,678,506 (8) | 1,914,777 (5) | 703,722 (6) | 4,057,445 (5) | 4,681,437 (6) | 691,916 (6) | 4,805,059 (12) | 5,092,070 (7) | 4,898,847 (8) | 4,326,394 (4) | 1,629,228 (8) | 3,594,954 (4) | 5,215,150 (12) | 13/20 | 46,289,505 |
| jar | 4,704,088 (14) | 2,158,273 (14) | 1,017,205 (16) | 4,129,816 (10) | 4,705,456 (13) | 893,622 (16) | 4,809,136 (16) | 5,107,254 (15) | 4,904,615 (13) | 4,603,367 (14) | 1,849,394 (14) | 4,451,718 (16) | 5,202,611 (8) | 6.2/20 | 48,536,555 |
| lha | 4,711,090 (16) | 2,215,476 (18) | 1,020,194 (18) | 4,204,071 (15) | 4,830,501 (15) | 913,845 (18) | 4,918,792 (19) | 5,206,933 (19) | 5,066,716 (19) | 4,802,049 (19) | 1,895,771 (17) | 4,447,253 (10) | 5,263,136 (18) | 6.7/20 | 49,495,827 |
| lzh | 4,711,090 (16) | 2,215,476 (18) | 1,066,340 (19) | 4,143,461 (14) | 4,819,157 (14) | 971,166 (19) | 4,816,349 (18) | 5,107,584 (16) | 4,924,974 (18) | 4,635,416 (18) | 1,945,961 (19) | 4,449,756 (11) | 5,212,837 (11) | 5.3/20 | 49,019,567 |
| pkz | 4,899,083 (20) | 2,354,373 (20) | 1,173,097 (20) | 4,401,289 (20) | 5,120,590 (19) | 1,018,250 (20) | 5,162,114 (20) | 5,253,006 (20) | 5,203,747 (20) | 5,076,577 (20) | 2,084,290 (20) | 5,027,854 (20) | 5,264,213 (19) | 0.2/20 | 52,038,483 |
| rar | 4,634,009 (5) | 1,693,150 (4) | 173,313 (4) | 3,948,241 (4) | 4,639,881 (4) | 318,269 (3) | 4,780,095 (6) | 5,081 085 (6) | 4,887,973 (6) | 4,258,775 (2) | 1,318,381 (4) | 2,657,731 (3) | 5,202,579 (7) | 15.5/20 | 43,593,482 |
| rk | 4,589,894 (3) | 1,474,339 (2) | 132,629 (1) | 3,866,814 (1) | 4,628,017 (3) | 257,588 (1) | 4,434,701 (1) | 5,017,545 (1) | 4,787,286 (1) | 4,498,992 (6) | 1,168,720 (1) | 1,659,771 (1) | 5,183,337 (1) | 18.2/20 | 41,699,633 |
| rs | 4,625,725 (4) | 2,137,145 (9) | 937,954 (10) | 4,221,864 (16) | 4,850,493 (16) | 768,711 (7) | 4,776,635 (5) | 5,066,886 (3) | 4,878,852 (3) | 4,612,537 (16) | 1,560,879 (7) | 3,804,335 (8) | 5,240,116 (17) | 10.7/20 | 47,482,132 |
| sqx | 4,662,560 (7) | 2,078,866 (6) | 991,992 (12) | 4,105,933 (6) | 4,699,518 (11) | 878,469 (14) | 4,808,697 (15) | 5,102,452 (10) | 4,908,341 (14) | 4,590,245 (10) | 1,836,245 (9) | 4,415,575 (9) | 5,208,275 (10) | 9.8/20 | 48,287,168 |
| tgz | 4,707,481 (15) | 2,165,409 (16) | 907,006 (8) | 4,133,949 (12) | 4,684,949 (7) | 861,638 (11) | 4,807,701 (14) | 5,105,913 (14) | 4,909,789 (15) | 4,588,822 (9) | 1,853,650 (15) | 4,451,792 (17) | 5,202,392 (6) | 7.8/20 | 48,380,491 |
| uha | 4,498,275 (1) | 1,474,005 (1) | 136,880 (2) | 3,879,360 (2) | 4,625,014 (2) | 284,363 (2) | 4,760,572 (3) | 5,104,837 (11) | 4,879,047 (4) | 4,237,400 (1) | 1,233,812 (2) | 2,435,124 (2) | 5,187,408 (2) | 17.3/20 | 44,736,097 |
| yz1 | 4,814,935 (19) | 2,128,899 (8) | 924,706 (9) | 4,279,162 (19) | 4,686,669 (8) | 804,198 (9) | 4,810,966 (17) | 5,124,596 (18) | 4,922,886 (17) | 4,568,274 (7) | 1,901,300 (18) | 4,561,179 (19) | 5,207,874 (9) | 6.4/20 | 48,735,644 |
| zip | 4,701,064 (10) | 2,155,923 (11) | 1,009,814 (14) | 4,135,619 (13) | 5,270,565 (20) | 877,679 (13) | 4,799,508 (9) | 5,101,205 (9) | 4,898,961 (9) | 4,599,883 (12) | 1,839,080 (10) | 4,450,719 (13) | 5,264,564 (20) | 7.5/20 | 49,104,584 |
| Intermediate compressed size |
4,701,089 | 2,152,155 | 962,880 | 4,130,160 | 4,696,327 | 851,884 | 4,803,740 | 5,103,645 | 4,902,262 | 4,593,983 | 1,839,634 | 4,448,505 | 5,210,556 | 48,519,559 | |
| Intermediate compression rate |
10.6 % | 59.0 % | 81.7 % | 21.4 % | 10.5 % | 83.8 % | 8.4 % | 2.8 % | 6.8 % | 12.6 % | 65.0 % | 15.4 % | 1.0 % | 29.0 % | |
P.Table(P.T.): PAQ8 (kgb archiver is Windows GUI of old PAQ7) is much better than this all, but for the copyright of the table it can't be copied.
Globally, the three best methods tested are rk, rar and 7z. WinRK and WinRar are commercial software, 7-zip is free, open source (LGPL licence) and can be used with Linux.
Data collections, commonly used for comparing compression algorithms.
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)
Some good "data compression" pages on the web:
How? computer.howstuffworks.com |
Join the WikiAnswers Q&A community. Post a question or answer questions about "data compression" at WikiAnswers.
Copyrights:
![]() | Sci-Tech Dictionary. McGraw-Hill Dictionary of Scientific and Technical Terms. Copyright © 2003, 1994, 1989, 1984, 1978, 1976, 1974 by McGraw-Hill Companies, Inc. All rights reserved. Read more | |
![]() | Sci-Tech Encyclopedia. McGraw-Hill Encyclopedia of Science and Technology. Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Read more | |
![]() | Marketing Dictionary. Dictionary of Marketing Terms. Copyright © 2000 by Barron's Educational Series, Inc. All rights reserved. Read more | |
![]() | Business Dictionary. Dictionary of Business Terms. Copyright © 2000 by Barron's Educational Series, Inc. All rights reserved. Read more | |
![]() | Geography Dictionary. A Dictionary of Geography. Copyright © Susan Mayhew 1992, 1997, 2004. All rights reserved. Read more | |
![]() | Britannica Concise Encyclopedia. Britannica Concise Encyclopedia. © 2006 Encyclopædia Britannica, Inc. All rights reserved. Read more | |
![]() | Wikipedia. This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Data compression". Read more |
Mentioned In: