Versions tested: | 1, 2, 3, 4, 5, 6, 7, 8p, 9a |
Optimal parameters: | default |
Links: | http://www.cs.fit.ed... |
Authors: | Matt Mahoney |
Algorithms: | CM, X86 |
Notable peformances: | - |
PAQ uses context mixing (CM). The following is a quote from paq8 source code which describes various contexts which are modeled:
paq8 source code
- Order n. The last n bytes, up to about 16. For general purpose data. Most of the compression occurs here for orders up to about 6. An order 0 context includes only the 0-7 bits of the partially coded byte and the number of these bits (255 possible values).
- Sparse. Usually 1 or 2 of the last 8 bytes preceding the byte containing the predicted bit, e.g (2), (3),..., (8), (1,3), (1,4), (1,5), (1,6), (2,3), (2,4), (3,6), (4,8). The ordinary order 1 and 2 context, (1) or (1,2) are included above. Useful for binary data.
- Text. Contexts consists of whole words (a-z, converted to lower case and skipping other values). Contexts may be sparse, e.g (0,2) meaning the current (partially coded) word and the second word preceding the current one. Useful contexts are (0), (0,1), (0,1,2), (0,2), (0,3), (0,4). The preceding byte may or may not be included as context in the current word.
- Formatted text. The column number (determined by the position of the last linefeed) is combined with other contexts: the charater to the left and the character above it.
- Fixed record length. The record length is determined by searching for byte sequences with a uniform stride length. Once this is found, then the record length is combined with the context of the bytes immediately preceding it and the corresponding byte locations in the previous one or two records (as with formatted text).
- Context gap. The distance to the previous occurrence of the order 1 or order 2 context is combined with other low order (1-2) contexts.
- FAX. For 2-level bitmapped images. Contexts are the surrounding pixels already seen. Image width is assumed to be 1728 bits (as in calgary/pic).
- Image. For uncompressed 24-bit color BMP and TIFF images. Contexts are the high order bits of the surrounding pixels and linear combinations of those pixels, including other color planes. The image width is detected from the file header. When an image is detected, other models are turned off to improve speed.
- JPEG. Files are further compressed by partially uncompressing back to the DCT coefficients to provide context for the next Huffman code. Only baseline DCT-Huffman coded files are modeled. (This ia about 90% of images, the others are usually progresssive coded). JPEG images embedded in other files (quite common) are detected by headers. The baseline JPEG coding process is:
The most useful contexts are the current partially coded Huffman code (including S following bits) combined with the coefficient position (0-63), color (0-2), and last few RS codes.
- Convert to grayscale and 2 chroma colorspace.
- Sometimes downsample the chroma images 2:1 or 4:1 in X and/or Y.
- Divide each of the 3 images into 8x8 blocks.
- Convert using 2-D discrete cosine transform (DCT) to 64 12-bit signed coefficients.
- Quantize the coefficients by integer division (lossy).
- Split the image into horizontal slices coded independently, separated by restart codes.
- Scan each block starting with the DC(0,0) coefficient in zigzag order to the (7,7) coefficient, interleaving the 3 color components in order to scan the whole image left to right starting at the top.
- Subtract the previous DC component from the current in each color.
- Code the coefficients using RS codes, where R is a run of R zeros (0-15) and S indicates 0-11 bits of a signed value to follow. (There is a special RS code (EOB) to indicate the rest of the 64 coefficients are 0).
- Huffman code the RS symbol, followed by S literal bits.
- Match. When a context match of 400 bytes or longer is detected, the next bit of the match is predicted and other models are turned off to improve speed.
- Exe. When a x86 file (.exe, .obj, .dll) is detected, sparse contexts with gaps of 1-12 selecting only the prefix, opcode, and the bits of the modR/M byte that are relevant to parsing are selected. This model is turned off otherwise.
(Have info that should be added here? E-mail.)
The programs are described as 'archivers' even though they do not store directory structure.
Ver | Rating | CPR | DPR | S.E. | R.E. | Ratio | C. kB/s | D. kB/s | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
<none> | ||||||||||||
1 | 57 | 54 | 61 | 28 | 10 | 3.271 | 195 | 196 | ||||
2 | 57 | 0% | -1.9% | 0% | -11% | + | 3.324 | +1.6% | 166 | -15% | 167 | -15% |
3 | 91 | +60% | +62% | +59% | +40% | +91% | 3.380 | +1.7% | 227 | +37% | 228 | +37% |
4 | 83 | -8.8% | -9.3% | -9.3% | -43% | +81% | 3.642 | +7.8% | 107 | -53% | 107 | -53% |
5 | 48 | -42% | -42% | -42% | -50% | -34% | 3.703 | +1.7% | 53 | -50% | 53 | -50% |
9a | 261 | +444% | +453% | +437% | +550% | +372% | 3.632 | -1.9% | 346 | +553% | 338 | +538% |
-5 | ||||||||||||
6 | 43 | 41 | 46 | 7 | 32 | 3.850 | 35 | 35 | ||||
7 | 56 | +30% | +32% | +26% | +100% | 4.072 | +5.8% | 30 | -14% | 28 | -20% | |
8p | verify error (app1) |