[Solved]: Why is compression ratio using bzip2 for a sequence of “a”s so jumpy?

Problem Detail: 

library(ggplot2)  compress <- function(str) {   length(memCompress(paste(rep("a", str), collapse=""), type="bzip2"))   / nchar(paste(rep("a", str), collapse="")) }  cr <- data.frame(i = 1:10000, r = sapply(1:10000, compress))  ggplot(cr[cr$i>=5000 & cr$i<=10000,], aes(x=i, y=r)) + geom_line() 

enter image description here The compression ratio starts out at 37 for “a” and hits the break-even at 39 “a”s (compression ratio = 1). The chart starts out pretty smooth and gets edgy suddenly for 98 “a”s and from there at increasingly small intervals. The local lows and smooth sections seem also quite erratic and random. Can somebody explain to me why bzip2 shows this behaviour in this example?

Asked By : Raffael

Answered By : Gilles

Suppose a simple compression algorithm that represents a run of a by storing $(text{header}, mathtt{text{“a”}}, n)$, i.e. some fixed header, the string a, and the number of repetitions $n$. This is a run-length encoding. Then the length of the compressed text would be close to $a + lg n$ bits for some constants $a$. The corresponding compression ratio would be $dfrac{a + lg(n)}{n}$. This is roughly the shape of the curve as seen from a distance, if you neglect the local ups and downs. Asymptotically, the compression ratio is $Theta(lg(n)^p/n)$ with $p ge 1$ (I haven’t worked it out but I suspect there are other factors at play which make the size of the output superlinear in the length of the input string). $lg n$ bits aren’t an integer number of bits, let alone bytes: the size of $n$ has to be at a minimum rounded up to an integer number of bytes. This explains a first threshold effect: for this simple compression algorithm, you would observe that the length of the output is $a$ when $n$ is sufficiently small, then $a+1$, then $a+2$, etc. The compression ratio therefore isn’t a smooth curve but jumps from $dfrac{a}{n}$ to $dfrac{a+1}{n}$, then $dfrac{a+2}{n}$, etc. This effect is barely visible in the realignment after jumps (but there are other effects at play which make the jumps different). Since the compression ratio is too close to the inverse ratio of the length for visual observation, here is the data for small length in my implementation (this may be dependent on the version of the bzip2 library, as there are multiple ways to compress some inputs). The first column indicates the number of a’s, the second column is the length of the compressed output.

1–3       37 4–99      39 100–115   37 116–258   39 259–354   45 355       43 356       40 357–370   41 371–498   43 499–513   41 514–609   45 610       43 611       41 613–625   42 626–753   44 754–764   42 765       40 766–767   41 768       42 769–864   45 … 

Bzip2 is far more complex that a simple run-length encoding. It works in a series of steps, and the first step is a run-length encoding step, but with a fixed size limit. The first step works as follows: if a byte is repeated at least 4 times, then replace the bytes after the 4th by a byte indicating the repeat count of the erased bytes. For example, aaaaaaa is transformed to aaaad{3} (where d{003} is the character with byte value 3); aaaa is transformed to aaaad{0}, and so on. Since there are only 256 distinct byte values, only sequences where the byte is repeated up to 259 times can be encoded this way; if there are more, a fresh sequence starts. Furthermore, the reference implementation stops at a repeat count of 252, which encodes a string of 256 bytes. This step explains the first threshold and the threshold at 258: $mathtt{a}^n$ compresses to 37 bytes for $1 le n le 3$ and to up to 39 bytes for $4 le n le 258$. The introduction of the repeat count results in 2 more bytes in the output of the subsequent steps, and that extra byte is able to represent all repeat counts up to 258. After 258 bytes, there is a second RLE string. There is a third one at 514 bytes, a fourth at 769 bytes, and so on. Extra RLE strings after the second one don’t cost much if at all because that sequence aaaad{252} (assuming that d{252} is the repeat count, I haven’t checked) is itself repeated and therefore compressed by subsequent steps. Why are the thresholds at 258, 516, 769, (none at 1024), 1279 rather than exact multiples of 256? I suspect it’s because the Burrows-Wheeler transform followed by a move-to-front can transform something like aaaa374aa (output of step 1 for $n=258$) by moving the a’s together so that they can be encoded together, but I haven’t checked. The reduction in length at $n=100$ is interesting. For the input $mathtt{a}^{101}$, the first RLE step produces aaaad{97} — but the byte value of a is 97, so this is aaaaa which has better potential for compression by subsequent steps. I suspect that the shrinking effect not only for $n=101$ but also for surrounding values is due to the delta encoding step which makes it easier to store bytes with nearby values. If you change a to A (65), you’ll notice that the shorter output happens for $68 le n le 83$. My analysis of this example is far from exhaustive. To understand other effects, you would have to study the other steps of the transformation: I mostly stopped after step 1 of 9. I hope this gives you an idea of why compression ratios get a little choppy and don’t vary monotonically. If you really want to figure out every detail, I recommend taking an existing implementation and observing it with a debugger. For the most part, such minute variations aren’t the main focus when designing a compression algorithm: in many common scenarios, such as general-purpose or media compression algorithms, a difference of a few bytes is irrelevant. Compression tries to squeeze out every bit at the local level, and tries to chain transformations in such a way as to gain often while rarely losing and then not by much. There are nonetheless situations such as special-purpose communication protocols designed for low-bandwidth communication where every bit matters. Another situation where exact output length matters is when the compressed text is encrypted: when an adversary can submit part of the text to be compressed and encrypted, variations on the length of the ciphertext can reveal the part of the compressed-and-encrypted text to the adversary; this was recently publicized in the CRIME exploit on HTTPS.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/14233

Leave a Reply