library(ggplot2) compress <- function(str) { length(memCompress(paste(rep("a", str), collapse=""), type="bzip2")) / nchar(paste(rep("a", str), collapse="")) } cr <- data.frame(i = 1:10000, r = sapply(1:10000, compress)) ggplot(cr[cr$i>=5000 & cr$i<=10000,], aes(x=i, y=r)) + geom_line()
The compression ratio starts out at 37 for “a” and hits the break-even at 39 “a”s (compression ratio = 1). The chart starts out pretty smooth and gets edgy suddenly for 98 “a”s and from there at increasingly small intervals. The local lows and smooth sections seem also quite erratic and random. Can somebody explain to me why bzip2 shows this behaviour in this example?
Asked By : Raffael
Answered By : Gilles
1–3 37 4–99 39 100–115 37 116–258 39 259–354 45 355 43 356 40 357–370 41 371–498 43 499–513 41 514–609 45 610 43 611 41 613–625 42 626–753 44 754–764 42 765 40 766–767 41 768 42 769–864 45 …
Bzip2 is far more complex that a simple run-length encoding. It works in a series of steps, and the first step is a run-length encoding step, but with a fixed size limit. The first step works as follows: if a byte is repeated at least 4 times, then replace the bytes after the 4th by a byte indicating the repeat count of the erased bytes. For example, aaaaaaa is transformed to aaaad{3} (where d{003} is the character with byte value 3); aaaa is transformed to aaaad{0}, and so on. Since there are only 256 distinct byte values, only sequences where the byte is repeated up to 259 times can be encoded this way; if there are more, a fresh sequence starts. Furthermore, the reference implementation stops at a repeat count of 252, which encodes a string of 256 bytes. This step explains the first threshold and the threshold at 258: $mathtt{a}^n$ compresses to 37 bytes for $1 le n le 3$ and to up to 39 bytes for $4 le n le 258$. The introduction of the repeat count results in 2 more bytes in the output of the subsequent steps, and that extra byte is able to represent all repeat counts up to 258. After 258 bytes, there is a second RLE string. There is a third one at 514 bytes, a fourth at 769 bytes, and so on. Extra RLE strings after the second one don’t cost much if at all because that sequence aaaad{252} (assuming that d{252} is the repeat count, I haven’t checked) is itself repeated and therefore compressed by subsequent steps. Why are the thresholds at 258, 516, 769, (none at 1024), 1279 rather than exact multiples of 256? I suspect it’s because the Burrows-Wheeler transform followed by a move-to-front can transform something like aaaa374aa (output of step 1 for $n=258$) by moving the a’s together so that they can be encoded together, but I haven’t checked. The reduction in length at $n=100$ is interesting. For the input $mathtt{a}^{101}$, the first RLE step produces aaaad{97} — but the byte value of a is 97, so this is aaaaa which has better potential for compression by subsequent steps. I suspect that the shrinking effect not only for $n=101$ but also for surrounding values is due to the delta encoding step which makes it easier to store bytes with nearby values. If you change a to A (65), you’ll notice that the shorter output happens for $68 le n le 83$. My analysis of this example is far from exhaustive. To understand other effects, you would have to study the other steps of the transformation: I mostly stopped after step 1 of 9. I hope this gives you an idea of why compression ratios get a little choppy and don’t vary monotonically. If you really want to figure out every detail, I recommend taking an existing implementation and observing it with a debugger. For the most part, such minute variations aren’t the main focus when designing a compression algorithm: in many common scenarios, such as general-purpose or media compression algorithms, a difference of a few bytes is irrelevant. Compression tries to squeeze out every bit at the local level, and tries to chain transformations in such a way as to gain often while rarely losing and then not by much. There are nonetheless situations such as special-purpose communication protocols designed for low-bandwidth communication where every bit matters. Another situation where exact output length matters is when the compressed text is encrypted: when an adversary can submit part of the text to be compressed and encrypted, variations on the length of the ciphertext can reveal the part of the compressed-and-encrypted text to the adversary; this was recently publicized in the CRIME exploit on HTTPS.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/14233