World's most popular travel blog for travel bloggers.

[Solved]: Compression of Random Data is Impossible?

, , No Comments
Problem Detail: 

A few days ago this appeared on HN http://www.patrickcraig.co.uk/other/compression.htm. This refers to a challenge from 2001 - where someone was offering a prize of \$5000 for any kind of reduction to the size of randomly generated data (the entrance cost would be \$100 and the contestant could choose the file-size).

I understand the result which shows it is impossible to compress all files of size >= N to a smaller size (http://en.wikipedia.org/wiki/Kolmogorov_complexity#Compression).

But in the challenge question the idea is to only compress a single randomly generated file. A simple approach is to find $l$ the longest recurring substring in a file and substitute that with $s$ the shortest non-occuring substring which would give a saving of $l-s$ (since the decompresser only needs one l,s pair).

To obtain an approximation for the single shortest non-occurring substring as Usul pointed using the approximation for the coupon problem: $$E(t) = nlog(n) + \gamma n + (1/2n)$$

In this case $t$ is the number of possible substrings and using bytes for simplicity that's respectively:

275 megabytes for 3 byte strings ($n=256^3$)

91 gigabytes for 4 byte strings ($n=256^4$)

The probability of a recurring substring in the random data is equivalent to the birthday problem over the size of the random data, $p = 1-e^{-((n^2)/2d)}$, for example at 91 gigabytes one can expect to find a recurring 9 byte substring with: $d=256^9$, $n = 91*1024^3$. $p \approx .636$

With $5000:100$ odds it should be possible to optimize a little more. Of course the cost of sed delimiters will be 3 bytes, if the cost of the command counts this method isn't enough to be successful - but this is without doing anything particularly clever.

Am I missing something (It would seem that the majority of random data should not be in fact incompressible)? What is the basis on which someone offers such a challenge and presumes it is entirely unwinnable for the player?


Edit (updated calculations up top as well): Also I think its possible to have a savings of precisely $l-s$ without having to pay for a delimiter. This is because it is known that substring $s$ did not exist prior to insertion - therefore the pair can be arranged with $sl$ contiguously written in memory and either the length of $s$ is known ahead of time (by estimating the shortest substring) or the next byte following $s$ in the file does not collide with the first byte of $l$ with at least probability $255/256$ (this gets better actually by having a choice which non-occuring $s$ to substitute for which recurring $l$).

Asked By : user3467349

Answered By : Wandering Logic

Here's the Kolmogorov complexity argument that @YuvalFilmus mentions in his answer.

Your input here is a sed script of some size plus an input file of some size. Say the total size of the sed script plus the file is $n$ bits.

There can be only $2^n$ inputs of size $n$ bits. Even if I let you have a sed script of less than $n$ bits (and don't charge you for telling me how long the input is) there can only be $2^{n+1}$ possible inputs. But there are $2^{n+8}$ possible strings of length $n+8$ bits, so you can produce at most $1/128$th of the possible $2^{n+8}$ bit strings. That's less than .01, so paying 50:1 my expected value for the bet would be positive (about 0.60). (And that's only giving you 1 byte to play with.)

In short, random strings that are compressible at all must be extremely rare.

Why your scheme is broken

Your probability calculations look fine to me. And the idea of a birthday attack is so appealingly clever, it took me quite a while to see the real problem.

Your basic scheme doesn't compress.

Suppose you find a 4-byte substring that doesn't occur in the original string, and an 8-byte substring that occurs twice in the original string.

Then you are going to produce a "compressed" string of length $n$ bytes - 8 bytes - 8 bytes + 4 bytes + 4 bytes, plus a sed replacement script which must be of length at least 4 bytes + 8 bytes. Add that all together and we get $n+4$.

If the longer string is length $l$ and repeats $r$ times and the shorter string is of length $s$ then in general your scheme reduces the string length by

$$ (l-s) r - (l+s).$$ So $l=7$, $r=3$, $s=4$ "works" (saves 2 bytes), but the probability of finding a 7 byte substring that repeats 3 times is infinitesimal.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/40239

0 comments:

Post a Comment

Let us know your responses and feedback