I just started reading a book called Introduction to Data Compression, by Guy E. Blelloch. On page one, he states:
The truth is that if any one message is shortened by an algorithm, then some other message needs to be lengthened. You can verify this in practice by running GZIP on a GIF file. It is, in fact, possible to go further and show that for a set of input messages of fixed length, if one message is compressed, then the average length of the compressed messages over all possible inputs is always going to be longer than the original input messages.
Consider, for example, the 8 possible 3 bit messages. If one is compressed to two bits, it is not hard to convince yourself that two messages will have to expand to 4 bits, giving an average of 3 1/8 bits.
Really? I find it very hard to convince myself of that. In fact, here's a counter example. Consider the algorithm which accepts as input any 3-bit string, and maps to the following outputs:
000 -> 0 001 -> 001 010 -> 010 011 -> 011 100 -> 100 101 -> 101 110 -> 110 111 -> 111
So there you are - no input is mapped to a longer output. There are certainly no "two messages" that have expanded to 4 bits.
So what exactly is the author talking about? I suspect either there's some implicit caveat that just isn't obvious to me, or he's using language that's far too sweeping.
Disclaimer: I realize that if my algorithm is applied iteratively, you do indeed lose data. Try applying it twice to the input 110: 110 -> 000 -> 0, and now you don't know which of 110 and 000 was the original input. However, if you apply it only once, it seems lossless to me. Is that related to what the author's talking about?
Asked By : Jack M
Answered By : Andrej Bauer
What you are missing is that you need to consider all bits of size 3 or less. That is: if in a compression scheme for bits of size 3 or less we compress one of the 3-bit strings to a 2-bit string, then some string of size 3 or less will have to expand to 3 bits or more.
A losless compression scheme is a function $C$ from finite bit strings to finite bit strings which is injective, i.e., if $C(x) = C(y)$ then $x = y$, i.e., $C(x)$ uniquely determines $x$.
Consider an arbitrary compression scheme $C$ and let $S$ be a set of binary strings. We can express how well $C$ works on $S$ by computing the ratio $$\text{CompressionRatio}(C,S) = \frac{\sum_{x \in S} \mathrm{length}(C(x))}{\sum_{x \in S} \mathrm{length}(x)}.$$ A small compression ratio is good. For example, if it is $1/2$ that means we can on average compress strings in $S$ by 50% using $C$.
If we try to compress all strings of length at most $n$ then we are in trouble:
Theorem: Let $S$ be the set of all strings of length at most $n$ and $C$ any compression scheme. Then $\text{CompressionRatio}(C,S) \geq 1$.
So, the best compression scheme in the world is the identity function! Well, only if we want to compress random strings of bits. The bit strings which occur in practice are far from random and exhibit a lot of regularity. This is why it makes sense to compress data despite the above theorem.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/7531
0 comments:
Post a Comment
Let us know your responses and feedback