I'll phrase my question using an intuitive and rather extreme example:
Is the expected compression ratio (using zip compression) of a children's book higher than that of a novel written for adults?
I read somewhere that specifically the compression ratio for zip compression can be considered an indicator for the information (as interpreted by a human being) contained in a text. Can't find this article anymore though.
I am not sure how to attack this question. Of course no compression algorithm can grasp the meaning of verbal content. So what would a zip compression ratio reflect when applied to a text? Is it just symbol patterns - like word repetitions will lead to higher ratio - so basically it would just reflect the vocabulary?
Update:
Another way to put my question would be whether there is a correlation which goes beyond repetition of words / restricted vocabulary.
Tangentially related:
Relation of Word Order and Compression Ratio and Degree of Structure
Asked By : Raffael
Answered By : Wandering Logic
Shannon's noisless coding theorem is the formal statement that the size of an optimally compressed data stream is equivalent to the amount of information in that data stream.
But you are also right that the definition of "amount of information" depends on your probability model for the data stream. If you have a correct table of letter probabilities before you start the compression then Huffman coding gets within a constant factor of optimal, Arithmetic coding gets even closer to optimal. But if there are correlations between neighboring symbols (which there are in real human produced text) you can do better by choosing codes for pairs of letters. And you can do even better if you look at triples, and so on. Additionally, you typically don't have a very good probabilistic model of your data stream when you start, so you need to adaptively construct the probability table and then assign variable length symbols depending on the probabilities as you learn more.
The kinds of compression used in zip/gzip, compress, and 7-zip are all variants of Lempel-Ziv coding. Lempel-Ziv coding is adaptive, builds probability tables over varying length chunks of letters, and is optimal in the sense that given an infinite random stream that is ergodic (the probabilities are stable over time) it will, in the limit, adaptively find a table that takes into account correlations over arbitrary distances, and then use that table to produce a stream that approaches optimally coded.
Of course, neither children's books nor novels for adults are infinitely long ergodic random processes (although some of them may seem like they are) so the conditions of the theorems don't hold.
Here's an interesting interview with Jacob Ziv where he talks a little about the universality of Lempel-Ziv coding: http://www.ieeeghn.org/wiki/index.php/Oral-History:Jacob_Ziv#Lempel-Ziv_algorithm.2C_Viterbi_algorithm.2C_and_Ziv-Zakai_bound.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/14150
0 comments:
Post a Comment
Let us know your responses and feedback