World's most popular travel blog for travel bloggers.

Data compression using prime numbers

, , No Comments
Problem Detail: 

I have recently stumbled upon the following interesting article which claims to efficiently compress random data sets by always more than 50%, regardless of the type and format of the data.

Basically it uses prime numbers to uniquely construct a representation of 4-byte data chunks which are easy to decompress given that every number is a unique product of primes. In order to associate these sequences with the primes it utilizes a dictionary.

My question is:

  • Is this really feasible as the authors suggest it? According to the paper, their results are very efficient and always compress data to a smaller size. Won't the dictionary size be enormous?
  • Couldn't this be used to iteratively re-compress the compressed data using the same algorithm? It is obvious, and has been demonstrated, that such techniques (where the compressed data is re-compressed as many times as possible, dramatically reducing the file size) are impossible; indeed, there would be no bijection between the set of all random data and the compressed data. So why does this feel like it would be possible?
  • Even if the technique is not perfect as of yet, it can obviously be optimized and strongly improved. Why is this not more widely known/studied? If indeed these claims and experimental results are true, couldn't this revolutionalize computing?
Asked By : Pickle

Answered By : Tom van der Zanden

always compress random data sets by more than 50%

That's impossible. You can't compress random data, you need some structure to take advantage of. Compression must be reversible, so you can't possibly compress everything by 50% because there are far less strings of length $n/2$ than there are of length $n$.

There are some major issues with the paper:

  • They use 10 test files without any indication of their content. Is the data really random? How were they generated?

  • They claim to achieve compression ratios of at least 50% their test data shows they achieve at most 50%.

This algorithm defines a lossless strategy which makes use of the prime numbers present in the decimal number system

  • What? Prime numbers are prime numbers regardless of the base.

  • Issue #1 with decompression: prime factorization is a hard problem, how do they do it efficiently?

  • Issue #2 with decompression (this is the kicker): they multiply the prime numbers together, but doing so you lose any information about the order, since $2\cdot 5 = 10 = 5\cdot 2$. I don't think it is possible to decompress at all using their technique.

I don't think this paper is very good.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/44594

0 comments:

Post a Comment

Let us know your responses and feedback