Given normally distributed integers with a mean of 0 and a standard deviation $\sigma$ around 1000, how do I compress those numbers (almost) perfectly? Given the entropy of the Gaussian distribution, it should be possible to store any value $x$ using $$\frac{1}{2} \mathrm{log}_2(2\pi\sigma^2)+\frac{x^2}{2\sigma^2}\rm{log}_2\rm{e} \;\text{ bits.}$$
Arithmetic coding should provide perfect compression. In principle it's not too hard. I can calculate the interval boundaries from the cumulative distribution function of the Gaussian. In practice, I hit considerable difficulties because when using floating point operations I cannot achieve perfect reproduction of the results, and I have no idea how to do this without FP operations. Perfect reproduction is necessary because the uncompressing code must come up with exactly the same interval boundaries as the compressing code.
So my question is: How do I compute the interval boundaries? Or is there any other way to achieve (near) perfect compression of such data?
Strictly speaking, the normal distribution is defined only for continuous variables. So what I mean here are integers x with a probability distribution function $$P(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\rm e^{-\frac{x^2}{2\sigma^2}}.$$
This distribution does not sum up exactly to 1. However, for $\sigma>100$ the difference from 1 is less than $10^{-1000}$ and hence it's more precise than any practical calculation would be.
Asked By : pentadecagon
Answered By : Pseudonym
In what follows, I'm going to assume that $x \ge 0$, and then try to put that back later.
First, we'll simplify the notation a little. The length of the code for $x$ should be:
$$L_m(x) = a + b \left(\frac{x}{\sigma}\right)^2 + \log_2 \sigma$$
where:
$$a = \frac{\log_2 2\pi}{2} \approx 1.325$$ $$b = \frac{1}{2 \ln 2} \approx 0.721$$
This suggests that you should divide $x$ by $\sigma$, encode the remainder in $\log_2 \sigma$ bits, then encode the quotient $q$ in $a + bq^2$ bits (say, using arithmetic coding). Instead of storing $20002$ codes using Yuval's method, you now only need $20$ or so.
In a sense, what you're doing here is converting the normal distribution to a standard distribution. Moreover, if you move $\sigma$ to the nearest power of 2, the division can be done using bit shifts and masks. $\sigma=1024$ is close enough to $1000$, so that part of the operation is extremely cheap.
However, there is a catch. To arithmetic code $L_0(10)$, you need about 60 bits. On a 64-bit machine, this is pushing the practical limit of an efficient arithmetic coder. So you should probably consider partitioning the space for more efficient encoding.
One possibility is to use binary arithmetic coding, with one binary symbol per $\sigma$. The probability that $x$ is within $\sigma$ of the mean is $0.6826895$ (or so). The probability that $x$ is within $2\sigma$ given that it is not within $\sigma$ is $0.85660651$ (or so), and so on. I calculate the code length for the $8\sigma$ symbol to be around 27.3 bits, and trying to calculate the length for $9\sigma$ breaks into numeric instability using the simple calculation that I did.
No matter what you try, representing numbers that are more than about $8\sigma$ from the mean is going to be difficult to do well, but given that those numbers are extremely rare, perhaps you should just fall back to a universal code (e.g. one of the Elias codes) for that case.
So in summary, here's the representation that I'd try.
- Divide $\left| x \right|$ by $\sigma$. Call the quotient $q$ and the remainder $r$.
- Store the sign of $x$ as a single bit.
- Arithmetic encode $q$ using a sequence of binary variables:
- Use a single binary symbol with $p_0 = 0.6826894921370859$. Store $0$ if $q=0$, or $1$ if $q>0$.
- If you stored $1$, use another binary symbol with $p_0 = 0.8566065013011934$. Store $0$ if $q=1$, or $1$ if $q>1$.
- ...and so on. Stop when you get to $q \ge 8$).
- If $q<8$, binary encode $r$ in $\log_2 \sigma$ bits.
- If $q\ge 8$, use a universal binary code of choice (Elias gamma is a popular choice) to store $x - 8\sigma$.
Now having said all this, a common approach used in many compression standards, which works fairly well when $\sigma$ is small, is to use exp-Golomb coding. It's not theoretically perfect, but seems to work well in practice.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/20156
0 comments:
Post a Comment
Let us know your responses and feedback