I have a sequence of $n$ integers in a small range $[0, k)$ and all the integers have the same frequency $f$ (so the size of the sequence is $n = f * k$). What I'm trying to do now is to compress this sequence while providing random access (what is the $i$-th integer). I'm more interested in achieving high compression at the expense of higher random access times.
I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.
Any help or pointers would be appreciated.
Thanks in advance.
Asked By : jplot
Answered By : Yuval Filmus
You can use the so-called "binomial encoding", described in one of my answers on math.stackexchange. You have to consider, though, whether it will be worthwhile. The naive encoding takes $n \log_2 k$ bits, and the best encoding saves about $O(\log n)$ bits (the best encoding uses $\log_2 \frac{n!}{f!^k}$ bits). Plug in your actual numbers to see what kind of saving you can expect.
Edit: I can't find my supposed post on math.stackexchange, so here is a summary of binomial coding (or rather multinomial coding), as it applies in this case. For a vector $x$ of length $k$, define $$ w(x) = \frac{(\sum_i x_i)!}{\prod_i x_i!}. $$ Note that for $x \neq 0$, $$ w(x) = \sum_{i\colon x_i > 0} w(x-e_i), $$ where $e_i$ is the $i$th basis vector. For example, if $k = 2$ and $x_1,x_2>0$ then $$ w(x_1,x_2) = w(x_1-1,x_2) + w(x_1,x_2-1). $$ This is just Pascal's identity $$ \binom{x_1+x_2}{x_1} = \binom{x_1+x_2}{x_1-1} + \binom{x_1+x_2}{x_1}. $$
For every vector $x$ and every $i$ such that $x_i > 0$, define $$ w(x,i) = \sum_{j < i\colon x_j > 0} w(x-e_j). $$ Given a sequence $\sigma$ of letters in $\{1,\ldots,k\}$, let $H(\sigma)$ be its histogram, a vector of length $k$. For a non-empty word $\sigma \neq \epsilon$, let $\sigma_1$ be the first symbol, and $\sigma_{>1}$ be the rest, so that $\sigma = \sigma_1 \sigma_{>1}$. We are now ready to define the encoding: $$ C(\sigma) = \begin{cases} w(H(\sigma),\sigma_1) + C(\sigma_{>1}) & \sigma \neq \epsilon \\ 0 & \sigma = \epsilon. \end{cases} $$ $C(\sigma)$ is an integer between $0$ and $\frac{n!}{f!^k}-1$.
Here is an example, with $k=f=2$. Let us encode $\sigma = 0101$: $$ \begin{align*} C(0101) &= w((2,2),0) + w((1,2),1) + w((1,1),0) + w((0,1),1) \\ &= 0 + w(0,2) + 0 + 0 = 1. \end{align*} $$ You can check that the inverse images of $0,1,2,3,4,5$ are $$ 0011,0101,0110,1001,1010,1100. $$ These are just all the solution in increasing lexicography order.
How do we decode? By reversing the encoding procedure. We know the initial histogram $H_1$. Given an input $c_1$, we find the unique symbol $\sigma_1$ such that $w(H_1,\sigma_1) \leq c_1 < w(H_1,\sigma_1+1)$. We then put $H_2 = H_1 - e_{\sigma_1}$, $c_2 = c_1 - w(H_1,\sigma_1)$, and continue. To test your understanding, try to decode $0101$ back from its encoding $1$.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/4639
0 comments:
Post a Comment
Let us know your responses and feedback