World's most popular travel blog for travel bloggers.

[Solved]: Compression of sequence with Direct Access

, , No Comments
Problem Detail: 

I have a sequence of $n$ integers in a small range $[0, k)$ and all the integers have the same frequency $f$ (so the size of the sequence is $n = f * k$). What I'm trying to do now is to compress this sequence while providing random access (what is the $i$-th integer). I'm more interested in achieving high compression at the expense of higher random access times.

I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.

Any help or pointers would be appreciated.

Thanks in advance.

Asked By : jplot

Answered By : Yuval Filmus

You can use the so-called "binomial encoding", described in one of my answers on math.stackexchange. You have to consider, though, whether it will be worthwhile. The naive encoding takes $n \log_2 k$ bits, and the best encoding saves about $O(\log n)$ bits (the best encoding uses $\log_2 \frac{n!}{f!^k}$ bits). Plug in your actual numbers to see what kind of saving you can expect.

Edit: I can't find my supposed post on math.stackexchange, so here is a summary of binomial coding (or rather multinomial coding), as it applies in this case. For a vector $x$ of length $k$, define $$ w(x) = \frac{(\sum_i x_i)!}{\prod_i x_i!}. $$ Note that for $x \neq 0$, $$ w(x) = \sum_{i\colon x_i > 0} w(x-e_i), $$ where $e_i$ is the $i$th basis vector. For example, if $k = 2$ and $x_1,x_2>0$ then $$ w(x_1,x_2) = w(x_1-1,x_2) + w(x_1,x_2-1). $$ This is just Pascal's identity $$ \binom{x_1+x_2}{x_1} = \binom{x_1+x_2}{x_1-1} + \binom{x_1+x_2}{x_1}. $$

For every vector $x$ and every $i$ such that $x_i > 0$, define $$ w(x,i) = \sum_{j < i\colon x_j > 0} w(x-e_j). $$ Given a sequence $\sigma$ of letters in $\{1,\ldots,k\}$, let $H(\sigma)$ be its histogram, a vector of length $k$. For a non-empty word $\sigma \neq \epsilon$, let $\sigma_1$ be the first symbol, and $\sigma_{>1}$ be the rest, so that $\sigma = \sigma_1 \sigma_{>1}$. We are now ready to define the encoding: $$ C(\sigma) = \begin{cases} w(H(\sigma),\sigma_1) + C(\sigma_{>1}) & \sigma \neq \epsilon \\ 0 & \sigma = \epsilon. \end{cases} $$ $C(\sigma)$ is an integer between $0$ and $\frac{n!}{f!^k}-1$.

Here is an example, with $k=f=2$. Let us encode $\sigma = 0101$: $$ \begin{align*} C(0101) &= w((2,2),0) + w((1,2),1) + w((1,1),0) + w((0,1),1) \\ &= 0 + w(0,2) + 0 + 0 = 1. \end{align*} $$ You can check that the inverse images of $0,1,2,3,4,5$ are $$ 0011,0101,0110,1001,1010,1100. $$ These are just all the solution in increasing lexicography order.

How do we decode? By reversing the encoding procedure. We know the initial histogram $H_1$. Given an input $c_1$, we find the unique symbol $\sigma_1$ such that $w(H_1,\sigma_1) \leq c_1 < w(H_1,\sigma_1+1)$. We then put $H_2 = H_1 - e_{\sigma_1}$, $c_2 = c_1 - w(H_1,\sigma_1)$, and continue. To test your understanding, try to decode $0101$ back from its encoding $1$.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/4639

3.2K people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback