I have real data I am using for a simulated card game. I am only interested in the ranks of the cards, not the suits. However it is a standard $52$ card deck so there are only $4$ of each rank possible in the deck. The deck is shuffled well for each hand, and then I output the entire deck to a file. So there are only $13$ possible symbols in the output file which are $2,3,4,5,6,7,8,9,T,J,Q,K,A$. ($T$ = ten rank). So of course we can bitpack these using $4$ bits per symbol, but then we are wasting $3$ of the $16$ possible encodings. We can do better if we group $4$ symbols at a time, and then compress them, because $13^4$ = $28,561$ and that can fit rather "snugly" into $15$ bits instead of $16$. The theoretical bitpacking limit is log($13$) / log($2$) = $3.70044$ for data with $13$ random symbols for each possible card. However we cannot have $52$ kings for example in this deck. We MUST have only $4$ of each rank in each deck so the entropy encoding drops by about half a bit per symbol to about $3.2$.
Ok, so here is what I am thinking. This data is not totally random. We know there are $4$ of each rank so in each block of $52$ cards (call it a shuffled deck), so we can make several assumptions and optimizations. One of those being we do not have to encode the very last card, because we will know what it should be. Another savings would be if we end on a single rank; for example, if the last $3$ cards in the deck are $777$, we wouldn't have to encode those because the decoder would be counting cards up to that point and see that all the other ranks have been filled, and will assume the $3$ "missing" cards are all $7$s.
So my question to this site is, what other optimizations are possible to get an even smaller output file on this type of data, and if we use them, can we ever beat the theoretical (simple) bitpacking entropy of $3.70044$ bits per symbol, or even approach the ultimate entropy limit of about $3.2$ bits per symbol on average? If so, how?
When I use a ZIP type program (WinZip for example), I only see about a $2:1$ compression, which tells me it is just doing a "lazy" bitpack to $4$ bits. If I "pre-compress" the data using my own bitpacking, it seems to like that better, because then when I run that through a zip program, I am getting a little over $2:1$ compression. What I am thinking is, why not do all the compression myself (because I have more knowledge of the data than the Zip program does). I am wondering if I can beat the entropy "limit" of log($13$)/log($2$) = $3.70044$. I suspect I can with the few "tricks" I mentioned and a few more I can probably find out. The output file of course does not have to be "human readable". As long as the encoding is lossless it is valid.
Here is a link to $3$ million human readable shuffled decks ($1$ per line). Anyone can "practice" on a small subset of these lines and then let it rip on the entire file. I will keep updating my best (smallest) filesize based on this data.
https://drive.google.com/file/d/0BweDAVsuCEM1amhsNmFITnEwd2s/view
By the way, in case you are interested in what type of card game this data is used for, here is the link to my active question (with $300$ point bounty). I am being told it is a hard problem to solve (exactly) since it would require a huge amount of data storage space. Several simulations agree with the approximate probabilities though. No purely mathematical solutions have been provided (yet). It's too hard, I guess.
I have a good algorithm that is showing $168$ bits to encode the first deck in my sample data. This data was generated randomly using the Fisher-Yates shuffle algorithm. It is real random data, so my newly created algorithm seems to be working VERY well, which makes me happy.
Regarding the compression "challenge", I am currently at about 160 bits per deck. I think I can go down to maybe 158. Yes I tried and I got 158.43 bits per deck. I think I am getting close to the limit of my algorithm so I succeeded to drop below 166 bits per deck but I failed to get 156 bits which would be 3 bits per card but it was a fun exercise. Perhaps in the future I will think of something to reduce each deck on average by 2.43 bits or more.
Asked By : David James
Answered By : Dan Bryant
Another thing to consider: if you only care about compressing a full set of several million decks and you also don't care about what order they are in, you can gain additional encoding flexibility by discarding the information about the ordering of the set of decks. This would be the case, for instance, if you need to load the set to enumerate all the decks and process them, but don't care what order they are processed in.
You start by encoding each deck individually, as other answers have described how to do. Then, sort those encoded values. Store a series of differences between the sorted encoded values (where the first difference starts from encoded deck '0'). Given a large number of decks, the differences will tend to be smaller than the full encoding range, so you can use some form of varint encoding to handle occasional large differences while still storing the smaller differences efficiently. The appropriate varint scheme would depend on how many decks you have in the set (thus determining the average difference size.)
I unfortunately don't know the mathematics of how much this would help your compression, but thought this idea might be useful to consider.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/62697
0 comments:
Post a Comment
Let us know your responses and feedback