World's most popular travel blog for travel bloggers.

Algorithm for compressing binary data that can efficiently be partially decompressed

, , No Comments
Problem Detail: 

So I have been given the following problem and since I have no experience at all in this field, I would really appreciate some advice/guidance.

These are the requirements:

So three times per second a certain amount (fixed per session) of 32-bit values come in. These values have to be stored into a file with a timestamp per data-entry and it must be possible to at any time, with a given timestamp and interval, read out the values that are stored in that file.

This is all relatively simple to do, but this produces a 5GB file every 24 hours. If this file is zipped, it just leaves a 500MB file, so it is compressable but in that case you can't instantly retreive datapoints because you would have to unzip the entire file.

So I was thinking that this could be solved like this:

  • Buffer the first 10MB of sensor data.
  • Then sort all the values by the amount of times that they contain the same value. (Sensor data, so quite a few of the channels will mostly contain stable values)
  • Run an algorithm that detects recurring patterns, replaces those patterns with shorter values and defines a table with the recurring patterns and their replacements.
  • The compress the buffer and save it to the disk and compress every new series of sensor data with the pre-defined table.

If this would be implemented properly, it should be possible to uncompress an individual datapoint and the file is sorted by timestamp so finding the right one should be trivial with a binary search.

So I guess the questions I'm having are the follwing

  1. Is this at all possible? I do not expect compression results near Zip and 2x - 5x would be enough.
  2. What is a good (and if possible semi-easy to implement) algorithm to find patterns in binary data? I have been doing some research and I mostly ran into algorithms that only apply to text data.

Any form of help is greatly appreciated!

Asked By : larzz11

Answered By : larzz11

Sorry to you guys who posted such a detailed answer, I ended up solving this in a quite simple way.

I ended up just compressing the data points individually and because the entries are quite large (60-600 kb) compression rates up to 12x are possible. (using zlib)

When a new file is created, I reserve the first part of the file to serve as "table of contents". The file starts with a "file-header" that contains the first and last timestamp that can be found in the file and the current amount of data points put in the file. Following that is a long list of equally sized "data-pointers" containing the timestamp and the offset in file. Due to all the reserved space for the headers this initially gives a file of 4MiB even when there is no data in it, but this is quickly negligable by the space saved by the compression. After the "table of contents" is each datapoint in compressed form one after another.

This makes it possible to do a binary search for the right data, and even in a file of several GB's the right data can be found in a matter of milliseconds. I initially set zlib to MAX_COMPRESSION, but this took up way more time than MAX_SPEED and just resulted in a minor change in compression rate. And at maximum stress (600KiB entries at 20Hz) it was barely keeping using MAX_COMPRESSION.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/54722

0 comments:

Post a Comment

Let us know your responses and feedback