World's most popular travel blog for travel bloggers.

[Solved]: Looking for a set implementation with small memory footprint

, , No Comments
Problem Detail: 

I am looking for implementation of the set data type. That is, we have to

  • maintain a dynamic subset $S$ (of size $n$) from the universe $U = \{0, 1, 2, 3, \dots , u – 1\}$ of size $u$ with
  • operations insert(x) (add an element x to $S$) and find(x) (checks whether element x is a member of $S$).

I don't care about other operations. For orientation, in applications I'm working with we have $u \approx 10^{10}$.

I know of implementations that provide both operations in time $O(1)$, so I worry mostly about the size of data structure. I expect billions of entries but want to avoid swapping as much as possible.

I am willing to sacrifice runtime if necessary. Amortised runtime of $O(\log n)$ is what I can admit; expected runtimes or runtimes in $\omega(\log n)$ are not admissable.

One idea I have is that if $S$ can be represented as a union of ranges [xmin, xmax], then we will be able to save on storage size with the price of some performance decrease. Also, some other data patterns are possible, like [0, 2, 4, 6].

Could you please point me to data structures which can do something like that?

Asked By : HEKTO

Answered By : Pseudonym

Joe's answer is extremely good, and gives you all the important keywords.

You should be aware that succinct data structure research is still in an early stage, and many of the results are largely theoretical. Many of the proposed data structures are quite complex to implement, but most of the complexity is due to the fact that you need to maintain asymptotic complexity both over the universe size and the number of elements stored. If either one of these is relatively constant, then a lot of the complexity goes away.

If the collection is semi-static (that is, inserts are rare, or at least low-volume), then it's certainly worth considering an easy-to-implement static data structure (Sadakane's sdarray is a fine choice) in conjunction with an update cache. Basically, you record updates in a traditional data structure (e.g. B-tree, trie, hash table), and periodically bulk-update the "main" data structure. This is a very popular technique in information retrieval, since inverted indexes have many advantages for searching but are hard to update in-place. If this is the case, please let me know in a comment and I'll amend this answer to give you some pointers.

If inserts are more frequent, then I suggest succinct hashing. The basic idea is straightforward enough to explain here, so I will do so.

So the basic information theoretic result is that if you're storing $n$ elements from a universe of $u$ items, and there is no other information (e.g. no correlation between the elements) then you need $\log {u \choose n} + O(1)$ bits to store it. (All logarithms are base-2 unless otherwise specified.) You need this many bits. There is no way around it.

Now some terminology:

  • If you have a data structure which can store the data and support your operations in $\log {u \choose n} + O(1)$ bits of space, we call this an implicit data structure.
  • If you have a data structure which can store the data and support your operations in $\log {u \choose n} + O(\log {u \choose n}) = (1 + O(1)) \log {u \choose n}$ bits of space, we call this a compact data structure. Note that in practice this means that the relative overhead (relative to the theoretical minimum) is within a constant. It could be 5% overhead, or 10% overhead, or 10 times overhead.
  • If you have a data structure which can store the data and support your operations in $\log {u \choose n} + o(\log {u \choose n}) = (1 + o(1)) \log {u \choose n}$ bits of space, we call this a succinct data structure.

The difference between succinct and compact is the difference between little-oh and big-oh. Ignoring the absolute-value thing for a moment...

  • $g(n) = O(f(n))$ means that there exists a constant $c$ and a number $n_0$ such that for all $n > n_0$, $g(n) < c \cdot f(n)$.
  • $g(n) = o(f(n))$ means that for all constants $c$ there exists a number $n_0$ such that for all $n > n_0$, $g(n) < c \cdot f(n)$.

Informally, big-oh and little-oh are both "within a constant factor", but with big-oh the constant is chosen for you (by the algorithm designer, the CPU manufacturer, the laws of physics or whatever), but with little-oh you pick the constant yourself and it can be as small as you like. To put it another way, with succinct data structures, the relative overhead gets arbitrarily small as the size of the problem increases.

Of course, the size of the problem may have to get huge to realise the relative overhead that you want, but you can't have everything.

OK, with that under our belts, let's put some numbers on the problem. Let's supposed that keys are $n$-bit integers (so the universe size is $2^n$), and we want to store $2^m$ of these integers. Let's suppose that we can magically arrange an idealised hash table with full occupancy and no wastage, so that we need exactly $2^m$ hash slots.

A lookup operation would hash the $n$-bit key, mask off $m$ bits to find the hash slots, and then check to see if the value in the table matched the key. So far, so good.

Such a hash table uses $n 2^m$ bits. Can we do better than this?

Suppose that the hash function $h$ is invertible. Then we don't have to store the whole key in each hash slot. The location of the hash slot gives you $m$ bits of the hash value, so if you only stored the $n-m$ remaining bits, you can reconstruct the key from those two pieces of information (the hash slot location and the value stored there). So you would only need $(n - m) 2^m$ bits of storage.

If $2^m$ is small compared with $2^n$, Stirling's approximation and a little arithmetic (proof is an exercise!) reveals that:

$$(n - m) 2^m = \log {2^n \choose 2^m} + o\left(\log {2^n \choose 2^m}\right)$$

So this data structure is succinct.

However, there are two catches.

The first catch is constructing "good" invertible hash functions. Fortunately, this is much easier than it looks; cryptographers make invertible functions all the time, only they call them "cyphers". You could, for example, base a hash function on a Feistel network, which is a straightforward way to construct an invertible hash functions from non-invertible hash functions.

The second catch is that real hash tables are not ideal, thanks to the Birthday paradox. So you'd want to use a more sophisticated type of hash table which gets you closer to full occupancy with no spilling. Cuckoo hashing is perfect for this, as it lets you get arbitrarily close to ideal in theory, and quite close in practice.

Cuckoo hashing does require multiple hash functions, and it requires that values in hash slots be tagged with which hash function was used. So if you use four hash functions, for example, you need to store an additional two bits in each hash slot. This is still succinct as $m$ grows, so it's not a problem in practice, and still beats storing whole keys.

Oh, you might also want to look at van Emde Boas trees.

MORE THOUGHTS

If $n$ is somewhere around $\frac{u}{2}$, then $\log {u \choose n }$ is approximately $u$, so (once again) assuming that there is no further correlation between the values, you basically can't do any better than a bit vector. You will note that the hashing solution above does effectively degenerate to that case (you end up storing one bit per hash slot), but it's cheaper just to use the key as the address rather than using a hash function.

If $n$ is very close to $u$, all of the succinct data structures literature advises you to to invert the sense of the dictionary. Store the values that don't occur in the set. However, now you effectively have to support the delete operation, and to maintain succinct behaviour you also need to be able to shrink the data structure as more elements get "added". Expanding a hash table is a well-understood operation, but contracting it is not.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/20070

0 comments:

Post a Comment

Let us know your responses and feedback