World's most popular travel blog for travel bloggers.

[Solved]: Building static hash table with particular collisions

, , No Comments
Problem Detail: 

Is there efficient algorithm to encode keys in hash function with provided collisions?

By efficient I mean with low-ish runtime of lookup operation (taking constants into account) and realistic time of finding such function.
Keys are floating point values - I can compare them for equality as they are raw, unprocessed. Equality holds, and there are no round-offs.
If there is better chance for algorithm on integers - I am still interested.
I did not included specific scheme for building perfect hash - it might be the wrong way or different scheme supports such change.

Let $h(x)$ be hash function (the one to be found)
$X$ be set of sets of float values that $\forall x \in X_i, h(x) = c_i$
$\forall X_i, |X_i| \in [2, 20]$
Let $Y$ be set of float values that there are no collisions (it does not collide with any $X_i$)

For example, I have set $X = \{\{1, 5.5, 24\}, \{7, 20\}\}$ and the set $Y = \{4, 25, 31.2, ...\}$

$h(1) = h(5.5) = h(24) = c_0$
$h(7) = h(20) = c_1$
$h(4) = c_2$
$h(25) = c_3$
$h(31.2) = c_4$
$c_i = c_j \iff i = j$

$X$ set containts colliding tuples, $Y$ contains values that collide with nothing.
$Y$ is about 85-95% of input, $X$ occupies the rest.
If for example $X$ has values in tuples of multiplicities {3, 5, 2, 2, 20, 14, 3}, this are 49 values, but 7 distinct buckets and Y has 451 elements, these are 451 distinct buckets.
In the hashtable there will be at least 458 buckets (this is perfect, and preffered), but might be a bit bigger up to 1%.
All possible inputs are given, $h(x)$ will never execute for value that was not in the input.
Order of the values from input does not matter, $h(x)$ does not need to preserve order, does not need to be monotone.

TLDR; Below are previous attempts, some background etc.


Background

In the initial phase I gather data from the device, this phase ends when for given settings I have collected all possible outputs.
This data is collected as floating point numbers. I calculate outputs based on what I get and create mapping of all values to calculated results (this are not continous functions, not piecewise, distinct values with collisions, which are giving the same results).
The callibration phase gives me value to add to calculated results from initial phase.
In the main part I gather data from device, this time with feedback where I need recalculated values in real time.
There are constraints on memory, query time, and I have 10^7 elements, which need to be calculated and then augmented with some value.
The data seems quite random to me (and to DieHard statistical tests also).


Solution attempts

So I thougth about hash function, calculate everything in advance, and make queries easy.
The problem is the normal hashtable does not fit very well because of slowdowns on collisions and when it is perfect it is too big, but collisions if happen on data that gives the same results - fits quite well.
There are a lots of perfect hashing techniques, or methods of collision resolution, but I could not find any materials about picking collisions on my own (this is twice perfect solution, not only I get rid of collisions but also data redundancy vanishes).
So I took CMPH library, and added rejections to collisions to reject if collision is not on the list. The solution is working (on smaller inputs, on real-like I got only two results so far, on relaxed collisions, due to search time).
The probability that I will find perfect function randomly or semirandomly in 10^7 elements is too low to consider.
So I thought about some modified solution, like taking several hash functions, adjusting them, finding similarities - everything is doable on very small sets, then time needed to execute drops like avalanche.
There are addidional test I did, like with Pearson hash, the idea is very cool, but the data size is bigger than 10^5, and values are floats, so it complicates a bit.


Workplace settings

The size of the set is about 10^7 elements, hash table with function $h(x)$ is flat array, that does not support collision resolution (from my perspective, the values from sets $X_i$ transform into the same output, so are mapped into one element).


Easier, discarded solutions

  • The most obvious attempt is to create minimal perfect static hash function from all elements. This solution is discarded because of memory constraints - it does not fit into memory in one piece, redundancy of data.
  • Each set $X_i$ gets own perfect hash function, and upon query I try $Y$ set as the most probable, and then every set $X_i$ via own hash function until I find the value. (I thought that Yuval Filmus answer is very similar to that approach), which might look a bit like Bloom filter, but the values queried via $h(x)$ are always in the table, and maintaining about 10^6 values more hashes is too slow.

Disaster solutions

  • Trees - Many sorts of trees - too slow, way too slow.
  • Mixture of hash and tree, Bloomtree(??). I never hoped that it would work but checked it anyway. Of course too slow.
Asked By : Evil

Answered By : Yuval Filmus

The easiest way is to construct a static hash table $T$ containing all the collisions, in the following form: for each set of keys $S$ which are supposed to map to the same value, single out some $x \in S$, and put all other $y \in S$ in the table with an entry stating "$x$".

Now take a good hash function $h$, and construct a new one as follows:

  • On input $x$, check whether $x \in T$.
  • If $x \notin T$, return $h(x)$.
  • If $x \in T$, return $h(T[x])$.

If $h$ were a uniformly random function, then this would result in a uniformly random function conditioned on your collisions.

There are some optimizations possible here:

  • The hash function used to access $T$ can be weak.
  • The table $T$ might be too large to fit in the cache, so accessing $T$ might be slow. We can add a Bloom filter that does fit in cache to ameliorate this.
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/53094

0 comments:

Post a Comment

Let us know your responses and feedback