World's most popular travel blog for travel bloggers.

# [Solved]: Hash-Table in Practice

, ,
Problem Detail:

I have a set of $n$ values,$v_i$ and want to insert them into a hash-table, $HT$, in a way that each bucket (or hash-table cell) has at most $d$ values. I set $k=\frac{n }{d}$, where $k$ is the number of buckets. Then, I use SHA hash function, $H$, to find the index a bucket in the hash-table and insert the values into the bucket:

$HT [H(v) \bmod k]=v_i$

The problem is that the indices do not look uniformly at random, for instance 3 different values may map into a cell whereas an index may not appear at all. And this can lead overflow.

Notice that I need a deterministic way as two different parties encode their values using hash table. Therefore, if the parties use the identical parameters for hash table, the same values should be in the same bucket (of different hash tables). Obviously different party may have different set values.

Question: What is the best way of preventing overflow in the above scenario?

#### Answered By : Yuval Filmus

SHA1 or SHA256, whichever you use, is for any practical purpose a random function. What you are observing is that random allocation is not as good as deterministic allocation. If you knew all the values in advance then you could indeed arrange that each cell would get exactly the same number of hits. Unfortunately, when you throw $n$ balls into $k$ bins, the number of balls landing into each bin isn't $n/k$ but rather has roughly Poisson distribution with mean $n/k$ (assuming $n/k$ is "small"). Thus you expect there to be a few bins which do get quite a few balls more than their share, and vice versa. A good hash table should be able to handle that – there are several approaches that you are probably aware of.

A similar situation happens when you throw $n$ balls into two bins. Rather than each bin getting exactly $n/2$ balls, the number of balls in each bin has roughly normal distribution with mean $n/2$ and standard deviation $\sqrt{n}/2$. Thus you expect one bin to contain roughly $\sqrt{n}$ more balls than the other.

Another similar situation concerns throwing $n$ balls into $n$ bins. Rather than an even allocation, you expect the maximum cell to contain roughly $\log n/\log\log n$ balls. However, if any given ball is given two options, and you put it in the lighter bin, then the maximum cell contains only roughly $\log\log n$ balls in expectation. This is one way of coping with your problem.