World's most popular travel blog for travel bloggers.

For what kind of data are hash table operations O(1)?

, , No Comments
Problem Detail: 

From the answers to (When) is hash table lookup O(1)?, I gather that hash tables have $O(1)$ worst-case behavior, at least amortized, when the data satisfies certain statistical conditions, and there are techniques to help make these conditions broad.

However, from a programmer's perspective, I don't know in advance what my data will be: it often comes from some external source. And I rarely have all the data at once: often insertions and deletions happen at a rate that's not far below the rate of lookups, so preprocessing the data to fine-tune the hash function is out.

So, taking a step out: given some knowledge about data source, how can I determine whether a hash table has a chance of having $O(1)$ operations, and possibly which techniques to use on my hash function?

Asked By : Gilles

Answered By : David Cary

There are several techniques that guarantee that lookups will always require O(1) operations, even in the worst case.

How can I determine whether a hash table has a chance of having O(1) operations, and possibly which techniques to use on my hash function?

The worst case happens when some malicious attacker (Mallory) deliberately gives you data that Mallory has specifically selected to make the system run slow.

Once you have picked some particular hash function, it's probably over-optimistic to assume Mallory will never find out which hash function you picked. Once Mallory discovers which hash function you picked, if you allow Mallory to give you lots of data to be inserted into your hash table using that hash function, then you are doomed: Mallory can internally rapidly generate billions of data items, hash them with your hash function to find which data items are likely to collide, and then feed you millions of one-in-a-thousand data items that are likely to collide, leading to lookups that run much slower than O(1).

All the techniques that guarantee "O(1) lookups even in the worst case" avoid this problem by doing a little bit of extra work on each insertion to guarantee that, in the future, every possible lookup can succeed in O(1) time. In particular, we assume (worst case) that Mallory will sooner or later discover which hash function we are using; but he only gets a chance to insert a few data items before we pick a different hash function -- tabulation hashing or some other universal hashing -- one that we specially select such that all the data we have so far can be looked up in 2 or 3 probes -- i.e., O(1). Because we randomly select this function, we can be fairly sure that Mallory won't know what function we picked for a while. Even if Mallory immediately gives us data that, even with this new hash function, collides with previous data, we can then pick yet another a fresh new hash function such that, after rehashing, all the previous data he and everyone else has fed us can now be looked up in 2 or 3 probes in the worst case -- i.e., O(1) lookups in the worst case.

It's fairly easy to randomly select a new hash function and rehash the entire table often enough to guarantee that each lookup is always O(1). While this guarantees that each lookup is always O(1), these techniques, when inserting the Nth item into a hash table that already contains N-1 items, can occasionally require O(N) time for that insert. However, it is possible to design the system such that, even when Mallory deliberately gives you new data that, using the new hash function, collides with previous data, the system can accept lots of items from Mallory and others before it needs to do a full O(N) rebuild. Hash table techniques that pick-a-new-function-and-rehash in order to guarantee O(1) lookups, even in the worst case, include:

  • cuckoo hashing guarantees that each key lookup succeeds with at most 2 hash calculations and 2 table lookups.
  • hopscotch hashing guarantees that each key lookup succeeds after inspecting at small number H (perhaps H=32) consecutive entries in the table.
  • dynamic perfect hashing -- the 1994 paper by Dietzfelbinger is the first one I've read that pointed out that, even though it rehashes "frequently" in order to guarantee that each key lookup always succeeds with 2 hash calculations and 2 lookups, it's possible to do a full rehash so rarely that even though each full rehash uses O(n) time, the expected average cost of insertions and deletion are O(1) amortized.

Data Structures/Hash Tables

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/477

0 comments:

Post a Comment

Let us know your responses and feedback