World's most popular travel blog for travel bloggers.

[Solved]: Optimizing hash table for common subset of keys?

, , No Comments
Problem Detail: 

Are there variants of hash tables that take advantage of the likely distribution of key values? For example, if I'm using utf-8 encoded unicode strings as identifiers in a compiler but expect the majority of them to fit into ASCII, can I optimize for that use case?

Asked By : Shea Levy

Answered By : Gilles

Yes, it's sometimes possible to optimize hash tables based on expected key values (i.e. make certain classes of keys at the expense of other classes). But I don't see how there could be one in your case.

There are two ways to potentially make hash tables faster: use a hash function with fewer collisions, and use a hash functions that's faster to calculate. But they pull in opposite directions: you can either make the hash function faster or better, not both.

Hash functions are designed to spread out likely key values in the first place. So if you have "typical" input, there's nothing to be gained here.

For example, if you're making exact string comparisons, then the fact that the strings are encoded in UTF-8 doesn't affect the way the hashes are calculated. The hash function looks at all the bytes of the string anyway, and has a low probability of collisions except for specially-crafted data. A hash function specialized to pure-ASCII strings won't be faster to calculate (operations are on bytes anyway, even on words, not on individual bits) nor have more collisions (repeated bit patterns like "all upper bits are 0" is something typical hash functions are designed to cope with).

An example where you can gain performance for typical cases is when the key for the hash table is a large data structure, such that there's a gain to be made by hashing only part of the data structure. Say the key consists of both a name and some large extra data, and most keys have different names, but occasionally there are distinct keys with the same name. Then you might use a two-level table: first hash on the key, and then, only in buckets where there is more than one entry, calculate the hash of the extra data.

In the case of a string, the closest thing would be to hash only a prefix. But that typically gives bad results in practice; for example, in a compiler, it's very common to have large sets of identifiers with a common prefix.

Another case where a simpler, partial hash function can be useful is when equality between keys is hard to test. This may be the case with Unicode strings if you need to treat different representations of the same text as equal, e.g. treat combining characters as equivalent to non-canonical forms. In this case, you may want to avoid doing the work of normalizing each string. However that wouldn't help for the pure-ASCII case, since normalizing a pure-ASCII string is easy. It might help with typical "tame" uses of Unicode (sticking to characters that are common in one language).

A hash table is not necessarily the best structure to represent a mapping from keys to values. Search trees or tries have many advantages. Balanced search trees have guaranteed $O(\lg(n))$ behavior, whereas hash tables are harder to protect against deliberately-crafted key sets (not so much an issue in compilers, but it can be an issue when interpreting code from an untrusted source, e.g. a web browser running Javascript code from some random website). Hash tables are pretty much impossible to share intelligently — you have to copy the whole table. With trees, on the other hand, sharing comes naturally. Sharing is very often useful in compilers, between scopes that differ only on a few identifiers.

Best Answer from StackOverflow

Question Source :

3.2K people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback