I have many vectors in my database. They are in high dimensions such as:
- $v_1$ : $\langle 23, 23, 1, 33, 103, 219, \dots \rangle$
- $v_2$ : $\langle 92, 83, 1, 33, 239, 192, \dots \rangle$
- ...
I will use Hamming distance to calculate their difference: The difference between $v_1$ and $v_2$ is $4$ because elements $3$ and $4$ are the same and others are difference.
Now, I want to use Locality Sensitive Hashing (LSH) to put those vectors into different bins.
What kind of hash function can I use for this case?
I have read some article about universal hash function, but I am not sure can I use it and how to ensure that the probability for the similar vectors going to the same bin is higher than those non-similar one.
Here is the way that I think how should I use the universal hash function for my task.
I will first divide those high dimensions vectors into sub-vectors: $$x : 23, 23 \; | \; 1, 33 \; | \; 103, 219 \; | \; \dots$$
sub1-x : 23 23
sub2-x : 1 33
sub3-x : 103 219
The following function will be used for each sub-vector: $$sum_{i=0}^{r} a_{i}x_{i} \mod m$$
Basically this is a dot product, a = {a_1, a_2, ... a_i}, x = {23, 23, 1, 33, 103, 219 ...}, m is a prime.
- Different combination of {a} will form a different hash table, one hash table is used for one sub-vector.
- I can now hash the data into bins, but the question is
Is this an LSH method? I don't know that two similar vectors will go into the same bin with a high probability.
Asked By : sflee
Answered By : D.W.
Yes. This is one plausible method, if you want to find pairs of vectors that are very close to each other. Two vectors that are close in Hamming distance are likely to end up in the same bin at some point.
Suppose two vectors $v,w$ agree in fraction $p$ of their coordinates (where $0\le p \le 1$). Then, heuristically, if we look at a pair of coordinates, the probability that $v,w$ agree on those coordinates is $p^2$; and if we choose those two coordinates for hashing, then $v,w$ will end up in the same bin. If you hash each vector $n$ times, choosing a random pair of coordinates each time, then with probability $1-(1-p^2)^n$, there will be at least one time when they show up in the same bin.
For instance, if two vectors agree on half of their coordinates, and if you hash them 10 times, then 94% of the time they will end up in the same bin for at least one of those 10 hashes.
To know how well it will work on your particular application, you may need to try it (or at least work out the parameters of how close vectors will be, when they are close; how many dimensions they have; how many times you will hash them; etc.).
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/13334
0 comments:
Post a Comment
Let us know your responses and feedback