World's most popular travel blog for travel bloggers.

Embedding high dimensional vectors into low dimensional space preserving similarity

, , No Comments
Problem Detail: 

I have a collection of high dimensional vectors such as $\vec{a}_{i} \in \mathbb{R}^{n}$ where $n$ is 3000. What I want to do is to embed these vectors into a space such as $\vec{b}_{i} \in [0, 255]^{n}$ where $n$ in this case is now 32. I want to embed it (using any hash trick or something like that) in a way that will preserve similarity. For example, if I have the following two vectors that are very similar to each other:

$\vec{a}_{1} = (1.2, 0.1, 0.4, 12.8, \ldots)$

$\vec{a}_{2} = (1.1, 0.1, 0.4, 12.5, \ldots)$

I want these two vectors to end in the same "bucket" of this other space where the vectors belongs to $[0, 255]^{32}$. Note that I don't want to use it for search, I want similar vectors to be in the same "bucket" in this other low dimensional space. I was trying to use LSH but I wasn't able to figure out a way to solve this.

Asked By : Tarantula

Answered By : D.W.

Locality-sensitive hashing is one reasonable approach for this. I suggest reading standard resources on locality-sensitive hashing (LSH). In your case, a locality-sensitive hash is a hash function that maps $h:\mathbb{R}^n \to S$ where if $x,y \in \mathbb{R}^n$ are close enough, then we'll have $h(x)=h(y)$ with high probability.

To get started, tead https://en.wikipedia.org/wiki/Locality-sensitive_hashing and http://cstheory.stackexchange.com/q/10053/5038.

Do understand that LSH is fundamentally probabilistic: if $x,y$ are similar, they will have a high probability of being mapped to the same bucket, but this cannot be guaranteed.

Alternatively, you might be interested in metric space embeddings. An embedding $E:\mathbb{R}^n \to \mathbb{R}^d$ has the property that if $x,y$ are close (i.e., $\|x-y\|_2$ is not too large), then $E(x),E(y)$ will be close (i.e., $\|E(x)-E(y)\|_2$ will be not too large). This could then provide a solution to your scheme, if you are satisfied that similar inputs map to buckets that are "nearby" but not necessarily identical.

For more on metric space embeddings, you can start with http://cs.stackexchange.com/a/27923/755, http://cstheory.stackexchange.com/a/6818/5038, http://cstheory.stackexchange.com/q/21487/5038, http://cstheory.stackexchange.com/questions/tagged/embeddings.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/60916

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback