World's most popular travel blog for travel bloggers.

[Solved]: Find all pairs of strings in a set with Levenshtein distance < d

, , No Comments
Problem Detail: 

I have a set of $n = $ 100 million strings of length $l = 20$, and for each string in the set, I would like to find all the other strings in the set with Levenshtein distance $\le d = 4$ from that string. The Levenshtein distance (also called the edit distance) between two strings is the number of insertions, deletions and/or replacements required to convert one string into the another.

This should be possible in $O((d + 1)^{2d + 1} \cdot l \cdot n)$ time with a Levenshtein transducer, which takes a single query string at a time and find all the matches in a set with Levenshtein distance $\le d$. However, using the implementation at that link, it appears to take more like $O(n \log n)$ time rather than $O(n)$, and uses more than 200 GB of memory.

Is there an alternate $O(n)$ approach that might be faster in practice? The Levenshtein transducer is more general than it needs to be for this application, since it finds matches for each string independently and doesn't exploit the fact that you're comparing every string against every other string.

Asked By : 1''

Answered By : D.W.

There's a "trick" you can use that might potentially speed up your algorithm a little: shingling. No guarantees that it'll necessarily help in your particular case, though.

Lemma. If the edit distance between two words $w,x$ is $\le 4$, and the two words both have length 20, then there exists some 4-gram that is common to both words (i.e., some $u$ of length 4 that is a substring of both $w$ and $x$).

This lemma then suggests the algorithmic trick. Build a table with $4^4$ buckets, one bucket for each possible $4$-gram. Given a word $w$, extract all of its 4-grams (there will be 17 of them) and insert it into those 17 buckets. Do this for each word.

Now, if there is a pair of words at edit distance $\le 4$, then there must be some bucket that contains both words. So, for each bucket, we search the set of words within that bucket for a pair of words at edit distance $\le 4$. Use whatever algorithm you like for that task. I would suggest trying BK-trees or metric trees.

Will this trick help? The only way to know for sure is to try it. But here's the intuition. We'll have 256 buckets. Each word will get mapped into 17 buckets. Thus, heuristically, we expect each bucket to have $n'=17n/256$ words in it. If we have an algorithm that can search for pairs of words at edit distance $\le 4$ among a set of words in $T(n)$ time, then applying that directly to the original set of words takes $T(n)$ time. In contrast, after applying this trick, we apply it once to each bucket. Now if $256 T(17n/256)$ is smaller than $T(n)$, this trick might yield improvements in running time. My prediction is that it'll yield improvements for $\Theta(n^2)$-runtime algorithms, but not for $O(n \lg n)$ or $O(n)$ time algorithms.

This trick can be combined with any algorithm for looking for a pair of words with edit distance $\le 4$ in a set of $n$ words.


The "trick" can be improved further. Define a "slate" of the word $w$ to be a pair $(i,u)$ where $i$ is a position ($1 \le i \le 20$) and $s$ is the 4-gram that occurs at position $i$ within $w$. From each word you get 17 slates. Now we get a refined lemma:

Lemma. If the edit distance between two words $w,x$ of length 20 is $\le 4$, then there exists a slate $(i,u)$ of $w$ and a slate $(j,v)$ of $x$ such that $u=v$ and $|i \le j| \le 2$.

(It might be possible to improve the latter condition to $|i \le j| \le 1$, but I haven't tried to check.)

We'll do the same thing as before, but identifying each bucket with a slate. We get $20 \times 256$ different buckets. Now, for each word $w$, extract its 17 slates and store the word in the corresponding 17 buckets. Let $B_{i,u}$ denote the set of words in the bucket for the slate $(i,u)$. Finally, for each slate $(i,u)$, we check whether there exists a word $w \in B_{i-2,u} \cup B_{i-1,u} \cup B_{i,u} \cup B_{i+1,u} \cup B_{i+2,u}$ that is within edit distance $\le 4$ of some word in $B_{i,u}$.

Heuristically, we expect each bucket to have $n'=17n/(20 \times 256) \approx 0.0033 n$ words in it. We expect the running time for each bucket to be something like $5 T(n')$, and we need to do that $20 \times 256$ times, for a total running time of $25600 T(0.0033n)$.

Is this better than the prior trick? The only way to know for sure is to try it. My prediction is that this'll be better for $\Theta(n^2)$-runtime algorithms, but worse for $O(n \lg n)$ or $O(n)$ time algorithms.


Finally, I don't think you have the running time of your scheme right. I believe the Levenshtein transducer automata approach will have running time $\Omega(n^2)$ in your setting, as several folks have outlined in the comment thread under your question.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/53299

 Ask a Question

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback