World's most popular travel blog for travel bloggers.

Efficiently calculating minimum edit distance of a smaller string at each position in a larger one

, , No Comments
Problem Detail: 

Given two strings, $r$ and $s$, where $n = |r|$, $m = |s|$ and $m \ll n$, find the minimum edit distance between $s$ for each beginning position in $r$ efficiently.

That is, for each suffix of $r$ beginning at position $k$, $r_k$, find the Levenshtein distance of $r_k$ and $s$ for each $k \in [0, |r|-1]$. In other words, I would like an array of scores, $A$, such that each position, $A[k]$, corresponds to the score of $r_k$ and $s$.

The obvious solution is to use the standard dynamic programming solution for each $r_k$ against $s$ considered separately, but this has the abysmal running time of $O(n m^2)$ (or $O(n d^2)$, where $d$ is the maximum edit distance). It seems like you should be able to re-use the information that you've computed for $r_0$ against $s$ for the comparison with $s$ and $r_1$.

I've thought of constructing a prefix tree and then trying to do dynamic programming algorithm on $s$ against the trie, but this still has worst case $O(n d^2)$ (where $d$ is the maximum edit distance) as the trie is only optimized for efficient lookup.

Ideally I would like something that has worst case running time of $O(n d)$ though I would settle for good average case running time. Does anyone have any suggestions? Is $O(n d^2)$ the best you can do, in general?

Here are some links that might be relevant though I can't see how they would apply to the above problem as most of them are optimized for lookup only:

I've also heard some talk about using some type of distance metric to optimize search (such as a BK-tree?) but I know little about this area and how it applies to this problem.

Asked By : user834

Answered By : Raphael

What you are interested in are semi-global and/or local alignments. The usual way to compute those is to adapt the dynamic programing algorithm for the Levenshtein distance:

  • Initialise the first row/column with $0$ (instead of $i$/$j$) if free deletions/insertions are allowed at the beginning.
  • Select the minimum value from the last row/column as result if free deletions/insertions are allowed at the end.

Different combinations are possible. In your case, assume that $r$ is the horizontal word, you initialise the first row with $0$ (the first column with $i$, no change here) and the result is the minimum of the last row. The result is then the smallest Levenshtein distance between $s$ and any substring of $r$, computed in time and space $O(nm)$.

Now, in order to get scores for all starting positions, note that the computation of semi-global alignments with allowed deletions/insertions at the end conveniently provides the wanted array in the last row/column -- well, almost: that gives you the best matches against prefixes up to position $k$. So in order to get best matches against suffixes from position $k$, reverse both $r$ and $s$.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/2519

0 comments:

Post a Comment

Let us know your responses and feedback