Given two strings, $r$ and $s$, where $n = |r|$, $m = |s|$ and $m \ll n$, find the minimum edit distance between $s$ for each beginning position in $r$ efficiently.
That is, for each suffix of $r$ beginning at position $k$, $r_k$, find the Levenshtein distance of $r_k$ and $s$ for each $k \in [0, |r|-1]$. In other words, I would like an array of scores, $A$, such that each position, $A[k]$, corresponds to the score of $r_k$ and $s$.
The obvious solution is to use the standard dynamic programming solution for each $r_k$ against $s$ considered separately, but this has the abysmal running time of $O(n m^2)$ (or $O(n d^2)$, where $d$ is the maximum edit distance). It seems like you should be able to re-use the information that you've computed for $r_0$ against $s$ for the comparison with $s$ and $r_1$.
I've thought of constructing a prefix tree and then trying to do dynamic programming algorithm on $s$ against the trie, but this still has worst case $O(n d^2)$ (where $d$ is the maximum edit distance) as the trie is only optimized for efficient lookup.
Ideally I would like something that has worst case running time of $O(n d)$ though I would settle for good average case running time. Does anyone have any suggestions? Is $O(n d^2)$ the best you can do, in general?
Here are some links that might be relevant though I can't see how they would apply to the above problem as most of them are optimized for lookup only:
- Fast and Easy Levensthein distance using a Trie
- SO: Most efficient way to calculate Levenshtein distance
- SO: Levenshtein Distance Algoirthm better than $O(n m)$
- An extension of Ukkonen's enhanced dynamic programming ASM algorithm
- Damn Cool Algorithms: Levenshtein Automata
I've also heard some talk about using some type of distance metric to optimize search (such as a BK-tree?) but I know little about this area and how it applies to this problem.
Asked By : user834
Answered By : Raphael
What you are interested in are semi-global and/or local alignments. The usual way to compute those is to adapt the dynamic programing algorithm for the Levenshtein distance:
- Initialise the first row/column with $0$ (instead of $i$/$j$) if free deletions/insertions are allowed at the beginning.
- Select the minimum value from the last row/column as result if free deletions/insertions are allowed at the end.
Different combinations are possible. In your case, assume that $r$ is the horizontal word, you initialise the first row with $0$ (the first column with $i$, no change here) and the result is the minimum of the last row. The result is then the smallest Levenshtein distance between $s$ and any substring of $r$, computed in time and space $O(nm)$.
Now, in order to get scores for all starting positions, note that the computation of semi-global alignments with allowed deletions/insertions at the end conveniently provides the wanted array in the last row/column -- well, almost: that gives you the best matches against prefixes up to position $k$. So in order to get best matches against suffixes from position $k$, reverse both $r$ and $s$.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/2519
0 comments:
Post a Comment
Let us know your responses and feedback