I am looking for a fast k-mismatch string matching algorithm. Given a pattern string P of length m, and a text string T of length n, I need a fast (linear time) algorithm to find all positions where P matches a substring of T with at most k mismatches. This is different from the k-differences problem (edit distance). A mismatch implies the substring and the pattern have a different letter in at most k positions. I really only require k=1 (at most 1 mismatch), so a fast algorithm for the specific case of k=1 will also suffice. The alphabet size is 26 (case-insensitive english text), so space requirement should not grow too fast with the size of the alphabet (eg., the FAAST algorithm, I believe, takes space exponential in the size of the alphabet, and so is suitable only for protein and gene sequences).
A dynamic programming based approach will tend to be O(mn) in the worst case, which will be too slow. I believe there are modifications of the Boyer-Moore algorithm for this, but I am not able to get my hands on such papers. I do not have subscription to access academic journals or publications, so any references will have to be in the public domain.
I would greatly appreciate any pointers, or references to freely available documents, or the algorithm itself for this problem.
Asked By : Paresh
Answered By : Paresh
Suffix arrays can be used for this problem. They contain the starting positions of each suffix of the string sorted in lexicographic order. Even though they can be constructed naively in $O(n\log n)$ complexity, there are methods to construct them in $\Theta(n)$ complexity. See for example this and this. Let us call this suffix array SA.
Once the suffix array has been constructed, we need to construct a Longest Common Prefix (LCP) array for the suffix array. The LCP array stores the length of the longest common prefix between two consecutive prefixes in the suffix array (lexicographic consecutive suffixes). Thus, LCP[i] contains the length of the longest common prefix between SA[i] and SA[i+1]. This array can also be constructed in linear time: see here, here and here for some good references.
Now, to compute the length of the longest prefix common to any two suffixes in the suffix tree (instead of consecutive suffixes), we need to use some RMQ data structure. It has been shown in the references above (and can be seen easily if the array is visualized as a suffix tree), that the length of the longest common prefix between two suffixes having positions $u$ and $v$ ($u < v$) in the suffix array, can be obtained as $min_{u<=k<=v-1}{LCP[k]}$. A good RMQ can pre-process the $LCP$ array in $O(n)$ or $O(n\log n)$ time and respond to queries of the form $LCP[u, v]$ in $O(1)$ time. See here for a succint RMQ algorithm, and here for a good tutorial on RMQ's, and the relationship (and reductions) between LCA and RMQs. This has another nice alternative approach.
With this information, we construct the suffix array and associated arrays (as described above) for the concatenation of the two strings with a delimiter in between (such as T#P, where '#' does not occur in either string). Then, we can perform k mismatch string matching using the "kangaroo" method. This and this explain the kangaroo method in the context of suffix trees, but can be directly applied to suffix arrays too. For every index $i$ of the text $T$, find the $LCP$ of the suffix of $T$ starting at $i$ and the suffix of $P$ starting at 0. This gives the location after which the first mismatch occurs when matching $P$ with $T[i]$. Let this length be $l_0$. Skip the mismatching character in both $T$ and $P$ and try to match the remaining strings. That is, again find the $LCP$ of $T[i + l_0 + 1]$ and $P[l_0 + 1]$. Repeat this till you obtain $k$ mismatches, or either string finishes. Each $LCP$ is $O(1)$. There are $O(k)$ $LCP$'s for each index $i$ of $T$, giving this a total complexity of $O(nk)$.
I used an easier to implement RMQ giving a total complexity of $O(nk + (n+m)\log(n+m))$, or $O(nk + n\log n)$ if $m = O(n)$, but it can be done in $O(nk)$ too as described above. There may be other direct methods for this problem, but this is a powerful and generic approach that can be applied to a lot of similar problems.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/4797
0 comments:
Post a Comment
Let us know your responses and feedback