How is it possible to print all patterns, of length k, contained in a string using FM-index and burrows-wheeler transform?
PROBLEM DETAIL: I think it's better to leave more details regarding my problem. The input i have are:
- the text encoded with Burrows Wheeler Transform
- the data structure of the FM-Index.
In particularly for the second element listed i have:
- C(c) => is a table that, for each character c in the alphabet, contains the number of occurrences of lexically smaller characters in the text
- Occ(c, k) => is the number of occurrences of character c in the prefix of the Burrow wheelers transform 1..k
The only strategy that I found at the moment is to reconstruct the text from the burrow wheelers transform and, during that operation, for every char print the patter that start from it. I leave a pseudo code that gives you an idea of what i was trying to do:
--------------------------------------------- DATA --------------------------------------------- bwt[n] => burrow wheeler transform of the text C, Occ => FM-index data structure K => lenght of the patterns to be extracted (so extract all patterns of length k from text T) --------------------------------------------- Algorithm --------------------------------------------- n = |bwt|; //text lenght T[0,n]; //array used to save the original text T[--n] = bwt[0]; //first char of the original text currentIndex = 0; for i = 0 to |btw(T)| - 1 do //get original lastFirst = c(bwt[currentIndex]) + Occ(bwt[currentIndex], currentIndex); T[--n] = bwt[lastFirst]; currentIndex = lastFirst; //print pattern only if we have reconstructed at least k char if((|btw(T)| - n) >= k) from original text print T[n, n+k]; //print last k char reconstructed endif; endfor;
Are there better technique/strategy to solve the problem? Do exists a strategy that doesn't need to reconstruct the original text?
Asked By : Fabrizio Duroni
Answered By : jnalanko
If I understand correctly, you want to print every substring of length $k$ from the original text. To do this you will need to invert the Burrows-Wheeler transform completely at some point, which is indeed what you do in your pseudocode. One improvement for your method would be to just store the $k$ latest characters instead of storing the whole string.
Your method is already as good as it gets, as it is equivalent to doing one pass over the string and directly printing the substrings, which is the minimum you have to do in any case.
--
Edit: so we want to print each substring only once. Now the problem becomes more interesting. An easy solution is to store the substrings in a hash map, for example, and thus avoid printing duplicates. If you can afford storing the original string and the inverse suffix array, then there is a sophisticated solution available which is good if $k$ is small.
A basic operation on the FM index is a left extension of a string. If we know the range of suffixes in the suffix array that start with the string $\alpha$, we can compute in constant time the range of $w\alpha$ for any character $w$. Here's how. Let $(i,j)$ be the range of the string $\alpha$. The range of $w \alpha$ is:
$$(C[w] + occ(w, i - 1) + 1,\;\; C[w] + occ(w,j)) \;.$$
We can start with the range of an empty string, which is $(1, n)$, and do a depth-first search of left-extensions down to a depth of $k$. At depth $k$ we will print the substring corresponding to the current range. For this we need to have the original string and the inverse suffix array $ISA$, which is defined by $ISA[SA[i]] = i$, where $SA$ is the suffix array. The substring corresponding to the range $(i,j)$ is the first $k$ characters of the suffix $ISA[i]$.
This method is very efficient in practice if $k$ is small, as it often is in bioinformatics applications. For a more thorough explanation on the backwards search I recommend e.g. this blog post: http://alexbowe.com/fm-index/
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/33935
0 comments:
Post a Comment
Let us know your responses and feedback