World's most popular travel blog for travel bloggers.

[Solved]: Using FFT to solve pattern matching problem with don't cares

, , No Comments
Problem Detail: 

I've studied about $FFT$ and it's use in string matching, but there are several issues that bother me. First I'll describe the problem and the solution I've seen in class:
Suppose we have alphabet $\Sigma = (a, b, \phi)$ where $\phi$ is "don't care", meaning it matches with every letter, including itself. We have a text $T = t_{_0}t_{_1}t_{_2}...t_{_n}$, and a pattern $P = p_{_0}p_{_1}p_{_2}...p_{_m}$, assuming $n$ is much bigger than $m$. We say that we have match in index $i$ if for every $0 \leq k \leq m$:
$$P[k]=\phi \space or\space T[i+k]=\phi \space or\space P[k]=T[i+k]$$ Therefore we introduced new "multiplication" function over $\Sigma$:
$$ f(a,b) = \begin{cases} 0 & \text{if a = b;}\\ 1 & \text{else;}\\ \end{cases} $$ The idea was to translate every appearance of $a$ and $\phi$ to $0$, and $b$ to $1$, reverse $P \text{ (P$^R = p_{_m}p_{_m-1}p_{_m-2}...p_{_0})$}$, treat both strings as polynomial coefficient vectors and multiply them according to the regularity of $f$. Then in their product vector, $C$, indices that are 0's will tell us there's a match starting in that index in text. In order to do that efficiently we used $FFT$, however, $FFT$ works with standard multiplication, so in order to bypass that obstacle we calculated $T^c \cdot P^R +T \cdot (P^R)^c$, where $c$ superscript means to "not" the string, for example: $(100010)^c = 011101$, and $+$ means normal addition of vectors. Now I have 2 questions bugging my mind:
1) How that thing we did in "not"-ing the strings helps solving the problem about the multiplication. I mean I've tried several examples and it works, but why it works?
2) If I wanted to expand this method to bigger alphabet, how could I do it? There's something very binary-oriented in the core of "not"-ing the string, relying on the fact that I have only one pair of letters that create discrepancy.

Asked By : enemypower4

Answered By : Yuval Filmus

The idea behind the NOT is that $T^c P^R + TP^{Rc}$ is measuring disagreement rather than agreement. Disagreement can come in two ways: text says $0$, pattern says $1$ ($T^c P^R$); and text says $1$, pattern says $0$ ($TP^{Rc}$). That's why you need two summands.

In more detail, $T^cP^R$ measures the number of indices at which the text was $0$ and the pattern was $1$; $TP^{Rc}$ measures the number of indices at which the text was $1$ and the pattern was $0$; and their sum measures the number of indices at which the text and the pattern didn't match.

When the alphabet has size $r$, you can use a similar approach with $r$ summands (an even simpler solution uses $r(r-1)$ summands). I'll let you figure out the details. (No hints will be given, and solutions won't be verified.) Extra points if you manage to do this with the (possibly) optimal $\lceil \log_2 r \rceil$ summands.

Best Answer from StackOverflow

Question Source :

3.2K people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback