World's most popular travel blog for travel bloggers.

deterministic finite automaton for "does contain substring $w$" is of linear size

, , No Comments
Problem Detail: 

A (non-deterministic) finite state automaton for "all strings that contain substring $w$" is very simple. Just make the path for $w$ and add looped transitions for all letters at the first initial state and the last final state.

When asked to make a deterministic automaton in general we might expect that the automaton can be of exponential size (see the warning of Yuval in the case where $w=101$: Write a regex to match string that does NOT contain a certain pattern).

However, Knuth-Morris-Pratt tells us that a pattern match automaton for $w$ is of linear size, and can be constructed in linear time. That means that the deterministic automaton for "contains substring $w$" is of linear size too (failure links can be replaced by transitions, on for each mismatching letter). Luke has used KMP to provide an automaton for the example "not ABCDABD" in his answer Automaton for substring matching.

My question is the following: will the standard determinization of a "substring $w$" automaton always lead to a linear sized automaton? Here I mean the practical version of the construction, where unreachable states are not constructed (so not the full $2^Q$ subset construction). Is the answer direct from the construction, or do we need KMP?

A little observation. In the construction there might be many states that contain the original final state. As they all will accept and also all transitions for ever will lead to accepting states (once we have seen $w$) these states can be grouped into one single state. I do not know whether that is a necessary trick to avoid exponential explosion, see http://cs.stackexchange.com/a/11842/4287

Asked By : Hendrik Jan
Answered By : Yuval Filmus

Consider the following NFA: the states are $q_0,\ldots,q_{|w|}$, there are transitions $q_{i-1} \to q_i$ on $w_i$, there are self-loops on $q_0,q_{|w|}$ (for all alphabet symbols), the initial state is $q_0$, and the unique final state is $q_{|w|}$.

Assuming $w \neq \epsilon$, the determinization generates $2|w|$ states:

  • For each location $0 \leq i < |w|$, a state consisting of $q_0$, $q_i$, and $q_j$ for $j < i$ such that $w_1\ldots w_j$ is a suffix of $w_1\ldots w_i$.

  • Same as before, with $q_{|w|}$ added.

When $w = \epsilon$, the NFA is already deterministic, and contains only one state.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/66535

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback