I created a simple regular expression lexer and parser to take a regular expression and generate its parse tree. Creating a non-deterministic finite state automaton from this parse tree is relatively simple for basic regular expressions. However I can't seem to wrap my head around how to simulate backreferences, lookaheads, and lookbehinds.
From what I read in the purple dragon book I understood that to simulate a lookahead $r/s$ where the regular expression $r$ is matched if and only if the match is followed by a match of the regular expression $s$, you create a non-deterministic finite state automaton in which $/$ is replaced by $\varepsilon$. Is it possible to create a deterministic finite state automaton that does the same?
What about simulating negative lookaheads and lookbehinds? I would really appreciate it if you would link me to a resource which describes how to do this in detail.
Asked By : Aadit M Shah
Answered By : Raphael
First of all, backreferences can not be simulated by finite automata as they allow you to describe non-regular languages. For example, ([ab]^*)\1
matches $\{ww \mid w \in \{a,b\}^*\}$, which is not even context-free.
Look-ahead and look-behind are nothing special in the world of finite automata as we only match whole inputs here. Therefore, the special semantic of "just check but don't consume" is meaningless; you just concatenate and/or intersect checking and consuming expressions and use the resulting automata. The idea is to check the look-ahead or look-behind expressions while you "consume" the input and store the result in a state.
When implementing regexps, you want to run the input through an automaton and get back start and end indices of matches. That is a very different task, so there is not really a construction for finite automata. You build your automaton as if the look-ahead or look-behind expression were consuming, and change your index storing resp. reporting accordingly.
Take, for instance, look-behinds. We can mimic the regexp semantics by executing the checking regexp concurrently to the implicitly consuming "match-all" regexp. only from states where the look-behind expression's automaton is in a final state can the automaton of the guarded expression be entered. For example, the regexp /(?=c)[ab]+/
(assuming $\{a,b,c\}$ is the full alphabet) -- note that it translates to the regular expression $\{a,b,c\}^*c\{a,b\}^+\{a,b,c\}^*$ -- could be matched by
[source]
and you would have to
- store the current index as $i$ whenever you enter $q_2$ (initially or from $q_2$) and
- report a (maximum) match from $i$ to the current index ($-1$) whenever you hit (leave) $q_2$.
Note how the left part of the automaton is the parallel automaton of the canonical automata for [abc]*
and c
(iterated), respectively.
Look-aheads can be dealt with similarly; you have to remember the index $i$ when you enter the "main" automaton, the index $j$ when you leave the main automaton and enter the look-ahead automaton and report a match from $i$ to $j$ only when you hit the look-ahead automaton's final state.
Note that non-determinism is inherent to this: main and look-ahead/-behind automaton might overlap, so you have to store all transitions between them in order to report the matching ones later, or backtrack.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/2557
0 comments:
Post a Comment
Let us know your responses and feedback