World's most popular travel blog for travel bloggers.

How to simulate backreferences, lookaheads, and lookbehinds in finite state automata?

, , No Comments
Problem Detail: 

I created a simple regular expression lexer and parser to take a regular expression and generate its parse tree. Creating a non-deterministic finite state automaton from this parse tree is relatively simple for basic regular expressions. However I can't seem to wrap my head around how to simulate backreferences, lookaheads, and lookbehinds.

From what I read in the purple dragon book I understood that to simulate a lookahead $r/s$ where the regular expression $r$ is matched if and only if the match is followed by a match of the regular expression $s$, you create a non-deterministic finite state automaton in which $/$ is replaced by $\varepsilon$. Is it possible to create a deterministic finite state automaton that does the same?

What about simulating negative lookaheads and lookbehinds? I would really appreciate it if you would link me to a resource which describes how to do this in detail.

Asked By : Aadit M Shah

Answered By : Raphael

First of all, backreferences can not be simulated by finite automata as they allow you to describe non-regular languages. For example, ([ab]^*)\1 matches $\{ww \mid w \in \{a,b\}^*\}$, which is not even context-free.

Look-ahead and look-behind are nothing special in the world of finite automata as we only match whole inputs here. Therefore, the special semantic of "just check but don't consume" is meaningless; you just concatenate and/or intersect checking and consuming expressions and use the resulting automata. The idea is to check the look-ahead or look-behind expressions while you "consume" the input and store the result in a state.

When implementing regexps, you want to run the input through an automaton and get back start and end indices of matches. That is a very different task, so there is not really a construction for finite automata. You build your automaton as if the look-ahead or look-behind expression were consuming, and change your index storing resp. reporting accordingly.

Take, for instance, look-behinds. We can mimic the regexp semantics by executing the checking regexp concurrently to the implicitly consuming "match-all" regexp. only from states where the look-behind expression's automaton is in a final state can the automaton of the guarded expression be entered. For example, the regexp /(?=c)[ab]+/ (assuming $\{a,b,c\}$ is the full alphabet) -- note that it translates to the regular expression $\{a,b,c\}^*c\{a,b\}^+\{a,b,c\}^*$ -- could be matched by

enter image description here
[source]

and you would have to

  • store the current index as $i$ whenever you enter $q_2$ (initially or from $q_2$) and
  • report a (maximum) match from $i$ to the current index ($-1$) whenever you hit (leave) $q_2$.

Note how the left part of the automaton is the parallel automaton of the canonical automata for [abc]* and c (iterated), respectively.

Look-aheads can be dealt with similarly; you have to remember the index $i$ when you enter the "main" automaton, the index $j$ when you leave the main automaton and enter the look-ahead automaton and report a match from $i$ to $j$ only when you hit the look-ahead automaton's final state.

Note that non-determinism is inherent to this: main and look-ahead/-behind automaton might overlap, so you have to store all transitions between them in order to report the matching ones later, or backtrack.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/2557

0 comments:

Post a Comment

Let us know your responses and feedback