Is there a good general paper about the interpretation or compilation of REGEXP in programming languages for pattern matching, with or without variables? I am not looking for a quick explanation about the construction of DFAs, but for a real paper on how it is actually done in programming languages implementation, and what is considered simple or difficult. I expect differences between languages may have an inpact. A formal paper on how REGEXP implementation should be done is useful too :-)
Asked By : babou
Answered By : Wandering Logic
I believe most interpreted regular expression matchers start with Thompson's construction algorithm to turn the regular expression into a non-deterministic finite automata. The article that first described these is: Ken Thompson, "Programming Techniques: Regular expression search algorithm", Communications of the ACM, 11(6):419-422, June 1968. But that paper is a little hard to read, since he was compiling to machine code.
My favorite tutorial on regular expression implementation is this series of blog posts by Russ Cox, the author of the RE2 regular expression library. He gives lots of historical discussion. He argues that the most efficient approach to simulating the NFA is to convert to DFA on the fly with caching of just the DFA states that you actually reach. (In contrast to, for example, the implementation of regular expressions in Perl, which use backtracking.) There are cases (e.g., when you get an extended regular expressions with backreferences) where you need to use backtracking, but Cox suggests that you should only use backtracking when you need to.
The other place you might look is Henry Spencer's regular expression library. According to that web site this was described in the book: Dale Schumacher (ed), Software Solutions In C, Academic Press, 1994.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/23089
0 comments:
Post a Comment
Let us know your responses and feedback