World's most popular travel blog for travel bloggers.

[Solved]: Parsing arbitrary context-free grammars, mostly short snippets

, , No Comments
Problem Detail: 

I want to parse user-defined domain specific languages. These languages are typically close to mathematical notations (I am not parsing a natural language). Users define their DSL in a BNF notation, like this:

expr ::= LiteralInteger        | ( expr )        | expr + expr        | expr * expr 

Input like 1 + ( 2 * 3 ) must be accepted, while input like 1 + must be rejected as incorrect, and input like 1 + 2 * 3 must be rejected as ambiguous.

A central difficulty here is coping with ambiguous grammars in a user-friendly way. Restricting the grammar to be unambiguous is not an option: that's the way the language is — the idea is that writers prefer to omit parentheses when they are not necessary to avoid ambiguity. As long as an expression isn't ambiguous, I need to parse it, and if it isn't, I need to reject it.

My parser must work on any context-free grammar, even ambiguous ones, and must accept all unambiguous input. I need the parse tree for all accepted input. For invalid or ambiguous input, I ideally want good error messages, but to start with I'll take what I can get.

I will typically invoke the parser on relatively short inputs, with the occasional longer input. So the asymptotically faster algorithm may not be the best choice. I would like to optimize for a distribution of around 80% inputs less than 20 symbols long, 19% between 20 and 50 symbols, and 1% rare longer inputs. Speed for invalid inputs is not a major concern. Furthermore, I expect a modification of the DSL around every 1000 to 100000 inputs; I can spend a couple of seconds preprocessing my grammar, not a couple of minutes.

What parsing algorithm(s) should I investigate, given my typical input sizes? Should error reporting be a factor in my selection, or should I concentrate on parsing unambiguous inputs and possibly run a completely separate, slower parser to provide error feedback?

(In the project where I needed that (a while back), I used CYK, which wasn't too hard to implement and worked adequately for my input sizes but didn't produce very nice errors.)

Asked By : Gilles

Answered By : Alex ten Brink

Probably the ideal algorithm for your needs is Generalized LL parsing, or GLL. This is a very new algorithm (the paper was published in 2010). In a way, it is the Earley algorithm augmented with a graph structured stack (GSS), and using LL(1) lookahead.

The algorithm is quite similar to plain old LL(1), except that it doesn't reject grammars if they are not LL(1): it just tries out all possible LL(1) parses. It uses a directed graph for every point in the parse, which means that if a parse state is encountered that has been dealt with before, it simply merges these two vertices. This makes it suitable for even left-recursive grammars, unlike LL. For exact details on its inner workings, read the paper (it's quite a readable paper, though the label soup requires some perseverance).

The algorithm has a number of clear advantages relevant to your needs over the other general parsing algorithms (that I know of). Firstly, implementation is very easy: I think only Earley is easier to implement. Secondly, performance is quite good: in fact, it becomes just as fast as LL(1) on grammars that are LL(1). Thirdly, recovering the parse is quite easy, and checking whether there is more than one possible parse is as well.

The main advantage GLL has is that it is based on LL(1) and is therefore very easy to understand and debug, when implementing, when designing grammars as well as when parsing inputs. Furthermore, it also makes error handling easier: you know exactly where possible parses stranded and how they might have continued. You can easily give the possible parses at the point of the error and, say, the last 3 points where parses stranded. You might instead opt to try to recover from the error, and mark the production that the parse that got the furthest was working on as 'complete' for that parse, and see if parsing can continue after that (say someone forgot a parenthesis). You could even do that for, say, the 5 parses that got the furthest.

The only downside to the algorithm is that it's new, which means there are no well-established implementations readily available. This may not be a problem to you - I've implemented the algorithm myself, and it was quite easy to do.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/234

0 comments:

Post a Comment

Let us know your responses and feedback