World's most popular travel blog for travel bloggers.

# [Solved]: Streaming parsing algorithms

, ,
Problem Detail:

A parser is a procedure which decides if an input belongs to a certain language and produces a witness in the form of a parse tree.

Let a streaming parser be a parser which for any prefix $u$ can produce a partial parse tree corresponding to the productions that must be expanded to eventually accept any completion $uv$ of the prefix, for some completion $v$.

For example, if parsing the CFG $$\begin{array}{lcl} S &\to& (a S+ b) \end{array}$$

then upon reading the prefix $aaa$, the streaming parser should report that any parse tree must consist of three top-level expansions of $S$.

Edit: I have two motivations for streaming parsing. The first is to provide a best-case constant bound on memory usage in the case where productions can always be resolved by a constant amount of lookahead. The second is to allow the consumer to lazily tear down the parse tree as the result is produced, e.g. by scheduling the execution of semantic actions associated with productions.

I am also interested in streaming parsing based on other foundations than context-free grammars, such as top-down parsing (parsing expression grammars, parser combinators, etc.).

Is there any existing work on such parsing algorithms?

###### Asked By : Ulrik Rasmussen

I believe that the LR(k) property corresponds to your notion of a grammar which can be parsed in a streaming fashion. It's probably useful to (re)read Donald Knuth's 1965 paper On the Translation of Languages from Left to Right. In that paper, Knuth defines the LR(k) property in terms of the construction of the parse tree; a grammar is LR(k) iff "any handle is always uniquely determined by the string to its left and the $k$ terminal characters to its right." (p. 610) He then proves that every language recognizable by a deterministic push-down automaton is LR(1).

In effect, the stack in an LR(k) parser is essentially the lower-left corner of the final parse tree, which is one way to understand the phrase "the tree starts with". An LR(k) parser's stack will reflect the parser of all but the last $k$ symbols read, so unless $k$ is 0, which would considerably limit the possible languages, the prefix read will be slightly longer than the corresponding parse stack. But that doesn't seem to be excluded by the form of your question.

Unfortunately, it is not decidable whether a particular language is LR(k). It is easy enough to figure out if a particular CFG is LR(k) for some given $k$, but I don't believe there is any way to find such a $k$ for a given CFG other than trying all possible values in sequence.

At the end of the paper, Knuth proposes a possible generalization ($LR(k,t)$) in which the $t$th last handle of every parse tree fragment is uniquely-determined (so that $LR(k)$ is $LR(k,1)$). This might still fall into your idea of possible streaming parsers, although it no longer is able to stream in a strictly bounded context. He gives an example of a language which is not $LR(k,1)$ for any $k$ but is $LR(0,2)$; however, he does not propose any algorithm for constructing an $LR(k,t)$ parser for $t\ne 1$.

Given an arbitrary CFG, you can use Tomita's Generalized LR (GLR) parsing algorithm, which, if implemented correctly, will produce all possible parses for a given string in $O(n^3)$ time and space. The GLR parser maintains a data structure from which all viable parse stacks can be derived; in particular, it is easy to find the longest prefix which corresponds to an unambiguous parse stack. However, there is no guarantee that all the parse stacks at a particular point in the parse (before the end) are actually viable; only that all viable stacks are included. So it is possible that there is a longer unambiguous prefix. A Tomita parser would correctly find the longest prefix for Knuth's $LR(0,2)$ example, but I don't know if that result could be generalized.