World's most popular travel blog for travel bloggers.

Why is using a lexer/parser on binary data so wrong?

, , No Comments
Problem Detail: 

I often work with lexer/parsers, as opposed to a parser combinator and see people who never took a class in parsing, ask about parsing binary data. Typically the data is not only binary but also context sensitive. This basically leads to having only one type of token, a token for byte.

Can someone explain why parsing binary data with a lexer/parser is so wrong with enough clarity for a CS student who hasn't taken a parsing class, but with a footing on theory?

Asked By : Guy Coder

Answered By : AProgrammer

In principle, there is nothing wrong.

In practice,

  • most non-textual data formats I know are not context-free and are therefore not suitable for common parser generators. The most common reason is that they have length fields giving the number of times a production has to be present.

    Obviously, having a non context-free language has never prevented the use of parser generators: we parse a superset of the language and then use semantic rules to reduce it to what we want. That approach could be used for non-textual formats if the result would be deterministic. The problem is to find something else than counts to synchronize on as most binary formats allow arbitrary data to be embedded; length fields tell you how much it is.

    You can then start playing tricks like having a manually writen lexer able to handle that with feedback from the parser (lex/yacc handling of C use that kind of tricks to handle typedef, for instance). But then we come to the second point.

  • most non-textual data formats are quite simple (even if they are not context-free). When the counts mentioned above are ignored, the languages are regular, LL1 at worst, and are thus well suited for manual parsing techniques. And handling counts is easy for manual parsing techniques like recursive descent.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/899

0 comments:

Post a Comment

Let us know your responses and feedback