World's most popular travel blog for travel bloggers.

[Solved]: What is the point of delimiters and whitespace handling

, , No Comments
Problem Detail: 

I see that language specifies reserved words, delimiters and whitespaces in the lexer section. Are delimiters just like reserved identifiers in the punctuation space or they have additioinal function regarding skipping the whitespace?

Namely, spec says

Each lexical element is either a delimiter, an identifier (which may be a reserved word), an abstract literal, a character literal, a string literal, a bit string literal or comment. In some cases an explicit separator is required to separate adjacent lexical elements (namely when, without separation, interpretation as a single lexical element is possible). A separator is either a space character (SPACE or NBSP), a format effector, or the end of a line.

I also see a definition

relative_pathname ::= { ^ . } partial_pathname 

where . is a delimiter but ^ is not. I do not understand why the difference. Moreover, ^ is a special character that can be only a part of "string literal", 'char literal' or /extended identifier/ and I don't understand how to deliver this character to the path parser.

Anyway, I wonder what to do with pieces of text smashed to each other like 11'c' or "this is string litral"with_some_identifier. Mine current lexer produces string_literal followed by identifier. However, I feel that others don't do that. What is the common practice for lexing and whitespace skipping -- when whitespace is it mandatory and when is it optional? How do you specify that to the parser/lexer? I ask because do not see that parsers/lexers specify a lot of whitespace or separators in the production rules. Despite this stuff must be ubequitos, in practice, I do not see it at all. In JavaCC, for instance, you just specify whitespace chars in SKIP and it does the rest itself. What is the convention? I see that parser combinators support lexical.reserved words and lexical.delimiters. What is the purpose?

I guess that I can supplement every definition of identifier, delimiter and literal with optional whitespace prefix. Now, is it right that only identifiers and literals have to be separated by delimiters or whitespace? How do I ensure that?

Asked By : Valentin Tihomirov

Answered By : rici

Now, is it right that only identifiers and literals have to be separated by delimiters or whitespace? How do I ensure that?

If by "right" you mean it is the case in every programming language, then no, it is not right, and probably no non-trivial lexical statement would be either.

In many languages, integer literals do not have to be separated from a following identifier; in other languages, they do. In most languages, <= is different from < =, so identifiers and numbers are not the only classes of tokens which require explicit separation. (If you don't consider < and = to be "delimiters", then you cannot say that identifiers need to be separated by whitespace or delimiters, either, since a<b is normally valid.)

Maximal munch is a very common lexical algorithm. This views the input to be a sequence of lexemes where each lexeme starts precisely following the previous lexeme and continues as far as possible (so that if two lexemes are possible at a given point in the source code, the longest one is chosen regardless of whether the rest of the input would be valid or not). Not all lexemes are tokens; the set of lexemes includes ignorable whitespace, for example, which is recognized by the tokeniser but does not otherwise participate in syntactic analysis.

However, most languages have idiosyncratic exceptions. Consider the handling of <:: in C++ or 1..2 in D; both of these require an explicit exception to the maximal-munch rule. A different issue is the handling of regular-expression literals in ECMAscript (and other languages), in which / and /= could be tokenised as a simple operator token, or they could be the first part of a regular expression literal. Which option is chosen depends on the immediate syntactic context. Yet another deviation from the above model is found in languages like Python and Haskell where layout (whitespace used as indententation) is sometimes syntactic.

To the extent that a language specification uses the maximal munch algorithm, whitespace is mandatory precisely in those contexts where maximal-munch would produce a different tokenisation. In such languages, whitespace is always discarded (after tokenisation) and there is no need to ensure it is supplied.

There is no global institution which regulates computer language designers. Each language community develops according to its own philosophy, customs, eccentricities and equivocations, and there is no higher truth to which one can appeal.


Particular languages may be based on a different tokenisation model, so you cannot make general language-independent statements based on the behaviour of a single language. In the comments streams, it has been suggested that this question actually applies to VHDL and that VHDL does not conform to the maximal munch model. The text cited in the OP apparently comes from section 13.2 of the VHDL Reference Manual. However, it seems clear (even from the text quoted in the OP), that the design is not much different from maximal-munch. I repeat, with emphasis added:

The text of each design unit is a sequence of separate lexical elements. Each lexical element is either a delimiter, an identifier (which may be a reserved word), an abstract literal, a character literal, a string literal, a bit string literal, or a comment.

In some cases an explicit separator is required to separate adjacent lexical elements (namely when, without separation, interpretation as a single lexical element is possible). A separator is either a space character (SPACE or NBSP), a format effector, or the end of a line. A space character (SPACE or NBSP) is a separator except within a comment, a string literal, or a space character literal.

So that makes it clear that the only case in which separation is obligatory is where the concatenated tokens could be tokenised as a (longer) combined token, which is essentially the maximal-munch rule.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/51487

3.2K people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback