Grammars for programming languages

13 min readOct 3, 2018

When syntax of programming languages is communicated, context-free grammars are a lingua franca. They define structure of syntax, but cannot express static semantics. This post gives an overview of other successful grammatical models (attribute, two-level, parsing expression, conjunctive, Boolean grammars) in relation to defining programming languages.

Lambdas in front of CWI in Amsterdam. Photo by the author.

Attempts to specify Algol

After Chomsky introduced phrase structure grammars in 1950s, this model quickly became a standard for definition of syntax of programming languages. Grammar rules of the form

A → α

convey the idea of defining a construct A as a sequence of strings represented by α. Indeed, consider a rule of the form

VariableDecl  ->  "var" ident ";"

This rule states that whenever there is a keyword var followed by an identifier that, in its turn, is followed by a semicolon, then this sequence of strings represents a variable declaration.

Though being surprisingly simple, this mechanism allows defining syntax of modern programming languages (in a form of BNF notation that is equivalent to context-free grammars) and is still used as a lingua franca when describing syntax.

Soon after context-free grammars had been introduced, it became apparent that their expressive power is limited. In 1962, Floyd has shown that Algol is not a context-free language and thus cannot be defined by a context-free grammar. Floyd considered the following Algol program, in which the only identifier is symbol x repeated n times.

For this program to be valid, number n should be the same in all its three occurrences. Under certain encoding, such programs can be represented as strings of the form a...a b...b c...c , where symbols a , b and c occur the same number of times (for example, aabbccis fine, but aabbbccc is not because b and c occur three times, whereas a only occurs twice). These strings form a well-known non-context-free language.

Nevertheless, context-free grammars are capable of defining the structure of such programs: indeed, as required by Algol’s syntax:

statements are separated by semicolons,
there is an identifier in the left-hand side of the assignment operator,
sequence of statements is frame with correctly spelled keywords begin — end, and so on.

Below is an example of a program that satisfies all these conditions and will be considered correct despite containing undeclared identifiers.

begin 
   real x;
   xx := xxx;
end

Context-free grammars define structural syntax of languages, but cannot express any facts about their static semantics (what is called context conditions). It is impossible to specify in a context-free grammar that every identifier should be declared before use, or that the number of formal parameters and actual arguments in a function call should agree.

Context-sensitive grammars (initially proposed by Chomsky to describe syntax of natural languages) are powerful enough to define these conditions, but they are equivalent to a special class of Turing machines, which means that no efficient parsing algorithm exists for them.

A LEGO implementation of a Turing machine. Image credit: https://filmfreeway.com/legoturingmachine

Subsequent research was done in two main directions:

to embed “checking actions” into rules of a context-free grammar,
to come up with an entirely new grammatical model.

The first approach led to attribute grammars and syntax-directed translation. In a grammar, every rule is associated with semantic actions, which, for example, may use symbol tables to look up whether a certain variable has been previously declared.

VariableDeclaration ::=
   "var" ident  {symbolTable.safeAdd($1);}  ";"

In this example, method safeAdd could check whether an identifier with the same name has been declared (the name of the “current” identifier is referred to by $1), and, if so, raise an exception. Otherwise, the new identifier is added to the symbol table. Such essentially ad hoc techniques are still used in industry-level compiler compilers.

Example of semantic actions (code within curly braces) in **lex** and **yacc**. Image credit: https://github.com/faustinoaq/vscode-lex-flex-yacc-bison

The other direction of research was to develop a reasonable (that is, efficiently parsable) model, which would be able to specify the desired context conditions. Of many such attempts, two-level grammars by van Wijngaarden received some attention: they were used to formally specify syntax and static semantics of Algol 68. Unfortunately, two-level grammars soon turned out to be Turing-complete: this makes impossible any practical parsing algorithms for them.

Fragment of van Wijngaarden’s grammar for loop clauses of Algol 68.

Logic and order for static semantics

Parsing expression grammars by Ford introduce ordering of rules in a context-free grammar: the choices are tried in order and the first one to succeed is used to parsing the input string. This is useful in disambiguating constructs like if-then-else statements.

IfStatement :
   "if" Expr "then" Statements "else" Statements /
   "if" Expr "then" Statements

This rule matches a conditional statement in a way that the optional else clause always binds to the innermost if. Without priority of rules, the rule would lead to dangling else ambiguity.

Parsing expression grammars also provide predicates for Boolean conjunction (&) and negation (!). The following rules define nested comments in Pascal (example taken from this book).

Comment        :  "(*" CommentedText* "*)"
CommentedText  :  Comment / ! "*)" .

A comment consists of an opening token (*, some commented text, and a closing token *). This construct is non-trivial to recognize because the commented text may contain symbols * and ), but not the sequence *). Moreover, comments may again contain comments: indeed,

(* a := b; (* nested comment *) *)

is a correct statement.

This construct can be recognized by the given rules: the idea is that CommentedText is either a complete comment or any symbol (expressed as . in parsing expression grammars) provided that it does not start with the string *) (this is expressed by the negation predicate ! "*)"). Similarly to negation, conjunction in parsing expression grammars is used as a mechanism of lookahead.

BulletList  :  &BulletSymbol List
List        :  Item* !BulletSymbol

These rules define the structure of bullet lists in Markdown. Rule for BulletList first verifies whether the first element of a list starts with a bullet symbol and only then it tries to recognize the string according to the rule for List. Elements of the list cannot contain the bullet symbol (this is expressed by !BulletSymbol in the second rule).

A systematic study of Boolean operations in grammars led to conjunctive and Boolean grammars, introduced by Okhotin. In such grammars, a rule

S → A & B

defines all strings produced both by A and B, and a rule

S → A & ¬B

defines all strings produced by A but not produced by B at the same time.

ValidIdentifier  :  ident & ¬ Keyword
Keyword          :  "var" | "if" | ...

These rules specify in a most natural way that an identifier cannot coincide with a keyword.

Conjunctive grammars can define sophisticated non-context-free languages, for example, copy language with central marker wcw. This language abstracts the condition of having each identifier declared before use, as shown below.

The main parsing algorithms for context-free grammars, including cubic-time general parsing algorithm, Earley’s algorithm, recursive descent and Generalized LR, have been extended to the case of conjunctive and Boolean grammars; they have the same time complexity, and are implemented in a prototype parser generator.

Fragment of a Boolean grammar for a simple programming language. Source: http://users.utu.fi/aleokh/papers/conj_bool_programming.pdf

A Boolean grammar was constructed to specify syntax and static semantics (including scoping rules) of a programming language. This was apparently the first such specification by an efficiently parsable grammatical model. Because conjunction and negation operators work on entire strings, rather than merely being a lookahead mechanism, the mentioned grammar is quite knotty.

A new life for cross-references and scoping?

Boolean predicates & and ! in parsing expression grammars are, in fact, positive and negative lookahead predicates, respectively. In the rules for bullet lists in Markdown, predicate &BulletSymbol succeeds if the next symbol of the input is a bullet, and predicate !BulletSymbol succeeds if the next symbol is not a bullet.

Neither of the predicates consume any input: they are only used to check the lookahead symbols in the input, and those lookahead symbols can be regarded as the right context of a string.

Drawing upon both parsing expression grammars and Boolean grammars, grammars with contexts provide a built-in mechanism to specify what left and right contexts should be.

ValidIdentifier  :  ident  &  << it was declared before
ValidIdentifier  :  ident  &  >> it will be declared later

These two informal rules state that whenever an identifier is used in a program, its declaration should appear either to its left (<<) or to its right (>>).

Consider the following fragment of a program in an assumed C-like language.

int f() {
   int ms, sec, min;
   ...
   return 60 * min;
}

Let’s write it once again, this time horizontally.

To ensure that identifier min used in the assignment expression is declared, one can verify whether its left context contains a function header (int f() {), keyword int, other identifiers (ms, sec), a comma, and the declaration of identifier min, followed by any other constructs ( ; ... return 60 * ), all the way up to the use of min itself.

This can be expressed in a grammar with contexts almost verbatim.

ValidIdentifier  :  
  ident &  <<== Functions FuncHeader "int" Identifiers CopiedString

This rule finds the substring between two positions in the input: before the declaration of an identifier and after its use. To include the use of the identifier into this substring, a so called extended left context <<== is used (that is, extended context of an identifier is its left context concatenated with that very identifier). After the desired substring has been found by the rule, it remains to check whether it forms a copy language wcw (and copy language can be defined by a conjunctive grammar, so everything works).

The standard restriction that forbids redeclaration of identifiers can be now expressed by the following rules:

IntegerDeclaration  :  "int" InvalidIdentifier ";"
InvalidIdentifier   :  ident & ¬ ValidIdentifier

It also becomes possible to distinguish between types of identifiers: the rule for a valid identifier breaks up into several rules, one for each type in the language.

ValidIntegerIdentifier  :  
   ident & <<== Functions FuncHeader "int" Identifiers CopiedString

The only difference between these rules is in the keyword that should occur in the left context of an identifier use.

ValidBooleanIdentifier  :
  ident & <<== Functions FuncHeader "bool" Identifiers CopiedString

Because identifiers are now distinguished according to their type, it makes sense to embed type checking directly into a grammar with contexts.

Assignment  :  ValidIntegerIdentifier "=" IntegerExpression |
               ValidBooleanIdentifier "=" BooleanExpression

These rules state that a variable of a certain type can only be assigned an expression of the same type.

Syntax and static semantics of a small typed programming language has been defined by a grammar with contexts. That grammar specifies standard context conditions, such as:

conditional expression of while loop is of type bool;
number and types of formal parameters and actual arguments in function calls agree;
type of expression in a return statement is the same as the returning type of a function, and so on.

By a set of rather sophisticated rules, the grammar also specified scopes of visibility: identifiers are visible in the block where they are defined, and in all its inner blocks.

Grammars with right contexts can be used in a way similar to parsing expression grammars — as a lookahead mechanism.

BulletList  :  ==>> BulletSymbol anything & List

The extended right context ==>> BulletSymbol anything corresponds to the positive lookahead predicate &BulletSymbol in a parsing expression grammar given earlier.

Several parsing algorithms, including cubic-time general parsing algorithm, (linear time) recursive descent, and Generalized LR, have been successfully extended to grammars with contexts. These algorithms have the same time complexity as their prototypes, and have been implemented in a parser generator.

Cross-references in a language workbench

Grammars with contexts can define arbitrary cross-references within a string. They have been used to specify an abstract programming language where identifiers can be declared before or after their use (example motivated by function prototypes in C).

Ability to define cross-references is what makes grammars with contexts applicable to software languages. On a practical side, this ability is also distinctive in parser-based language workbenches, such as Eclipse Xtext.

Definition of a language in Xtext starts with writing a grammar in a metalanguage that is similar to that of ANTLR and supports cross-references.

Variable    :  'var' name=ID ';' ;
Assignment  :  left=[Variable] '=' right=Expression ';' ;

The first rule defines what a variable declaration is: it is keyword var followed by an identifier and a semicolon. The identifier is to be “stored” in feature name associated with this rule.

Remark
Xtext creates a model from a grammar using Eclipse Modeling Framework. In a simplified setting, rules become classes (instances of EClass) and features of rules become fields of those classes. The model is then populated during the parsing, essentially resulting in an AST that can be further analyzed or transformed by the user.

In the second rule, [Variable] specifies a reference to an existing Variable, in particular, to its feature name (in the default setting, references are associated with the feature called name).

Remark
The reference is made to an instance of EClass Variable; if such an instance does not exist, an error is reported. If the instance exists, the reference is associated with the value of its field name.

This is conceptually very similar to grammars with contexts, where ValidVariable would be used to specify the idea of what [Variable] expresses in Xtext.

There is, however, an important difference between the two approaches. To perform static checking, Xtext bases on high-level programming language code, which is either generated automatically by Xtext or is written later by a user. (Remark: there are special DSLs to define typing and scoping rules for languages created in Xtext).

Xtext + Xtend: checking whether the types of left-hand side and right-hand side parts of an assignment expression match.

In contrast, grammars with contexts are purely a syntactic formalism, and after a language is defined by a grammar, no additional code is required to further define its static semantics: it’s all there directly in the grammar. This clearly has its advantages: a language is defined solely by a very formal mechanism, and correctness of this definition can be proven in a formal way.

Definition of a typed programming language by a grammar with contexts is heavily based on the copy language wcw that is used to check cross-references. This language is defined in a non-trivial manner, and would require a substantial modification to accommodate possible changes to how identifiers are defined in a language. However, for small sublanguages of larger programming languages (for example, a sublanguage of arithmetical expressions), it might make sense to use pure grammatical models to achieve higher degree of certainty about the correctness of the definition.

Can the formalism of grammars with contexts be made more approachable? Answering this question might involve implementing an Xtext-like language workbench based on these grammars (all necessary parsing algorithms exist, they were proven to have decent time complexity, and are implemented in a prototype parser generator) and, most probably, introducing a simplified “front-end” formalism that would be then translated into a grammar with contexts.

For example, in such a formalism, concepts of a language can be annotated with tags (“suffix numbers”) to define cross-references:

NewIdent       :  ident  &  !DeclaredIdent
DeclaredIdent  :  ident.1  &  << 'var' ident.1

In the rule for DeclaredIdent, both the ordinary (ident.1) and contextual (<< ... ident.1) occurrences of ident have the same tag to express that these identifiers should be identical. That is, this rules expresses that “a declared identifier is any identifier that was declared before”.

Very information explanation of what copy language **wcw** means: “thing-1, arbitrary-junk, thing-1”.

Fragments from the book by Cleaveland and Uzgalis “Grammars for programming languages”.

Devising an adequate and convenient formalism on top of grammars with contexts is a challenging task. If this task is solved, then implementing a language workbench that would support these grammars would be a matter of technique.

Conclusions Informally

Context-free grammars are great to define structural syntax of languages. They cannot define that “every variable should be declared before its use”.
In real life, one would use attribute grammars to fix this. In even more real life, one uses parser generators (also called “compiler compilers”) that allow specifying semantic actions. And in practice, one should use language workbenches. Reason: apart from generating a parser for your language, they also generate an IDE! (You aren’t going to use a language without an IDE, are you?)
Many attempts have been taken to extend context-free grammars so that context conditions could be incorporated into grammars. A lot of these new models fail(ed). Why? They try to crack a nut with a sledgehammer (read: they are Turing-complete and can’t be parsed at all). However, some of them, including parsing expression grammars, conjunctive and Boolean grammars, and grammars with contexts, do not exhibit such behavior (read: they can be parsed, and in most cases, efficiently).
It seems that it makes sense to continue playing with Chomsky’s original idea of rules only applicable in certain contexts to see whether these brain exercises would disrupt several decades of “stagnation” in the field of parsing.

UPD. Here is a Twitter thread on limitations of context-free grammars for programming languages.

Bonus point: Can languages be defined without grammars at all? Yes! Have a look at projectional editing and at JetBrains MPS in particular.

Acknowledgements

I am grateful to Alexander Okhotin for discussions and his comments about this post.

What do you think of grammars? What is your experience with parser generators and parser-based language workbenches? I am looking forward to hearing you feedback.