Context-Free Grammars and Parsing


Source: http://eli-project.sourceforge.net/elionline4.4/syntax_1.html

A context-free grammar is a formal system that describes a language by specifying how any legal text can be derived from a distinguished symbol called the axiom, or sentence symbol. It consists of a set of productions, each of which states that a given symbol can be replaced by a given sequence of symbols. To derive a legal text, the grammar is used as data for the following algorithm:

  1. Let text be a single occurrence of the axiom.

  2. If no production states that a symbol currently in text can be replaced by some sequence of symbols, then stop.

  3. Rewrite text by replacing one of its symbols with a sequence according to some production.

  4. Go to step (2).

When this algorithm terminates, text is a legal text in the language. The phrase structure of that text is the hierarchy of sequences used in its derivation.

Given a context-free grammar that satisfies certain conditions, Eli can generate a parsing routine to determine the derivation (and hence the phrase structure) of any legal text. This routine will also automatically detect and report any errors in the text, and repair them to produce a correct phrase structure (which may not be that intended by the person who wrote the erroneous text).

How to describe a context-free grammar

Each production of a context-free grammar consists of a symbol to be replaced and the sequence that replaces it. This can be represented in a type-`con' file by giving the symbol to be replaced, followed by a colon, followed by the sequence that replaces it, followed by a period:

Assignment: Variable ':=' Expression.
StatementList: .
Statement:
   'if' Expression 'then' Statement
   'else' Statement.

The first production asserts that the symbol Assignment can be replaced by the sequence consisting of the three symbols Variable, ':=', and Expression. Any occurrence of the symbol StatementList can be replaced by an empty sequence according to the second production. In the third production, you see that new lines can be used as separators in the description of a production. This notation is often more commonly referred to as Backus Naur Form, or just BNF.

Symbols that are to be replaced are called nonterminals, and are always represented by identifiers. (An identifier is a sequence of letters and digits, the first of which is a letter.) Every nonterminal must appear before a colon in at least one production. The axiom is a nonterminal that appears before the colon in exactly one production, and does not appear between the colon and the period in any production. There must be exactly one nonterminal satisfying the conditions for the axiom.

Symbols that cannot be replaced are called terminals, and may be represented by either identifiers or literals. (A literal is a sequence of characters bounded by apostrophes ('). An apostrophe appearing within a literal is represented by two successive apostrophes.) No terminal may appear before a colon in any production. Terminals represent character strings that are recognized by the lexical analyzer (see Lexical Analysis).

Using extended BNF to describe more complex rules

Extended BNF allows the use of certain operators on the right hand side of a production. These operators are designed to be short-hands to simplify the grammar description. Rules with extended BNF operators can be translated into rules which use only the strict BNF constructs described so far. While the use of extended BNF constructs is supported for the concrete syntax description in Eli, only strict BNF constructs are allowed in the abstract syntax. When it comes time to deduce the correspondence between the concrete and abstract syntax, Maptool operates on the abstract syntax and a version of the concrete syntax in which all rules containing extended BNF constructs have been translated into equivalent strict BNF rules.

The remainder of this section is devoted to describing how each of the extended BNF constructs are translated to their strict BNF equivalents. Note that most of the EBNF constructs require the introduction of generated symbols for their strict BNF translation. Users are strongly discouraged from using these constructs in instances where attribution is required for those contexts, because changes in the grammar will change the names of the generated symbols used.

The most appropriate use of EBNF constructs that introduce generated symbols is when matching the LIDO LISTOF construct, since the LISTOF construct makes no assumptions about the phrase structure of the list. For a description of the LISTOF construct, see Productions of LIDO - Reference Manual.

When a grammar contains many productions specifying replacement of the same nonterminal, a slash, denoting alternation can be used to avoid re-writing the symbol being replaced:

Statement:
   Variable ':=' Expression /
   'if' Expression 'then' Statement 'else' Statement /
   'while' Expression 'do' Statement .

This alternation specifies three productions. The nonterminal to be replaced is Statement in each case. Possible replacement sequences are separated by slashes (/). The strict BNF translation for the above example is:

Statement: Variable ':=' Expression .
Statement: 'if' Expression 'then' Statement 'else' Statement . 
Statement: 'while' Expression 'do' Statement . 

Alternation does not introduce any generated symbols and has a very straight-forward translation. As a result, it is the most heavily used of the EBNF constructs.

Square brackets are used to denote that the set of symbols enclosed by the brackets are optional. In the following example, Constants and Variables are optional, but Body is not:

Program: [Constants] [Variables] Body .

The strict BNF translation of this construct is to generate a rule for each possible permutation of the right hand side. In the case of the above example, the following four rules would result:

Program: Body .
Program: Variables Body .
Program: Constants Body .
Program: Constants Variables Body .

While the translation doesn't introduce any generated symbols, indiscriminate use of this construct may lead to less readable specifications.

An asterisk (or star) is used to denote zero or more occurrences of the phrase to which it is applied. In the following example, Program consists of zero or more occurrences of Variable followed by Body:

Program: Variable* Body .

The strict BNF translation of this construct requires the introduction of a generated symbol. Generated symbols begin with the letter G and are followed by a unique number. Generated symbols are chosen to not conflict with existing symbols in the concrete syntax. No check is performed to ensure that the generated symbols do not conflict with symbols in the abstract syntax, so users should avoid using symbols of this form in their abstract syntax. The translation for the above example is as follows:

Program: G1 Body .
G1: G1 Variable .
G1: .

A plus is used to denote one or more occurrences of the phrase to which it is applied. In the following example, Program consists of one or more occurrences of Variable followed by Body:

Program: Variable+ Body .

The strict BNF translation of this construct is similar to the translation of the asterisk (see Using extended BNF to describe more complex rules). The translation for the above example is as follows:

Program: G1 Body .
G1: G1 Variable .
G1: Variable .

A double slash is used to denote one or more occurrences of a phrase separated by a symbol. In the following example, Input is a sequence of one or more Declaration's separated by a comma:

Input: Declaration // ',' .

The strict BNF translation for the above example is as follows:

Input: G1 .
G1: G2 .
G1: G1 ',' G2 .
G2: Declaration .

Note that all of the EBNF constructs, except the single slash (for alternation) have higher precedence than the separator construct.

Parentheses are used to group EBNF constructs. This is used primarily to apply other EBNF operators to more than a single symbol. For example:

Program: (Definition Use)+ .

In this example, we want to apply the Plus operator to the concatenation of a Definition and a Use. The result denotes one or more occurrences of Definition's followed by Use's. The strict BNF translation for the above is:

Program: G2 .
G1: Definition Use .
G2: G1 .
G2: G2 G1 .

This is identical to the translation for the Plus operator operating on a single symbol, except that another generated symbol is created to represent the parenthetical phrase.

Note that a common error is to introduce parentheses where they are not needed. This will result in the introduction of unexpected generated symbols.

Using structure to convey meaning

A production is a construct with two components: the symbol to be replaced and the sequence that replaces it. We defined the meaning of the production in terms of those components, saying that whenever the symbol was found in text, it could be replaced by the sequence. This is the general approach that we use in defining the meaning of constructs in any language. For example, we say that an assignment is a statement with two components, a variable and an expression. The meaning of the assignment is to replace the value of the variable with the value resulting from evaluating the expression.

The context-free grammar for a language specifies a "component" relationship. Each production says that the components of the phrase represented by the symbol to be replaced are the elements of the sequence that replaces it. To be useful, the context-free grammar for a language should embody exactly the relationship that we use in defining the meanings of the constructs of that language.

Operator precedence

Consider the following expressions:

A + B * C
(A + B) * C

In the first expression, the operands of the addition are the variable A and the product of the variables B and C. The reason is that in normal mathematical notation, multiplication takes precedence over addition. Parentheses have been used in the second expression to indicate that the operands of the multiplication are the sum of variables A and B, and the variable C.

The general method for embodying this concept of operator precedence in a context-free grammar for expressions is to associate a distinct nonterminal with each precedence level, and one with operands that do not contain "visible" operators. For our expressions, this requires three nonterminals:

Sum
An expression whose operator is +

Term
An expression whose operator is *

Primary
An expression not containing "visible" operators

The productions that embody the concept of operator precedence would then be:

Sum: Sum '+' Term / Term.
Term: Term '*' Primary / Primary.
Primary: '(' Sum ')' / Identifier.

Operator associativity

Consider the following expressions:

A - B - C
A ** B ** C
A < B < C

Which operator has variable B as an operand in each case?

This question can be answered by stating an association for each operator: If - is "left-associative", then the first expression is interpreted as though it had been written (A-B)-C. Saying that ** is "right-associative" means that the second expression is interpreted as though it had been written A**(B**C). The language designer may wish to disallow the third expression by saying that < is "non-associative".

Association rules are embodied in a context-free grammar by selecting appropriate nonterminals to describe the operands of an operator. For each operator, two nonterminals must be known: the nonterminal describing expressions that may contain that operator, and the nonterminal describing expressions that do not contain that operator but may be operands of that operator. Usually these nonterminals have been established to describe operator precedence. Here is a typical set of nonterminals used to describe expressions:

Relation
An expression whose operator is < or >

Sum
An expression whose operator is + or -

Term
An expression whose operator is * or /

Factor
An expression whose operator is **

Primary
An expression not containing "visible" operators

The association rules discussed above would therefore be expressed by the following productions (these are not the only productions in the grammar):

Sum: Sum '-' Term.
Factor: Primary '**' Factor.
Relation: Sum '<' Sum.

The first production says that the left operand of - can contain other - operators, while the right operand cannot (unless the subexpression containing them is surrounded by parentheses). Similarly, the right operand of ** can contain other ** operators but the left operand cannot. The third rule says that neither operand of < can contain other < operators.

Scope rules for declarations

Identifiers are normally given meaning by declarations. The meaning given to an identifier by a particular declaration holds over some portion of the program, called the scope of that declaration. A context-free grammar for a language should define a phrase structure that is consistent with the scope rules of that language.

For example, the declaration of a procedure P within the body of procedure Q gives meaning to the identifier P, and its scope might be the body of the procedure Q. If P has parameters, the scope of their declarations (which are components of the procedure declaration) is the body of procedure Q.

Now consider the following productions describing a procedure declaration:

procedure_declaration: 'procedure' procedure_heading procedure_body.
procedure_heading:
   ProcIdDef formal_parameter_part ';' specification_part.

Notice that the phrase structure induced by these productions is inconsistent with the postulated scope rules. The declaration of P (ProcIdDef) is in the same phrase (procedure_heading) as the declarations of the formal parameters. This defect can be remedied by a slight change in the productions:

procedure_declaration: 'procedure' ProcIdDef ProcRange.
ProcRange:
   formal_parameter_part ';' value_part specification_part procedure_body.

Here the formal parameters and the body have both been made components of a single phrase (ProcRange), which defines the scope of the formal parameter declarations. The declaration of P lies outside of this phrase, thus allowing its scope to be differentiated from that of the formal parameters.