Recall that the input to the parser is a sequence of tokens (received interactively, via calls to the scanner). The parser:
digit+ (("+" | "-" | "*" | "/") digit+)*but provides no information about the precedence and associativity of the operators.
So to specify the syntax of a programming language, we use a different formalism, called context-free grammars.
We can write a context-free grammar (CFG) for the language of (very simple) arithmetic expressions involving only subtraction and division. In English:
exp --> INTLITERAL exp --> exp MINUS exp exp --> exp DIVIDE exp exp --> LPAREN exp RPARENAnd here is how to understand the grammar:
A more compact way to write this grammar is:
exp --> INTLITERAL | exp MINUS exp | exp DIVIDE exp | LPAREN exp RPARENIntuitively, the vertical bar means ``or'', but do not be fooled into thinking that the right-hand sides of grammar rules can contain regular expression operators! This use of the vertical bar is just shorthand for writing multiple rules with the same left-hand-side nonterminal.
A CFG is a 4-tuple (N, Sigma, P, S) where:
The language of boolean expressions can be defined in English as follows:
bexp --> TRUE bexp --> FALSE bexp --> bexp OR bexp bexp --> bexp AND bexp bexp --> NOT bexp bexp --> LPAREN bexp RPARENHere is a CFG for a language of very simple assignment statements (only statements that assign a boolean value to an identifier):
stmt --> ID ASSIGN bexp SEMICOLONWe can ``combine'' the two grammars given above, and add two more rules to get a grammar that defines the language of (very simple) if statements. In words, an if statement is:
stmt --> IF LPAREN bexp RPAREN stmt stmt --> IF LPAREN bexp RPAREN stmt ELSE stmt stmt --> ID ASSIGN bexp SEMICOLON bexp --> TRUE bexp --> FALSE bexp --> bexp OR bexp bexp --> bexp AND bexp bexp --> NOT bexp bexp --> LPAREN bexp RPAREN
Write a context-free grammar for the language of very simple while loops (in which the loop body only contains one statement) by adding a new production with nonterminal stmt on the left-hand side.
The language defined by a context-free grammar is the set of strings (sequences of terminals) that can be derived from the start nonterminal. What does it mean to derive something?
Below is an example derivation, using the 4 productions for the grammar of arithmetic expressions given above. In this derivation, we use the actual lexemes instead of the token names (e.g., we use the symbol "-" instead of MINUS).
Here is another way to state the definition of what it means for a CFG to define a language:
L(G) = { w | S ==>+ w S: start nonterminal w: a sequence of terminals or epsilon }
There are several kinds of derivations that are important. A derivation is a leftmost derivation if it is always the leftmost nonterminal that is chosen to be replaced. It is a rightmost derivation if it is always the rightmost one.
Another way to derive things using a context-free grammar is to construct a parse tree (also called a derivation tree) as follows:
Using the example expression grammar, here's a parse tree that derives 1 - 4 / 2:
exp / | \ / | \ / | \ exp - exp | /|\ | / | \ 1 exp / exp | | 4 2
Question 1: Give a derivation for the string: 2 / ( 3 - 4 ) Is your derivation leftmost, rightmost, or neither?
Question 2: Give a parse tree for the same string.
The string 1 - 4 / 2 has two parse trees using the example expression grammar. One was given above; here's the other one:
exp / | \ / | \ / | \ exp / exp /|\ | / | \ | / | \ | exp - exp 2 | | 1 4
If for grammar G and string S there is:
In general, ambiguous grammars cause problems:
Since every programming language includes expressions, it is useful to know how to write a grammar for an expression language so that the grammar correctly reflects the precedences and associativities of the operators.
To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("-" in our case), then write a rule for the operator(s) with the next lowest precedence, etc:
exp --> exp MINUS exp | term term --> term DIVIDE term | factor factor --> INTLITERAL | LPAREN exp RPARENNow let's try using these new rules to build parse trees for 1 - 4 / 2. First, a parse tree that correctly reflects that fact that division has higher precedence than subtraction:
exp /|\ / | \ / | \ exp - exp | | term term | /|\ | / | \ factor term / term | | | 1 factor factor | | 4 2Now we'll try to construct a parse tree that shows the wrong precedence:
exp | term /|\ / | \ / | \ term / term | | here we | want "-" but we factor cannot derive it | without parens 2
This grammar captures operator precedence, but it is still ambiguous! Parse trees using this grammar may not correctly express the fact that both subtraction and division are left associative; e.g., the expression: 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)).
Draw two parse trees for the expression 5-3-2 using the grammar given above; one that correctly groups 5-3, and one that incorrectly groups 3-2.
To write a grammar that correctly expresses operator associativity:
exp --> exp MINUS term | term term --> term DIVIDE factor | factor factor --> INTLITERAL | LPAREN exp RPARENAnd here's the (one and only) parse tree that can be built for 5 - 3 - 2 using this grammar:
exp / | \ / | \ / | \ exp - term /|\ | / | \ | / | \ | exp - term factor | | | term factor 2 | | factor 3 | 5
Now let's consider a more complete expression grammar, for arithmetic expressions with addition, multiplication, and exponentiation, as well as subtraction and division. We'll use the token POW for the exponentiation operator, and we'll use "**" as the corresponding lexeme; e.g., "two to the third power" would be written: 2 ** 3, and the corresponding sequence of tokens would be: INTLITERAL POW INTLITERAL. Here's an ambiguous context-free grammar for this language:
exp --> exp PLUS exp | exp MINUS exp | exp TIMES exp | exp DIVIDE exp | exp POW exp | LPAREN exp RPAREN | INTLITERALFirst, we'll modify the grammar so that parse trees correctly reflect the fact that addition and subtraction have the same, lowest precedence; multiplication and division have the same, middle precedence; and exponentiation has the highest precedence:
exp --> exp PLUS exp | exp MINUS exp | term term --> term TIMES term | term DIVIDE term | factor factor --> factor POW factor | exponent exponent --> INTLITERAL | LPAREN exp RPARENThis grammar is still ambiguous; it does not yet reflect the associativities of the operators. So next we'll modify the grammar so that parse trees correctly reflect the fact that all of the operators except exponentiation are left associative (and exponentiation is right associative; e.g., 2**3**4 is equivalent to: 2**(3**4)):
exp --> exp PLUS term | exp MINUS term | term term --> term TIMES factor | term DIVIDE factor | factor factor --> exponent POW factor | exponent exponent --> INTLITERAL | LPAREN exp RPARENFinally, we'll modify the grammar by adding a unary operator, unary minus, which has the highest precedence of all (e.g., -3**4 is equivalent to: (-3)**4, not to -(3**4). Note that the notion of associativity does not apply to unary operators, since associativity only comes into play in an expression of the form: x op y op z.
exp --> exp PLUS term | exp MINUS term | term term --> term TIMES factor | term DIVIDE factor | factor factor --> exponent POW factor | exponent exponent --> MINUS exponent | final final --> INTLITERAL | LPAREN exp RPAREN
Question 1: Write a grammar for the language of boolean expressions, with two possible operands: true false, and three possible operators: and or not. First write an ambiguous grammar using only one nonterminal. Then add nonterminals so that or has lowest precedence, then and, then not. Finally, change the grammar to reflect the fact that both and and or are left associative.
Question 2: Draw a parse tree for the expression: true and not true.
Another kind of grammar that you will often need to write is a grammar that defines a list of something. There are several common forms:
1. xList --> X | xList xList 2. xList --> X | xList X 3. xList --> X | X xList
1. xList --> X | xList COMMA xList 2. xList --> X | xList COMMA X 3. xList --> X | X COMMA xList
1. xList --> X SEMICOLON | xList xList 2. xList --> X SEMICOLON | xList X SEMICOLON 3. xList --> X SEMICOLON | X SEMICOLON xList
1. xList --> epsilon | X | xList xList 2. xList --> epsilon | X | xList X 3. xList --> epsilon | X | X xList
1. xList --> epsilon | X SEMICOLON | xList xList 2. xList --> epsilon | X SEMICOLON | xList X SEMICOLON 3. xList --> epsilon | X SEMICOLON | X SEMICOLON xList
xList --> epsilon | nonEmptyXList nonEmptyXList --> X | X COMMA nonEmptyXList
To write a grammar for a whole programming language, break down the problem into pieces. For example, think about a Java program: a program consists of one or more classes:
program --> classlist classlist --> class | class classlistA class is the word "class", optionally preceded by the word "public", followed by an identifier, followed by an open curly brace, followed by the class body, followed by a closing curly brace:
class --> PUBLIC CLASS ID LCURLY classbody RCURLY | CLASS ID LCURLY classbody RCURLYA class body is a list of zero or more field and/or method definitions:
classbody --> epsilon | deflist deflist --> def | def deflistand so on.
To understand how a parser works, we start by understanding context-free grammars, which are used to define the language recognized by the parser. Important terminology includes:
Two common kinds of grammars are grammars for expression languages, and grammar for lists. It is important to know how to write a grammar for an expression language that expresses operator precedence and associativity. It is also important to know how to write grammars for both non-empty and possibly empty lists, and for lists both with and without separators and terminators.