Recall that
the input to the parser is a sequence of tokens (received interactively,
via calls to the scanner).
The parser:
So to specify the syntax of a programming language, we use a different
formalism, called context-free grammars.
We can write a context-free grammar (CFG) for the language of (very simple)
arithmetic expressions involving only subtraction and division.
In English:
A more compact way to write this grammar is:
A CFG is a 4-tuple (N, Sigma, P, S) where:
The language of boolean expressions can be defined in English as follows:
Write a context-free grammar for the language of very simple while loops
(in which the loop body only contains one statement) by adding a new production
with nonterminal stmt on the left-hand side.
The language defined by a context-free grammar is the set of strings
(sequences of terminals) that can be derived from the start
nonterminal.
What does it mean to derive something?
Below is an example derivation, using the 4 productions for the grammar
of arithmetic expressions given above.
In this derivation, we use the actual lexemes instead of the token names
(e.g., we use the symbol "-" instead of MINUS).
Here is another way to state the definition of what it means for
a CFG to define a language:
There are several kinds of derivations that are important.
A derivation is a leftmost derivation if it is
always the leftmost nonterminal that is chosen to
be replaced.
It is a rightmost derivation if it is always the rightmost one.
Another way to derive things using a context-free grammar is to
construct a parse tree (also called a derivation tree) as follows:
Using the example expression grammar, here's a parse tree that
derives 1 - 4 / 2:
Question 1:
Give a derivation for the string:
2 / ( 3 - 4 )
Is your derivation leftmost, rightmost, or neither?
Question 2:
Give a parse tree for the same string.
The string 1 - 4 / 2 has two parse trees using the
example expression grammar.
One was given above; here's the other one:
If for grammar G and string S there is:
In general, ambiguous grammars cause problems:
Since every programming language includes expressions, it is useful to
know how to write a grammar for an expression language so that the grammar
correctly reflects the precedences and associativities of the operators.
To write a grammar whose parse trees express precedence correctly,
use a different nonterminal for each precedence level.
Start by writing a rule for the operator(s) with the lowest precedence
("-" in our case), then write a rule for the operator(s) with the next
lowest precedence, etc:
This grammar captures operator precedence, but it is still ambiguous!
Parse trees using this grammar may not correctly express the fact that
both subtraction and division are left associative; e.g., the
expression:
5-3-2 is equivalent to: ((5-3)-2) and not to:
(5-(3-2)).
Draw two parse trees for the expression 5-3-2 using the
grammar given above;
one that correctly groups 5-3, and one that incorrectly
groups 3-2.
To write a grammar that correctly expresses operator associativity:
Now let's consider a more complete expression grammar, for arithmetic
expressions with addition, multiplication, and exponentiation, as well
as subtraction and division.
We'll use the token POW for the exponentiation operator, and we'll use
"**" as the corresponding lexeme; e.g., "two to the third power" would
be written: 2 ** 3, and the corresponding sequence of tokens
would be: INTLITERAL POW INTLITERAL.
Here's an ambiguous context-free grammar for this language:
Question 1:
Write a grammar for the language of boolean expressions, with two
possible operands: true false, and three possible
operators: and or not.
First write an ambiguous grammar using only one nonterminal.
Then add nonterminals so that or has lowest precedence, then
and, then not.
Finally, change the grammar to reflect the fact that both
and and or are left associative.
Question 2:
Draw a parse tree for the expression: true and not true.
Another kind of grammar that you will often need to write is a grammar
that defines a list of something.
There are several common forms:
Overview
The output depends on whether the input is a syntactically legal program;
if so, then the output is some representation of the program:
We know that we can use regular expressions to define languages
(for example, the languages of the tokens to be recognized by the scanner).
Can we use them to define the language to be recognized by the parser?
Unfortunately, the answer is no.
Regular expressions are not powerful enough to define many aspects
of a programming language's syntax.
For example, a regular expression cannot be used to specify that the
parentheses in an expression must be balanced, or that every
``else'' statement has a corresponding ``if''.
Furthermore, a regular expression doesn't say anything about
underlying structure.
For example, the following regular expression defines integer arithmetic
involving addition, subtraction, multiplication, and division:
digit+ (("+" | "-" | "*" | "/") digit+)*
but provides no information about the precedence and associativity of
the operators.
Example: Simple Arithmetic Expressions
Here is the corresponding CFG:
exp --> INTLITERAL
exp --> exp MINUS exp
exp --> exp DIVIDE exp
exp --> LPAREN exp RPAREN
And here is how to understand the grammar:
exp --> INTLITERAL | exp MINUS exp | exp DIVIDE exp | LPAREN exp RPAREN
Intuitively, the vertical bar means ``or'', but do not be fooled
into thinking that the right-hand sides of grammar rules can contain
regular expression operators!
This use of the vertical bar is just shorthand for writing multiple
rules with the same left-hand-side nonterminal.
Formal Definition
Example: Boolean Expressions, Assignment Statements, and If Statements
Here is the corresponding CFG:
bexp --> TRUE
bexp --> FALSE
bexp --> bexp OR bexp
bexp --> bexp AND bexp
bexp --> NOT bexp
bexp --> LPAREN bexp RPAREN
Here is a CFG for a language of very simple assignment statements
(only statements that assign a boolean value to an identifier):
stmt --> ID ASSIGN bexp SEMICOLON
We can ``combine'' the two grammars given above, and add two more rules
to get a grammar that defines the language of (very simple) if statements.
In words, an if statement is:
And here's the grammar:
stmt --> IF LPAREN bexp RPAREN stmt
stmt --> IF LPAREN bexp RPAREN stmt ELSE stmt
stmt --> ID ASSIGN bexp SEMICOLON
bexp --> TRUE
bexp --> FALSE
bexp --> bexp OR bexp
bexp --> bexp AND bexp
bexp --> NOT bexp
bexp --> LPAREN bexp RPAREN
The Language Defined by a CFG
Thus we arrive either at epsilon or at a string of terminals.
That is how we derive a string in the language defined by a CFG.
until the current sequence contains no nonterminals.
exp ==> exp - exp ==> 1 - exp ==> 1 - exp / exp ==>
1 - exp / 2 ==> 1 - 4 / 2
And here is some useful notation:
So, given the above example, we could write:
exp ==>+ 1 - exp / exp.
L(G) = { w | S ==>+ w
S: start nonterminal
w: a sequence of terminals or epsilon }
Leftmost and Rightmost Derivations
Parse Trees
The derived string is formed by reading the leaf nodes from left to right.
until there are no more leaf nonterminals left.
exp
/ | \
/ | \
/ | \
exp - exp
| /|\
| / | \
1 exp / exp
| |
4 2
Ambiguous Grammars
exp
/ | \
/ | \
/ | \
exp / exp
/|\ |
/ | \ |
/ | \ |
exp - exp 2
| |
1 4
then G is called an ambiguous grammar.
(Note: the three conditions given above are equivalent;
if one is true then all three are true.)
Expression Grammars
Precedence
exp --> exp MINUS exp | term
term --> term DIVIDE term | factor
factor --> INTLITERAL | LPAREN exp RPAREN
Now let's try using these new rules to build parse trees for
1 - 4 / 2.
First, a parse tree that correctly reflects that fact that division
has higher precedence than subtraction:
exp
/|\
/ | \
/ | \
exp - exp
| |
term term
| /|\
| / | \
factor term / term
| | |
1 factor factor
| |
4 2
Now we'll try to construct a parse tree that shows the wrong
precedence:
exp
|
term
/|\
/ | \
/ | \
term / term
| |
here we |
want "-" but we factor
cannot derive it |
without parens 2
Associativity
To understand how to write expression grammars that correctly reflect
the associativity of the operators, you need to understand about
recursion in grammars.
The grammar given above for arithmetic expressions is both left and
right recursive in nonterminals exp and term (can
you write the derivation steps that show this?).
Here's the correct grammar:
exp --> exp MINUS term | term
term --> term DIVIDE factor | factor
factor --> INTLITERAL | LPAREN exp RPAREN
And here's the (one and only) parse tree that can be built
for 5 - 3 - 2 using this grammar:
exp
/ | \
/ | \
/ | \
exp - term
/|\ |
/ | \ |
/ | \ |
exp - term factor
| | |
term factor 2
| |
factor 3
|
5
exp --> exp PLUS exp | exp MINUS exp | exp TIMES exp | exp DIVIDE exp |
exp POW exp | LPAREN exp RPAREN | INTLITERAL
First, we'll modify the grammar so that parse trees correctly reflect
the fact that addition and subtraction have the same, lowest
precedence; multiplication and division have the same, middle
precedence; and exponentiation has the highest precedence:
exp --> exp PLUS exp | exp MINUS exp | term
term --> term TIMES term | term DIVIDE term | factor
factor --> factor POW factor | exponent
exponent --> INTLITERAL | LPAREN exp RPAREN
This grammar is still ambiguous; it
does not yet reflect the associativities of the operators.
So next we'll modify the grammar so that parse trees correctly reflect the
fact that all of the operators except exponentiation are left associative (and
exponentiation is right associative; e.g., 2**3**4 is equivalent
to: 2**(3**4)):
exp --> exp PLUS term | exp MINUS term | term
term --> term TIMES factor | term DIVIDE factor | factor
factor --> exponent POW factor | exponent
exponent --> INTLITERAL | LPAREN exp RPAREN
Finally, we'll modify the grammar by adding a unary operator,
unary minus,
which has the highest precedence of all (e.g., -3**4 is
equivalent to: (-3)**4, not to -(3**4).
Note that the notion of associativity does not apply to unary operators,
since associativity only comes into play in an expression of the form:
x op y op z.
exp --> exp PLUS term | exp MINUS term | term
term --> term TIMES factor | term DIVIDE factor | factor
factor --> exponent POW factor | exponent
exponent --> MINUS exponent | final
final --> INTLITERAL | LPAREN exp RPAREN
List Grammars
1. xList --> X | xList xList
2. xList --> X | xList X
3. xList --> X | X xList
1. xList --> X | xList COMMA xList
2. xList --> X | xList COMMA X
3. xList --> X | X COMMA xList
1. xList --> X SEMICOLON | xList xList
2. xList --> X SEMICOLON | xList X SEMICOLON
3. xList --> X SEMICOLON | X SEMICOLON xList
1. xList --> epsilon | X | xList xList
2. xList --> epsilon | X | xList X
3. xList --> epsilon | X | X xList
1. xList --> epsilon | X SEMICOLON | xList xList
2. xList --> epsilon | X SEMICOLON | xList X SEMICOLON
3. xList --> epsilon | X SEMICOLON | X SEMICOLON xList
Either an empty list, or a non-empty list of x's separated
by commas.
We already know how to write a grammar for a non-empty list of
x's separated by commas, so now it's easy to write the grammar:
xList --> epsilon | nonEmptyXList
nonEmptyXList --> X | X COMMA nonEmptyXList