Recall that the parser must produce output (e.g., an abstract-syntax
tree) for the next phase of the compiler.
This involves doing a syntax-directed translation -- translating
from a sequence of tokens to some other form, based on the underlying
syntax.
A syntax-directed translation is defined by augmenting the CFG:
a translation rule is defined for each production.
A translation rule defines the translation of the left-hand side nonterminal
as a function of:
Below is the definition of a syntax-directed translation that
translates an arithmetic expression to its integer value.
When a nonterminal occurs more than once in a grammar rule, the
corresponding translation rule uses subscripts to identify a
particular instance of that nonterminal.
For example, the rule exp -> exp + term has two exp nonterminals;
exp1 means the left-hand-side exp, and
exp2 means the right-hand-side exp.
Also, the notation xxx.value is used to mean the value associated
with token xxx.
Consider the following CFG, which defines expressions that use
the three operators: +, &&, ==.
Let's define a syntax-directed translation that type checks these
expressions; i.e., for type-correct expressions, the translation
will be the type of the expression (either INT or BOOL), and
for expressions that involve type errors, the translation will be
the special value ERROR.
We'll use the following type rules:
The following grammar defines the language of base-2 numbers:
So far, our example syntax-directed translations have produced simple
values (an int or a type) as the translation of an input.
In practice however, we want the parser to build an abstract-syntax tree
as the translation of an input program.
But that is not really so different from what we've seen so far;
we just need to use tree-building operations in the translation
rules instead of, e.g., arithmetic operations.
First, let's consider how an abstract-syntax tree (AST) differs
from a parse tree.
An AST can be thought of as a condensed form of the parse tree:
Below is an example of the parse tree and the AST for the expression
3 * (4 + 2) (using the usual arithmetic-expression grammar that
reflects the precedences and associativities of the operators).
Note that the parentheses are not needed in the AST because the structure of
the AST defines how the subexpressions are grouped.
For constructs other than expressions, the compiler writer has some choices
when defining the AST -- but remember that lists (e.g., lists of declarations
lists of statements, lists of parameters) should be flattened, that operators
(e.g., "assign", "while", "if") go at internal nodes, not at leaves, and
that syntactic details are omitted.
For example:
To define a syntax-directed translation so that the translation of an input
is the corresponding AST, we first need operations that create AST nodes.
Let's use java code, and assume that we have the following class hierarchy:
Illustrate the syntax-directed translation defined above by drawing the
parse tree for the expression 2 + 3 * 4, and annotating the
parse tree with its translation (i.e., each nonterminal in the tree will have
a pointer to the AST node that is the root of the subtree of the AST that
is the nonterminal's translation).
Now we consider how to implement a syntax-directed translation using
a predictive parser.
It is not obvious how to do this, since the predictive parser works by
building the parse tree top-down, while the syntax-directed translation
needs to be computed bottom-up.
Of course, we could design the parser to actually build the parse tree
(top-down), then use the translation rules to build the translation
(bottom-up).
However, that would not be very efficient.
Instead, we avoid explicitly building the parse tree by giving the parser
a second stack called the semantic stack:
For example, consider the following syntax-directed translation for the
language of balanced parentheses and square brackets.
The translation of a string in the language is the number of parenthesis
pairs in the string.
Note that since action #3 just pushes exactly what is popped, that action
is redundant, and it is not necessary to have any action associated with
the third grammar rule.
Here's a picture that illustrates what happens when the input "([])" is
parsed (assuming that we have removed action #3):
In the example above, there is no grammar rule with more than one
nonterminal on the right-hand side.
If there were, the translation action for that rule would have to
do one pop for each right-hand-side nonterminal.
For example, suppose we are using a grammar that includes the rule:
methodBody -> { varDecls stmts }, and that the
syntax-directed translation is counting the number of declarations
and statements in each method body (so the translation of varDecls
is the number of derived declarations, the translation of stmts
is the number of derived statements, and the translation of
methodBody is the number of derived declarations and statements).
Another issue that has not been illustrated yet arises when a
left-hand-side nonterminal's translation depends on the value of a
right-hand-side terminal.
In that case, it is important to put the action number before
that terminal symbol when incorporating actions into grammar rules.
This is because a terminal symbol's value is available during the parse
only when it is the "current token".
For example, if the translation of an arithmetic expression is the
value of the expression:
For the following grammar, give (a) translation rules, (b) translation
actions, and (c) a CFG with actions so that the translation of an
input expression is the value of the expression.
Do not worry about the fact that the grammar is not LL(1).
Recall that a non-LL(1) grammar must be transformed to an equivalent
LL(1) grammar if it is to be parsed using a predictive parser.
Recall also that the transformed grammar usually
does not reflect the underlying structure the way the original grammar did.
For example, when left recursion is removed from the grammar for
arithmetic expressions, we get grammar rules like this:
For example:
Transform the grammar rules with actions that you wrote for the
"Test Yourself #3" exercise to LL(1) form.
Trace the actions of the predictive parser on the input 2 + 3 * 4.
Motivation and Definition
To translate an input string:
The translation of the string is the translation of the parse tree's root
nonterminal.
Example 1
CFG Translation rules
=== =================
exp -> exp + term exp1.trans = exp2.trans + term.trans
exp -> term exp.trans = term.trans
term -> term * factor term1.trans = term2.trans * factor.trans
term -> factor term.trans = factor.trans
factor -> INTLITERAL factor.trans = INTLITERAL.value
factor -> ( exp ) factor.trans = exp.trans
Input
=====
2 * (4 + 5)
Annotated Parse Tree
====================
exp (18)
|
term (18)
/|\
/ | \
/ * \
/ factor (9)
/ /|\
(2) term ( | )
| |
(2) factor exp(9)
| /|\
2 / | \
/ | \
(4) exp + term (5)
| |
(4) factor factor (5)
| |
4 5
Example 2
Here is the CFG and the translation rules:
CFG Translation rules
=== =================
exp -> exp + term if ((exp2.trans == INT) and (term.trans == INT)
then exp1.trans = INT
else exp1.trans = ERROR
exp -> exp && term if ((exp2.trans == BOOL) and (term.trans == BOOL)
then exp1.trans = BOOL
else exp1.trans = ERROR
exp -> exp == term if ((exp2.trans == term.trans) and (exp2.trans != ERROR))
then exp1.trans = BOOL
else exp1.trans = ERROR
exp -> term exp.trans = term.trans
term -> true term.trans = BOOL
term -> false term.trans = BOOL
term -> intliteral term.trans = INT
term -> ( exp ) term.trans = exp.trans
Input
=====
( 2 + 2 ) == 4
Annotated Parse Tree
====================
exp (BOOL)
/|\
/ | \
(INT) exp == term (INT)
| |
| 4
(INT) term
/|\
/ | \
/ | \
( exp )
(INT)
/|\
/ | \
(INT) exp + term (INT)
| |
(INT) term 2
|
2
B -> 0
-> 1
-> B 0
-> B 1
Define a syntax-directed translation so that the translation of a binary
number is its base 10 value.
Illustrate your translation scheme by drawing the parse tree for
1001 and annotating each nonterminal in the tree with
its translation.
Building an Abstract-Syntax Tree
The AST vs the Parse Tree
In general, the AST is a better structure for later stages of the compiler
because it omits details having to do with the source language, and just
contains information about the essential structure of the program.
Parse Tree Abstract Syntax Tree
========== ====================
exp *
| / \
term 3 +
/|\ / \
term * factor 4 2
| /|\
| / | \
factor ( exp )
| /|\
3 exp + term
| |
term factor
| |
factor 2
|
4
Input Parse Tree
===== ==========
{ ____ methodBody _________
x = 0; / / \ \
while (x<10) { { declList stmtList }
x = x+1; | / \
} epsilon stmtList stmt___
y = x*2; / \ / | \ \
} stmtList stmt ID = exp ;
/ | \ (y) / | \
stmtList stmt ... exp * term
| / | | \ | |
epsilon ID = exp ; term factor
(x) | | |
INTLITERAL factor INT
(0) | (2)
ID
(x)
AST
===
methodBody
/ \
declList stmtList
/ | \
assign while assign
/ \ ... / \
ID INT ID *
(x) (0) (y) / \
ID INT
(x) (2)
Note that in the AST there is just one stmtList node, with a list of three
children (the list of statements has been "flattened").
Also, the "operators" for the statements (assign and
while) have been "moved up" to internal nodes (instead of
appearing as tokens at the leaves).
And finally, syntactic details (curly braces and semi-colons)
have been omitted.
Translation Rules That Build an AST
class ExpNode { }
class IntLitNode extends ExpNode {
public IntLitNode(int val) {...}
}
class PlusNode extends ExpNode {
public PlusNode( ExpNode e1, ExpNode e2 ) { ... }
}
class TimesNode extends ExpNode {
public TimesNode( ExpNode e1, ExpNode e2 ) { ... }
}
Now we can define a syntax-directed translation for simple arithmetic expressions,
so that the translation of an expression is its AST:
CFG Translation rules
=== =================
exp -> exp + term exp1.trans = new PlusNode(exp2.trans, term.trans)
exp -> term exp.trans = term.trans
term -> term * factor term1.trans = new TimesNode(term2.trans, factor.trans)
term -> factor term.trans = factor.trans
factor -> INTLITERAL factor.trans = new IntLitNode(INTLITERAL.value)
factor -> ( exp ) factor.trans = exp.trans
Syntax-Directed Translation and Predictive Parsing
So what actually happens is that the action for a grammar rule
X -> Y1 Y2 ... Yn
is pushed onto the (normal) stack when the derivation step
X -> Y1 Y2 ... Yn is made,
but the action is not actually performed until complete derivations for all
of the Y's have been carried out.
Example: Counting Parentheses
CFG Translation Rules
=== =================
exp -> epsilon exp.trans = 0
-> ( exp ) exp1.trans = exp2.trans + 1
-> [ exp ] exp1.trans = exp2.trans
The first step is to replace the translation rules with translation
actions.
Each action must:
Here are the translation actions:
CFG Translation Actions
=== ===================
exp -> epsilon push(0);
-> ( exp ) exp2Trans = pop(); push( exp2Trans + 1 );
-> [ exp ] exp2Trans = pop(); push( exp2Trans );
Next, each action is represented by a unique action number, and those
action numbers become part of the grammar rules:
CFG with Actions
================
exp -> epsilon #1
-> ( exp ) #2
-> [ exp ] #3
#1: push(0);
#2: exp2Trans = pop(); push( exp2Trans + 1 );
#3: exp2Trans = pop(); push( exp2Trans );
input so far stack semantic stack action
------------ ----- -------------- ------
( exp EOF pop, push "( exp ) #2"
( (exp) #2 EOF pop, scan
([ exp) #2 EOF pop, push "[ exp ]"
([ [exp] ) #2 EOF pop, scan
([] exp] ) #2 EOF pop, push epsilon #1
([] #1 ] ) #2 EOF pop, do action
([] ] ) #2 EOF 0 pop, scan
([]) ) #2 EOF 0 pop, scan
([]) EOF #2 EOF 0 pop, do action
([]) EOF EOF 1 pop, scan
([]) EOF empty stack: input accepted!
translation of input = 1
CFG Rule: methodBody -> { varDecls stmts }
Translation Rule: methodBody.trans = varDecls.trans + stmts.trans
Translation Action: stmtsTrans = pop(); declsTrans = pop();
push( stmtsTrans + declsTrans );
CFG rule with Action: methodBody -> { varDecls stmts } #1
#1: stmtsTrans = pop();
declsTrans = pop();
push( stmtsTrans + declsTrans );
Note that the right-hand-side nonterminals' translations are popped from
the semantic stack right-to-left.
That is because the predictive parser does a leftmost derivation,
so the varDecls nonterminal gets "expanded" first;
i.e., its parse tree is created before the parse tree for the
stmts nonterminal.
This means that the actions that create the translation of the
varDecls nonterminal are performed first, and thus its
translation is pushed onto the semantic stack first.
CFG Rule: factor -> INTLITERAL
Translation Rule: factor.trans = INTLITERAL.value
Translation Action: push( INTLITERAL.value )
CFG rule with Action: factor -> #1 INTLITERAL // action BEFORE terminal
#1: push( currToken.value )
exp -> exp + term
-> exp - term
-> term
term -> term * factor
-> term / factor
-> factor
factor -> INTLITERAL
-> ( exp )
Handling Non-LL(1) Grammars
exp -> term exp'
exp' -> epsilon
-> + term exp'
It is not at all clear how to define
a syntax-directed translation for rules like these.
The solution is to define the syntax-directed translation
using the original grammar (define translation rules,
convert them to actions that push and pop using the semantic stack, and
then incorporate the action numbers into the grammar rules).
Then convert the grammar to be LL(1), treating the action numbers
just like grammar symbols!
Non-LL(1) Grammar Rules With Actions
====================================
exp -> exp + term #1
-> term
term -> term * factor #2
-> factor
#1: TTrans = pop(); ETrans = pop(); push Etrans + TTrans;
#2: FTrans = pop(); TTrans = pop(); push Ttrans * FTrans;
After Removing Immediate Left Recursion
=======================================
exp -> term exp'
exp' -> + term #1 exp'
-> epsilon
term -> factor term'
term' -> * factor #2 term'
-> epsilon