Syntax-Directed Translation

Motivation and Definition
Example 1: Value of an Arithmetic Expression
Example 2: Type of an Expression
Test Yourself #1
Building an Abstract-Syntax Tree
Syntax-Directed Translation and Predictive Parsing
Summary

Motivation and Definition

Recall that the parser must produce output (e.g., an abstract-syntax tree) for the next phase of the compiler. This involves doing a syntax-directed translation -- translating from a sequence of tokens to some other form, based on the underlying syntax.

A syntax-directed translation is defined by augmenting the CFG: a translation rule is defined for each production. A translation rule defines the translation of the left-hand side nonterminal as a function of:

constants
the right-hand-side nonterminals' translations
the right-hand-side tokens' values (e.g., the integer value associated with an INTLIT token, or the String value associated with an ID token)

To translate an input string:

Build the parse tree.
Use the translation rules to compute the translation of each nonterminal in the tree, working bottom up (since a nonterminal's value may depend on the value of the symbols on the right-hand side, you need to work bottom-up so that those values are available).

The translation of the string is the translation of the parse tree's root nonterminal.

Example 1

Below is the definition of a syntax-directed translation that translates an arithmetic expression to its integer value. When a nonterminal occurs more than once in a grammar rule, the corresponding translation rule uses subscripts to identify a particular instance of that nonterminal. For example, the rule exp -> exp + term has two exp nonterminals; exp₁ means the left-hand-side exp, and exp₂ means the right-hand-side exp. Also, the notation xxx.value is used to mean the value associated with token xxx.

        CFG                          Translation rules
        ===                          =================
        exp    -> exp + term         exp₁.trans = exp₂.trans + term.trans
        exp    -> term               exp.trans  = term.trans
        term   -> term * factor      term₁.trans = term₂.trans * factor.trans
        term   -> factor             term.trans  = factor.trans
        factor -> INTLITERAL         factor.trans  = INTLITERAL.value
        factor -> ( exp )            factor.trans  = exp.trans


	Input
	=====
        2 * (4 + 5)


	Annotated Parse Tree
	====================

                 exp (18)
                  |
                 term (18)
                 /|\
                / | \
               /  *  \
              /     factor (9)
             /       /|\
      (2) term      ( | )
            |         |
      (2) factor     exp(9)
            |        /|\
            2       / | \
                   /  |  \
             (4) exp  +  term (5)
                  |       |
            (4) factor  factor (5)
                  |       |
                  4       5

Example 2

Consider the following CFG, which defines expressions that use the three operators: +, &&, ==. Let's define a syntax-directed translation that type checks these expressions; i.e., for type-correct expressions, the translation will be the type of the expression (either INT or BOOL), and for expressions that involve type errors, the translation will be the special value ERROR. We'll use the following type rules:

Both operands of the + operator must be of type INT.
Both operands of the && operator must be of type BOOL.
Both operands of the == operator have the same (non-ERROR) type.

Here is the CFG and the translation rules:

        CFG                     Translation rules
        ===                     =================
        exp -> exp + term        if ((exp₂.trans == INT) and (term.trans == INT)
                                 then exp₁.trans = INT
                                 else exp₁.trans = ERROR

        exp -> exp && term       if ((exp₂.trans == BOOL) and (term.trans == BOOL)
                                 then exp₁.trans = BOOL
                                 else exp₁.trans = ERROR

        exp -> exp == term       if ((exp₂.trans == term.trans) and (exp₂.trans != ERROR))
                                 then exp₁.trans = BOOL
                                 else exp₁.trans = ERROR

        exp -> term              exp.trans = term.trans

        term -> true             term.trans = BOOL
        term -> false            term.trans = BOOL
        term -> intliteral       term.trans = INT
        term -> ( exp )          term.trans = exp.trans


        Input
        =====
        ( 2 + 2 ) == 4


        Annotated Parse Tree
        ====================
                 exp (BOOL)
                 /|\
                / | \
        (INT) exp == term (INT)
               |      |
               |      4
        (INT) term
              /|\
             / | \
            /  |  \
           (  exp   )
             (INT)
              /|\
             / | \
    (INT) exp  +  term (INT)
           |       |
     (INT) term    2
           |
           2

TEST YOURSELF #1

The following grammar defines the language of base-2 numbers:

B -> 0
  -> 1
  -> B 0
  -> B 1

Define a syntax-directed translation so that the translation of a binary number is its base 10 value. Illustrate your translation scheme by drawing the parse tree for 1001 and annotating each nonterminal in the tree with its translation.

Building an Abstract-Syntax Tree

So far, our example syntax-directed translations have produced simple values (an int or a type) as the translation of an input. In practice however, we want the parser to build an abstract-syntax tree as the translation of an input program. But that is not really so different from what we've seen so far; we just need to use tree-building operations in the translation rules instead of, e.g., arithmetic operations.

The AST vs the Parse Tree

First, let's consider how an abstract-syntax tree (AST) differs from a parse tree. An AST can be thought of as a condensed form of the parse tree:

Operators appear at internal nodes instead of at leaves.
"Chains" of single productions are collapsed.
Lists are "flattened".
Syntactic details (e.g., parentheses, commas, semi-colons) are omitted.

In general, the AST is a better structure for later stages of the compiler because it omits details having to do with the source language, and just contains information about the essential structure of the program.

Below is an example of the parse tree and the AST for the expression 3 * (4 + 2) (using the usual arithmetic-expression grammar that reflects the precedences and associativities of the operators). Note that the parentheses are not needed in the AST because the structure of the AST defines how the subexpressions are grouped.

Parse Tree               Abstract Syntax Tree
==========		 ====================

       exp                          *           
        |                          / \          
      term                        3   +
       /|\                           / \        
   term * factor                    4   2
    |    /|\
    |   / | \
factor ( exp )
  |      /|\
  3   exp + term
       |      |
     term   factor
       |      |
     factor   2
       |
       4

For constructs other than expressions, the compiler writer has some choices when defining the AST -- but remember that lists (e.g., lists of declarations lists of statements, lists of parameters) should be flattened, that operators (e.g., "assign", "while", "if") go at internal nodes, not at leaves, and that syntactic details are omitted.

For example:

Input                                  Parse Tree
=====                                  ==========

{                                ____ methodBody _________
   x = 0;                       /       /            \    \
   while (x<10) {              {  declList        stmtList }
      x = x+1;                       |           /        \
   }                              epsilon   stmtList       stmt___
   y = x*2;                                /      \       /  | \  \
}                                     stmtList   stmt    ID  = exp ;
                                     /      |        \   (y)  / | \
                             stmtList      stmt    ...     exp  * term
                               |          / | | \            |     |
                            epsilon     ID  = exp ;         term  factor
                                        (x)    |             |      |
                                              INTLITERAL   factor  INT
					         (0)         |     (2)
						             ID
							     (x)


                        AST
                        ===

                        methodBody
                       /          \
               declList		   stmtList
                                  /   |    \
                            assign  while   assign
                            /     \  ...    /     \
			  ID     INT      ID       *
			  (x)    (0)      (y)     / \
						 ID  INT
						 (x) (2)

Note that in the AST there is just one stmtList node, with a list of three children (the list of statements has been "flattened"). Also, the "operators" for the statements (assign and while) have been "moved up" to internal nodes (instead of appearing as tokens at the leaves). And finally, syntactic details (curly braces and semi-colons) have been omitted.

Translation Rules That Build an AST

To define a syntax-directed translation so that the translation of an input is the corresponding AST, we first need operations that create AST nodes. Let's use java code, and assume that we have the following class hierarchy:

class ExpNode { }

class IntLitNode extends ExpNode {
    public IntLitNode(int val) {...}
}

class PlusNode extends ExpNode {
    public PlusNode( ExpNode e1, ExpNode e2 ) { ... }
}

class TimesNode extends ExpNode {
    public TimesNode( ExpNode e1, ExpNode e2 ) { ... }
}

Now we can define a syntax-directed translation for simple arithmetic expressions, so that the translation of an expression is its AST:

CFG                      Translation rules
===                      =================
exp    -> exp + term     exp₁.trans = new PlusNode(exp₂.trans, term.trans)
exp    -> term           exp.trans  = term.trans
term   -> term * factor  term₁.trans = new TimesNode(term₂.trans, factor.trans)
term   -> factor         term.trans  = factor.trans
factor -> INTLITERAL     factor.trans  = new IntLitNode(INTLITERAL.value)
factor -> ( exp )        factor.trans  = exp.trans

TEST YOURSELF #2

Illustrate the syntax-directed translation defined above by drawing the parse tree for the expression 2 + 3 * 4, and annotating the parse tree with its translation (i.e., each nonterminal in the tree will have a pointer to the AST node that is the root of the subtree of the AST that is the nonterminal's translation).

Syntax-Directed Translation and Predictive Parsing

Now we consider how to implement a syntax-directed translation using a predictive parser. It is not obvious how to do this, since the predictive parser works by building the parse tree top-down, while the syntax-directed translation needs to be computed bottom-up. Of course, we could design the parser to actually build the parse tree (top-down), then use the translation rules to build the translation (bottom-up). However, that would not be very efficient.

Instead, we avoid explicitly building the parse tree by giving the parser a second stack called the semantic stack:

The semantic stack holds nonterminals' translations; when the parse is finished, it will hold just one value: the translation of the root nonterminal (which is the translation of the whole input).
Values are pushed onto the semantic stack (and popped off) by adding actions to the grammar rules. The action for one rule must:
- Pop the translations of all right-hand-side nonterminals.
- Compute and push the translation of the left-hand-side nonterminal.
The actions themselves are represented by action numbers, which become part of the right-hand sides of the grammar rules. They are pushed onto the (normal) stack along with the terminal and nonterminal symbols. When an action number is the top-of-stack symbol, it is popped and the action is carried out.

So what actually happens is that the action for a grammar rule X -> Y₁ Y₂ ... Y_n is pushed onto the (normal) stack when the derivation step X -> Y₁ Y₂ ... Y_n is made, but the action is not actually performed until complete derivations for all of the Y's have been carried out.

Example: Counting Parentheses

For example, consider the following syntax-directed translation for the language of balanced parentheses and square brackets. The translation of a string in the language is the number of parenthesis pairs in the string.

CFG                          Translation Rules
===			     =================
exp -> epsilon		     exp.trans = 0
    -> ( exp )               exp₁.trans = exp₂.trans + 1
    -> [ exp ]		     exp₁.trans = exp₂.trans

The first step is to replace the translation rules with translation actions. Each action must:

Pop all right-hand-side nonterminals' translations from the semantic stack.
Compute and push the left-hand-side nonterminal's translation.

Here are the translation actions:

CFG                          Translation Actions
===			     ===================
exp -> epsilon		     push(0);
    -> ( exp )               exp2Trans = pop(); push( exp2Trans + 1 );
    -> [ exp ]		     exp2Trans = pop(); push( exp2Trans );

Next, each action is represented by a unique action number, and those action numbers become part of the grammar rules:

CFG with Actions
================
exp -> epsilon #1
    -> ( exp ) #2
    -> [ exp ] #3

#1: push(0);
#2: exp2Trans = pop(); push( exp2Trans + 1 );
#3: exp2Trans = pop(); push( exp2Trans );

Note that since action #3 just pushes exactly what is popped, that action is redundant, and it is not necessary to have any action associated with the third grammar rule. Here's a picture that illustrates what happens when the input "([])" is parsed (assuming that we have removed action #3):

   input so far   stack            semantic stack  action
   ------------   -----            --------------  ------
      (           exp EOF                          pop, push "( exp ) #2"
      (           (exp) #2 EOF                     pop, scan
      ([          exp) #2 EOF                      pop, push "[ exp ]"
      ([          [exp] ) #2 EOF                   pop, scan
      ([]         exp] ) #2 EOF                    pop, push epsilon #1
      ([]         #1 ] ) #2 EOF                    pop, do action
      ([]	  ] ) #2 EOF         0             pop, scan
      ([])        ) #2 EOF           0             pop, scan
      ([]) EOF    #2 EOF             0             pop, do action
      ([]) EOF    EOF                1             pop, scan
      ([]) EOF			                   empty stack: input accepted!
                                                   translation of input = 1

In the example above, there is no grammar rule with more than one nonterminal on the right-hand side. If there were, the translation action for that rule would have to do one pop for each right-hand-side nonterminal. For example, suppose we are using a grammar that includes the rule: methodBody -> { varDecls stmts }, and that the syntax-directed translation is counting the number of declarations and statements in each method body (so the translation of varDecls is the number of derived declarations, the translation of stmts is the number of derived statements, and the translation of methodBody is the number of derived declarations and statements).

CFG Rule:              methodBody -> { varDecls stmts }
Translation Rule:      methodBody.trans = varDecls.trans + stmts.trans
Translation Action:    stmtsTrans = pop(); declsTrans = pop();
	               push( stmtsTrans + declsTrans );
CFG rule with Action:  methodBody -> { varDecls stmts } #1
                       #1: stmtsTrans = pop();
		           declsTrans = pop();
			   push( stmtsTrans + declsTrans );

Note that the right-hand-side nonterminals' translations are popped from the semantic stack right-to-left. That is because the predictive parser does a leftmost derivation, so the varDecls nonterminal gets "expanded" first; i.e., its parse tree is created before the parse tree for the stmts nonterminal. This means that the actions that create the translation of the varDecls nonterminal are performed first, and thus its translation is pushed onto the semantic stack first.

Another issue that has not been illustrated yet arises when a left-hand-side nonterminal's translation depends on the value of a right-hand-side terminal. In that case, it is important to put the action number before that terminal symbol when incorporating actions into grammar rules. This is because a terminal symbol's value is available during the parse only when it is the "current token". For example, if the translation of an arithmetic expression is the value of the expression:

CFG Rule:              factor -> INTLITERAL
Translation Rule:      factor.trans = INTLITERAL.value
Translation Action:    push( INTLITERAL.value )
CFG rule with Action:  factor -> #1 INTLITERAL       // action BEFORE terminal
                       #1: push( currToken.value )

TEST YOURSELF #3

For the following grammar, give (a) translation rules, (b) translation actions, and (c) a CFG with actions so that the translation of an input expression is the value of the expression. Do not worry about the fact that the grammar is not LL(1).

exp    -> exp + term
       -> exp - term
       -> term
term   -> term * factor
       -> term / factor
       -> factor
factor -> INTLITERAL
       -> ( exp )

Handling Non-LL(1) Grammars

Recall that a non-LL(1) grammar must be transformed to an equivalent LL(1) grammar if it is to be parsed using a predictive parser. Recall also that the transformed grammar usually does not reflect the underlying structure the way the original grammar did. For example, when left recursion is removed from the grammar for arithmetic expressions, we get grammar rules like this:

exp  -> term exp'
exp' -> epsilon
     -> + term exp'

It is not at all clear how to define a syntax-directed translation for rules like these. The solution is to define the syntax-directed translation using the original grammar (define translation rules, convert them to actions that push and pop using the semantic stack, and then incorporate the action numbers into the grammar rules). Then convert the grammar to be LL(1), treating the action numbers just like grammar symbols!

For example:

Non-LL(1) Grammar Rules With Actions
====================================
exp  -> exp + term #1
     -> term
term -> term * factor #2
     -> factor

#1: TTrans = pop(); ETrans = pop(); push Etrans + TTrans;
#2: FTrans = pop(); TTrans = pop(); push Ttrans * FTrans;

After Removing Immediate Left Recursion
=======================================
exp  -> term exp'
exp' -> + term #1 exp'
     -> epsilon

term  -> factor term'
term' -> * factor #2 term'
      -> epsilon

TEST YOURSELF #4

Transform the grammar rules with actions that you wrote for the "Test Yourself #3" exercise to LL(1) form. Trace the actions of the predictive parser on the input 2 + 3 * 4.

Summary

A syntax-directed translation is used to define the translation of a sequence of tokens to some other value, based on a CFG for the input. A syntax-directed translation is defined by associating a translation rule with each grammar rule. A translation rule defines the translation of the left-hand-side nonterminal as a function of the right-hand-side nonterminals' translations, and the values of the right-hand-side terminals. To compute the translation of a string, build the parse tree, and use the translation rules to compute the translation of each nonterminal in the tree, bottom-up; the translation of the string is the translation of the root nonterminal.

There is no restriction on the type of a translation; it can be a simple type like an integer, or a complex type list an abstract-syntax tree.

To implement a syntax-directed translation using a predictive parser, the translation rules are converted to actions that manipulate the parser's semantic stack. Each action must pop all right-hand-side nonterminals' translations from the semantic stack, then compute and push the left-hand-side nonterminal's translation. Next, the actions are incorporated (as action numbers) into the grammar rules. Finally, the grammar is converted to LL(1) form (treating the action numbers just like terminal or nonterminal symbols).

Contents