Parsing

Overview
LL(1) Grammars and Predictive Parsers
Test Yourself #1
Grammar Transformations
FIRST and FOLLOW Sets
- FIRST
- FOLLOW
Test Yourself #3
How to Build Parse Tables
- Test Yourself #4
How to Code a Predictive Parser

Overview

There are algorithms that can be used to parse the language defined by an arbitrary CFG. However, in the worst case, the algorithms take O(n³) time, where n is the number of tokens. That is too slow!

Fortunately, there are classes of grammars for which O(n) parsers can be built (and given a grammar, we can quickly test whether it is in such a class). Two such classes are:

	LL(1)
	^^ ^
	|| |___ one token of look-ahead
	||_____ do a leftmost derivation
	|______ scan the input left-to-right

	LALR(1)
	^ ^^ ^
	| || |__ one token of look-ahead
	| ||____ do a rightmost derivation in reverse
	| |_____ scan the input left-to-right
	|_______ LA means "look-ahead"; this has nothing to do with the
	         number of tokens the parser can look at before it chooses
		 what to do -- it is a technical term that only means
		 something when you study how LR parsers work...

LALR(1) grammars are:

More general than LL(1) grammars (every LL(1) grammar is also LALR(1) but not vice versa).
The class of grammars accepted by the parser generators yacc, bison, and Java Cup.
Parsed by bottom-up parsers (they construct the derivation tree by going from terminals up to nonterminals, then attaching several nonterminals together to form a new nonterminal, etc, ending with just the start nonterminal).
Hard to understand (i.e., it is hard to understand how the LALR(1) parser works, and what makes a grammar LALR(1)).

So we will learn about LL(1) grammars (remember, if a grammar is LL(1) then it is guaranteed to be LALR(1), too, so when using Java Cup, if your grammar is not LALR(1), you can always make it LL(1) and it will work).

LL(1) Grammars and Predictive Parsers

LL(1) grammars are parsed by top-down parsers. They construct the derivation tree starting with the start nonterminal and working down. One kind of parser for LL(1) grammars is the predictive parser. The idea is as follows:

"Build" the parse tree top down (don't actually build it, just discover what it would be).
Keep track of "work to be done" using a stack of terminals and nonterminals; the scanned tokens together with the stack contents correspond to the leaves of the current (incomplete) parse tree.
Also use a parse table (or selector table) to decide how to do the parse. The rows of the table are indexed by the nonterminals of the grammar, and the columns are indexed by the tokens (including the special EOF token). Each element of the table for the row indexed by nonterminal X is either empty or contains the right-hand side of a grammar rule for X.

Here's how the predictive parser works:

Start by pushing the special "EOF" terminal onto the stack, then push the start nonterminal and call the scanner to get the first token t.

Repeat:

If the top-of-stack symbol is a nonterminal X:
- Use nonterminal X and the current token t to index into the parse table to choose a production with X on the left-hand side (the one whose right-hand side is in Table[X][t]).
- Pop the X from the stack and push the chosen production's right-hand side (push the symbols one at a time, from right to left).
If the top-of-stack symbol is a terminal, match it with the current token. If it matches, pop it and call the scanner to get the next token.

until one of the following happens:
- Top-of-stack is a nonterminal, and the parse table entry is empty: reject the input.
- Top-of-stack is a terminal but doesn't match the current token: reject the input.
- Stack is empty: accept the input.
Here's a very simple example, using a grammar that defines the language of balanced parentheses or square brackets, and running the parser on the input "( [ ] ) EOF". Note that in the examples in this set of notes we will use actual characters (such as: (, ), [, and ]) instead of the token names (LPAREN, RPAREN, etc). Also note that in the picture, the top of stack is to the left.
```
   grammar: S -> epsilon | ( S ) | [ S ]

   parse table:
      
           (         )       [        ]       EOF
       +--------------------------------------------
     S | ( S ) | epsilon | [ S ] | epsilon | epsilon
       +--------------------------------------------


   input seen so far   stack         action
   -----------------   -----         ------
         (              S EOF       pop, push "(S)"
	 (              (S) EOF     pop, scan (top-of-stack term matches curr token)
	 ([             S) EOF      pop, push "[S]"
	 ([             [S]) EOF    pop, scan (top-of-stack term matches curr token)
	 ([]            S]) EOF     pop, push epsilon (no push)
	 ([]            ]) EOF      pop, scan (top-of-stack term matches curr token)
	 ([])           ) EOF       pop, scan (top-of-stack term matches curr token)
	 ([]) EOF       EOF         pop, scan (top-of-stack term matches curr token)
	 ([]) EOF                   empty stack: input accepted!
```
Remember, it is not always possible to build a predictive parser given a CFG; only if the CFG is LL(1). For example, the following grammar is not LL(1) (but it is LL(2)):
```
       S -> ( S ) | [ S ] | ( ) | [ ]
```
If we try to parse an input that starts with a left paren, we are in trouble! We don't know whether to choose the first production: S -> ( S ), or the third one: S -> ( ) . If the next token is a right paren, we want to push "()". If the next token is a left paren, we want to push "(S)". So here we need two tokens of look-ahead.
TEST YOURSELF #1
Draw a picture like the one given above to illustrate what the parser for the grammar:
does on the input: "[[]]".

Grammar Transformations
We need to answer two important questions:
1. How to tell whether a grammar is LL(1).
2. How to build the parse (or selector) table for a predictive parser, given an LL(1) grammar.
It turns out that there is really just one answer: if we build the parse table and no element of the table contains more than one grammar rule right-hand side, then the grammar is LL(1).
Before saying how to build the table we will consider two properties that preclude a context-free grammar from being LL(1): left-recursive grammars and grammars that are not left factored. We will also consider some transformations that can be applied to such grammars to make them LL(1).
First, we will introduce one new definition:
Here are some examples of useless nonterminals :
From now on "context-free grammar" means a grammar without useless nonterms.
Left Recursion
- A grammar G is recursive in a nonterminal X if X can derive a sequence of symbols that includes X, in one or more steps: X ==>+ alpha X beta
- G is left recursive in nonterminal X if X can derive a sequence of symbols that starts with X, in one or more steps: X ==>+ X beta
- G is immediately left recursive in nonterminal X if X can derive a sequence of symbols that starts with X in one step: X ==> X beta (i.e., the grammar includes the production: X -> X beta).
In general, it is not a problem for a grammar to be recursive. However, if a grammar is left recursive, it is not LL(1). Fortunately, we can change a grammar to remove immediate left recursion without changing the language of the grammar. Here is how to do the transformation:
Using this rule, we create a new grammar from a grammar with immediate left recursion. The new grammar is equivalent to the original one; i.e., the two grammars derive exactly the same sets of strings, but the new one is not immediately left recursive (and so has a chance of being LL(1)).
To illustrate why the new grammar is equivalent to the original one, consider the parse trees that can be built using the original grammar:
Note that the derived strings are:
1. beta
2. beta alpha
3. beta alpha alpha
4. etc.
That is, they are of the form "beta, followed by zero or more alphas". The new grammar derives the same set of strings, but the parse trees have a different shape (the single "beta" is derived right away, and then the zero or more alphas):
Example: Consider the grammar for arithmetic expressions involving only subtraction:
```
		exp    -> exp - factor | factor
		factor -> INTLITERAL   | ( exp )
```
Notice that the first rule (exp -> exp - factor) has immediate left recursion, so this grammar is not LL(1). (For example, if the first token is INTLITERAL, you don't know whether to choose the production exp -> exp - factor, or exp -> factor. If the next token is MINUS, then you should choose exp -> exp - factor, but if the next token is EOF, then you should choose exp -> factor.)
Using the transformation defined above, we remove the immediate left recursion, producing the following new grammar:
```
		exp     -> factor exp'
		exp'    -> - factor exp' | epsilon
		factor  -> INTLITERAL    | ( exp )
```
Let's consider what the predictive parser built using this grammar does when the input starts with an integer:
- The predictive parser starts by pushing EOF, then exp onto the stack. Regardless of what the first token is, there is only one production with exp on the left-hand side, so it will always pop the exp from the stack and push factor exp' as its first action.
- Now the top-of-stack symbol is the nonterminal factor. Since the input is the INTLITERAL token (not the LPAREN token) it will pop the factor and push INTLITERAL.
- The top-of-stack symbol is now a terminal (INTLITERAL), which does match the current token, so the stack will be popped, and the scanner called to get the next token.
- Now the top-of-stack symbol is nonterminal exp'. We'll consider two cases:
  1. The next token is MINUS. In this case, we pop exp' and push - factor exp'.
  2. The next token is EOF. In this case, we pop exp' and push epsilon (i.e., push nothing).
So with the new grammar, the parser is able to tell (using only one token look-ahead) what action to perform.
Unfortunately, there is a major disadvantage of the new grammar, too. Consider the parse trees for the string 2 - 3 - 4 for the old and the new grammars:
```
              Parse tree using                     Parse tree using
            the original grammar:                  the new grammar:


                       exp                           exp
                       /|\                           / \
                      / | \                         /   \
                   exp  -  factor               factor  exp'
                   /|\       |                     |    /|\
                  / | \      4                     2   / | \
               exp  -  factor                         /  |  \
                |        |                           - factor exp'
              factor     3                               |    /|\
                |                                        3   / | \ 
                2                                           /  |  \ 
		                                           - factor exp'
                                                               |     |
                                                               |     |
                                                               4  epsilon
```
The original parse tree shows the underlying structure of the expression; in particular it groups 2 - 3 in one subtree to reflect the fact that subtraction is left associative. The parse tree for the new grammar is a mess! Its subtrees don't correspond to the sub-expressions of 2 - 3 - 4 at all! Fortunately, we can design a predictive parser to create an abstract-syntax tree that does correctly reflect the structure of the parsed code even though the grammar's parse trees do not.
Note that the rule for removing immediate left recursion given above only handled a somewhat restricted case, where there was only one left-recursive production. Here's a more general rule for removing immediate left recursion:
- For every set of productions of the form:
- Replace them with the following productions:
Note also that there are rules for removing non-immediate left recursion; for example, you can read about how to do that in the compiler textbook by Aho, Sethi & Ullman, on page 177. However, we will not discuss that issue here.
Left Factoring
A second property that precludes a grammar from being LL(1) is if it is not left factored, i.e., if a nonterminal has two productions whose right-hand sides have a common prefix. For example, the following grammar is not left factored:
In this example, the common prefix is "(".
This problem is solved by left-factoring, as follows:
- Given a pair of productions of the form:
  where alpha is a sequence of terminals and/or nonterminals, and beta₁ and beta₂ are sequences of terminals and/or nonterminals that do not have a common prefix (and one of the betas could be epsilon),
- Replace those two production with:
For example, consider the following productions: exp -> ( exp ) | ( )
Using the rule defined above, they are replaced by:
Here's the more general algorithm for left factoring (when there may be more than two productions with a common prefix):
Note that this transformation (like the one for removing immediate left recursion) has the disadvantage of making the grammar much harder to understand. However, it is necessary if you need an LL(1) grammar.
Here's an example that demonstrates both left-factoring and immediate left-recursion removal:
- The original grammar is:
- After removing immediate left-recursion, the grammar becomes:
- After left-factoring, this new grammar becomes:
TEST YOURSELF #2
Using the same grammar: exp -> ( exp ) | exp exp | ( ), do left factoring first, then remove immediate left recursion.

FIRST and FOLLOW Sets
Recall: A predictive parser can only be built for an LL(1) grammar. A grammar is not LL(1) if it is:
1. Left recursive, or
2. not left factored.
However, grammars that are not left recursive and are left factored may still not be LL(1). As mentioned earlier, to see if a grammar is LL(1), we try building the parse table for the predictive parser. If any element in the table contains more than one grammar rule right-hand side, then the grammar is not LL(1).
To build the table, we must must compute FIRST and FOLLOW sets for the grammar.
FIRST Sets
Ultimately, we want to define FIRST sets for the right-hand sides of each of the grammar's productions. To do that, we define FIRST sets for arbitrary sequences of terminals and/or nonterminals, or epsilon (since that's what can be on the right-hand side of a grammar production). The idea is that for sequence alpha, FIRST(alpha) is the set of terminals that begin the strings derivable from alpha, and also, if alpha can derive epsilon, then epsilon is in FIRST(alpha). Using derivation notation:
```
	FIRST(alpha) = { t | (t is a terminal and alpha ==>* t beta) or
			     (t = epsilon and alpha ==>* epsilon) }
```
To define FIRST(alpha) for arbitrary alpha, we start by defining FIRST(X), for a single symbol X (a terminal, a nonterminal, or epsilon):
1. X is a terminal: FIRST(X) = {X}
2. X is epsilon: FIRST(X) = {epsilon}
3. X is a nonterminal. In this case, we must look at all grammar productions with X on the left, i.e., productions of the form:
  where each Y_k is a single terminal or nonterminal (or there is just one Y, and it is epsilon). For each such production, we perform the following actions:
  - Put FIRST(Y₁) - {epsilon} into FIRST(X).
  - If epsilon is in FIRST(Y₁), then put FIRST(Y₂) - {epsilon} into FIRST(X).
  - If epsilon is in FIRST(Y₂), then put FIRST(Y₃) - {epsilon} into FIRST(X).
  - etc...
  - If epsilon is in FIRST(Y_i) for 1 <= i <= k (all production right-hand sides)) then put epsilon into FIRST(X).
For example, consider computing FIRST sets for each of the nonterminals in the following grammar:
```
	exp -> term exp'
	exp'-> - term exp' | epsilon
	term -> factor term'
	term'-> / factor term' | epsilon	
	factor -> INTLITERAL | ( exp )		
```
Here are the FIRST sets (starting with nonterminal factor and working up, since we need to know FIRST(factor) to compute FIRST(term), and we need to know FIRST(term) to compute FIRST(exp)):
Once we have computed FIRST(X) for each terminal and nonterminal X, we can compute FIRST(alpha) for every production's right-hand-side alpha. In general, alpha will be of the form:
where each X is a single terminal or nonterminal, or there is just one X and it is epsilon. The rules for computing FIRST(alpha) are essentially the same as the rules for computing the first set of a nonterminal:
1. Put FIRST(X₁) - {epsilon} into FIRST(alpha)
2. If epsilon is in FIRST(X₁) put FIRST(X₂) into FIRST(alpha).
3. etc...
4. If epsilon is in the FIRST set for every X_k, put epsilon into FIRST(alpha).
For the example grammar above, here are the FIRST sets for each production right-hand side:
```
	FIRST( term exp' )      =  { INTLITERAL, ( }
	FIRST( - term exp' )    =  { - }
	FIRST( epsilon )        =  { epsilon }
	FIRST( factor term' )   =  { INTLITERAL, ( }
	FIRST( / factor term' ) =  { / }
	FIRST( epsilon )        =  { epsilon }
	FIRST( INTLITERAL )     =  { INTLITERAL }
	FIRST( ( exp ) )        =  { ( }
```
Why do we care about the FIRST(alpha) sets? During parsing, suppose the top-of-stack symbol is nonterminal A, that there are two productions:
and that the current token is a. If FIRST(alpha) includes a, choose the first production: pop, push alpha. If FIRST(beta) includes a, choose the second production : pop, push beta. We haven't yet given the rules for using FIRST and FOLLOW sets to determine whether a grammar is LL(1); however, you might be able to guess based on this discussion, that if a is in both FIRST(alpha) and FIRST(beta), the grammar is not LL(1).
FOLLOW Sets
FOLLOW sets are only defined for single nonterminals. The definition is as follows:
It is worth noting that epsilon is never in a FOLLOW set.
Using notation:
```
	     FOLLOW(A) = {t | (t is a terminal and S ==>+ alpha A t beta)
 		 	      or (t is EOF and S ==>* alpha A)}
```
Here are some pictures illustrating the conditions under which symbols a, c, and EOF are in the FOLLOW set of nonterminal A:
```
          Fig. 1                      Fig. 2                     Fig. 3

	     S                           S                         S
            / \                         / \                       / \
           /   \                       /   \                     /   \
          /     \                     /  X Y\                   /     \
         /  /\   \                   /  /  | \                 /       \
           A  \                        /\  |                           (A)
              /\                      A  B /\                           |
             /  \                       / /  \                          |
           (a)b c                epsilon /    \                        /
            |                           /______\         EOF is in FOLLOW(A)
	    |                           (c)d e f
            a is in FOLLOW(A)                |
					 |
					  \
                                      c is in FOLLOW(A)
```
How to compute FOLLOW(A) for each nonterminal A:
- If A is the start nonterminal, put EOF in FOLLOW(A) (like S in Fig. 3).
- Find the productions with A on the right-hand-side:
  - For each production X -> alpha A beta, put FIRST(beta) - {epsilon} in FOLLOW(A) -- see Fig. 1.
  - If epsilon is in FIRST(beta) then put FOLLOW(X) into FOLLOW(A) -- see Fig. 2.
  - For each production X -> alpha A, put FOLLOW(X) into FOLLOW(A) -- see Fig. 3, and Fig. 4 below:
```
                 Fig. 4

                     S
                    / \
                   /   \    Whatever follows X is also going to follow A
                  / X   \   
                   / \
                alpha A
         
```
It is worth noting that:
- To compute FIRST(A) you must look for A on a production's left-hand side.
- To compute FOLLOW(A) you must look for A on a production's right-hand side.
- FIRST and FOLLOW sets are always sets of terminals (plus, perhaps, epsilon for FIRST sets, and EOF for follow sets). Nonterminals are never in a FIRST or a FOLLOW set.
Here's an example of FOLLOW sets (and the FIRST sets we need to compute them). In this example, nonterminals are upper-case letters, and terminals are lower-case letters.
Now let's consider why we care about FOLLOW sets:
- Suppose, during parsing, we have some X at the top-of-stack, and a is the current token.
- We need to replace X on the stack with the right-hand side of a production X -> alpha. Great. But what if X has two productions, say, X -> alpha and X -> beta. Which one should we use??
- Well, we've already said that if a is in FIRST(alpha) but not in FIRST(beta) then we want to choose X -> alpha.
- But what if a is not in FIRST of either alpha or beta? Well, if alpha or beta can derive epsilon, and a is in FOLLOW(X), then we still have hope that the input will be accepted: If alpha can derive epsilon (i.e., epsilon is in FIRST(alpha), then we want to choose X -> alpha (and similarly if beta can derive epsilon). The idea is that since alpha can derive epsilon, it will eventually be popped from the stack, and if we're lucky, the next symbol down (the one that was under the X) will be a.
TEST YOURSELF #3
Here are five grammar productions for (simplified) method headers:
Question 1: Compute the FIRST and FOLLOW sets for the three nonterminals, and the FIRST sets for each production right-hand side.
Question 2: Draw a picture to illustrate what the predictive parser will do, given the input sequence of tokens: "VOID ID LPAREN RPAREN EOF". Include an explanation of how the FIRST and FOLLOW sets are used when there is a nonterminal on the top-of-stack that has more than one production.

How to Build Parse Tables (and to tell if a grammar is LL(1))
Recall that the form of the parse table is:
Table entry[X,a] is either empty (if having X on top of stack and having a as the current token means a syntax error) or contains the right-hand side of a production whose left-hand-side nonterminal is X -- that right-hand side is what should be pushed.
To build the table, we fill in the rows one at a time for each nonterminal X as follows:
The grammar is not LL(1) iff there is more than one entry for any cell in the table.
Let's try building a parse table for the following grammar:
First we calculate the FIRST and FOLLOW sets:
Then we use those sets to start filling in the parse table:
Not all entries have been filled in, but already we can see that this grammar is not LL(1) since there are two entries in table[S,a] and in table[S,c].
Here's how we filled in this much of the table:
1. First, we considered the production S -> B c. FIRST(Bc) = { a, c }, so we put the production's right-hand side (B c) in Table[S, a] and in Table[S, c]. FIRST(Bc) does not include epsilon, so we're done with that production.
2. Next, we considered the production S -> D B . FIRST(DB) = { d, a, c }, so we put the production's right-hand side (D B) in Table[S, d], Table[S, a], and Table[S, c].
3. Next, we considered the production D -> epsilon. FIRST(epsilon) = { epsilon }, so we must look at FOLLOW(D). FOLLOW(D) = { a, c }, so we put the production's right-hand side (epsilon) in Table[D, a] and Table[D, c}.
TEST YOURSELF #4
Finish filling in the parse table given above.

How to Code a Predictive Parser
Now, suppose we actually want to code a predictive parser for a grammar that is LL(1). The simplest idea is to use a table-driven parser with an explicit stack. Here's pseudo-code for a table-driven predictive parser:
```
   Stack.push(EOF);
   Stack.push(start-nonterminal);
   currToken = scan();

   while (! Stack.empty()) {
     topOfStack = Stack.pop();
     if (isNonTerm(topOfStack)) {
       // top of stack symbol is a nonterminal
       p = table[topOfStack, currToken];
       if (p is empty) report syntax error and quit
       else {
         // p is a production's right-hand side
	 push p, one symbol at a time, from right to left
       }
     }
     else {
       // top of stack symbol is a terminal
       if (topOfStack == currToken) currToken = scan();
       else report syntax error and quit
     }
   }
```

Contents