LL(1) Grammars and Predictive Parsers
LL(1) grammars are parsed by top-down parsers. They construct the derivation tree starting with the start nonterminal and working down. One kind of parser for LL(1) grammars is the predictive parser. The idea is as follows:
Here's a very simple example, using a grammar that defines the language of balanced parentheses or square brackets, and running the parser on the input "( [ ] ) EOF". Note that in the examples on this page we will use sometimes name terminal symbols using single characters (such as: (, ), [, and ]) instead of the token names (lparen, rparen, etc). Also note that in the picture, the top of stack is to the left.
Grammar: | |||
$S$ | $\longrightarrow$ | $\varepsilon$ | ( $S$ ) | [ $S$ ] |
Parse Table: | ( | ) | [ | ] | EOF | |
$S$ | ( $S$ ) | $\varepsilon$ | [ $S$ ] | $\varepsilon$ | $\varepsilon$ |
Input seen so far | stack | Action | ||
( | $S$ EOF | pop, push ( $S$ ) | ||
( | ( $S$ ) EOF | pop, scan | ||
([ | $S$ ) EOF | pop, push [ $S$ ] | ||
([ | [ $S$ ] ) EOF | pop, scan | ||
([] | $S$ ] ) EOF | pop, push nothing | ||
([] | ] ) EOF | pop, scan | ||
([]) | ) EOF | pop, scan | ||
([]) EOF | EOF | pop, scan | ||
([]) EOF | empty stack: input accepted |
Remember, it is not always possible to build a predictive parser given a CFG; only if the CFG is LL(1). For example, the following grammar is not LL(1) (but it is LL(2)):
$S$ | $\longrightarrow$ | ( $S$ ) | [ $S$ ] | ( ) | [ ] |
Draw a picture like the one given above to illustrate what the parser for the grammar:
S -> epsilon | ( S ) | [ S ]
We need to answer two important questions:
It turns out that there is really just one answer: if we build the parse table and no element of the table contains more than one grammar rule right-hand side, then the grammar is LL(1).
Before saying how to build the table we will consider two properties that preclude a context-free grammar from being LL(1): left-recursive grammars and grammars that are not left factored. We will also consider some transformations that can be applied to such grammars to make them LL(1).
First, we will introduce one new definition:
For case 1:
$S$ | $\longrightarrow$ | $A$ $B$ |
$A$ | $\longrightarrow$ | + | - | $\varepsilon$ |
$B$ | $\longrightarrow$ | digit | $B$ digit |
$C$ | $\longrightarrow$ | . $B$ |
For case 2:
$S$ | $\longrightarrow$ | $X$ | $Y$ |
$X$ | $\longrightarrow$ | ( ) |
$Y$ | $\longrightarrow$ | ( $Y$ $Y$ ) |
From now on "context-free grammar" means a grammar without useless nonterms.
In general, it is not a problem for a grammar to be recursive. However, if a grammar is left recursive, it is not LL(1). Fortunately, we can change a grammar to remove immediate left recursion without changing the language of the grammar. Here is how to do the transformation:
$A$ | $\longrightarrow$ | $A$ $\alpha$ |
$|$ | $\beta$ |
$A$ | $\longrightarrow$ | $\beta$ A` |
A` | $\longrightarrow$ | $\alpha$ A` | $\varepsilon$ |
Using this rule, we create a new grammar from a grammar with immediate left recursion. The new grammar is equivalent to the original one; i.e., the two grammars derive exactly the same sets of strings, but the new one is not immediately left recursive (and so has a chance of being LL(1)).
To illustrate why the new grammar is equivalent to the original one, consider the parse trees that can be built using the original grammar:
etc.Note that the derived strings are:
That is, they are of the form "beta, followed by zero or more alphas". The new grammar derives the same set of strings, but the parse trees have a different shape (the single "beta" is derived right away, and then the zero or more alphas):
etc.Consider, for instance, the grammar for arithmetic expressions involving only subtraction:
Exp | $\longrightarrow$ | Exp minus Factor | Factor |
Factor | $\longrightarrow$ | intliteral | ( Exp ) |
Notice that the first rule ( Exp$\longrightarrow$Exp minus Factor) has immediate left recursion, so this grammar is not LL(1). (For example, if the first token is intliteral, you don't know whether to choose the production Exp$\longrightarrow$Exp minus Factor, or Exp$\longrightarrow$Factor. If the next token is minus, then you should choose Exp$\longrightarrow$Exp minus Factor, but if the next token is EOF, then you should choose Exp$\longrightarrow$Factor.
Using the transformation defined above, we remove the immediate left recursion, producing the following new grammar:
Exp | $\longrightarrow$ | Factor Exp` |
Exp` | $\longrightarrow$ | minus Factor Exp` | |
Factor | $\longrightarrow$ | intliteral | ( Exp ) |
Let's consider what the predictive parser built using this grammar does when the input starts with an integer:
Unfortunately, there is a major disadvantage of the new grammar, too. Consider the parse trees for the string 2 - 3 - 4 for the old and the new grammars:
Before eliminating Left Recursion:
After eliminating Left Recursion:
The original parse tree shows the underlying structure of the expression; in particular it groups 2 - 3 in one subtree to reflect the fact that subtraction is left associative. The parse tree for the new grammar is a mess! Its subtrees don't correspond to the sub-expressions of 2 - 3 - 4 at all! Fortunately, we can design a predictive parser to create an abstract-syntax tree that does correctly reflect the structure of the parsed code even though the grammar's parse trees do not.
Note that the rule for removing immediate left recursion given above only handled a somewhat restricted case, where there was only one left-recursive production. Here's a more general rule for removing immediate left recursion:
Note also that there are rules for removing non-immediate left recursion; for example, you can read about how to do that in the compiler textbook by Aho, Sethi & Ullman, on page 177. However, we will not discuss that issue here.
A second property that precludes a grammar from being LL(1) is if it is not left factored, i.e., if a nonterminal has two productions whose right-hand sides have a common prefix. For example, the following grammar is not left factored:
This problem is solved by left-factoring, as follows:
For example, consider the following productions: $Exp$ $\longrightarrow$ ( $Exp$ ) | ( )
Using the rule defined above, they are replaced by:
Here's the more general algorithm for left factoring (when there may be more than two productions with a common prefix):
$A$ | $\longrightarrow$ | $\alpha$ $\beta_1$ | $\ldots$ | $\alpha$ $\beta_m$ | y_{1} | .. | y_{n} |
$A$ | $\longrightarrow$ | $\alpha$ A` | $y_1$ | $\ldots$ | $y_n$ |
A` | $\longrightarrow$ | $\beta_1$ | $\ldots$ | $\beta_m$ |
Repeat this process until no nonterminal has two productions with a common prefix.
Note that this transformation (like the one for removing immediate left recursion) has the disadvantage of making the grammar much harder to understand. However, it is necessary if you need an LL(1) grammar.
Here's an example that demonstrates both left-factoring and immediate left-recursion removal:
$Exp$ | $\longrightarrow$ | ( $Exp$ ) | $Exp$ $Exp$ | ( ) |
$Exp$ | $\longrightarrow$ | ( $Exp$ ) $Exp`$ | ( ) $Exp`$ |
$Exp`$ | $\longrightarrow$ | $Exp$ $Exp`$ | $\varepsilon$ |
$Exp$ | $\longrightarrow$ | ( $Exp``$ |
$Exp``$ | $\longrightarrow$ | $Exp$ ) $Exp`$ | ) $Exp`$ |
$Exp`$ | $\longrightarrow$ | $Exp$ $Exp`$ | $\varepsilon$ |
Using the same grammar: exp -> ( exp ) | exp exp | ( ), do left factoring first, then remove immediate left recursion.
Recall: A predictive parser can only be built for an LL(1) grammar. A grammar is not LL(1) if it is:
To build the table, we must must compute FIRST and FOLLOW sets for the grammar.
Ultimately, we want to define FIRST sets for the right-hand sides of each of the grammar's productions. To do that, we define FIRST sets for arbitrary sequences of terminals and/or nonterminals, or $\varepsilon$ (since that's what can be on the right-hand side of a grammar production). The idea is that for sequence $\alpha$, FIRST($\alpha$) is the set of terminals that begin the strings derivable from $\alpha$, and also, if $\alpha$ can derive $\varepsilon$, then $\varepsilon$ is in FIRST($\alpha$). Using derivation notation: $$\mathrm{FIRST}(\alpha) = \left\{ \; t \; \middle| \; \left( t \; \in \Sigma \wedge \alpha \stackrel{*}\Longrightarrow t \beta \right) \vee \left( t = \varepsilon \wedge \alpha \stackrel{*}\Longrightarrow \varepsilon \right) \right\} $$ To define FIRST($\alpha$) for arbitrary $\alpha$, we start by defining FIRST(X), for a single symbol X (a terminal, a nonterminal, or $\varepsilon$):
For example, consider computing FIRST sets for each of the nonterminals in the following grammar:
Exp | $\longrightarrow$ | Term Exp` |
Exp` | $\longrightarrow$ | minus Term Exp` | $\varepsilon$ |
Term | $\longrightarrow$ | Factor Term' |
Term' | $\longrightarrow$ | divide Factor Term' | $\varepsilon$ |
Factor | $\longrightarrow$ | intlit | ( Exp ) |
Here are the FIRST sets (starting with nonterminal factor and working up, since we need to know FIRST(factor) to compute FIRST(term), and we need to know FIRST(term) to compute FIRST(exp)):
Once we have computed FIRST(X) for each terminal and nonterminal X, we can compute FIRST($\alpha$) for every production's right-hand-side $\alpha$. In general, $\alpha$ will be of the form:
For the example grammar above, here are the FIRST sets for each production right-hand side:
Why do we care about the FIRST($\alpha$) sets? During parsing, suppose that there are two productions:
Production 1: | $A$ | $\longrightarrow$ | $\alpha$ |
Production 2: | $A$ | $\longrightarrow$ | $\beta$ |
FOLLOW sets are only defined for single nonterminals. The definition is as follows:
Using notation: $$\mathrm{FOLLOW}(A) = \left\{ \; \mathrm{t} \; \middle| \; \left( \mathrm{t} \; \in \Sigma \wedge S \stackrel{+}\Longrightarrow \alpha \; A \; \mathrm{t} \; \beta \right) \vee \left( \mathrm{t} = \mathrm{EOF} \wedge S \stackrel{*}\Longrightarrow \alpha A \right) \right\} $$
Here are some pictures illustrating the conditions under which symbols a, c, and EOF are in the FOLLOW set of nonterminal A:
Figure 1
Figure 2
Figure 3
Figure 4
It is worth noting that:
Here's an example of FOLLOW sets (and the FIRST sets we need to compute them). In this example, nonterminals are upper-case letters, and terminals are lower-case letters.
$S$ | $\longrightarrow$ | $B$ c | $D$ $B$ |
$B$ | $\longrightarrow$ | a b | c $S$ |
$D$ | $\longrightarrow$ | d | $\varepsilon$ |
$\alpha$ | FIRST($\alpha$) | FOLLOW($\alpha$) | |||
$D$ | {d, $\varepsilon$} | {a, c} | |||
$B$ | {a, c} | {c, EOF} | |||
$S$ | {a, c, d} | {EOF, c} | Note that FOLLOW($S$) always includes EOF |
Now let's consider why we care about FOLLOW sets:
Here are five grammar productions for (simplified) method headers:
1. methodHeader -> VOID ID LPAREN paramList RPAREN 2. paramList -> epsilon 3. paramList -> nonEmptyParamList 4. nonEmptyParamList -> ID ID 5. nonEmptyParamList -> ID ID COMMA nonEmptyParamListQuestion 1: Compute the FIRST and FOLLOW sets for the three nonterminals, and the FIRST sets for each production right-hand side.
Question 2: Draw a picture to illustrate what the predictive parser will do, given the input sequence of tokens: "VOID ID LPAREN RPAREN EOF". Include an explanation of how the FIRST and FOLLOW sets are used when there is a nonterminal on the top-of-stack that has more than one production.
Recall that the form of the parse table is:
Table entry[$X$,a] is either empty (if having $X$ on top of stack and having a as the current token means a syntax error) or contains the right-hand side of a production whose left-hand-side nonterminal is $X$ -- that right-hand side is what should be pushed.
To build the table, we fill in the rows one at a time for each nonterminal $X$ as follows:
The grammar is not LL(1) iff there is more than one entry for any cell in the table.
$S$ | $\longrightarrow$ | $B$ c | $D$ $B$ |
$B$ | $\longrightarrow$ | a b | c $S$ |
$D$ | $\longrightarrow$ | d | $\varepsilon$ |
$\alpha$ | FIRST($\alpha$) | FOLLOW($\alpha$) | |||
$D$ | {d, $\varepsilon$} | {a, c} | |||
$B$ | {a, c} | {c, EOF} | |||
$S$ | {a, c, d} | {EOF, c} | |||
$B$ c | {a, c} | - | |||
$D$ $B$ | {d, a, c} | - | |||
a b | {a} | - | |||
c $S$ | {c} | - | |||
d | {d} | - | |||
$\varepsilon$ | {$\varepsilon$} | - |
Here's how we filled in this much of the table:
Finish filling in the parse table given above.
How to Code a Predictive Parser
Now, suppose we actually want to code a predictive parser for a grammar that is LL(1). The simplest idea is to use a table-driven parser with an explicit stack. Here's pseudo-code for a table-driven predictive parser:
Stack.push(EOF); Stack.push(start-nonterminal); currToken = scan(); while (! Stack.empty()) { topOfStack = Stack.pop(); if (isNonTerm(topOfStack)) { // top of stack symbol is a nonterminal p = table[topOfStack, currToken]; if (p is empty) report syntax error and quit else { // p is a production's right-hand side push p, one symbol at a time, from right to left } } else { // top of stack symbol is a terminal if (topOfStack == currToken) currToken = scan(); else report syntax error and quit } }
Here is a CFG (with rule numbers):
S -> epsilon (1) | X Y Z (2) X -> epsilon (3) | X S (4) Y -> epsilon (5) | a Y b (6) Z -> c Z (7) | d (8)
Question 1(a): Compute the First and Follow Sets for each nonterminal.
Question 1(b): Draw the Parse Table.