There are algorithms that can be used to parse the language defined by an
arbitrary CFG. However, in the worst case, the algorithms take
O(n3) time, where n is the number of tokens. That is too
slow!
Fortunately, there are classes of grammars for which O(n) parsers can
be built (and given a grammar, we can quickly test whether it is in
such a class).
Two such classes are:
LL(1) grammars are parsed by top-down parsers. They construct the derivation
tree starting with the start nonterminal and working down. One kind of
parser for LL(1) grammars is the predictive parser.
The idea is as follows:
Here's a very simple example, using a grammar that defines the language
of balanced parentheses or square brackets, and running the parser on
the input "( [ ] ) EOF".
Note that in the examples in this set of notes we will use actual characters
(such as: (, ), [, and ]) instead
of the token names (LPAREN, RPAREN, etc).
Also note that in the picture, the top of stack is to the left.
Draw a picture like the one given above to illustrate what the parser
for the grammar:
We need to answer two important questions:
It turns out that there is really just one answer: if we build the
parse table and no element of the table contains more than one
grammar rule right-hand side, then the grammar is LL(1).
Before saying how to build the table we will consider two properties
that preclude a context-free grammar from being LL(1):
left-recursive grammars and grammars that are not left
factored.
We will also consider some transformations that can be applied to
such grammars to make them LL(1).
First, we will introduce one new definition:
In general, it is not a problem for a grammar to be recursive.
However, if a grammar is left recursive, it is not LL(1).
Fortunately, we can change a grammar to remove immediate left recursion
without changing the language of the grammar.
Here is how to do the transformation:
To illustrate why the new grammar is equivalent to the original one,
consider the parse trees that can be built using the original grammar:
Example: Consider the grammar for arithmetic expressions involving only
subtraction:
Using the transformation defined above, we remove the immediate left
recursion, producing the following new grammar:
Unfortunately, there is a major disadvantage of the new grammar, too.
Consider the parse trees for the string 2 - 3 - 4 for
the old and the new grammars:
Note that the rule for removing immediate left recursion given above
only handled a somewhat restricted case, where there was only one
left-recursive production.
Here's a more general rule for removing immediate left recursion:
Note also that there are rules for removing non-immediate
left recursion; for example, you can read about how to do that in the compiler
textbook by Aho, Sethi & Ullman, on page 177.
However, we will not discuss that issue here.
A second property that precludes a grammar from being LL(1) is if
it is not left factored, i.e., if a nonterminal has two productions
whose right-hand sides have a common prefix.
For example, the following grammar is not left factored:
This problem is solved by left-factoring, as follows:
Note that this transformation (like the one for removing immediate left
recursion) has the disadvantage of making the grammar much harder to
understand.
However, it is necessary if you need an LL(1) grammar.
Here's an example that demonstrates both left-factoring and immediate
left-recursion removal:
Using the same grammar: exp -> ( exp ) | exp exp | ( ),
do left factoring first, then remove immediate left recursion.
Recall: A predictive parser can only be built for an LL(1) grammar.
A grammar is not LL(1) if it is:
To build the table, we must must compute FIRST and
FOLLOW sets for the grammar.
Ultimately, we want to define FIRST sets for the right-hand sides
of each of the grammar's productions.
To do that, we define FIRST sets for arbitrary sequences of terminals
and/or nonterminals, or epsilon (since that's what can be on the right-hand
side of a grammar production).
The idea is that for sequence alpha, FIRST(alpha) is the set of
terminals that begin the
strings derivable from alpha, and also, if alpha can derive epsilon,
then epsilon is in FIRST(alpha).
Using derivation notation:
For example, consider computing FIRST sets for each of the nonterminals
in the following grammar:
Once we have computed FIRST(X) for each terminal and nonterminal X,
we can compute FIRST(alpha) for every
production's right-hand-side alpha.
In general, alpha will be of the form:
Why do we care about the FIRST(alpha) sets?
During parsing, suppose the top-of-stack symbol is nonterminal A, that
there are two productions:
FOLLOW sets are only defined for single nonterminals.
The definition is as follows:
Using notation:
It is worth noting that:
Here's an example of FOLLOW sets (and the FIRST sets we need to
compute them). In this example, nonterminals are upper-case letters, and
terminals are lower-case letters.
Now let's consider why we care about FOLLOW sets:
Here are five grammar productions for (simplified) method headers:
Question 1:
Compute the FIRST and FOLLOW sets for the three nonterminals,
and the FIRST sets for each production right-hand side.
Question 2:
Draw a picture to illustrate what the predictive parser will do, given
the input sequence of
tokens: "VOID ID LPAREN RPAREN EOF".
Include an explanation of how the FIRST and FOLLOW sets
are used when there is a nonterminal on the top-of-stack that has
more than one production.
Recall that the form of the parse table is:
To build the table, we fill in the rows one at a time for each
nonterminal X as follows:
The grammar is not LL(1) iff there is more
than one entry for any cell in the table.
Let's try building a parse table for the following grammar:
Here's how we filled in this much of the table:
Finish filling in the parse table given above.
Overview
LL(1)
^^ ^
|| |___ one token of look-ahead
||_____ do a leftmost derivation
|______ scan the input left-to-right
LALR(1)
^ ^^ ^
| || |__ one token of look-ahead
| ||____ do a rightmost derivation in reverse
| |_____ scan the input left-to-right
|_______ LA means "look-ahead"; this has nothing to do with the
number of tokens the parser can look at before it chooses
what to do -- it is a technical term that only means
something when you study how LR parsers work...
LALR(1) grammars are:
So we will learn about LL(1) grammars (remember, if a grammar is LL(1) then
it is guaranteed to be LALR(1), too, so when using Java Cup, if your
grammar is not LALR(1), you can always make it LL(1) and it will work).
LL(1) Grammars and Predictive Parsers
Here's how the predictive parser works:
does on the input: "[[]]".
grammar: S -> epsilon | ( S ) | [ S ]
parse table:
( ) [ ] EOF
+--------------------------------------------
S | ( S ) | epsilon | [ S ] | epsilon | epsilon
+--------------------------------------------
input seen so far stack action
----------------- ----- ------
( S EOF pop, push "(S)"
( (S) EOF pop, scan (top-of-stack term matches curr token)
([ S) EOF pop, push "[S]"
([ [S]) EOF pop, scan (top-of-stack term matches curr token)
([] S]) EOF pop, push epsilon (no push)
([] ]) EOF pop, scan (top-of-stack term matches curr token)
([]) ) EOF pop, scan (top-of-stack term matches curr token)
([]) EOF EOF pop, scan (top-of-stack term matches curr token)
([]) EOF empty stack: input accepted!
Remember, it is not always possible to build a predictive parser given a CFG;
only if the CFG is LL(1). For example, the following grammar is not
LL(1) (but it is LL(2)):
S -> ( S ) | [ S ] | ( ) | [ ]
If we try to parse an input that starts with a left paren, we are in trouble!
We don't know whether to choose the first production: S -> ( S ),
or the third one: S -> ( ) .
If the next token is a right paren, we want to push "()".
If the next token is a left paren, we want to push "(S)".
So here we need two tokens of look-ahead.
S -> epsilon | ( S ) | [ S ]
Grammar Transformations
A nonterminal X is useless if either:
Here are some examples of useless nonterminals :
1. (for case 1) :
S -> A B
A -> + | - | epsilon
B -> digit | B digit
C -> . B
C is useless
2. (for case 2) :
S -> X | Y
X -> ( )
Y -> ( Y Y )
Y just derives more and more nonterminals.
So it is useless.
From now on "context-free grammar" means a grammar without useless nonterms.
Left Recursion
Given two productions of the form: A -> A alpha | beta
where:
Using this rule, we create a new grammar from a grammar with immediate
left recursion.
The new grammar is equivalent to the original one; i.e.,
the two grammars derive exactly the same sets of strings, but the new
one is not immediately left recursive (and so has a chance of
being LL(1)).
Replace those two productions with the following three productions:
A -> beta A'
where A' is a new nonterminal.
A' -> alpha A' | epsilon
A A A etc.
| / \ / \
beta A alpha A alpha
| / \
beta A alpha
|
beta
Note that the derived strings are:
That is, they are of the form "beta, followed by zero or more alphas".
The new grammar derives the same set of strings, but the parse trees
have a different shape (the single "beta" is derived right away, and
then the zero or more alphas):
A A A etc.
/ \ / \ / \
beta A' beta A' beta A'
| / \ / \
epsilon alpha A' alpha A'
| / \
epsilon alpha A'
|
epsilon
exp -> exp - factor | factor
factor -> INTLITERAL | ( exp )
Notice that the first rule (exp -> exp - factor) has
immediate left recursion, so this grammar is not LL(1).
(For example, if the first token is INTLITERAL, you don't know
whether to choose the production exp -> exp - factor, or
exp -> factor.
If the next token is MINUS, then you should choose
exp -> exp - factor, but if the next token is EOF, then you
should choose exp -> factor.)
exp -> factor exp'
exp' -> - factor exp' | epsilon
factor -> INTLITERAL | ( exp )
Let's consider what the predictive parser built using this grammar
does when the input starts with an integer:
So with the new grammar, the parser is able to tell (using only one token
look-ahead) what action to perform.
Parse tree using Parse tree using
the original grammar: the new grammar:
exp exp
/|\ / \
/ | \ / \
exp - factor factor exp'
/|\ | | /|\
/ | \ 4 2 / | \
exp - factor / | \
| | - factor exp'
factor 3 | /|\
| 3 / | \
2 / | \
- factor exp'
| |
| |
4 epsilon
The original parse tree shows the underlying structure of the expression;
in particular it groups 2 - 3 in one subtree to reflect the fact
that subtraction is left associative.
The parse tree for the new grammar is a mess!
Its subtrees don't correspond to the sub-expressions of 2 - 3 - 4
at all!
Fortunately, we can design a predictive parser to create an abstract-syntax
tree that does correctly reflect the structure of the parsed
code even though the grammar's parse trees do not.
A -> A alpha1 | A alpha2 | .. | A alpham | beta1 | .. | betan
A -> beta1 A' | beta2 A' | .. | betan A'
A' -> alpha1 A' | .. | alpham A' | epsilon
Left Factoring
exp -> ( exp ) | ( )
In this example, the common prefix is "(".
For example, consider the following productions:
exp -> ( exp ) | ( ) A -> alpha beta1 | alpha beta2
where alpha is a sequence of terminals and/or nonterminals, and
beta1 and beta2 are sequences
of terminals and/or nonterminals that do not have a common prefix (and
one of the betas could be epsilon),
A -> alpha A'
A' -> beta1 | beta2
Using the rule defined above, they are replaced by:
exp -> ( exp'
Here's the more general algorithm for left factoring (when there may be
more than two productions with a common prefix):
exp' -> exp ) | )
For each nonterminal A:
Repeat this process until no nonterminal has two productions with a
common prefix.
A -> alpha beta1 | .. | alpha betam | y1 | .. | yn
with:
A -> alpha A' | y1 | .. | yn
A' -> beta1 | .. | betam
exp -> ( exp ) | exp exp | ( )
exp -> ( exp ) exp' | ( ) exp'
exp' -> exp exp' | epsilon
exp -> ( exp''
exp'' -> exp ) exp' | ) exp'
exp' -> exp exp' | epsilon
FIRST and FOLLOW Sets
However, grammars that are not left recursive and are
left factored may still not be LL(1).
As mentioned earlier, to see if a grammar is LL(1), we try building
the parse table for the predictive parser.
If any element in the table contains more than one grammar rule
right-hand side, then the grammar is not LL(1).
FIRST Sets
FIRST(alpha) = { t | (t is a terminal and alpha ==>* t beta) or
(t = epsilon and alpha ==>* epsilon) }
To define FIRST(alpha) for arbitrary alpha, we start by defining FIRST(X),
for a single symbol X (a terminal, a nonterminal, or epsilon):
X -> Y1 Y2 Y3 ... Yk
where each Yk is a single terminal or nonterminal (or
there is just one Y, and it is epsilon).
For each such production, we perform the following actions:
exp -> term exp'
exp'-> - term exp' | epsilon
term -> factor term'
term'-> / factor term' | epsilon
factor -> INTLITERAL | ( exp )
Here are the FIRST sets (starting with nonterminal factor and working up,
since we need to know FIRST(factor) to compute FIRST(term), and we need to
know FIRST(term) to compute FIRST(exp)):
FIRST(factor) = { INTLITERAL, ( }
FIRST(term') = { /, epsilon }
FIRST(term) = { INTLITERAL, ( }// Note that FIRST(term) includes FIRST(factor);
// since FIRST(factor) does not include epsilon,
// that's all that is in FIRST(term).
FIRST(exp') = { -, epsilon }
FIRST(exp) = {INTLITERAL, ( } // Note that FIRST(exp) include FIRST(term);
// since FIRST(term) does not include epsilon,
// that's all that is in FIRST(exp).
X1 X2 ... Xn
where each X is a single terminal or nonterminal, or there is just one
X and it is epsilon.
The rules for computing FIRST(alpha) are essentially the same as the rules for
computing the first set of a nonterminal:
For the example grammar above, here are the FIRST sets for each production
right-hand side:
FIRST( term exp' ) = { INTLITERAL, ( }
FIRST( - term exp' ) = { - }
FIRST( epsilon ) = { epsilon }
FIRST( factor term' ) = { INTLITERAL, ( }
FIRST( / factor term' ) = { / }
FIRST( epsilon ) = { epsilon }
FIRST( INTLITERAL ) = { INTLITERAL }
FIRST( ( exp ) ) = { ( }
A -> alpha
A -> beta
and that the current token is a.
If FIRST(alpha) includes a, choose the first production: pop,
push alpha.
If FIRST(beta) includes a, choose the second production :
pop, push beta.
We haven't yet given the rules for using FIRST and FOLLOW sets
to determine whether a grammar is LL(1);
however, you might be able to guess based on this discussion, that if
a is in both FIRST(alpha) and FIRST(beta), the grammar is
not LL(1).
FOLLOW Sets
For a nonterminal A, FOLLOW(A) is the set of terminals
that can appear immediately to the right of A in some
partial derivation; i.e., terminal t is in FOLLOW(A) if:
S ==>+ ... A t... where t is a terminal
Furthermore, if A can be the rightmost symbol in a derivation,
then EOF is in FOLLOW(A).
It is worth noting that epsilon is never in a FOLLOW set.
FOLLOW(A) = {t | (t is a terminal and S ==>+ alpha A t beta)
or (t is EOF and S ==>* alpha A)}
Here are some pictures illustrating the conditions under which symbols
a, c, and EOF are in the FOLLOW set of nonterminal A:
Fig. 1 Fig. 2 Fig. 3
S S S
/ \ / \ / \
/ \ / \ / \
/ \ / X Y\ / \
/ /\ \ / / | \ / \
A \ /\ | (A)
/\ A B /\ |
/ \ / / \ |
(a)b c epsilon / \ /
| /______\ EOF is in FOLLOW(A)
| (c)d e f
a is in FOLLOW(A) |
|
\
c is in FOLLOW(A)
How to compute FOLLOW(A) for each nonterminal A:
Fig. 4
S
/ \
/ \ Whatever follows X is also going to follow A
/ X \
/ \
alpha A
S -> B c | D B
B -> a b | c S
D -> d | epsilon
X FIRST(X) FOLLOW(X)
------------------------------
D { d, epsilon } { a, c }
B { a, c } { c, EOF }
S { a, c, d } { EOF, c } Note: FOLLOW of S always includes EOF
1. methodHeader -> VOID ID LPAREN paramList RPAREN
2. paramList -> epsilon
3. paramList -> nonEmptyParamList
4. nonEmptyParamList -> ID ID
5. nonEmptyParamList -> ID ID COMMA nonEmptyParamList
How to Build Parse Tables (and to tell if a grammar is LL(1))
a b c... (the current token)
+-----------------+
X | | | |
|-----|-----|-----|
Y | | | | inside: the production right-hand side to push
|-----|-----|-----|
Z | | | |
+-----------------+
^
|
(the nonterminal at the top-of-stack)
Table entry[X,a] is either empty (if having X on top of stack and having
a as the current token means a syntax error) or contains
the right-hand side of
a production whose left-hand-side nonterminal is X -- that right-hand
side is what should be pushed.
for each production X -> alpha:
for each terminal t in First(alpha):
put alpha in Table[X,t]
if epsilon is in First(alpha) then:
for each terminal t in Follow(X):
put alpha in Table[X,t]
S -> B c | D B
B -> a b | c S
D -> d | epsilon
First we calculate the FIRST and FOLLOW sets:
X FIRST(X) FOLLOW(X)
------------------------------------
D { d, epsilon } { a, c }
B { a, c } { c, EOF }
S { a, c, d } { EOF, c }
Bc { a, c }
DB { d, a, c }
ab { a }
cS { c }
d { d }
epsilon { epsilon }
Then we use those sets to start filling in the parse table:
a b c d EOF
+-----------------------------+
S | B c | | B c | D B | |
| D B | | D B | | |
|-----|-----|-----|-----|-----|
B | | | | | |
| | | | | |
|-----|-----|-----|-----|-----|
D |epsil| |epsil| | |
| | | | | |
+-----------------------------+
Not all entries have been filled in,
but already we can see that this grammar is not LL(1) since there are
two entries in table[S,a] and in table[S,c].