## Contents

LR Parsing Overview There are several different kinds of bottom-up parsing. We will discuss an approach called LR parsing, which includes SLR, LALR, and LR parsers. LR means that the input is scanned left-to-right, and that a rightmost derivation, in reverse, is constructed. SLR means "simple" LR, and LALR means "look-ahead" LR.

Every SLR(1) grammar is also LALR(1), and every LALR(1) grammar is also LR(1), so SLR is the most limited of the three, and LR is the most general. In practice, it is pretty easy to write an LALR(1) grammar for most programming languages (i.e., the "power" of an LR parser isn't usually needed). A disadvantage of LR parsers is that their tables can be very large. Therefore, parser generators like Yacc and Java Cup produce LALR(1) parsers.

Let's start by considering the advantages and disadvantages of the LR parsing family:

• Almost all programming languages have LR grammars.
• LR parsers take time and space linear in the size of the input (with a constant factor determined by the grammar).
• LR is strictly more powerful than LL (for example, every LL(1) grammar is also both LALR(1) and LR(1), but not vice versa).
• LR grammars are more "natural" than LL grammars (e.g., the grammars for expression languages get mangled when we remove the left recursion to make them LL(1), but that isn't necessary for an LALR(1) or an LR(1) grammar).
• Although an LR grammar is usually easier to understand than the corresponding LL grammar, the parser itself is harder to understand and to write (thus, LR parsers are built using parser generators, rather than being written by hand).
• When we use a parser generator, if the grammar that we provide is not LALR(1), it can be difficult to figure out how to fix that.
• Error repair may be more difficult using LR parsing than using LL.
• Table sizes may be larger (about a factor of 2) than those used for LL parsing.

Recall that top-down parsers use a stack. The contents of the stack represent a prediction of what the rest of the input should look like. The symbols on the stack, from top to bottom, should "match" the remaining input, from first to last token. For example, earlier, we looked at the example grammar

 Grammar: $S$ $\longrightarrow$ $\varepsilon$ | ( $S$ ) | [ $S$ ]

and parsed the input string ([]). At one point during the parse, after the first parenthesis has been consumed, the stack contains

    [ S ] ) EOF
(with the top-of-stack at the left). This is a prediction that the remaining input will start with a '[', followed by zero or more tokens matching an S, followed by the tokens ']' and ')', in that order, followed by end-of-file.

Bottom-up parsers also use a stack, but in this case, the stack represents a summary of the input already seen, rather than a prediction about input yet to be seen. For now, we will pretend that the stack symbols are terminals and nonterminals (as they are for predictive parsers). This isn't quite true, but it makes our introduction to bottom-up parsing easier to understand.

A bottom-up parser is also called a "shift-reduce" parser because it performs two kind of operations, shift operations and reduce operations. A shift operation simply shifts the next input token from the input to the top of the stack. A reduce operation is only possible when the top N symbols on the stack match the right-hand side of a production in the grammar. A reduce operation pops those symbols off the stack and pushes the non-terminal on the left-hand side of the production.

One way to think about LR parsing is that the parse tree for a given input is built, starting at the leaves and working up towards the root. More precisely, a reverse rightmost derivation is constructed.

Recall that a derivation (using a given grammar) is performed as follows:

1. start with the start symbol (i.e., the current string is the start symbol)
2. repeat:
• choose a nonterminal X in the current string
• choose a grammar rule X → alpha
• replace X in the current string with alpha
until there are no more nonterminals in the current string

A rightmost derivation is one in which the rightmost nonterminal is always the one chosen.

Rightmost Derivation

CFG

 $E$ $\longrightarrow$ $E$ + $T$ | $T$ $T$ $\longrightarrow$ $T$ * $F$ | $F$ $F$ $\longrightarrow$ id | ( $E$ )

Rightmost derivation

$E$
1. $\Longrightarrow$ $E$ + $T$
2. $\Longrightarrow$ $E$ + $T$ * $F$
3. $\Longrightarrow$ $E$ + $T$ * id
4. $\Longrightarrow$ $E$ + $F$ * id
5. $\Longrightarrow$ $E$ + id * id
6. $\Longrightarrow$ $T$ + id * id
7. $\Longrightarrow$ $F$ + id * id
8. $\Longrightarrow$ id + id * id
In this example, the nonterminal that is chosen at each step is in red, and each derivation step is numbered. The corresponding bottom-up parse is shown below by showing the parse tree with its edges numbered to show the order in which the tree was built (e.g., the first step was to add the nonterminal F as the parent of the leftmost parse-tree leaf "id", and the last step was to combine the three subtrees representing "id", "+", and "id * id" as the children of a new root node E).

Note that both the rightmost derivation and the bottom-up parse have 8 steps. Step 1 of the derivation corresponds to step 8 of the parse; step 2 of the derivation corresponds to step 7 of the parse; etc. Each step of building the parse tree (adding a new nonterminal as the parent of some existing subtrees) is called a reduction (that's where the "reduce" part of "shift-reduce" parsing comes from).

### Basic LR Parsing Algorithm

All LR parsers use the same basic algorithm:
Based on:
• the top-of-stack symbol and
• the current input symbol (token) and
• the entry in one of the parse tables (indexed by the top-of-stack and current-input symbols)
the parser performs one of the following actions:
1. shift: push the current input symbol onto the stack, and go on to the next input symbol (i.e., call the scanner again)
2. reduce: a grammar rule's right-hand side is on the top of the stack! pop it off and push the grammar rule's left-hand-side nonterminal
3. accept: accept the input (parsing has finished successfully)
4. reject: the input is not syntactically correct

The difference between SLR, LALR, and LR parsers is in the tables that they use. Those tables use different techniques to determine when to do a reduce step, and, if there is more than one grammar rule with the same right-hand side, which left-hand-side nonterminal to push.

## Example

Here are the steps the parser would perform using the grammar of arithmetic expressions with + and * given above, if the input is
id + id * id
 Stack Input Action id + id * id shift (id) id + id * id reduce by $F$ $\longrightarrow$ id $F$ + id * id reduce by $T$ $\longrightarrow$ $F$ $T$ + id * id reduce by $E$ $\longrightarrow$ $T$ $E$ + id * id shift(+) $E$ + id * id shift(id) $E$ + id * id reduce by $F$ $\longrightarrow$ id $E$ + $F$ * id reduce by $T$ $\longrightarrow$ $F$ $E$ + $T$ * id shift(*) $E$ + $T$ * id shift(id) $E$ + $T$ * id reduce by $F$ $\longrightarrow$ id $E$ + $T$ * $F$ reduce by $T$ $\longrightarrow$ $T$ * $F$ $E$ + $T$ reduce by $E$ $\longrightarrow$ $E$ + $T$ $E$ accept

(NOTE: the top of stack is to the right; the reverse rightmost derivation is formed by concatenating the stack with the remaining input at each reduction step)

Note that in step 8 the top of the stack contained E + T, which is the right-hand side of the first grammar rule. However, the parser does not do a reduce step at that point; instead it does a shift, which is the right thing to do since (as you can see if you look back at the parse tree given above), that particular "E + T" is not grouped together (because the grammar reflects the fact that multiplication has higher precedence than addition).

Parse Tables

As mentioned above, the symbols pushed onto the parser's stack are not actually terminals and nonterminals. Instead, they are states, that correspond to a finite-state machine that represents the parsing process (more on this soon).

All LR Parsers use two tables: the action table and the goto table. The action table is indexed by the top-of-stack symbol and the current token, and it tells which of the four actions to perform: shift, reduce, accept, or reject. The goto table is used during a reduce action as explained below.

Above we said that a shift action means to push the current token onto the stack. In fact, we actually push a state symbol onto the stack. Each "shift" action in the action table includes the state to be pushed.

Above, we also said that when we reduce using the grammar rule A → alpha, we pop alpha off of the stack and then push A. In fact, if alpha contains N symbols, we pop N states off of the stack. We then use the goto table to know what to push: the goto table is indexed by state symbol t and nonterminal A, where t is the state symbol that is on top of the stack after popping N times.

Here's pseudo code for the parser's main loop:

push initial state s0
a = scan()
do forever
t = top-of-stack (state) symbol
switch action[t, a] {
case shift s:
push(s)
a = scan()
case reduce by A → alpha:
for i = 1 to length(alpha) do pop() end
t = top-of-stack symbol
push(goto[t, A])
case accept:
return( SUCCESS )
case error:
call the error handler
return( FAILURE )
}
end do


Remember, all LR parsers use this same basic algorithm. As mentioned above, for all LR parsers, the states that are pushed onto the stack represent the states in an underlying finite-state machine. Each state represents "where we might be" in a parse; each "place we might be" is represented (within the state) by an item. What's different for the different LR parsers is:

• the definition of an item
• the number of states in the underlying finite-state machine
• the amount of information in a state

SLR Parsing

SLR means simple LR; it is the weakest member of the LR family (i.e., every SLR grammar is also LALR and LR, but not vice versa). To understand SLR parsing we'll use a new example grammar (a very simple grammar for parameter lists):
 $PList$ $\longrightarrow$ ( $IDList$ ) $IDList$ $\longrightarrow$ id | $IDList$ id

Building the Action and Goto Tables for an SLR Parser

Definition of an SLR item:

An item (for a given grammar) is a production with a dot somewhere on the right-hand side.
Example
NOTE: for the production X $\longrightarrow$ $\varepsilon$, there is just one item: X $\longrightarrow$ . $\varepsilon$

The item "PList $\longrightarrow$ . lparens IDList rparens" can be thought of as meaning "we may be parsing a PList, but so far we haven't seen anything".

The item "PList $\longrightarrow$ lparens . IDList rparens" means "we may be parsing a PList, and so far we've seen a lparens".

The item "PList $\longrightarrow$ lparens IDList . rparens" means "we may be parsing a PList, and so far we've seen a lparens and parsed an IDList.

We need 2 operations on sets of items: Closure and Goto

Closure

To compute Closure($I$), where $I$ is a set of items:

1. put $I$ itself into Closure($I$)
2.  while there exists an item in Closure($I$) of the form X $\longrightarrow$ $\alpha$ . B $\beta$ such that there is a production B $\longrightarrow$ $\gamma$, and B $\longrightarrow$ . $\gamma$ is not in Closure($I$) do add B $\longrightarrow$ . $\gamma$ to Closure($I$)

The idea is that the item "X $\longrightarrow$ $\alpha$ . B $\beta$" means "we may be trying to parse an X, and so far we've parsed all of $\alpha$, so the next thing we'll parse may be a B". And the item "B $\longrightarrow$ . $\gamma$" also means that the next thing we'll parse may be a B (in particular, a B that derives $\gamma$), but we haven't seen any part of it yet.

Example 1: Closure({ PList $\longrightarrow$ . lparens IDList rparens })

We'll begin by putting the initial item into the Closure (Step 1 above). So far, our set is: { PList $\longrightarrow$ . lparens IDList rparens}

Now, we will do step 2, checking the set we build for productions of the form B $\longrightarrow$ $\gamma$, where the item B $\longrightarrow$ . $\gamma$ is not in the set. There's only one item that we can check, and the symbol to the immediate right of the dot is lparens which is a terminal symbol. Obviously, there are no productions of the form B $\longrightarrow$ $\gamma$ with a terminal symbol on the left-hand side, so there's nothing else to check.

With Step 2 exhausted, we can return the set we've built up: Closure({ PList $\longrightarrow$ . lparens IDList rparens }) = { PList $\longrightarrow$ . lparens IDList rparens }

Example 2: Closure({ PList $\longrightarrow$ lparens . IDList rparens })

As with the previous example, we put the initial item into the Closure. So far, our set is { PList $\longrightarrow$ lparens . IDList rparens }

For step 2, we begin by selecting the only item in our working set, PList $\longrightarrow$ lparens . IDList rparens. We now look for productions with a left-hand side of IDList, since that's the symbol to the immediate right of the dot. One production of this form is "IDList $\longrightarrow$ id". Since the item IDList $\longrightarrow$ . id is not in the Closure yet, we add it. Our set so far is { PList $\longrightarrow$ lparens . IDList rparens , IDList $\longrightarrow$ . id}

We know that the item that we just added, IDList $\longrightarrow$ . id will not yield any more items, because the symbol immediately to the right of the dot is a terminal. However, we still haven't captured every production with IDList on the left-hand side, which we need to check because of our initial item. The grammar also has the production IDList $\longrightarrow$ IDList id, so we add the item "IDList $\longrightarrow$ . IDList id" to the closure. At this point, our working set is { PList $\longrightarrow$ lparens . IDList rparens , IDList $\longrightarrow$ . id, IDList $\longrightarrow$ . IDList id}

The new item that we added has IDList to the right-hand side of the dot. Fortunately, we've already exhausted every production of the grammar with IDList on the left-hand side. Thus, we can pronounce our working set complete:

Closure({ PList $\longrightarrow$ lparens . IDList rparens }) = { PList $\longrightarrow$ lparens . IDList rparens , IDList $\longrightarrow$ . id, IDList $\longrightarrow$ . IDList id}

Goto

Now that we have defined the Closure of a set of items, we can use it to define the Goto operation. The basic idea is that $I$ tells us where we might be in the parse, and Goto($I$, X) tells us where we might be after parsing an X. Here is the definition:

Let $I$ be a set of items, and X be a grammar symbol (i.e. a single terminal or nonterminal).
Goto($I$, X) = the Closure of the set of items
A $\longrightarrow$ $\alpha$ X . $\beta$
such that
A $\longrightarrow$ $\alpha$ . X $\beta$
is in $I$
Example 1: Goto($I$1, X1)
where
$I$1 = { PList $\longrightarrow$ . lparens IDList rparens }
X1 = lparens

Let us begin by defining an intermediate set:

$\mathcal{W}$ = the set of items of the form A $\longrightarrow$ $\alpha$ X . $\beta$ such that an item of the form A $\longrightarrow$ $\alpha$ . X $\beta$ is in $I$.

We can now build $\mathcal{W}$ by taking each item from $I$ (of which there is only one) and advancing the dot to the right.
Thus, $\mathcal{W}$ = { PList $\longrightarrow$ lparens . IDList rparens}

With $\mathcal{W}$ in hand, we are ready to perform the Goto operation by computing Closure($\mathcal{W}$) = Closure( { PList $\longrightarrow$ lparens . IDList rparens} ) . We already computed this closure above as { PList $\longrightarrow$ lparens . IDList rparens , IDList $\longrightarrow$ . id, IDList $\longrightarrow$ . IDList id}, so we are done:

Goto($I$1, X1) = { PList $\longrightarrow$ lparens . IDList rparens , IDList $\longrightarrow$ . id, IDList $\longrightarrow$ . IDList id}

Example 2: Goto( $I$2, X2 )
where

$I$2 = Goto($I$1, X1 )
X2 = IDList

The inner Goto operation is the result of Example 1, so we can substitute that result directly. Expanded, the problem statement is:

Goto( { PList $\longrightarrow$ lparens . IDList rparens , IDList $\longrightarrow$ . id, IDList $\longrightarrow$ . IDList id} , IDList)
Again, we start by computing $\mathcal{W}$:
 Item in $I$ of the form A $\longrightarrow$ $\alpha$ . X $\beta$ Item of the form A $\longrightarrow$ $\alpha$ X . $\beta$ PList $\longrightarrow$ lparens . IDList rparens PList $\longrightarrow$ lparens IDList . rparens IDList $\longrightarrow$ . IDList id IDList $\longrightarrow$ IDList . id

Thus, $\mathcal{W}$ = { PList $\longrightarrow$ lparens IDList . rparens, IDList $\longrightarrow$ IDList . id }

We can now take the closure of $\mathcal{W}$ to complete the operation. This turns out to be trivial, since no element in $\mathcal{W}$ is followed by a nonterminal, and therefore yields no additional items. Thus, Goto($I$2, X2) = Closure($\mathcal{W}$) = $\mathcal{W}$ =
PList $\longrightarrow$ lparens IDList . rparens, IDList $\longrightarrow$ IDList . id }

Our ultimate goal is to create a To build the FSM:

1. Augment the grammar
(a) add new start nonterminal S'
(b) add new production S' → S (where S is the old start nonterminal)
2. I0 = closure({ S' → . S })
3. for each grammar symbol X such that there is an item in I0 containing ".X" do
add a transition on X from state I0 to state Goto(I0,X)
4. repeat step (3) for each new state until there are no more new states
Note: In step (3), when we say "state Goto(...)" we mean the state that contains exactly that set of items; if there is not already such a state, then create a new one.

Example


grammar
S' → plist
plist → ( idlist )
idlist → ID
idlist → idlist ID


Corresponding SLR FSM

Given the FSM, here's how to build Action and Goto tables:

Action Table:
1. if state i includes item
A → alpha . a beta
where a is a terminal, and the transition out of state i on a is to state j,
then set Action[i,a] = shift j
2. if state i includes item
A → alpha .
where A is not the new start symbol S'
then for every a in FOLLOW(A), set Action[i,a] = reduce by A → alpha
3. if state i includes item
S' → S .
then set Action[i,$] = accept 4. set all other entries of Action table to error Goto Table: for every nonterminal X if there is a transition from state i to state j on X then set Goto[i, X] = j Example  FOLLOW(idlist) = { ), ID } FOLLOW(plist) = {$ }

Conflicts

Not every grammar is SLR(1). If a grammar is not SLR(1), there will be a conflict in the SLR Action table. There is a conflict in the table if there is a table entry with more than 1 rule in it. There are two possible kinds of conflicts:

1. shift/reduce conflicts (a shift and a reduce action in the same table entry)
2. reduce/reduce conflicts (two reduce actions in the same table entry)

A shift/reduce conflict means that it is not possible to determine, based only on the top-of-stack state symbol and the current token, whether to shift or to reduce. This kind of conflict arises when one state contains two items of the following form:

1. X $\longrightarrow$ $\alpha$ . a $\beta$
2. Y $\longrightarrow$ $\alpha$
and a is in FOLLOW(Y).

A reduce/reduce conflict means that it is not possible to determine, based only on the top-of-stack state symbol and the current token, whether to reduce by one grammar rule or by another grammar rule. This kind of conflict arises when one state contains two items of the form

1. X $\longrightarrow$ $\alpha$ .
2. Y $\longrightarrow$ $\beta$ .
and there is some symbol a that is in both FOLLOW(X) and FOLLOW(Y).

A non-SLR(1) grammar

This grammar causes a shift/reduce conflict

grammar

S' $\longrightarrow$ S
S $\longrightarrow$ A $\times$ B | B
A $\longrightarrow$ a | + B
B $\longrightarrow$ A

relevant part of the FSM

Another non-SLR(1) grammar

This grammar causes a reduce/reduce conflict:

grammar

S' $\longrightarrow$ S
S $\longrightarrow$ a A d | b B d | a B e | b A e
A $\longrightarrow$ c
B $\longrightarrow$ c