Contents

Overview

There are algorithms that can be used to parse the language defined by an arbitrary CFG. However, in the worst case, the algorithms take O(n3) time, where n is the number of tokens. That is too slow!

Fortunately, there are classes of grammars for which O(n) parsers can be built (and given a grammar, we can quickly test whether it is in such a class). Two such classes are:

LL(1) Scan input left-to-right Do a leftmost derivation One token of look-ahead and
LALR(1) do a rightmost derivation One token of look-ahead scan the input left-to-right "Look-ahead". This has nothing to do with the number of tokens the parser can look at before it chooses what to do -- it is a technical term that only means something when you study how LR parsers work LALR(1) grammars are: We will learn about three parsing techniques:
  1. A general, bottom-up parsing algorithm called the CYK algorithm.
  2. A technique for parsing LL(1) grammars top-down, called predictive parsing.
  3. A technique for bottom-up parsing that works for LALR(1) grammars, but since that is rather complicated, we will study a version of the algorithm that only works for a subset of the LALR(1) grammars called SLR(1) grammars.

Overview

Given a CFG and a string (a sequence of tokens), the goal of parsing is to determine whether the string is in the language of the CFG. The answer is yes iff the string can be derived from the CFG's start nonterminal. Equivalently, the answer is yes iff we can build a parse tree for the string, with the CFG's start nonterminal at the root.

There are a number of algorithms that can answer this question for any CFG, even ambiguous CFGs. Their worst-case running time is O(N3), where N is the length of the input (the number of tokens in the input string). Some algorithms have better running times for non-ambiguous CFGs. For example, there is an algorithm called Earley's algorithm that has worst-case runtime O(N2) for non-ambiguous grammars. However, many of these algorithms are rather complicated, so we will look at one that is not so complicated, but always has O(N3) runtime. This algorithm is called the CYK algorithm, for its inventors: Cocke, Younger, and Kasami.

The CYK Algorithm

The CYK algorithm can be used to parse any CFG (i.e., to determine, for any input, whether that input is in the language of a given CFG). It essentially builds all of the possible parse trees for all (consecutive) fragments of the input, bottom-up. It accepts the input iff it is able to build a complete parse tree: one with the CFG's start nonterminal at the root, and the entire input at the leaves. Chomsky Normal Form

The CYK algorithm requires that the CFG be in CNF: Chomsky normal form. This means that the CFG rules must all be in one of the following forms:

  1. $X$ $\longrightarrow$ t
  2. $X$ $\longrightarrow$ $A$ $B$
where t is a terminal symbol, and $A$ and $B$ are nonterminal symbols. Also, grammar rules of the form

$X$ $\longrightarrow$ $\varepsilon$

are not allowed, unless the left nonterminal is the start nonterminal. If there is a rule

$S$ $\longrightarrow$ $\varepsilon$

(where $S$ is the start nonterminal), then $S$ cannot occur on the right-hand side of any rule.

Conversion to CNF

Fortunately, it is easy to convert any CFG to CNF. The conversion involves the following steps:

  1. Eliminate $\varepsilon$ rules
  2. Eliminate unit rules (rules with exactly one nonterminal on the right)
  3. Fix remaining rules so that all rules have either a single terminal or exactly two nonterminals on the right.

Example: Conversion to CNF

We'll start with a very simple grammar for function calls (the function name, followed by a list of zero or more comma-separated arguments, where each argument can only be an identifier).
$F$ $\longrightarrow$ id ( $A$ )
$A$ $\longrightarrow$ $\varepsilon$
$A$ $\longrightarrow$ $N$
 
$N$ $\longrightarrow$ id
$N$ $\longrightarrow$ id , $N$

Step 1 (eliminate $\varepsilon$ rules):

This CFG has one rule with epsilon on the right-hand side (the second rule). That rule has $A$ on the left-hand side. We'll remove the $\varepsilon$-rule by copying the rule that has $A$ on the right (the first rule in the CFG above), and replacing that $A$ with nothing ($\varepsilon$):
$F$ $\longrightarrow$ id ( )

Now we have this CFG:

$F$ $\longrightarrow$ id ( $A$ )
$F$ $\longrightarrow$ id ( )
$A$ $\longrightarrow$ $N$
 
$N$ $\longrightarrow$ id
$N$ $\longrightarrow$ id , $N$

Note that if $A$ occurred more than once on the right of some rule, we would have to make as many copies as needed to get all combinations of $A$ being there and being replaced by $\varepsilon$. For example, if we had this rule:
$X$ $\longrightarrow$ $A$ x $A$ y $A$

we'd have to add all of the following new rules, as well as keeping the original rule:

$X$ $\longrightarrow$ $A$ x $A$ y
$X$ $\longrightarrow$ $A$ x y $A$
$X$ $\longrightarrow$ $A$ x y
$X$ $\longrightarrow$ x $A$ y $A$
$X$ $\longrightarrow$ x $A$ y
$X$ $\longrightarrow$ x y $A$
$X$ $\longrightarrow$ x y

Furthermore, if we have a rule

$S$ $\longrightarrow$ $\varepsilon$

where S is the start nonterminal, and there is a grammar rule with S on the right, then we need to do the following:

Step 2 (eliminate unit rules):

Our current CFG looks like this:
$F$ $\longrightarrow$ id ( $A$ )
$F$ $\longrightarrow$ id ( )
$A$ $\longrightarrow$ $N$
 
$N$ $\longrightarrow$ id
$N$ $\longrightarrow$ id , $N$

It includes one unit rule, a rule with exactly 1 nonterminal (and no terminals) on the right:

$A$ $\longrightarrow$ $N$

Because there are no other rules with $A$ on the left, we can remove the unit rule by replacing every occurrence of $A$ on the right side of some rule with $N$. (If there were other rules with $A$ on the left, we would instead make copies of the rules with $A$ on the right, and replace $A$ with $N$ in the copies). Here's our CFG after removing the unit rule:
$F$ $\longrightarrow$ id ( $N$ )
$F$ $\longrightarrow$ id ( )
$N$ $\longrightarrow$ id
$N$ $\longrightarrow$ id , $N$

Step 3 (fix remaining rules):

The last step in conversion to CNF is to fix the remaining rules so that each one has either a single terminal, or exactly two nonterminals on the right.

Handle rules with terminals on the right

First, we find the rules that have a terminal plus some other symbols on the right:
$F$ $\longrightarrow$ id ( $N$ )
$F$ $\longrightarrow$ id ( )
$N$ $\longrightarrow$ id , $N$

For each such terminal t, we add a new rule
$X$ $\longrightarrow$ t
Then we replace t with $X$ in the original rules. Here's the result (including the one rule that was already in the correct form):
$F$ $\longrightarrow$ $I$ $L$ $N$ $R$
$F$ $\longrightarrow$ $I$ $L$ $R$
$N$ $\longrightarrow$ id
$N$ $\longrightarrow$ $I$ $C$ $N$
 
$I$ $\longrightarrow$ id
$L$ $\longrightarrow$ (
$R$ $\longrightarrow$ )
$C$ $\longrightarrow$ ,

Handle rules with more than 2 nonterminals on the right

Now, for every right-hand side that has more than two nonterminals, we replace all but the first nonterminal on the right with a new nonterminal, and we add a new rule for the new nonterminal. For example, we handle the rule

$F$ $\longrightarrow$ $I$ $L$ $N$ $R$

as follows:

$F$ $\longrightarrow$ $I$ $W$
$W$ $\longrightarrow$ $L$ $N$ $R$

Then we handle the new rule (with W on the left) as follows:

$W$ $\longrightarrow$ $L$ $X$
$X$ $\longrightarrow$ $N$ $R$

If we take care of the other two "problem" rules:

$F$ $\longrightarrow$ $I$ $L$ $R$
$N$ $\longrightarrow$ $I$ $C$ $N$

We finish with the following CFG (in CNF):

$F$ $\longrightarrow$ $I$ $W$
$F$ $\longrightarrow$ $I$ $Y$
$W$ $\longrightarrow$ $L$ $X$
$X$ $\longrightarrow$ $N$ $R$
$Y$ $\longrightarrow$ $L$ $R$
 
$N$ $\longrightarrow$ id
$N$ $\longrightarrow$ $I$ $Z$
$Z$ $\longrightarrow$ $C$ $N$
 
$I$ $\longrightarrow$ id
$L$ $\longrightarrow$ (
$R$ $\longrightarrow$ )
$C$ $\longrightarrow$ ,

TEST YOURSELF #1

The CFG below is for a grammar of very simple statements. The terminal symbols are id, =, (, ), ++, and read. Convert this CFG to CNF.
$S$ $\longrightarrow$ id = id
$S$ $\longrightarrow$ id ( )
$S$ $\longrightarrow$ id ++
$S$ $\longrightarrow$ read ( id )
$S$ $\longrightarrow$ $S$ $S$

solution


Parsing an Input

Now that we have converted our CFG to Chomsky Normal form, we can use it to try to parse an input. Let's use the input f(x,y) as an example. The actual sequence of tokens for that input would be: id ( id , id )

To parse an input of length N, we use an N-by-N grid. First, we fill in the squares along the diagonal. The kth square contains the nonterminal(s) $X$ such that there is a CFG rule

$X$ $\longrightarrow$ t

and t is the kth token. Here's the 6-by-6 grid for our example, with the diagonal filled in (and with the rows and columns numbered):

1 2 3 4 5 6 1 2 3 4 5 6 I,N L I,N C I,N R

Now we fill in each successive diagonal above the one that's already filled in. When filling in a square, we consider what's already in the squares to the left and below it. For example, to fill in the square in position 2,1, we look at the contents of the square to its left ($I$,$N$) and below it ($L$). If there were a grammar rule with "$I$ $L$" or with "$N$ $L$" on the right-hand side, we would put the left-side nonterminal into the square in position 2,1. However, for our example, there are no such grammar rules, so square 2,1 stays empty. The first new square that we can fill in is in position 5,4. That square has $C$ to its left, and $I$, $N$ below it. There is a grammar rule $Z$ $\longrightarrow$ $C$ $N$, so we put a $Z$ in square 5,4. If there had also been a grammar rule $H$ $\longrightarrow$ $C$ $I$, we would also have put an $H$ in square 5,4. Note that we also fill in square 5,5 due to the rule $X$ $\longrightarrow$ $N$ $R$. Here is the grid so far:

1 2 3 4 5 6 1 2 3 4 5 6 I,N L I,N C I,N R X Z

The squares in the next diagonal each have two squares to their left and two squares below. We look at pairs of squares: first the square all the way to the left and the square immediately below, then the square immediately to the left and the one two squares down. If either square in the current pair is empty, we go on to the next pair. Each time we look at a pair of non-empty squares, we are looking for a nonterminal $A$ in the square to the left and a nonterminal $B$ in the square that is below, such that there is a grammar rule with $A$ $B$ on the right. If there is such a grammar rule, we fill in the current square with the left-side nonterminal. For example, when trying to fill in the square in position 3,5, we see $I$, $N$ in position 3,3, and $Z$ in position 4,5. There is a grammar rule $N$ $\longrightarrow$ $I$ $Z$, so we put an $N$ is position 3,5. Here is the result of filling in the whole grid:

1 2 3 4 5 6 1 2 3 4 5 6 I,N L I,N C I,N R X Z N X W F

To understand this process, remember that the CYK algorithm works by building up parse trees for longer and longer fragments of the input. Each box in the upper-right part of the grid represents the root of one or more parse trees. The nonterminal(s) in a square is/are the root nonterminal(s) of those trees. The children of the root are the nonterminals in the pairs of squares, one to the left, and one below, that caused the nonterminal to be filled in. The five nonterminals in the upper-right part of our grid represent these five parse trees:

Z C N , id
Z C N , id N I Z id image/svg+xml
X N R I Z ) id C N , id
image/svg+xml L W X N R ( I Z ) id C N , id
F I W id id L X N R ( I Z ) id C N , id I id Z , C N id

TEST YOURSELF #2

Use the statement CFG that was converted to CNF for the "Test Yourself #1" question to parse the input x ++ y = z y ++ (i.e., build and fill in the grid).

solution


Accepting an Input

The CYK discovers that an input is in the language of a CFG iff the CFG's start nonterminal appears in the upper-right corner of the grid. For our example, F is the start nonterminal, and it does appear in square 1,6, so the example input is in the language of the CFG. The rightmost parse tree shown above is the (unique) parse tree for the example input.


TEST YOURSELF #3

Is the input x ++ y = z y ++ in the language of the simple statement grammar? (Justify your answer.)

solution