Contents

There are algorithms that can be used to parse the language defined by an
arbitrary CFG. However, in the worst case, the algorithms take
O(n^{3}) time, where n is the number of tokens. That is
*too slow*!

Fortunately, there are classes of grammars for which O(n) parsers can be built (and given a grammar, we can quickly test whether it is in such a class). Two such classes are:

andLALR(1) grammars are:

- The class of grammars accepted by the parser generators yacc, bison, and Java Cup.
- Parsed by bottom-up parsers (they construct the derivation tree by going from terminals up to nonterminals, then attaching several nonterminals together to form a new nonterminal, etc, ending with just the start nonterminal).
- Harder to understand than LL(1) parsers (i.e., it is hard to understand how the LALR(1) parser works, and what makes a grammar LALR(1)).

- A general, bottom-up parsing algorithm called the
*CYK algorithm*. - A technique for parsing LL(1) grammars top-down,
called
*predictive parsing*. - A technique for bottom-up parsing that works for LALR(1)
grammars, but since that is rather complicated, we will study a
version of the algorithm that only works for a subset of the LALR(1)
grammars called
*SLR(1)*grammars.

Given a CFG and a string (a sequence of tokens), the goal of parsing
is to determine whether the string is in the language of the CFG.
The answer is yes iff the string can be *derived* from the CFG's
start nonterminal.
Equivalently, the answer is yes iff we can build a parse tree for
the string, with the CFG's start nonterminal at the root.

There are a number of algorithms that can answer this question
for *any* CFG, even ambiguous CFGs.
Their worst-case running time is O(N^{3}), where N is
the length of the input (the number of tokens in the input string).
Some algorithms have better running times for non-ambiguous CFGs.
For example, there is an algorithm called Earley's algorithm that
has worst-case runtime O(N^{2}) for non-ambiguous grammars.
However, many of these algorithms are rather complicated, so we
will look at one that is not so complicated, but always has
O(N^{3}) runtime.
This algorithm is called the *CYK* algorithm, for its
inventors: Cocke, Younger, and Kasami.

The CYK algorithm can be used to parse any CFG (i.e., to determine, for any input, whether that input is in the language of a given CFG). It essentially builds all of the possible parse trees for all (consecutive) fragments of the input, bottom-up. It accepts the input iff it is able to build a complete parse tree: one with the CFG's start nonterminal at the root, and the entire input at the leaves. Chomsky Normal Form

The CYK algorithm requires that the CFG be in CNF: *Chomsky normal form*.
This means that the CFG rules must all be in one of the following forms:

- $X$ $\longrightarrow$
**t** - $X$ $\longrightarrow$ $A$ $B$

$X$ $\longrightarrow$ $\varepsilon$

are not allowed, unless the left nonterminal is the start nonterminal. If there is a rule

$S$ $\longrightarrow$ $\varepsilon$

(where $S$ is the start nonterminal), then $S$ cannot occur on the right-hand side of any rule.Conversion to CNF

Fortunately, it is easy to convert any CFG to CNF. The conversion involves the following steps:

- Eliminate $\varepsilon$ rules
- Eliminate unit rules (rules with exactly one nonterminal on the right)
- Fix remaining rules so that all rules have either a single terminal or exactly two nonterminals on the right.

We'll start with a very simple grammar for function calls (the function name, followed by a list of zero or more comma-separated arguments, where each argument can only be an identifier).

$F$ | $\longrightarrow$ | id ( $A$ ) |

$A$ | $\longrightarrow$ | $\varepsilon$ |

$A$ | $\longrightarrow$ | $N$ |

$N$ | $\longrightarrow$ | id |

$N$ | $\longrightarrow$ | id , $N$ |

This CFG has one rule with epsilon on the right-hand side (the second rule). That rule has $A$ on the left-hand side. We'll remove the $\varepsilon$-rule by copying the rule that has $A$ on the right (the first rule in the CFG above), and replacing that $A$ with nothing ($\varepsilon$):

$F$ | $\longrightarrow$ | id ( ) |

Now we have this CFG:

$F$ | $\longrightarrow$ | id ( $A$ ) |

$F$ | $\longrightarrow$ | id ( ) |

$A$ | $\longrightarrow$ | $N$ |

$N$ | $\longrightarrow$ | id |

$N$ | $\longrightarrow$ | id , $N$ |

Note that if $A$ occurred more than once on the right of some rule, we would have to make as many copies as needed to get all combinations of $A$ being there and being replaced by $\varepsilon$. For example, if we had this rule:

$X$ | $\longrightarrow$ | $A$ x $A$ y $A$ |

we'd have to add all of the following new rules, as well as keeping the original rule:

$X$ | $\longrightarrow$ | $A$ x $A$ y |

$X$ | $\longrightarrow$ | $A$ x y $A$ |

$X$ | $\longrightarrow$ | $A$ x y |

$X$ | $\longrightarrow$ | x $A$ y $A$ |

$X$ | $\longrightarrow$ | x $A$ y |

$X$ | $\longrightarrow$ | x y $A$ |

$X$ | $\longrightarrow$ | x y |

Furthermore, if we have a rule

$S$ | $\longrightarrow$ | $\varepsilon$ |

where `S` is the start nonterminal, and
there is a grammar rule with `S` on the
right, then we need to do the following:

- Add a new start nonterminal S'
- And new rules
S' $\longrightarrow$ $S$ S' $\longrightarrow$ $\varepsilon$ - Remove the rule
$S$ $\longrightarrow$ $\varepsilon$

Our current CFG looks like this:

$F$ | $\longrightarrow$ | id ( $A$ ) |

$F$ | $\longrightarrow$ | id ( ) |

$A$ | $\longrightarrow$ | $N$ |

$N$ | $\longrightarrow$ | id |

$N$ | $\longrightarrow$ | id , $N$ |

It includes one *unit rule*, a rule with exactly 1 nonterminal
(and no terminals) on the right:

$A$ | $\longrightarrow$ | $N$ |

Because there are no other rules with $A$ on the left, we can remove the unit rule by replacing every occurrence of $A$ on the right side of some rule with $N$. (If there were other rules with $A$ on the left, we would instead make copies of the rules with $A$ on the right, and replace $A$ with $N$ in the copies). Here's our CFG after removing the unit rule:

$F$ | $\longrightarrow$ | id ( $N$ ) |

$F$ | $\longrightarrow$ | id ( ) |

$N$ | $\longrightarrow$ | id |

$N$ | $\longrightarrow$ | id , $N$ |

The last step in conversion to CNF is to fix the remaining rules so that each one has either a single terminal, or exactly two nonterminals on the right.

$F$ | $\longrightarrow$ | id ( $N$ ) |

$F$ | $\longrightarrow$ | id ( ) |

$N$ | $\longrightarrow$ | id , $N$ |

For each such terminal **t**, we add a new rule

$X$ | $\longrightarrow$ | t |

$F$ | $\longrightarrow$ | $I$ $L$ $N$ $R$ |

$F$ | $\longrightarrow$ | $I$ $L$ $R$ |

$N$ | $\longrightarrow$ | id |

$N$ | $\longrightarrow$ | $I$ $C$ $N$ |

$I$ | $\longrightarrow$ | id |

$L$ | $\longrightarrow$ | ( |

$R$ | $\longrightarrow$ | ) |

$C$ | $\longrightarrow$ | , |

Now, for every right-hand side that has more than two nonterminals, we
replace all but the first nonterminal on the right with a *new*
nonterminal, and we add a new rule for the new nonterminal.
For example, we handle the rule

$F$ | $\longrightarrow$ | $I$ $L$ $N$ $R$ |

as follows:

$F$ | $\longrightarrow$ | $I$ $W$ |

$W$ | $\longrightarrow$ | $L$ $N$ $R$ |

Then we handle the new rule (with `W` on the left)
as follows:

$W$ | $\longrightarrow$ | $L$ $X$ |

$X$ | $\longrightarrow$ | $N$ $R$ |

If we take care of the other two "problem" rules:

$F$ | $\longrightarrow$ | $I$ $L$ $R$ |

$N$ | $\longrightarrow$ | $I$ $C$ $N$ |

We finish with the following CFG (in CNF):

$F$ | $\longrightarrow$ | $I$ $W$ |

$F$ | $\longrightarrow$ | $I$ $Y$ |

$W$ | $\longrightarrow$ | $L$ $X$ |

$X$ | $\longrightarrow$ | $N$ $R$ |

$Y$ | $\longrightarrow$ | $L$ $R$ |

$N$ | $\longrightarrow$ | id |

$N$ | $\longrightarrow$ | $I$ $Z$ |

$Z$ | $\longrightarrow$ | $C$ $N$ |

$I$ | $\longrightarrow$ | id |

$L$ | $\longrightarrow$ | ( |

$R$ | $\longrightarrow$ | ) |

$C$ | $\longrightarrow$ | , |

The CFG below is for a grammar of very simple statements. The terminal symbols are
**id**, **=**, **(**, **)**, **++**, and **read**. Convert this CFG to CNF.

$S$ | $\longrightarrow$ | id = id |

$S$ | $\longrightarrow$ | id ( ) |

$S$ | $\longrightarrow$ | id ++ |

$S$ | $\longrightarrow$ | read ( id ) |

$S$ | $\longrightarrow$ | $S$ $S$ |

Parsing an Input

Now that we have converted our CFG to Chomsky Normal form, we can
use it to try to parse an input.
Let's use the input `f(x,y)` as an example.
The actual sequence of tokens for that input would be:
**id ( id , id )**

To parse an input of length *N*, we use an *N*-by-*N* grid.
First, we fill in the squares along the diagonal.
The *k*^{th} square contains the nonterminal(s) $X$
such that there is a CFG rule

$X$ $\longrightarrow$ **t**

and `t` is the k^{th} token.
Here's the 6-by-6 grid for our example, with the diagonal
filled in (and with the rows and columns numbered):

Now we fill in each successive diagonal above the one that's already filled in. When filling in a square, we consider what's already in the squares to the left and below it. For example, to fill in the square in position 2,1, we look at the contents of the square to its left ($I$,$N$) and below it ($L$). If there were a grammar rule with "$I$ $L$" or with "$N$ $L$" on the right-hand side, we would put the left-side nonterminal into the square in position 2,1. However, for our example, there are no such grammar rules, so square 2,1 stays empty. The first new square that we can fill in is in position 5,4. That square has $C$ to its left, and $I$, $N$ below it. There is a grammar rule $Z$ $\longrightarrow$ $C$ $N$, so we put a $Z$ in square 5,4. If there had also been a grammar rule $H$ $\longrightarrow$ $C$ $I$, we would also have put an $H$ in square 5,4. Note that we also fill in square 5,5 due to the rule $X$ $\longrightarrow$ $N$ $R$. Here is the grid so far:

The squares in the next diagonal each have two squares to their left and two squares below. We look at pairs of squares: first the square all the way to the left and the square immediately below, then the square immediately to the left and the one two squares down. If either square in the current pair is empty, we go on to the next pair. Each time we look at a pair of non-empty squares, we are looking for a nonterminal $A$ in the square to the left and a nonterminal $B$ in the square that is below, such that there is a grammar rule with $A$ $B$ on the right. If there is such a grammar rule, we fill in the current square with the left-side nonterminal. For example, when trying to fill in the square in position 3,5, we see $I$, $N$ in position 3,3, and $Z$ in position 4,5. There is a grammar rule $N$ $\longrightarrow$ $I$ $Z$, so we put an $N$ is position 3,5. Here is the result of filling in the whole grid:

To understand this process, remember that the CYK algorithm works by building up parse trees for longer and longer fragments of the input. Each box in the upper-right part of the grid represents the root of one or more parse trees. The nonterminal(s) in a square is/are the root nonterminal(s) of those trees. The children of the root are the nonterminals in the pairs of squares, one to the left, and one below, that caused the nonterminal to be filled in. The five nonterminals in the upper-right part of our grid represent these five parse trees:

Use the statement CFG that was converted to CNF
for the "Test Yourself #1" question to parse the
input `x ++ y = z y ++` (i.e.,
build and fill in the grid).

Accepting an Input

The CYK discovers that an input *is* in the language of
a CFG iff the CFG's start nonterminal appears in the upper-right
corner of the grid.
For our example, `F` is the start nonterminal, and it
does appear in square 1,6, so the example input is in the
language of the CFG.
The rightmost parse tree shown above is the (unique) parse
tree for the example input.

Is the input `x ++ y = z y ++`
in the language of the simple statement grammar?
(Justify your answer.)