CS 536
Lecture Notes: February 26, 1998
Main Topics: Grammar Analysis and Parsing


ANNOUNCEMENTS:


Abstract Syntax Trees (AST) and Java Cup

Example 1
original production:
in cup form:
in cup form with actions:
stmt -> ident = exp ;
stmt ::= ident asg exp semi
stmt ::= ident:i asg exp:e semi
	  {: RESULT = new asgNode(i, e, i.linenum, i.colnum);
	   :}
AST Tree produced:
		-----------
      stmt = 	| asgNode |
		-----------
		 /	\
        	/	 \
     	      ident	 exp

Example 2
original production:
in cup form with actions:
stmts -> stmt stmts
stmts ::= stmt:s1 stmts:s2
  {: RESULT = new stmtsNode(s1, s2, s1.linenum, s1.colnum);
   :}
AST Tree produced:

		-------------
       stmts =  | stmtsNode |
		-------------
	 	 /	\
		/	 \
	      stmt	stmts

The parser generator combines the AST trees produced by each production to form the complete tree for the given grammar. Later, type checking and code generation can be done on a node by node basis. Combining the above two examples gives the following AST tree:

		-------------
       		| stmtsNode |
		-------------
	 	 /	\
		/	 \
	  -----------	... (not defined in this example)	
	  | asgNode |
	  -----------
	    /	\
	   /	 \
	 ident	 exp

For project 3, we will build an unparser utility which will walk through the complete AST tree and print out the original tokens in the same way they were entered. This not only facilitates grading, but is a good way to check for errors in the structure of the tree for debugging. Each node will have a member function "void Unparse(int indent)" to print out the information it contains.

For example, the Unparse routine for identNode will be easy to implement. It will print out the serial number for the token, using Registration.toString(). A more complex example is for the asgNode. The code for this is as follows:

	
	void Unparse(int indent){
		genIndent(indent);
		target.Unparse(0);
		System.out.print(" = ");
		source.Unparse(0);
		System.out.println(";");
	}


GRAMMAR ANALYSIS

S -> A B
   | x
B-> b
A -> a A
C -> d
As discussed last lecture, this grammar has two structural problems:
  1. C is not reachable from the start production (unreachable non-terminal)
  2. A causes an infinite loop if it is reached (non terminating non-terminal)
Marking Algorithms (such as the three below) can be used to fix these and other grammar errors. Note that in these algorithms that if a non-terminal is marked, all instances of it should be marked. For these notes, marked non-terminals will be in parentheses.

~~~~~~~~

ALGORITHM FOR FINDING USELESS NONTERMINALS (ones that are non-terminating)
  1. Mark all terminal symbols
  2. while (there are more symbols that can be marked){
    if (all symbols on right side of arrow are marked){ mark left side }
    }
The above grammar becomes:
(S) -> A (B)
     | (x)
(B) -> (b)
(A) -> (a) A
(C) -> (d)

Unmarked non-terminal A is "useless" and should generate an error.

~~~~~~~~

ALGORITHM FOR REACHABILITY
  1. Mark start symbol
  2. while (there are more symbols that can be marked){
    if (left side is marked){ mark all non-terminals on right side }
    }

The above grammar becomes:
(S) -> (A) (B)
     |  (x)
(B) -> (b)
(A) -> (a) (A)
 C  -> d

C -> d is unreachable.

~~~~~~~~

PARSING NON-TERMINALS WHICH CONTAIN LAMBDA
A -> B C
B -> lambda
C -> lambda
In this example, A goes to lambda indirectly
A marking algortithm can be used to generate a list of all productions in a grammar that can go to lambda.
  1. If (A->lambda) is a production, mark A.
  2. while (there are more symbols that can be marked){
    if (B-> C D E ... X is a production and the entire right side is marked){ mark B }
    }

An example of this:
S -> A (B) (C)
A -> a
(B) -> (C)  (D)
(D) -> d | lambda
(C) -> c | lambda

So B, C, D can generate lambda.

~~~~~~~~

A context-free language is one where you can do free substitution of terms regardless of position in the code. The following example (using the grammar just above) illustrates this:
Derivation
Step		Expression
0		S -> A B C
1		S => a B C
2		  => a C D C
3		  => a D C
4		  => a C
5		  => a

Some notation and examples:

=>	one derivation step		(ex.  	see above)		
=>+	one or more derivation steps	(ex.	S=>+ a)
=>*	zero or more derivation steps	(ex	S=>* S)
(compare the last two to the use of + and * in regular expressions)


PARSING

Parsing implies: given a token sequence, figure out it's parse tree (derivation steps).
There are two approaches to parsing: Example: Given the following grammar, parse ID + ID using both methods.
E -> E + T
   | T
T-> T  * ID
   |  ID
Top Down Parse
  • Start with start symbol (E)
  • Try expanding each non-terminal until you get the correct expression

start here->	  E
       	       /  |   \
      	      E   +    T
      	      |	       |
      	      T	       ID
              |
      	      ID	

Bottom Up Parse
  • Search for ID in all productions
  • Try expanding each non-terminal until you get the correct expression
		     E	
		/    |    \
		E    |	  |
		|    |	  |
		T    |	  T
		|    |    |
start here ->	ID   +   ID	

Top Down parsing is simpler but slower because you have to search each production for each non-terminal in the expression. Later, we will learn more precise techniques for finding these parse trees. In general, top-down parsing uses approximately i^3 steps to parse an expression where i is the number of tokens.

		Number of Tries 
Expression	to Get Correct Parse		i
---------------------------------------------------
(a]			7			2	
((a]]			17			4
(((a]]]			37			6
So for an average size program of 1000 tokens, it would take 10^6 steps to parse, which translates to hundreds of seconds on a machine of today's standards.


mabaxter@students.wisc.edu