Classnotes for CS536-1 	Feb 17, 1998

by Wai-Kwong So (wai-kwon@cs.wisc.edu)

----------------------------------------------------------------------------

How does a scanner generator operate?
1. translate regular expressions into non-deterministic (NFA)
2. transform the NFA to a deterministic FA (DFA)

We will use the subset construction.

First, we number NFA states for ease of references.
for example,

			  /---\
		       	 c|   |
	            b     | <--
	    /--->(2)---->(3)-----\
      lambda|             |      |lambda
	    |       	  |      |        d
       --->(1)	   |-->---|^     |--->(6)--->(7)
	    |	  c|      |a     |
	    |  b   |  c   |      | 
	    \---->(4)--->(5)-----/
	      
To build a corresponding DFA, we label DFA states with sets of NFA states.
The idea is that if the NFA, after reading strings S, could be in state
j, or state k, then the corresponding DFA, after reading S, will be in a
state labelled with {i, j, k}

In effect the DFA "tracks" all the different states the NFA might reach.

for example,

         b	      c		   a
 {1, 2} ---> {3,4,6} ---> {3,5,6} ---> {3,6}
  b c a					-   Final state

Each brace represents a state.

We will define the lambda-successors of a state set S, of NFA state,
denoted
A(S) = S U {all states reachable from members of S reading only ?}
where S is the set of NFA states.

In the previous example, A({1})={1,2}
			 A({3,5})={3,5,6}

The start state of the DFA we are building will be
START  = A({START })
     D           N

For S, a set of NFA states, the sucessor set, under character a, is:
		      a	
s(S,a) = {v|t <- S, t -> u, v <- A({u})}
where s is a sucessor function, a is single char

That u, for each state t in s, we read "a" and go to a suceessor state u.
We then include u's lambda-successors.

A final state in the new DFA is ANY state labelled with a set that included
AT LEAST one final state of the NFA.

To build a DFA from an NFA:
1. Create START state of DFA = A({START })
				       N
2. For each DFA state S, and each character c
	add s(S,c) to the DFA
   Until no new states or transitions can be added

3. Mark final states of DFA

Example (from above)                           c
					    |----|
	 b	      c		   c        |    |
-->(1,2)--->((3,4,6))--->((3,5,6))--->((3,6))<----   
		|	     |           |
                |            | d         |
                |    d       v       d   |
                ---------->((7))<---------

We can prove the original NFA and the new DFA are equivalent.

That is, the NFA accepts C1,...CL iff the DFA does.

Use induction on the length of C1,...,CL
Induction hypothesis:
	DFA is labelled with {i,j,k} iff NFA can reach state i or
state j or state k reading L characters.

Note also the 2^M state size blowup potential

As another example, (aa|aaaaa)+

NFA:	                 lambda
      ______________________________________________
     |	 a      a      a      a      a      a	    |
     | --->(2)--->(3)--->(4)--->(5)--->(6)--->((7))--
     v/
--->(1)
      \  a     a      a
       --->(8)--->(9)--->((10))


Do subset construction

	    a	     a		       a	  a
--->(1,2,8)--->(3.9)--->((1,2,4,8,10))--->(3,5,9)--->((1,2,4,6,8,10))-----
			 				    	        a|
		    	    a			  a	    	         |
 ---((1,2,3,4,5,7,8,9,10))<---((1,2,3,4,6,8,9,10))<---((1,2,3,5,7,8,9))<--
 |           
 | a 			    a				
 --->((1,2,3,4,5,6,8,9,10))--->((1,2,3,4,5,6,7,8,9,10))<---
					  |		  |a
					  |_______________|
					  

Renumbering states we have

       a      a        a      a        a        a        a        a
-->(1)--->(2)--->((3))--->(4)--->((5))--->((6))--->((7))--->((8))--->((9))
								       |
								      a|
								       |
								       v
							             ((10))<---
								       |      |a
								       |______|
								
This is a correct DFA, but it has more status than are really necessary.

We can OPTIMIZE the DFA, reducing the number of states needed, without
affecting the DFA's correctness.

The idea is to group together and merge states that are EQUIVALENT in
their operation.

Two states are equivalent if starting in one allowed EXACTLY the same
set of string to be accepted as the other state does.

for example, in
                 a
	    |-------->(())
            |
	-->()
            |    b
            |-------->(())

both final states allow only namdela to be accepted. They are equivalent
and may be merged together.

We then get
                 a
	    |---------|
            |         | 
	-->()         |-->(())
            |    b    |
            |-------- |

We will define a GREEDY state merging by:
Optimize DFA:
1. Group states of DFA into 2 disjoint groups:
	G1 = all final states
	G2 = all non-finall states
2. Repeat
	Let G = any state group {s1,s2,...,sn}
	Let C = aby character
	Let G'= {t1,t2,...,tn} = group of successors to states in G
	            c
	endes c (si--->ti)

	IF G' is not entirely contained in some existing groups
	THEN
	     Split G into new groups such that si and sj remains in
	some groups if and only if ti and tj are in the same group.

   UNTIL no changes

3. Groups now correspond to states of an optimized DFA:
	The Group with original START state is new START state.
	Groups of final states are new final states.
	Transitions are as in original DFA.

Example 1:
                a        b
	     |----->(2)---->((3))
             |	
	--->(1)
	     |  c        d       b
	     |----->(4)---->(5)---->((6))

Initially, G1 = [1,2,4,5]
	   G2 = [3,6]

Now 1 has successors under a; 2,4,5 doesn't, so split
	   G3 = [1]	G4 = [2,4,5]

Now 4 has a successor undered; 2 and 5 don't, so split
	   G5 = [4]	G6 = [2,5]

Overall, we have [1],[4],[2,5],[3,6]

No further splits are needed.

We have
                 a          b
	     |------->(2,5)----->((3,6))
             |	        |
	--->(1)         |d
	     |  c       |       
	     |---->(4)-->


Example 2,

        a      a        a      a        a        a        a        a
--->(1)--->(2)--->((3))--->(4)--->((5))--->((6))--->((7))--->((8))--->
      a          a
((9))--->((10))<--|
           |      |
           |______|

Initially we have G1 = [3,5,6,7,8,9,10]
		  G2 = [1,2,4]

Look at successors under a:
Split 3 from G1: [3] [5,6,7,8,9,10]
[1,2,4] splits into [1] and [2,4]
[2,4] then splits into [2] and [4]

We end up with

        a      a        a      a            a
--->(1)--->(2)--->((3))--->(4)--->((5-10))<--|
           			      |      |
           			      |______|