Classnotes for CS536-1 Feb 17, 1998 by Wai-Kwong So (wai-kwon@cs.wisc.edu) ---------------------------------------------------------------------------- How does a scanner generator operate? 1. translate regular expressions into non-deterministic (NFA) 2. transform the NFA to a deterministic FA (DFA) We will use the subset construction. First, we number NFA states for ease of references. for example, /---\ c| | b | <-- /--->(2)---->(3)-----\ lambda| | |lambda | | | d --->(1) |-->---|^ |--->(6)--->(7) | c| |a | | b | c | | \---->(4)--->(5)-----/ To build a corresponding DFA, we label DFA states with sets of NFA states. The idea is that if the NFA, after reading strings S, could be in state j, or state k, then the corresponding DFA, after reading S, will be in a state labelled with {i, j, k} In effect the DFA "tracks" all the different states the NFA might reach. for example, b c a {1, 2} ---> {3,4,6} ---> {3,5,6} ---> {3,6} b c a - Final state Each brace represents a state. We will define the lambda-successors of a state set S, of NFA state, denoted A(S) = S U {all states reachable from members of S reading only ?} where S is the set of NFA states. In the previous example, A({1})={1,2} A({3,5})={3,5,6} The start state of the DFA we are building will be START = A({START }) D N For S, a set of NFA states, the sucessor set, under character a, is: a s(S,a) = {v|t <- S, t -> u, v <- A({u})} where s is a sucessor function, a is single char That u, for each state t in s, we read "a" and go to a suceessor state u. We then include u's lambda-successors. A final state in the new DFA is ANY state labelled with a set that included AT LEAST one final state of the NFA. To build a DFA from an NFA: 1. Create START state of DFA = A({START }) N 2. For each DFA state S, and each character c add s(S,c) to the DFA Until no new states or transitions can be added 3. Mark final states of DFA Example (from above) c |----| b c c | | -->(1,2)--->((3,4,6))--->((3,5,6))--->((3,6))<---- | | | | | d | | d v d | ---------->((7))<--------- We can prove the original NFA and the new DFA are equivalent. That is, the NFA accepts C1,...CL iff the DFA does. Use induction on the length of C1,...,CL Induction hypothesis: DFA is labelled with {i,j,k} iff NFA can reach state i or state j or state k reading L characters. Note also the 2^M state size blowup potential As another example, (aa|aaaaa)+ NFA: lambda ______________________________________________ | a a a a a a | | --->(2)--->(3)--->(4)--->(5)--->(6)--->((7))-- v/ --->(1) \ a a a --->(8)--->(9)--->((10)) Do subset construction a a a a --->(1,2,8)--->(3.9)--->((1,2,4,8,10))--->(3,5,9)--->((1,2,4,6,8,10))----- a| a a | ---((1,2,3,4,5,7,8,9,10))<---((1,2,3,4,6,8,9,10))<---((1,2,3,5,7,8,9))<-- | | a a --->((1,2,3,4,5,6,8,9,10))--->((1,2,3,4,5,6,7,8,9,10))<--- | |a |_______________| Renumbering states we have a a a a a a a a -->(1)--->(2)--->((3))--->(4)--->((5))--->((6))--->((7))--->((8))--->((9)) | a| | v ((10))<--- | |a |______| This is a correct DFA, but it has more status than are really necessary. We can OPTIMIZE the DFA, reducing the number of states needed, without affecting the DFA's correctness. The idea is to group together and merge states that are EQUIVALENT in their operation. Two states are equivalent if starting in one allowed EXACTLY the same set of string to be accepted as the other state does. for example, in a |-------->(()) | -->() | b |-------->(()) both final states allow only namdela to be accepted. They are equivalent and may be merged together. We then get a |---------| | | -->() |-->(()) | b | |-------- | We will define a GREEDY state merging by: Optimize DFA: 1. Group states of DFA into 2 disjoint groups: G1 = all final states G2 = all non-finall states 2. Repeat Let G = any state group {s1,s2,...,sn} Let C = aby character Let G'= {t1,t2,...,tn} = group of successors to states in G c endes c (si--->ti) IF G' is not entirely contained in some existing groups THEN Split G into new groups such that si and sj remains in some groups if and only if ti and tj are in the same group. UNTIL no changes 3. Groups now correspond to states of an optimized DFA: The Group with original START state is new START state. Groups of final states are new final states. Transitions are as in original DFA. Example 1: a b |----->(2)---->((3)) | --->(1) | c d b |----->(4)---->(5)---->((6)) Initially, G1 = [1,2,4,5] G2 = [3,6] Now 1 has successors under a; 2,4,5 doesn't, so split G3 = [1] G4 = [2,4,5] Now 4 has a successor undered; 2 and 5 don't, so split G5 = [4] G6 = [2,5] Overall, we have [1],[4],[2,5],[3,6] No further splits are needed. We have a b |------->(2,5)----->((3,6)) | | --->(1) |d | c | |---->(4)--> Example 2, a a a a a a a a --->(1)--->(2)--->((3))--->(4)--->((5))--->((6))--->((7))--->((8))---> a a ((9))--->((10))<--| | | |______| Initially we have G1 = [3,5,6,7,8,9,10] G2 = [1,2,4] Look at successors under a: Split 3 from G1: [3] [5,6,7,8,9,10] [1,2,4] splits into [1] and [2,4] [2,4] then splits into [2] and [4] We end up with a a a a a --->(1)--->(2)--->((3))--->(4)--->((5-10))<--| | | |______|