Recall that the job of the scanner is to translate the sequence of
characters that is the input to the compiler to a corresponding sequence
of tokens.
In particular, each time the scanner is called it should find the
longest sequence of characters in the input, starting with
the current character, that corresponds to a token, and should return
that token.
It is possible to write a scanner from scratch, but a more efficient and
less error-prone approach is to use a scanner generator like
lex or flex (which produce C code), or jlex (which produces java code).
The input to a scanner generator includes one regular expression
for each token (and for each construct that must be recognized and ignored,
such as whitespace and comments).
Therefore, to use a scanner generator you need to understand regular
expressions.
To understand how the scanner generator produces code that correctly
recognizes the strings defined by the regular expressions, you need to
understand finite-state machines (FSMs).
A finite-state machine is similar to a compiler in that:
Here's an example of a finite-state-machine that recognizes Pascal identifiers
(sequences of one or more letters or digits, starting with a letter):
An input string is accepted by a FSM if:
The following strings are in the language of the FSM shown above:
x, tmp2, XyZzy, position27.
The following strings are not in the language of the FSM shown above:
123, a?, 13apples.
Question 1:
Write a finite-state machine that accepts Java identifiers (one or more
letters, digits, or underscores, starting with a letter or an underscore).
Question 2:
Write a finite-state machine that accepts only Java identifiers that do
not end with an underscore.
The following is a finite-state machine that accepts integer literals
with an optional + or - sign:
A FSM is a 5-tuple: (Q, E, d, q, F), where E should be capital sigma
and d should be delta.
There are two kinds of FSM:
Example
Here is a non-deterministic finite-state machine that recognizes the
same language as the second example deterministic FSM above
(the language of integer literals with an optional sign):
Sometimes, non-deterministic machines are simpler
than deterministic ones, though not in this example.
A string is accepted by a non-deterministic finite-state machine
if there exists a sequence of moves starting in
the start state, ending in a final state, that consumes the entire string.
For example, here's what happens when the above machine is run on the
input "+75":
It is worth noting that there is a theorem that says:
The most straightforward way to program a (deterministic)
finite-state machine is to use a table-driven approach.
This approach uses a table with one row for each state in the machine,
and one column for each possible character.
Table[j][k] tells which state to go to from state j on character k.
(An empty entry corresponds to the machine getting stuck, which means
that the input should be rejected.)
Here's the table for the (deterministic) "integer literal"
FSM given above:
The table-driven program for a FSM works as follows:
Regular expressions provide a compact way to define a language that
can be accepted by a finite-state machine.
Regular expressions are used in the input to a scanner generator to
define each token, and to define things like whitespace and comments
that do not correspond to tokens, but must be recognized and ignored.
As an example, recall that a Pascal identifier consists of a letter,
followed by zero or more letters or digits.
The regular expression for the language of Pascal identifiers is:
Often, the "followed by" dot is omitted, and just writing two things
next to each other means that one follows the other.
For example:
In fact, the operands of a regular expression should be single characters
or the special character epsilon, meaning the empty string
(just as the labels on the edges of a FSM should be single characters
or epsilon).
In the above example, "letter" is used as a shorthand for:
To understand a regular expression, it is necessary to know the precedences
of the three operators.
They can be understood by analogy with the arithmetic operators
for addition, multiplication, and exponentiation:
So, for example, the regular expression:
Describe (in English) the language defined by each of the following
regular expressions:
An integer literal with an optional sign can be defined in English as:
Every regular expression defines a language: the set of strings that
match the expression.
We will not give a formal definition here, instead, we'll give some
examples:
There is a theorem that says that for every regular expression, there
is a finite-state machine that defines the same language, and vice versa.
This is relevant to scanning because it is usually easy to define the
tokens of a language using regular expressions, and then those regular
expression can be converted to finite-state machines (which can actually
be programmed).
For example, let's consider a very simple language: the language of
assignment statements in which the left-hand side is a Pascal identifier
(a letter followed by one or more letters or digits),
and the right-hand side is one of the following:
These regular expressions can be converted into the following finite-state
machines:
Given a FSM for each token, how do we write a scanner?
Recall that the goal of a scanner is to find the longest
prefix of the current input that corresponds to a token.
This has two consequences:
Furthermore, remember that regular expressions are used both to define tokens
and to define things that must be recognized and skipped (like whitespace
and comments).
In the first case a value (the current token) must be returned when the
regular expression is matched, but in the second case the scanner should
simply start up again trying to match another regular expression.
With all this in mind, to create a scanner from a set of FSMs,
we must:
For example, the FSM that recognizes Pascal identifiers must be
modified as follows:
And here is the combined FSM for the five tokens (with the actions
noted below):
We can convert this FSM to code using the table-driven technique
described above, with a few small modifications:
Augment the "combined" finite-state machine to:
Overview
Finite-State Machines
In both cases, the input (the program or the string) is a sequence of characters.
Example: Pascal Identifiers
--------
| |
v | letter,digit
----- letter ======= |
| S | ----------> || A || ---
----- =======
In this picture:
A FSM is applied to an input (a sequence of characters).
It either accepts or rejects that input.
Here's how the FSM works:
The language defined by a FSM is the set of strings accepted
by the FSM.
Example: Integer Literals
+-----+
| | digit
v |
digit ======= -+
+-----------> || B ||
| =======
| ^
| |digit
| + |
----- ---------> -----
| S | | A |
----- ---------> -----
-
Formal Definition
+ - digit
_________________________________
|
S | A A B
|
A | B
|
B | B
The entry in row R and column C tells you what state to go to when
you are currently in state R and the current input character is C.
For example, if you are in the start state (state S) and the current
character is a plus-sign, then you should go to state A.
Deterministic and Non-Deterministic FSMs
+-----+
| | digit
v |
======= -+
|| B ||
=======
^
|digit
+ |
----- ---------> -----
| S | | A |
----- ---------> -----
| - ^
| |
+----------------+
epsilon
After scanning Can be in these States
-------------- ----------------------
(nothing) S A
+ A -stuck-
+7 B -stuck-
+75 B -stuck-
There is one sequence of moves that consumes the entire input
and ends in a final state (state B), so this input is accepted
by his machine.
For every non-deterministic finite-state machine M,
there exists a deterministic machine M' such that M and
M' accept the same language.
How to Implement a FSM
character
+ - digit
+-----------------
| A A B
S |
state |
A | B
|
B | B
until the machine gets stuck (the table entry is empty) or the
entire input is read. If the machine gets stuck, reject the input.
Otherwise, if the current state is a final state,
accept the input; otherwise, reject it.
Regular Expressions
letter . (letter | digit)*
The following table explains the symbols used in this regular expression:
|
means "or"
.
means "followed by"
*
means zero or more instances of
( )
are used for grouping
letter (letter | digit)*
a | b | c | ... | z | A | ... | Z
and similarly for "digit".
Also, we sometimes put the characters in quotes (this is necessary
if you want to use a vertical bar, a dot, or a star character).
Regular Expression Operator
Analogous Arithmetic Operator
Precedence
|
plus
lowest precedence
.
times
middle
*
exponentiation
highest precedence
letter.letter | digit*
does not define the same language as the expression given above.
Since the dot operator has higher precedence than the | operator (and the
* operator has the highest precedence of all),
this expression is the same as:
(letter.letter) | (digit*)
and it means "either two letters, or zero or more digits".
Example: Integer Literals
(nothing or + or -) followed by one or more digits
The corresponding regular expression is:
(+|-|epsilon).(digit.digit*)
Note that the regular expression for "one or more digits" is:
digit.digit*
i.e., "one digit followed by zero or more digits".
Since "one or more" is a common pattern, another operator, +, has
been defined to mean "one or more".
For example,
digit+
means "one or more digits", so another way to define integer literals with
optional sign is:
(+|-|epsilon).digit+
The Language Defined by a Regular Expression
Regular Expression
Corresponding Set of Strings
epsilon
{""}
a
{"a"}
a.b.c
{"abc"}
a | b | c
{"a", "b", "c"}
(a | b | c)*
{"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}
Using Regular Expressions and Finite-State Machines to Define a Scanner
This language has five tokens, which can be defined by the following
five regular expressions:
Token
Regular Expression
ASSIGN
"="
ID
letter (letter | digit)*
PLUS
"+"
TIMES
"*"
EQUALS
"=="
----- "=" =======
ASSIGN: | S | ----------> || ||
----- =======
--------
| |
v | letter,digit
----- letter ======= |
ID: | S | ----------> || || ---
----- =======
----- "+" =======
PLUS: | S | ----------> || ||
----- =======
----- "*" =======
TIMES: | S | ----------> || ||
----- =======
----- "=" ----- "=" =======
EQUALS: | S | -------> | | -------> || ||
----- ----- =======
letter, digit
+---+
| |
v | ======================
.---. letter .---. || action: put back ||
| S |--------->| |-------------------->|| 1 char; ||
`---' `---' any char except || return ID ||
letter or digit ======================
========
|| F2 ||
======== letter, digit
^ +--+
"+"| | |
| v |
======== "*" .---. .---. ========
|| F4 ||<--------| S |--------->| A |------------------->|| F3 ||
======== `---' letter `---' any char except ========
| letter or digit
"=" |
v
.---. "=" ========
| B |------->|| F5 ||
`---' ========
|
any char |
except = |
v
========
|| F1 ||
========
ACTIONS
-------
F1: put back 1 char; return ASSIGN
F2: return PLUS
F3: put back 1 char; return ID
F4: return TIMES
F5: return EQUALS
Here's the table for the above "combined" FSM:
+
*
=
letter
digit
EOF
S
F3, return PLUS
F4, return TIMES
B
A
A
F2, put back 1 char; return ID
F2, put back 1 char; return ID
F2, put back 1 char; return ID
A
A
F2, put back 1 char; return ID
B
F1, put back 1 char; return ASSIGN
F1, put back 1 char; return ASSIGN
F5, return EQUALS
F1, put back 1 char; return ASSIGN
F1, put back 1 char; return ASSIGN
F1, put back 1 char; return ASSIGN