JLex is a scanner generator that produces Java code.
Here's a picture illustrating how to create and run a program using JLex:
The picture above assumes that a class named Main has been defined
that contains the main program of interest.
That program will declare an object of type Yylex, and will include
calls to the Yylex constructor and its next_token method.
A JLex specification has three parts, separated by double percent signs:
The state-list part of a rule will be discussed later.
Ignoring state-lists for now, the form of a regular expression rule is:
When the scanner's next_token method is called, it repeats:
If an input character is not matched in any pattern, the scanner throws
an exception (so it is important to make sure that there can be no such
unmatched characters, since it is not good to have a scanner that can
"crash" on bad input).
The regular expressions are similar to the ones discussed in the
scanner notes.
Here's how they are used to match the input:
Note that whitespace only matches itself if it is inside quotes or in
a character class;
otherwise, it ends the current pattern.
So the two rules:
Question 1:
The character class [a-zA-Z] matches any letter.
Write a character class that matches any letter or any digit.
Question 2:
Write a pattern that matches any Pascal identifier (a sequence of
one or more letters and/or digits, starting with a letter).
Question 3:
Write a pattern that matches any Java identifier (a sequence of
one or more letters and/or digits and/or underscores, starting with
a letter or underscore).
Question 4:
Write a pattern that matches any Java identifier that does not
end with an underscore.
Recall that the second part of a JLex specification contains directives.
This can include specifying the value that should be returned on end-of-file,
specifying that line counting should be turned on, and specifying that
the scanner will be used with the Java parser generator java cup.
(See the
JLex Reference Manual for more information about directives.)
The directives part also includes macro definitions.
The form of a macro definition is:
Once a macro has been defined, it can be used in a regular expression
(either to define another macro, or in the "Regular Expression Rules"
part of the JLex specification.
To use a macro, just use its name inside curly braces.
For example, given the above macro definitions, the following pattern
could be used to match Pascal identifiers:
Define a macro named NOTSPECIAL that matches any character except a
newline, double quote, or backslash.
You can include comments in the first and second parts of your JLex
specification, but not in the third part (because JLex will think
they are part of a pattern).
JLex comments are like Java comments: they start with two slashes,
and continue to the end of the line.
Recall that each regular expression rule (a pattern and the action
to be performed when the pattern is matched) can optionally include
a list of states at the beginning of the pattern.
For example:
Each time the scanner is called, it is in some state.
Initially, it is in a special state called YYINITIAL.
It will stay in that state unless it matches a pattern whose
corresponding action includes code that causes
it to change to another state.
For example, given the rule:
If a rule has no list of states, then it will be matched
in any state;
however, if it has a s\list of states, then it will be matched only
when the scanner is in one of those states.
So for example, the rule for "abc" given above will only
be matched after the rule for "xyz".
Every state other than YYINITIAL must be declared in the
JLex directives part of the JLex specification.
Here's an example declaration:
Suppose that for floating-point numbers you want your scanner to return
two values: the value before the decimal point, and the value after the
decimal point.
Here's an example of using a JLex state to do that (using some
pseudo-code):
A quoted string consists of three parts:
Use JLex states to write a specification to recognize quoted strings,
and to return the number of escaped quotes in each such string.
If you turn line counting on (by including %line in the
"directives" part of the specification),
you can use the variable yyline in the actions that
you write for the regular expressions.
The value of yyline will be the current line number in the
input file, counting from zero (so to use that number in error messages
printed by your scanner, you will need to add one to yyline).
You can also use the method yytext() in your actions.
This method returns a String -- the sequence of characters that
was just matched.
Here is a small (complete) JLex specification:
Note that the actions in this example are not what you would
really put in a JLex specification for a scanner.
Instead of printing, the first four actions should return the
appropriate tokens.
Overview
+-----------+
JLex specification ---> | JLex.Main | ---> Java source code
(xxx.jlex) +-----------+ (xxx.jlex.java)
+------------------+
xxx.jlex.java ---> | jikes (or javac) | ---> Yylex.class
+------------------+
+--------+
Yylex.class-------> | java | ---> output of Main
Main.class--------> | |
+--------+
The input to JLex is a specification that includes a set of regular
expressions and associated actions.
The output of JLex is a Java source file that defines a class named
Yylex.
Yylex includes a constructor that is called with one argument:
the input stream (an InputStream or a Reader).
It also includes a method called next_token, which returns the
next token in the input.
Format of a JLex Specification
We will discuss the regular expression rules part first.
Regular Expressions Rules
regular-expression { action }
^ ^
| |
the pattern to java code to be executed
be matched when the pattern is matched
until an action causes the next_token method to return.
If there are several patterns that match the same (longest)
sequence of characters, then the first such pattern is considered
to be matched (so the order of the patterns can be important).
abc
==
while
are three patterns that match exactly those sequences of characters
(note that writing one character after another means "followed by"
as usual).
"abc"
"=="
"while"
And the following pattern:
"a|b"
matches the three-character sequence: a then |
then b, rather than matching a single a or a
single b.
|
means "or"
*
means zero or more instances of
+
means one or more instances of
?
means zero or one instance of
( )
are used for grouping
So to match a backslash character, put it in quotes.
(See the
JLex Reference Manual for a complete list of the special characters
escaped by a backslash.)
\n newline
\t tab
\" double quote
- means a range of characters; e.g., a-z
means "a to z".
^ is only a special character if it is the
first character in the square brackets;
it means not any of the following
characters. So for example, [^abc]
matches any character other than an a, or a
b, or a c.
\ is used as an escape character with n,
b, etc as usual;
it can also be used to escape the characters that
are special inside a character class (e.g.,
[a\-z] matches an a or
a - or a z, and [\\]
matches a backslash.
" can be used around characters that are special inside
a character class to make them match themselves (e.g.,
["\"] matches a backslash, and
["-"] matches a hyphen.
[a bc] {}
a|" "|b|c {}
are equivalent;
each matches an a, or a space, or a b, or a c,
while the rule:
a bc {}
causes an error when you try to process your specification.
JLex directives
name = regular-expression
where name is any valid Java identifier, and
regular-expression is any regular expression as defined above.
Here are some examples:
DIGIT= [0-9]
LETTER= [a-zA-Z]
WHITESPACE= [ \t\n]
{LETTER}({LETTER}|{DIGIT})*
Comments
States
<STATE1, STATE2>"abc" { }
is a rule that starts with a list of two states (named STATE1
and STATE2).
"xyz" { yybegin(STATE1); }
if the input contains "xyz", then the call to yybegin
will be executed, and the scanner will enter the STATE1 state.
%state STATE1
%%
DIGIT= [0-9]
DOT= "."
%state DOTSTATE
%%
<YYINITIAL>{DIGIT}+{DOT} { yybegin(DOTSTATE);
-- save the value so far --
}
<DOTSTATE>{DIGIT}+ { yybegin(YYINITIAL);
-- return the saved value and the new one --
}
The text can contain any characters except a newline or a single
double-quote character.
It can contain an "escaped" quote, which is
two double-quote characters in a row.
yyline and yytext
A Small Example
%%
DIGIT= [0-9]
LETTER= [a-zA-Z]
WHITESPACE= [ \t\n] // space, tab, newline
// The next 3 lines are included so that we can use the generated scanner
// with java CUP (the Java parser generator)
%implements java_cup.runtime.Scanner
%function next_token
%type java_cup.runtime.Symbol
// Turn on line counting
%line
%%
{LETTER}({LETTER}|{DIGIT}*) {System.out.println(yyline+1 + ": ID " + yytext());}
{DIGIT}+ {System.out.println(yyline+1 + ": INT");}
"=" {System.out.println(yyline+1 + ": ASSIGN");}
"==" {System.out.println(yyline+1 + ": EQUALS");}
{WHITESPACE}* { }
. {System.out.println(yyline+1 + ": bad char");}
Symbol | Meaning in Regular Expressions | Meaning in Character Classes |
( | Matches with ) to group sub-expressions. | Represents itself. |
) | Matches with ( to group sub-expressions. | Represents itself. |
[ | Begins a character class. | Represents itself. |
] | Is illegal. | Ends a character class. |
{ | Matches with } to delimit a macro name. | Matches with } to delimit a macro name. |
} | Matches with { to delimit a macro name. | Represents itself or matches with { to delimit a macro name. |
" | Matches with " to delimit strings (only \ is special within strings). | Matches with " to delimit a string of characters that belong to the character class. Only \" is special within the string. |
\ | Escapes special characters (n, t, etc). Also used to specify a character by its octal, hexadecimal, or unicode value. | Escapes characters that are special inside a character class. |
. | Matches any one character except newline. | Represents itself. |
| | Alternation (or) operator. | Represents itself. |
* | Kleene closure operator (zero or more matches). | Represents itself. |
+ | Positive closure operator (one or more matches). | Represents itself. |
? | Optional choice operator (zero or one matches). | Represents itself. |
^ | Matches only at beginning of a line. | When it is the first character in the character class, complements the remaining characters in the class. |
$ | Matches only at end of a line. | Represents itself. |
- | Represents itself. | Range of characters operator. |