## Overview

JLex is a scanner generator that produces Java code. Here's a picture illustrating how to create and run a program using JLex:

The input to JLex is a specification that includes a set of regular expressions and associated actions. The output of JLex is a Java source file that defines a class named Yylex. Yylex includes a constructor that is called with one argument: the input stream (an InputStream or a Reader). It also includes a method called next_token, which returns the next token in the input.

The picture above assumes that a class named P2 has been defined that contains the core program of interest. That program will declare an object of type Yylex, and will include calls to the Yylex constructor and its next_token method.

## Format of a JLex Specification

A JLex specification has three parts, separated by double percent signs:

1. User code: this part of the specification will not be discussed here.
3. Regular expression rules: These rules specify how to divide up the input into tokens. Each rule includes an optional state list, a regular expression, and an associated action.
We will discuss the regular expression rules part first.

## Regular Expressions Rules

The state-list part of a rule is discussed below. Ignoring state-lists for now, the form of a regular expression rule is:

When the scanner's next_token method is called, it repeats:

1. Find the longest sequence of characters in the input (starting with the current character) that matches a regular-expression pattern.
2. Perform the associated action.

until an action causes the next_token method to return. If there are several patterns that match the same (longest) sequence of characters, then the first such pattern is considered to be matched (so the order of the regular-expression rules can be important).

If an input character is not matched in any pattern, the scanner throws an exception. It is not good to have a scanner that can "crash" on bad input, so it is important to make sure that there can be no such unmatched characters!

The regular expressions are similar to the ones discussed in the scanner notes. Here's how they are used to match the input:

• Most characters match themselves. For example:
• abc
• ==
• while
are three patterns that match exactly those sequences of characters (note that writing one character after another means "followed by" as usual).

• Characters (even special characters, except backslash) enclosed in double quotes match themselves. For example, the following patterns are equivalent to the three given above:
• "abc"
• "=="
• "while"
• And the following pattern:
       "a|b"

matches the three-character sequence: a then | then b, rather than matching a single a or a single b.

• The following characters have the usual special meanings as regular expression operators:  | means "or" * means zero or more instances of + means one or more instances of ? means zero or one instance of ( ) are used for grouping

• The dot character matches any character except the newline character. It is usually used in the last rule in the specification, to match all "bad" characters (and the associated action issues an error message).

• The backslash is a special escape character:
 \n newline \t tab \" double quote
To match a backslash character, put two backslashes in a character class (see below). See the JLex Reference Manual for a complete list of the special characters escaped by a backslash.