CS 536 Program 2: The Scanner
Due date: Thursday, October 14 (by midnight)
Not accepted after midnight on Sunday, October 17
Overview |
Requirements |
Announcements | Handin
Overview
For this assignment you will use JLex to write a scanner
for a language called C--, a small subset of the C++ language.
Features of C-- that are relevant to this assignment
are described below.
You will also write a main program and an input file to test your scanner.
You will be graded both on the correctness of your scanner
and on how thoroughly your input file tests the scanner.
Requirements
Getting Started
Skeleton files on which you should build are in:
~cs536-1/public/prog2
The files are:
- c.jlex:
An example JLex specification.
- sym.java:
Token definitions (this file will eventually be
generated by the parser generator).
- Errors.java:
The Errors class will be used to print error and warning messages.
- P2.java:
Contains the main program that tests the scanner.
- Makefile: A Makefile that
uses JLex to create a scanner, and also makes P2.class.
- test.C:
Input for the current version of P2.java.
JLex
Use the on-line
JLex reference manual, and/or the on-line
JLex notes for information about writing a JLex specification.
To run JLex you'll need to modify your CLASSPATH environment variable.
Taking into account the change needed to use jikes, your .cshrc.local
file should include the line:
setenv CLASSPATH ".:/s/java/jre/lib/rt.jar:/p/course/cs536-horwitz/public/JAVA"
Don't forget to type:
source ~/.cshrc
after changing .cshrc.local.
The C-- Language
This section defines the lexical level of the C-- language.
At this level, we have the following language issues:
Tokens
The tokens of the C-- language are defined as follows:
- Any of the following reserved words:
int bool void true false if
else while return cin cout
- Any identifier (a sequence of one or more letters and/or digits,
and/or underscores, starting with a letter or underscore, excluding
reserved words).
- Any integer literal (a sequence of one or more digits).
- Any string literal (a sequence of zero or more "string" characters
surrounded by double quotes; a "string" character is either an
escaped character -- a backslash followed by an n, a
t, a single quote, a double quote, or another backslash --
or a single character other than newline or double quote or backslash).
Examples of legal string literals:
""
"&!#"
"use \n to denote a newline character"
"include a quote like this \" and a backslash like this \\"
Examples of things that are not legal string literals:
"unterminated
"also unterminated \"
"backslash followed by space: \ is not allowed"
"bad escaped character: \a AND not terminated
- Any of the following one- or two-character symbols:
{ } ( ) [ ] , = ;
+ - * / ! && || == !=
< > <= >= << >>
Token "names" (i.e., values to be returned by the scanner)
are defined in the file
sym.java.
For example, the name for the token to be returned when an integer
literal is recognized is INTLITERAL.
Note that code telling JLex to return the EOF token on end-of-file
has already been included in the file c.jlex -- you don't have to
include a specification for that token. Note also that the READ token is for the
2-character symbol >> and the WRITE token is for the 2-character symbol <<
Comments
Text starting with a double slash (//) or a sharp sign (#) up to the end
of the line is a comment (except of course if those characters are inside
a string literal).
For example:
// this is a comment
# and so is this
The scanner should recognize and ignore comments (but there is no
COMMENT token).
Whitespace
Spaces, tabs, and newline characters are whitespace.
Whitespace separates tokens, but should otherwise be ignored (except inside
a string literal).
Illegal Characters
Any character that is not whitespace and is not part of a token or
comment is illegal.
Length Limits
You may not assume any limits on the lengths of identifiers, string literals,
integer literals, comments, etc.
What the Scanner Should Do
The main job of the scanner is to identify and return the next token.
The value to be returned includes:
- The token "name" (e.g., INTLITERAL).
Token names are defined in the file
sym.java.
- The line number in the input file on which the token starts.
- The number of the character on that line at which the token starts.
- For identifiers, integer literals, and string literals: the actual
value (a String, an int, or a String,
respectively).
For a string literal, the value should include the double quotes
that surround the string,
as well as any backslashes used inside the string as part of an
"escaped" character.
Your scanner will return this information by creating a new Symbol
object in the action associated with each regular expression that defines
a token (the Symbol type is defined in java_cup.runtime).
A Symbol includes a field
of type int for the token name, and a field of type
Object (named value), which will be used for the line
and character numbers, as well as the token value (for identifiers and
literals).
See c.jlex for examples of
how to call the Symbol constructor.
In your compiler, the value field of a Symbol will actually
be of type TokenVal;
that type is defined in
c.jlex.
Every TokenVal includes a linenum field, and a
charnum field.
Subtypes of TokenVal with more fields will be used for
the values associated with identifier, integer literal, and string literal
tokens.
One subtype, IntLitTokenVal, is defined in
c.jlex.
You will need to add definitions for the subtypes to be used for
identifier and string literal tokens.
Line counting is done by the scanner generated by
JLex (the variable yyline holds the current line number, counting
from 0), but you will have to include
code to keep track of the current character number on that line.
The code in c.jlex does this
for the tokens that it defines, and you should be able to figure out
how to do the same thing for the new tokens that you add.
The JLex scanner also provides a method yytext that returns
the actual text that matches a regular expression.
You will find it useful to use this method in the actions you write
in your JLex specification.
Note that, for the integer literal token, you will need to convert
a String (the value scanned) to an int (the
value to be returned).
You should use code like the following:
double d = (new Double(yytext())).doubleValue(); // convert String to double
// INSERT CODE HERE TO CHECK FOR BAD VALUE -- SEE ERRORS AND WARNINGS BELOW
int k = (new Integer(yytext())).intValue(); // convert to int
Errors and Warnings
The scanner should handle the following errors as indicated:
Illegal characters.
Issue the error message:
ignoring illegal character: ch (where ch is the
illegal character) and ignore the character.
Unterminated string literals
A string literal is considered to be unterminated if there is a
newline or end-of-file before the closing quote.
Issue the error message: ignoring unterminated string literal
and ignore the unterminated string literal (start looking for the
next token after the newline).
Bad string literals
A string literal is "bad" if it includes a bad "escaped" character;
i.e., a backslash followed by something other than an n, a
t, a single quote, a double quote, or another backslash.
Issue the error message:
ignoring string literal with bad escaped character
and ignore the string literal (start looking for the
next token after the closing quote).
If the string literal has a bad escaped character and is
unterminated, issue the error message
ignoring unterminated string literal with bad escaped character,
and ignore the bad string literal (start looking for the next token
after the newline).
Note that a string literal that has a newline immediately after a
backslash should be treated as having a bad escaped character and
being unterminated. For example, given:
the scanner should report an unterminated string literal with a bad
escaped character on line 1, and an identifier on line 2.
Bad integer literals (integer literals larger than
Integer.MAX_VALUE).
Issue the warning message:
integer literal too large; using max value
and return Integer.MAX_VALUE as the value for that token.
For unterminated string literals, bad string literals, and bad integer
literals, the line and column numbers used in the error message
should correspond to the position of the first character
in the string/integer literal.
Use the fatal and warn methods of the Errors class
to print error and warning messages.
Be sure to use exactly the wording given above for each message
so that the output of your scanner will match the output that we expect
when we test your code.
The Main Program
In addition to specifying a scanner, you should extend the main
program in P2.java.
The main program expects one command-line argument: the name of the file
to be scanned.
That file is opened for reading;
then the program loops, calling the scanner's next_token method
until the special end-of-file token is returned.
The tokens returned by the scanner should be printed (to System.out),
one per line, preceded by the token's line and character numbers:
<linenum>:<charnum> <token>
where <token> means the token name as defined in
sym.java.
For ID, STRINGLITERAL,
and INTLITERAL tokens,
the main program should also print the value returned (on
the same line as the token, enclosed in parentheses):
<linenum>:<charnum> <token> (<value>)
Testing
Part of your task will be to figure out a strategy for testing
your implementation.
As mentioned in the Overview, part of your
grade will be determined by how thoroughly the test file that you turn
in tests your scanner.
You are to write one test file named test.C.
Use comments in the file to explain what aspects of the scanner are
being tested.
To test your scanner on a file that has text on the last line but no
final newline, you can either use emacs to create the file (first make
sure that your .emacs file does not include
(setq require-final-newline t), and do not give the file
any extension),
or you can create the file by writing a Java program that uses
System.out.print, and redirecting the output to a file. An
example file with no final newline is in:
~cs536-1/public/prog2/eof.txt.
You can tell that there is no newline by typing: cat eof.txt
You should see your command-line prompt at the end of the
last line of the output instead of at the beginning of the following
line.
Working in Pairs
Graduate and special students must work alone on this assignment.
Undergraduates may work alone or in pairs.
Please send mail to hasti@cs.wisc.edu
saying whether you want to work with a partner.
If you want to work with a particular person, send that
person's name (note: both partners must request each other);
otherwise, mention that you need to be matched with a partner.
Below is some advice on how to work in pairs.
This assignment involves three main tasks:
- Writing the scanner specification (c.jlex).
- Writing the main program (P2.java).
- Writing input files to test your scanner.
Since both partners are responsible for learning how to use JLex, it is
important that you share responsibility for task (1).
It is also a good idea to work together on testing (that way you will be
less likely to overlook some things that should be tested).
I suggest that you proceed as follows:
- Divide up the tokens into two parts, one part for each person.
- Each person should extend their own copy of c.jlex
by adding rules for their half of the tokens, and should extend
their own copy of the main program to handle those same tokens.
- Write test inputs for your own tokens, and perhaps for the other
person's tokens, too.
- After each person makes sure that their scanner and main program
work on their own tokens, combine the two (it should be pretty easy
to cut and paste one person's JLex rules into the other person's
c.jlex, and similarly for the main program).
- Talk about what needs to be tested, and decide together what your
final version of test.C should include.
- Do not try to implement all of your half of the tokens
at once. Instead, implement just a few to start with to make
sure that you both know what you're doing, and that you're able
to combine your work easily.
The most challenging JLex rules are for the
STRINGLITERAL token (for which you will need several rules: for a
correct string literal, for an unterminated string literal, for
a string literal that contains a bad escaped character, and for a
string literal that contains a bad escaped character and
is unterminated).
Be sure to divide these up so that each person gets to work on some of them.
It is very important to set deadlines and to stick to them.
I suggest that you choose one person to be the "project leader" (plan
to switch off on future assignments).
The project leader should propose a division of tokens, as well as deadlines
for completing phases of the program, and should be responsible for
keeping the most recent version of the combined code (be sure to keep
back-up versions, too, perhaps in another directory or using RCS).
To share your code, you can either use e-mail, or the project leader
can create a directory for the combined code (not the directory
in which that person develops the code).
I suggest that you create a new top-level directory (i.e., at the same
level as your public and private directories), named
something like cs536-P2.
To set the permissions of the directory for the combined code to allow your
partner to write into it, change to that directory and type:
fs setacl . <login> write
using your partner's CS login in place of <login>.
You should also prevent any other access by typing:
fs setacl . system:anyuser none
in the new directory that you create (not in your top-level
directory).
To see what the permissions are in your current directory, type:
fs listacl
Announcements
Includes: Additions, Revisions, and FAQs
(Frequently Asked Questions).
Please check here frequently.
9/30/2004 |
Program released. |
Handin
What to turn in
See the assignments page for information about how to submit your code.
The late policy is also found on the assignments page.
Electronically submit all of the files that are needed to
create and run your main program as well as your Makefile and
your test.C.
Do not copy any ".class" files, and do not create any subdirectories
in your handin directory.
If you are working with a partner only one of you should hand in files.
Include a comment at the top of P2.java with the names of both
partners.
General information on program grading criteria can be found on the Grading Criteria
for Programs page.