CS 536 Program 2: The Scanner

Due date: Thursday, October 14 (by midnight)
Not accepted after midnight on Sunday, October 17

Overview | Requirements | Announcements | Handin

Overview

For this assignment you will use JLex to write a scanner for a language called C--, a small subset of the C++ language. Features of C-- that are relevant to this assignment are described below. You will also write a main program and an input file to test your scanner. You will be graded both on the correctness of your scanner and on how thoroughly your input file tests the scanner.


Requirements

Getting Started

Skeleton files on which you should build are in: ~cs536-1/public/prog2

The files are:

JLex

Use the on-line JLex reference manual, and/or the on-line JLex notes for information about writing a JLex specification.

To run JLex you'll need to modify your CLASSPATH environment variable. Taking into account the change needed to use jikes, your .cshrc.local file should include the line:

setenv CLASSPATH ".:/s/java/jre/lib/rt.jar:/p/course/cs536-horwitz/public/JAVA" Don't forget to type: source ~/.cshrc after changing .cshrc.local.

The C-- Language

This section defines the lexical level of the C-- language. At this level, we have the following language issues:

Tokens

The tokens of the C-- language are defined as follows: Token "names" (i.e., values to be returned by the scanner) are defined in the file sym.java. For example, the name for the token to be returned when an integer literal is recognized is INTLITERAL. Note that code telling JLex to return the EOF token on end-of-file has already been included in the file c.jlex -- you don't have to include a specification for that token. Note also that the READ token is for the 2-character symbol >> and the WRITE token is for the 2-character symbol <<

Comments

Text starting with a double slash (//) or a sharp sign (#) up to the end of the line is a comment (except of course if those characters are inside a string literal). For example:
    // this is a comment
    # and so is this
The scanner should recognize and ignore comments (but there is no COMMENT token).

Whitespace

Spaces, tabs, and newline characters are whitespace. Whitespace separates tokens, but should otherwise be ignored (except inside a string literal).

Illegal Characters

Any character that is not whitespace and is not part of a token or comment is illegal.

Length Limits

You may not assume any limits on the lengths of identifiers, string literals, integer literals, comments, etc.

What the Scanner Should Do

The main job of the scanner is to identify and return the next token. The value to be returned includes:

Your scanner will return this information by creating a new Symbol object in the action associated with each regular expression that defines a token (the Symbol type is defined in java_cup.runtime). A Symbol includes a field of type int for the token name, and a field of type Object (named value), which will be used for the line and character numbers, as well as the token value (for identifiers and literals). See c.jlex for examples of how to call the Symbol constructor.

In your compiler, the value field of a Symbol will actually be of type TokenVal; that type is defined in c.jlex. Every TokenVal includes a linenum field, and a charnum field. Subtypes of TokenVal with more fields will be used for the values associated with identifier, integer literal, and string literal tokens. One subtype, IntLitTokenVal, is defined in c.jlex. You will need to add definitions for the subtypes to be used for identifier and string literal tokens.

Line counting is done by the scanner generated by JLex (the variable yyline holds the current line number, counting from 0), but you will have to include code to keep track of the current character number on that line. The code in c.jlex does this for the tokens that it defines, and you should be able to figure out how to do the same thing for the new tokens that you add.

The JLex scanner also provides a method yytext that returns the actual text that matches a regular expression. You will find it useful to use this method in the actions you write in your JLex specification.

Note that, for the integer literal token, you will need to convert a String (the value scanned) to an int (the value to be returned). You should use code like the following:

  double d = (new Double(yytext())).doubleValue(); // convert String to double
  // INSERT CODE HERE TO CHECK FOR BAD VALUE -- SEE ERRORS AND WARNINGS BELOW
  int k =  (new Integer(yytext())).intValue();    // convert to int

Errors and Warnings

The scanner should handle the following errors as indicated:

Illegal characters.

Issue the error message: ignoring illegal character: ch (where ch is the illegal character) and ignore the character.

Unterminated string literals

A string literal is considered to be unterminated if there is a newline or end-of-file before the closing quote. Issue the error message: ignoring unterminated string literal and ignore the unterminated string literal (start looking for the next token after the newline).

Bad string literals

A string literal is "bad" if it includes a bad "escaped" character; i.e., a backslash followed by something other than an n, a t, a single quote, a double quote, or another backslash. Issue the error message: ignoring string literal with bad escaped character and ignore the string literal (start looking for the next token after the closing quote). If the string literal has a bad escaped character and is unterminated, issue the error message ignoring unterminated string literal with bad escaped character, and ignore the bad string literal (start looking for the next token after the newline). Note that a string literal that has a newline immediately after a backslash should be treated as having a bad escaped character and being unterminated. For example, given:
the scanner should report an unterminated string literal with a bad escaped character on line 1, and an identifier on line 2.

Bad integer literals (integer literals larger than Integer.MAX_VALUE).

Issue the warning message: integer literal too large; using max value and return Integer.MAX_VALUE as the value for that token. For unterminated string literals, bad string literals, and bad integer literals, the line and column numbers used in the error message should correspond to the position of the first character in the string/integer literal.

Use the fatal and warn methods of the Errors class to print error and warning messages. Be sure to use exactly the wording given above for each message so that the output of your scanner will match the output that we expect when we test your code.

The Main Program

In addition to specifying a scanner, you should extend the main program in P2.java. The main program expects one command-line argument: the name of the file to be scanned. That file is opened for reading; then the program loops, calling the scanner's next_token method until the special end-of-file token is returned. The tokens returned by the scanner should be printed (to System.out), one per line, preceded by the token's line and character numbers:
  <linenum>:<charnum> <token>
where <token> means the token name as defined in sym.java. For ID, STRINGLITERAL, and INTLITERAL tokens, the main program should also print the value returned (on the same line as the token, enclosed in parentheses):
  <linenum>:<charnum> <token> (<value>)

Testing

Part of your task will be to figure out a strategy for testing your implementation. As mentioned in the Overview, part of your grade will be determined by how thoroughly the test file that you turn in tests your scanner.

You are to write one test file named test.C. Use comments in the file to explain what aspects of the scanner are being tested. To test your scanner on a file that has text on the last line but no final newline, you can either use emacs to create the file (first make sure that your .emacs file does not include (setq require-final-newline t), and do not give the file any extension), or you can create the file by writing a Java program that uses System.out.print, and redirecting the output to a file. An example file with no final newline is in: ~cs536-1/public/prog2/eof.txt. You can tell that there is no newline by typing: cat eof.txt You should see your command-line prompt at the end of the last line of the output instead of at the beginning of the following line.

Working in Pairs

Graduate and special students must work alone on this assignment. Undergraduates may work alone or in pairs. Please send mail to hasti@cs.wisc.edu saying whether you want to work with a partner. If you want to work with a particular person, send that person's name (note: both partners must request each other); otherwise, mention that you need to be matched with a partner.

Below is some advice on how to work in pairs.

This assignment involves three main tasks:

  1. Writing the scanner specification (c.jlex).
  2. Writing the main program (P2.java).
  3. Writing input files to test your scanner.
Since both partners are responsible for learning how to use JLex, it is important that you share responsibility for task (1). It is also a good idea to work together on testing (that way you will be less likely to overlook some things that should be tested).

I suggest that you proceed as follows:

The most challenging JLex rules are for the STRINGLITERAL token (for which you will need several rules: for a correct string literal, for an unterminated string literal, for a string literal that contains a bad escaped character, and for a string literal that contains a bad escaped character and is unterminated). Be sure to divide these up so that each person gets to work on some of them.

It is very important to set deadlines and to stick to them. I suggest that you choose one person to be the "project leader" (plan to switch off on future assignments). The project leader should propose a division of tokens, as well as deadlines for completing phases of the program, and should be responsible for keeping the most recent version of the combined code (be sure to keep back-up versions, too, perhaps in another directory or using RCS).

To share your code, you can either use e-mail, or the project leader can create a directory for the combined code (not the directory in which that person develops the code). I suggest that you create a new top-level directory (i.e., at the same level as your public and private directories), named something like cs536-P2. To set the permissions of the directory for the combined code to allow your partner to write into it, change to that directory and type:

fs setacl . <login> write
using your partner's CS login in place of <login>. You should also prevent any other access by typing:
fs setacl . system:anyuser none
in the new directory that you create (not in your top-level directory). To see what the permissions are in your current directory, type:
fs listacl

Announcements

Includes: Additions, Revisions, and FAQs (Frequently Asked Questions).
Please check here frequently.

9/30/2004 Program released.


Handin

What to turn in

See the assignments page for information about how to submit your code.  The late policy is also found on the assignments page. 

Electronically submit all of the files that are needed to create and run your main program as well as your Makefile and your test.C. Do not copy any ".class" files, and do not create any subdirectories in your handin directory.

If you are working with a partner only one of you should hand in files. Include a comment at the top of P2.java with the names of both partners.

General information on program grading criteria can be found on the Grading Criteria for Programs page.