CS536 Lecture notes for February 12th, 1999

Written by: Deb Deppeler
email: deppeler

Reading Assignment

Sections 3.5 & 3.6

Capture the name of token, line & column number and then return it or handle it.
Example: Given a token pattern and some java code, do the following:
"+" { return new Symbol(sym.PLUS, new CSXToken(Pos.linenum, Pos.colnum));}
JLex has some internal tracking of line and column number and we may use these methods, but we will have to modify the column number returned.
Must match and handle invalid tokens, like whitespace.
Example: Match blank and do nothing.
(" ")+ { /* skip */ }
If name is not on "sym" list, it is NOT a valid token.
Some tokens, such as identifiers, need more processing. Identifiers have the potential to be misread as reserved words.
Example: The regular expression:
[A-Za-z]([A-Za-z0-9])*

can except ALL identifiers. Then, we must handle reserved words within the semantic processing. Can check token against reserved words table or can match specific reserved words in scanning.

Should put reserved word expressions before other expressions.
[Ii][Ff] { return new Symbol(sym.RW_IF, new CSXToken(Pos.linenum, Pos.colnum));}
The scanner generator will try and make longest match. This is what we want.
Example:
iff is a valid identifier, while
if is a reserved word
However, this method can make the underlying finite automaton significantly more complex. This is extra complexity, even though it is usually not done by us as programmers.

Must handle ~123 as -123
Must handle overflow errors. Can do this by accepting values as a double, compare to MAXINT. If OK, then return. Else, display error.
To convert a String to a double, use:
Double(String).valueOf()
To convert String to char:
String.charAt(index)
To convert Character to char:
Character.charValue()
Must do more to recognize and process multiple char tokens.
'a', '\n', '\t', '\"', '\\', ' ', "ABC", "AB\n"

Recognize patterns, not a series, and handle appropriately.

A=B@C; [A-Za-z]([A-Za-z0-9])*
1. scan a "B", so far no problem
2. scan a "@", This is read as though it must be the start of the next token.
  But, it's not a valid token and causes a runtime error.
To avoid this, must include a pattern to match any illegal tokens. This rule should be at end of all valid token rules. Example:
. { /* Error Handling */ System.out.println(yytext);}
This example produces three errors. This is fine.
A=B@@@C
How about?
1. (.)+ { /* error handling */ }
  
  This would cause the greedy "longest match" algorithm to accept all succeeding characters as part of the invalid token also. This is not likely to be a correct interpretation.
2. 10.0E-qq
  
  The scanner sees initial prefix (10.E-) and thinks floating point, but the entire token is not valid. Thus,it takes 10.0 as the longest valid complete token, even though 'E-' is still part of a valid token.
Error handling continued on Monday (2/15/99)