CS536 Lecture notes for February 12th, 1999
Written by: Deb Deppeler
email: deppeler
Reading Assignment
Sections 3.5 & 3.6
Practical Considerations & Error Handling in scanning
- Review
- Reserved Words
- Integer Literals
- Error Handling
- Review
- Capture the name of token, line & column number and then return it or handle it.
Example: Given a token pattern and some java code, do the following:
"+" { return new Symbol(sym.PLUS, new CSXToken(Pos.linenum, Pos.colnum));}
- JLex has some internal tracking of line and column number and we may use these methods, but we will have to modify the column number returned.
- Must match and handle invalid tokens, like whitespace.
Example: Match blank and do nothing.
(" ")+ { /* skip */ }
- If name is not on "sym" list, it is NOT a valid token.
- Some tokens, such as identifiers, need more processing. Identifiers have the potential to be misread as reserved words.
Example: The regular expression:
[A-Za-z]([A-Za-z0-9])*
can except ALL identifiers. Then, we must handle reserved words within the semantic processing.
Can check token against reserved words table or can match specific reserved words in scanning.
- Reserved Words
- Should put reserved word expressions before other expressions.
[Ii][Ff] { return new Symbol(sym.RW_IF, new CSXToken(Pos.linenum, Pos.colnum));}
- The scanner generator will try and make longest match. This is what we want.
Example:
iff is a valid identifier, while
if is a reserved word
- However, this method can make the underlying finite automaton significantly more complex. This is extra complexity, even though it is usually not done by us as programmers.
- Integer Literals
- Must handle ~123 as -123
- Must handle overflow errors.
Can do this by accepting values as a double, compare to MAXINT.
If OK, then return. Else, display error.
- To convert a String to a double, use:
Double(String).valueOf()
- To convert String to char:
String.charAt(index)
- To convert Character to char:
Character.charValue()
- Must do more to recognize and process multiple char tokens.
'a', '\n', '\t', '\"', '\\', ' ', "ABC", "AB\n"
- Error Handling
- Recognize patterns, not a series, and handle appropriately.
A=B@C; | [A-Za-z]([A-Za-z0-9])* |
- scan a "B", so far no problem
- scan a "@", This is read as though it must be the start of the next token.
But, it's not a valid token and causes a runtime error.
- To avoid this, must include a pattern to match any illegal tokens. This rule should be at end of all valid token rules. Example:
. { /* Error Handling */ System.out.println(yytext);}
- This example produces three errors. This is fine.
A=B@@@C
- How about?
- (.)+ { /* error handling */ }
This would cause the greedy "longest match" algorithm to accept all succeeding characters as part of the invalid token also.
This is not likely to be a correct interpretation.
- 10.0E-qq
The scanner sees initial prefix (10.E-) and thinks floating point, but the entire token is not valid.
Thus,it takes 10.0 as the longest valid complete token, even though 'E-' is still part of a valid token.
- Error handling continued on Monday (2/15/99)