What is a compiler?
A compiler operates in phases;
each phase translates the source program from one representation to
another. Different compilers may include different phases, and/or may
order them somewhat differently.
A typical organization is shown below.
The scanner is called by the parser;
here's how it works:
Example
Here are some Java lexemes and the corresponding tokens:
Note that multiple lexemes can correspond to the same token (e.g., there are
many identifiers).
Given the source code:
The parser:
Example
The semantic analyzer checks for (more) "static semantic" errors,
e.g., type errors.
It may also annotate and/or change the abstract syntax tree (e.g., it
might annotate each node that represents an expression with its type).
Example:
The intermediate code generator translates from abstract-syntax tree
to intermediate code.
One possibility is 3-address code (code in which each instruction
involves at most 3 operands).
Below is an example of 3-address code for the abstract-syntax tree
shown above.
Note that in this example, the first three instructions each have
exactly three operands (the location where the result of the
operation is stored, and two operators);
the fourth instruction has just two operands ("position" and "temp3").
Introduction
Here's a simple pictorial view:
source program --> COMPILER --> object program
|
--> error messages
A compiler is itself a program, written in some host
language.
(In cs536, students will implement a compiler for a simple source language
using Java as the host language.)
source code
(sequence of characters)
||
||
\/
----------------------
| lexical analyzer |
| (scanner) |
----------------------
||
|| sequence of tokens
\/
----------------------
| syntax analyzer |
| (parser) |
----------------------
||
|| abstract-syntax tree
\/
----------------------
| semantic analyzer |
----------------------
||
|| augmented, annotated abstract-syntax tree
\/
----------------------
| intermediate code |
| generator | /\
---------------------- ||
|| FRONT END
|| intermediate code ----------------------------------
\/ BACK END
---------------------- ||
| optimizer | \/
----------------------
||
|| optimized intermediate code
\/
----------------------
| code |
| generator |
----------------------
||
||
\/
object program (might be assembly code or machine code)
Below, we look at each phase of the compiler.
The Scanner
The definitions of what is a lexeme, token, or bad character all depend on
the source language.
lexeme: ; = index tmp 37 102
corresponding token: SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT
position = initial + rate * 60 ;
a Java scanner would return the following sequence of tokens:
IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT SEMI-COLON
Erroneous characters for Java source include # and control-a.
The Parser
position = * 5 ;
corresponds to the sequence of tokens:
IDENT ASSIGN TIMES INT-LIT SEMI-COLON
All are legal tokens, but that sequence of tokens is erroneous.
source code: position = initial + rate * 60 ;
abstract-syntax tree: =
/ \
position +
/ \
initial *
/ \
rate 60
Notes:
The Semantic Analyzer
Abstract syntax tree before semantic analysis
=
/ \
/ \
position +
/ \
/ \
initial *
/ \
/ \
rate 60
Abstract syntax tree after semantic analysis
= (float)
/ \
/ \
position + (float)
(float) / \
/ \
initial * (float)
(float) / \
/ \
rate intToFloat (float)
|
|
60 (int)
The Intermediate Code Generator
temp1 = inttofloat(60)
temp2 = rate * temp1
temp3 = initial + temp2
position = temp3