Code can be generated by a syntax-directed translation while parsing or by traversing the abstract syntax tree after the parse (i.e., by writing a codeGen method for the appropriate kinds of AST nodes). We will assume the latter approach, and will discuss code generation for a subset of the C language. In particular, we will discuss generating MIPS assembly code suitable for input to the Spim interpreter. Some information on Spim is provided in the next section; the following sections discuss code generation for:
Documentation on Spim is available on-line:
To run the (plain) Spim interpreter, type:
To run Spim with an X-windows interface (on a Linux machine), type just: xspim. This will cause a window to open. Click on the load button in that window, then type in the name of your assembly-code file, then press return. If there are syntax errors, you will see error messages. If there are no errors, you can run your program by clicking on run, then (in the small window that will be opened) on OK. If your program generates any output, a new window will be opened to display that output.
Spim uses the following special registers (there are others; you can see the on-line Spim documentation for those, but you should not need them for the class project):
Register | Purpose | |
\$sp | stack pointer | |
\$fp | frame pointer | |
\$ra | return address | |
\$v0 | used for system calls and to return int values from function calls, including the syscall that reads an int | |
\$f0 | used to return double values from function calls, including the syscall that reads a double | |
\$a0 | used for output of int and string values | |
\$f12 | used for output of double values | |
\$t0 - \$t7 | temporaries for ints | |
\$f0 - $f30 | registers for doubles (used in pairs; i.e., use \$f0 for the pair \$f0, \$f1) |
Auxiliary Constants and Methods
To simplify the task of code generation, it is convenient to have a set of constants (final static fields) that define the string representations of the registers that will be used in the generated code and the values used to represent true and false, as well as a set of methods for actually writing the generated code to a file. We will assume that we have the following register constants: SP, FP, RA, V0, A0, T0, T1, F0, F12, as well as the constants TRUE and FALSE (and that TRUE is represented as 1, and false as 0). We will also assume that we have the following methods:
Method | Purpose | |
generate | write the given op code and arguments, nicely formatted, to the output file | |
generateIndexed | the arguments are: an op code, a register R1, another register R2, and an offset; generate code of the form: op R1, offset(R2) | |
genPush(String reg, int bytes) |
generate code to push the value in the given register onto the stack; parameter bytes is 4 for an int and 8 for a double |
|
genPop(String reg, int bytes) | generate code to pop the top-of-stack value into the given register | |
nextLabel | return a string to be used as a label (more on this later) | |
genLabel | given a label L, generate: L: |
For each global variable v, generate:
.data .align 2 # align on a word boundary _v: .space Nwhere N is the size of the variable in bytes. (Scalar integer variables require 4 bytes; double variables require 8 bytes.) This code tells the assembler to set aside N bytes in the static data area, in a location labeled with the name _v.
Example: Given this source code:
int x; double y;you should generate this code:
.data .align 2 _x: .space 4 .data .align 2 _y: .space 8It is not actually necessary to generate .data if the previous generated code was also for a global variable declaration; however, since function declarations can be intermixed with global variable declarations (and cause code to be generated in the text area, not the static data area), this may not be the case; it is safe (and easier) just to generate those directives for every global variable.
For every function you will generate code for:
For the main function, generate:
.text .globl main main:
For all other functions, generate:
.text _<functionName>:using the actual name in place of <functionName>. This tells the assembler to store the following instructions in the text area, labeled with the given name.
After generating this "preamble" code, you will generate code for (1) function entry, (2) function body, and (3) function exit.
We assume that when a function starts executing, the stack looks like this:
# (1) Push the return addr sw $ra, 0($sp) subu $sp, $sp, 4 # (2) Push the control link sw $fp, 0($sp) subu $sp, $sp, 4 # (3) set the FP # Note: our convention for $sp is that it points to the first unused word of the stack. # The reason for adding 8 is that the unused word and the control link each take 4 bytes addu $fp, $sp, 8 # (4) Push space for the locals subu $sp, $sp, <size of locals in bytes>Note: <size of params> and <size of locals> will need to be available to the code generator. The symbol-table entry for the function name will have information about the parameters (because that will have been used for type checking). For example, it might have a list of the symbol-table entries for the parameters. You could also store the total size of the parameters in the function name's symbol-table entry, or you could write a method that takes the list of parameters as its argument and computes the total size. It is not so easy to compute the total size of the local variables at code-generation time; it is probably a better idea to do that during name analysis. The name-analysis phase will be computing the offsets for the parameters and local variables anyway; it should not be difficult to extend that code to also compute the total size of the locals (and to store that information in the function name's symbol-table entry).
Note: we are talking about the codeGen method for the FnBodyNode, whose subtree will look like this:
There is no need to generate any code for the declarations. So to generate code for the function body, just call the codeGen method of the StmtListNode, which will in turn call the codeGen method of each statement in the list. What those methods will do is discussed below in the section on Code Generation for Statements.
Just before a function returns, the stack looks like this:
Here is the code that needs to be generated:
lw $ra, 0($fp) # load return address move $t0, $fp # FP holds the address to which we need to restore SP lw $fp, -4($fp) # restore FP move $sp, $t0 # restore SP jr $ra # returnNote that there are two things that cause a function to return:
What about a return statement that returns a value? As discussed below, the codeGen method for the returned expression will generate code to evaluate that expression, leaving the value on the stack. The MIPS convention is to use register V0 to return a int value from a function and to use register F0 to return a double value. So the codeGen method for the return statement should generate code to pop the value from the stack into the appropriate register (before generating the "return" code or the jump to the return code discussed above).
Code Generation for Statements
You will write a different codeGen method for each kind of StmtNode. You are strongly advised to write this method for the WriteIntStmtNode, WriteDblStmtNode, and WriteStrStmtNode first. Then you can test code generation for the other kinds of statements and the expressions by writing a program that computes and prints a value. It will be much easier to find errors in your code this way (by looking at the output produced when a program is run) than by looking at the assembly code you generate.
To generate code for a write statement whose expression is of type int you must:
Below is the code you would write for the codeGen method of the WriteIntStmtNode.
// step (1) myExp.codeGen(); // step (2) genPop(A0, 4); // step (3) generate("li", V0, 1); // step (4) generate("syscall");The code for the WriteStrStmtNode and the WriteDblStmtNode is similar, except for the following:
If-Then Statement The AST for an if-then statement looks like:
------------ | IfStmtNode | ------------ / | \ --------- -------------- ------------- | ExpNode | | DeclListNode | | StmtListNode| --------- -------------- -------------There are two different approaches to generating code for statements that involve conditions (e.g., for if statements and while loops):
Note that the code generated for an if-then statement will need to include a label. Each label in the generated code must have a unique name (although we will refer to labels in these notes using names like "FalseLabel" as above). As discussed above, we will assume that there is a method called nextLabel that returns (as a String) a new label every time it is called, and we will assume that there is a method called genLabel that prints the given label to the assembly-code file.
Question 1: What is the form of the code generated by an IfElseStmtNode's codeGen method?
Question 2: What is the actual code that needs to be written for the IfStmtNode's codeGen method?
Question 3: What is the form of the code generated by a WhileStmtNode's codeGen method?
The AST for a return statement is either:
---------------- | ReturnStmtNode | ----------------or:
---------------- | ReturnStmtNode | ---------------- | ----------- | ExpNode | -----------As discussed above, if a value is being returned, the ReturnStmtNode's codeGen method should call its ExpNode's codeGen method (to generate code to evaluate the returned expression, leaving the value on the stack), then should generate code to pop that value into register V0 or register F0 (depending on its type).
To generate the code that actually does the return, use one of the following approaches:
The AST for a read statement is one of the following:
------------------- ------------------- | ReadIntStmtNode | | ReadDblStmtNode | ------------------- ------------------- | | ---------- ---------- | IdNode | | IdNode | ---------- ----------
To read an integer value into register V0, you must generate this code:
li $v0, 5 syscallThe code loads the special value 5 into register V0, then does a syscall. The fact that V0 contains the value 5 tells the syscall to read an integer value from standard input, storing the value back into register V0. So we must start by generating the above code, then we must generate code to store the value from V0 to the address of the IdNode.
To read a double value, load the value 7 into V0, do a syscall, and expect the input value to be in register F0 (not V0).
Digression: Code Generation for IdNodes
Before considering other kinds of statements, let's think about the role identifiers will play in code generation. Names show up in the following contexts:
We use "codeGen" for the second case (fetching the value and pushing it onto the stack) since that is what the codeGen methods of all ExpNodes must do.
lw $t0 _g | // load the value of int global g into T0 |
lw $t0 0($fp) | // load the value of the int local stored at offset 0 into T0 |
l.d $f0 _d | // load the value of dbl global d into F0 |
l.d $f0 -4($fp) | // load the value of the dbl local stored at offset -4 into F0 |
Note that this means there must be a way to tell whether an IdNode represents a global or a local variable. There are several possible ways to accomplish this:
Here is the code you need to generate to load the address of a global variable g into register T0 (this works whether g is int or double, since an address always takes 4 bytes):
---------------- | AssignStmtNode | ---------------- | ------------ | AssignNode | ------------ / \ -------- --------- | IdNode | | ExpNode | -------- ---------The AssignStmtNode's codeGen method can call the AssignNode's method to do the assignment, but be careful: an AssignNode is a subclass of ExpNode, so like all nodes that represent expressions, it must leave the value of the expression on the stack. Therefore, the AssignStmtNode must generate code to pop (and ignore) that value.
Code Generation for Expressions
The codeGen method for the subclasses of ExpNode must all generate code that evaluates the expression and leaves the value on top of the stack. We have already talked about how to do this for IdNodes; in the subsections below we discuss code generation for other kinds of expressions.
The codeGen method for an assignment expression must generate code to:
The codeGen methods for IntLitNodes must simply generate code to push the literal value onto the stack. The generated code will look like this:
li $t0, <value> # load value into T0 sw $t0, ($sp) # push onto stack subu $sp, $sp, 4
The code for a DblLitNode is similar, except that the value must be loaded into a double register using the appropriate opcode, and the "push" code is different for a double value. For example:
li.d $f0, <value> # load value into F0 s.d $f0, -4($sp) # push onto stack subu $sp, $sp, 8
For a StringLitNode, the string literal itself must be stored in the static data area, and its address must be pushed. The code to store a string literal in the static data area looks like this:
.data <label>: .asciiz <string value>Note:
The code you need to generate to push the address of a string literal onto the stack looks like this:
.text la $t0, <label> # load addr into $t0 sw $t0, ($sp) # push onto stack subu $sp, $sp, 4
The AST for a function call looks like:
-------------- | CallExpNode | -------------- / \ -------- ------------- | IdNode | | ExpListNode | -------- -------------We need to generate code to:
Note that there is also a call statement:
-------------- | CallStmtNode | -------------- | -------------- | CallExpNode | -------------- / \ -------- ------------- | IdNode | | ExpListNode | -------- -------------In this case, the called function may not actually return a value (i.e., may have return type void). It doesn't hurt to have the CallExpNode's codeGen method push the value in V0 (or F0) after the call (it will just be pushing some random garbage), but it is important for the CallStmtNode's codeGen method to pop that value.
In general, the codeGen methods for the non short-circuited operators (PlusNode, MinusNode, ..., NotNode, LessNode, ..., EqualsNode, etc.) must all do the same basic sequence of tasks:
To minimize the amount of code you must write, you might want to define subclasses of ExpNode that have very similar code-generation methods. For example, you could define a BinaryExpNode class whose codeGen method does all of the steps above, calling an opCode method (that would be implemented in each BinaryExpNode subclass) to get the appropriate opcode for use in step 3.
Note that the comparison operators (described in Section 2.6 of the Spim Reference Manual) produce one when the result of the comparison is true. However, the not operator does a bitwise logical negation, so it will not work correctly if you use 0 and 1 to represent true and false. Thus, it is better to use the seq operator instead of the not operator.
Example: Recall that the AST for an addition looks like this:
---------- | PlusNode | ---------- / \ --------- --------- | ExpNode | | ExpNode | --------- ---------Here is the codeGen method for the PlusNode, assuming that both operands are ints.
public void codeGen() { // step 1: evaluate both operands myExp1.codeGen(); myExp2.codeGen(); // step 2: pop values in T0 and T1 genPop(T1, 4); genPop(T0, 4); // step 3: do the addition (T0 = T0 + T1) generate("add", T0, T0, T1); // step 4: push result genPush(T0, 4) }
To illustrate how code is generated for an expression involving several operators, consider generating code for the expression: b + c * d. Here is the AST for the expression:
PlusNode / \ IdNode TimesNode b / \ IdNode IdNode c dBelow is the sequence of calls that would be made at compile time to generate code for this expression, and a description of what the generated code does.
Sequence of calls What the generated code does ----------------- ---------------------------- +--- PlusNode.codeGen() | IdNode.codeGen() ---------> push b's value | +- TimesNode.codeGen() | | IdNode.codeGen() ---------> push c's value | | IdNode.codeGen() ---------> push d's value | | | +-------------------------------> pop d's value into T1 | pop c's value into T0 | T0 = T0 * T1 | push T0's value | +---------------------------------> pop result of * into T1 pop b's value into T0 T0 = T0 + T1 push T0's value
The short-circuited operators are represented by AndNodes and OrNodes. "Short-circuiting" means that the right operand is evaluated only if necessary. For example, for the expression (j != 0) && (k/j > epsilon), the sub-expression (k/j > epsilon) is evaluated only if variable j is not zero. Therefore, the code generated for an AndNode must work as follows:
evaluate the left operand if the value is true then evaluate the right operand; that value is the value of the whole expression else don't bother to evaluate the right operand the value of the whole expression is falseSimilarly, for an OrNode:
evaluate the left operand if the value is false then evaluate the right operand; that value is the value of the whole expression else don't bother to evaluate the right operand the value of the whole expression is trueThis means that the code generated for the logical operators will need to involve some jumps depending on the values of some expressions.
Expand the outlines given above for the code generated for AndNodes and OrNodes, giving a lower-level picture of the generated code. Use the outline of the code generated for an if-then statement as a model of what to write.
As mentioned above in the section on If-Then Statements, there are actually two different approaches to generating code for statements with conditions (like if statements and while loops):
First, let's reconsider code generation for an if-then statement; this time we'll use the control-flow method instead of the numeric method for evaluating the condition part of the statement. Under this new assumption, the code generated by the IfStmtNode's codeGen method will have the following form:
-- code to evaluate the condition, jumping to TrueLab if it is true, and to DoneLab if it is false -- TrueLab: -- code for the statement list -- DoneLab:The actual code written for the IfStmtNode's codeGen method will be:
public void codeGen() { String trueLab = nextLabel(); String doneLab = nextLabel(); myExp.genJumpCode(trueLab, doneLab); genLabel(trueLab); myStmtList.codeGen(); genLabel(doneLab); }
To implement the control-flow approach, in addition to changing the codeGen method for the statements that have conditions, we must also write a new genJumpCode method for each AST node that could represent a condition. Below, we look at two representative cases, for IdNode and LessNode. We give the code generated for the (old) numeric approach, and for the (new) control-flow approach so that we can see which code is better (in terms of the number of instructions). (The code given below assumes that all operands are int, and it is not quite assembly code -- for example, for clarity, we use "push" instead of the actual two instructions that implement a push operation.)
IdNode: numeric control-flow ------- ------------ lw $t0, <var's addr> lw $t0, <var's addr> push $t0 beq $t0, FALSE, falseLab b trueLabNote that in both approaches to generating code for an IdNode, 3 instructions are generated (because "push" is actually 2 instructions). However, while all 3 of the numeric-approach instructions will always execute, the last instruction generated by the control-flow approach will execute only if the value of the Id is true.
LessNode: numeric control-flow ------- ------------ -- code to evaluate both operands -- ditto -- pop values into T1, T0 -- ditto slt $t2, $t0, $t1 blt $t0, $t1, trueLab push $t2 b falseLabThe operands of a LessNode are integer expressions, not boolean expressions, so both approaches start by generating (the same) code to evaluate the operands and to pop their values into registers T0 and T1. After that, however, the numeric approach will generate 3 instructions (to set $t2 to TRUE or FALSE as appropriate, then to push that value onto the stack -- remember that "push" is really two instructions), while the control-flow code will generate only 2 instructions. Furthermore, as was the case for the IdNode, all three of the numeric-approach instructions will always execute, while the last instruction generated by the control-flow approach will only execute if the comparison evaluates to FALSE.
Now let's consider how to write the new genJumpCode method for the short-circuited operators (AndNodes and OrNodes).
Recall that the AST for an && expression looks like this:
--------- | AndNode | --------- / \ --------- --------- | ExpNode | | ExpNode | --------- ---------Here's how the genJumpCode method of the AndNode works:
The AndNode's genJumpCode method would be:
public void genJumpCode(String trueLab, String falseLab) { String newLab = nextLabel(); myExp1.genJumpCode(newLab, falseLab); genLabel(newLab); myExp2;genJumpCode(trueLab, falseLab); }Example: Consider the code that would be generated for the statement: if (a && b>0) { ... }, represented by the AST:
------------ | IfStmtNode | ------------ / | \ --------- -------------- ------------- | AndNode | | DeclListNode | | StmtListNode| --------- -------------- ------------- / \ -------- ---------- | IdNode | | LessNode | -------- ---------- / \ ... ...The IfStmtNode's codeGen method would create two labels, TrueLab and DoneLab, and would call the AndNode's genJumpCode method, passing those labels as the arguments. The AndNode's genJumpCode method would create one new label (NewLab) and then would call the genJumpCode method of its left child (the IdNode), passing NewLab and DoneLab as the arguments. It would then generate the NewLab label, and then would call its right child's genJumpCode method, passing TrueLab and DoneLab. The code generated for the whole condition would look like this:
Generated Code Generated By -------------- ------------ -- code to load the value of a into T0 IdNode jump to DoneLab if T0 == FALSE IdNode jump to NewLab IdNode NewLab: AndNode -- code to push the value of b LessNode's child -- code to push the iteral 0 LessNode's child pop into T1 LessNode pop into T0 LessNode jump to TrueLab if T0 < T1 LessNode jump to DoneLab LessNode(Of course, the actual label would be something like L3, not NewLab.) After calling the AndNode's genJumpCode method, the IfStmtNode's codeGen method would call genLabel to print TrueLab, then would call its StmtList child's codeGen method to generate code for the list of statements. Finally, it would call genLabel to print DoneLab. So the code generated for this if statement would be like this:
-- code to load the value of a into T0 jump to DoneLab if T0 == FALSE jump to NewLab NewLab: -- code to push the value of b -- code to push the iteral 0 pop into T1 pop into T0 jump to TrueLab if T0 < T1 jump to DoneLab TrueLab: -- code for the list of statements DoneLab:
Question 1: What is the form of the code generated by an OrNode's genJumpCode method?
Question 2: What is the form of the code generated by a NotNode's genJumpCode method?
How does the code generated for an AndNode using the control-flow method compare to the code generated using the numeric method? Here are outlines of the code generated in each case:
Numeric Code Control-Flow Code ------------ ----------------- -- code to evaluate left -- code to evaluate left -- operand, leaving the -- operand, including jumps -- value on the stack -- to NewLab and FalseLab pop into T0 goto TrueLab if T0 == TRUE newLab: push FALSE goto DoneLab TrueLab: -- code to evaluate right -- code to evaluate right -- operand, leaving the -- operand, including -- value on the stack -- jumps to TrueLab and -- FalseLab DoneLab:Note that the numeric code includes 6 instructions (shown in bold) in addition to the ones generated by the codeGen methods of the two children, while in the control-flow case, no instructions are generated by the AndNode itself (just a label); the instructions are all generated by the genJumpCode methods of its two children. Those children could represent names, boolean literals, comparisons or logical expressions. We have already seen that in the first three cases, better code is generated using the control-flow approach. Now we see that for logical expressions (at least for AndNodes) fewer instructions are generated using the control-flow approach than using the numeric approach.
Compare the code generated by the two approaches for an OrNode and for a NotNode.
Finally, let's compare the code generated by the numeric and control-flow approaches for an if-then statement. Here are outlines of the two different versions of the code that would be generated:
Numeric Code Control-Flow Code ------------ ----------------- -- code for condition, leaving -- code for condition, -- value on the stack -- including jumps to -- trueLab and falseLab pop into T0 goto falseLab if T0 == FALSE trueLab: -- code for "then" stmts -- code for "then" stmts goto doneLab goto doneLab falseLab: falseLab: -- code for "else" stmts -- code for "else" stmts doneLab: doneLab:Note that much of the code is the same for the two methods; the code generated to evaluate the condition will be different, and the numeric method has three extra instructions: a pop, followed by a conditional goto (those instructions are shown in bold font). So as long as the code generated for the condition is no worse for the control-flow method than for the numeric method, the control-flow method is better (both in terms of the number of instructions generated, and in terms of the number of instructions that will be executed).
But we have already looked at the code generated for all the different kinds of conditions, for each of the two approaches, and in fact the control-flow method was never worse, and was sometimes better. Thus, the control-flow method is the winner, in terms of generating less and more efficient code!