CS 701, Assignment #1
An Introduction to Sparc Code Optimization

Due: Tuesday, September 27, 2005

Not accepted after: Tuesday, October 4, 2005

The purpose of this assignment is to familiarize you with the Sparc instruction set, the Sparc assembler syntax, and a variety of important compiler optimizations. The SPARC Architecture Manual details the Sparc architecture and instruction set. You can use this manual as necessary to resolve detailed questions about the SPARC ISA. The work stations we'll be using in this class are designated n01.cs.wisc.edu through n16.cs.wisc.edu; they are UltraSPARC-IIe processors running Solaris.

One easy way to become familiar with the the Sparc instruction set is to look at the code generated by the standard departmental C/C++ compiler (gcc). If we execute

gcc -S prog.c
gcc will create prog.s, an assembler file, containing the code it generated for prog.c. If you working on a workstation that does not use a Sparc processor, you will need to log onto a workstation that does contain a Sparc (otherwise, you'll see code for an entirely different architecture!) For example, given a file test.c that contains:
int incr(int i){
   return i+1;
}
int main(){
  int a;
  return incr(a);
}
gcc generates
	.file	"prog1.c"
	.section	".text"
	.align 4
	.global incr
	.type	incr, #function
	.proc	04
incr:
	!#PROLOGUE# 0
	save	%sp, -112, %sp
	!#PROLOGUE# 1
	st	%i0, [%fp+68]
	ld	[%fp+68], %g1
	add	%g1, 1, %g1
	mov	%g1, %i0
	ret
	restore
	.size	incr, .-incr
	.align 4
	.global main
	.type	main, #function
	.proc	04
main:
	!#PROLOGUE# 0
	save	%sp, -120, %sp
	!#PROLOGUE# 1
	ld	[%fp-20], %o0
	call	incr, 0
	 nop
	mov	%o0, %g1
	mov	%g1, %i0
	ret
	restore
	.size	main, .-main
	.ident	"GCC: (GNU) 3.4.4"
Op codes that begin with a "." are assembler directives. If we look at the label incr we see the body of the incr function. In the save instruction a new register window (for the function's local and out registers) is created and the function's frame is pushed on the run-time stack (by subtracting the frame size, 112 bytes, from the caller's stack pointer (%sp) to create the callee's stack pointer). (The two references to %sp in the save instruction are to different registers!). The caller's stack pointer becomes the callee's frame pointer (%fp).

The function parameter, passed in register %i0, is saved into its memory location (%fp+68), and then is immediately reloaded into %g1. One is added to i (in %g1), then %g1 is copied to %i0. (Register %i0 is the standard function return register.) In the Sparc, branches, calls and returns are delayed. That is, the instruction following a branch, call or return is executed before any transfer of control occurs. This takes some getting used to when you read Sparc assembler; you need to mentally transpose the instructions to get the actual execution order. Thus when we see a ret, restore pair, the restore is done before the return actually occurs. A restore resores the caller's register window (by restoring the caller's stack pointer and frame pointer). The function return value, placed in the callee's %i0 register is now in the caller's %o0 register (because in the register window mechanism, a caller's output registers (%o0 to %o7) overlap a callee's input register's (%i0 to %i7)). Finally, the ret instruction branches to the return point, which is the address %i7+8. (Register %i7 contains the address of the calling instruction, set by a call. But why is the return point 8 bytes past the call instruction?)

Function main, compiled at label main, is similar in structure to incr. It pushes its frame using a save instruction. It loads local a from its frame location, %fp-20 (the brackets indicate that a memory location is being addressed). Local a is correctly loaded into register %o0, the register assigned to the first scalar parameter. The call instruction transfers control to label incr; the call's address is stored in register %o7. The nop after the call indicates that the compiler found nothing useful to do prior to the call. Upon return, incr's return value, in %o0, is clumsily moved to %g1 and then %g1 is moved to register %i0, the return value register. Finally, the resore/ret pair is executed to pop main's frame and return to the caller.

Gcc, in unoptimized mode, generates rather simple-minded code. It does far better when optimization is requested. Flags -O1, -O2, and -O3 request progressively more extensive optimization. Often -O1 is enough to produce a dramatic improvement in code quality. Let's see what is produced by -O1 for our above example:

	.file	"prog1.c"
	.section	".text"
	.align 4
	.global incr
	.type	incr, #function
	.proc	04
incr:
	!#PROLOGUE# 0
	!#PROLOGUE# 1
	retl
	add	%o0, 1, %o0
	.size	incr, .-incr
	.align 4
	.global main
	.type	main, #function
	.proc	04
main:
	!#PROLOGUE# 0
	save	%sp, -112, %sp
	!#PROLOGUE# 1
	call	incr, 0
	 nop
	ret
	restore %g0, %o0, %o0
	.size	main, .-main
	.ident	"GCC: (GNU) 3.4.4"

The generated code has been improved considerably. The function incr is recognized as a leaf procedure (a procedure that makes no calls). A leaf procedure needs no frame. Rather, incr simply accesses its parameter from the parameter register %o0. It increments i, and returns it in %o0. The add is placed in the delay slot of retl, which returns using %o7 (since no new register window was created for the call).

The main function has also been improved. Local variable a has been assigned to register %o0, the parameter register. Function incr is called as before, returning its value in %o0. A restore/ret pair is executed, but with a clever twist. The restore adds %g0, which always contains 0 to %o0. Why? Well, the %o0 that is the target of the trivial addition is in the caller's window, effectively moving the return value from the callee's register window to the correct register in the caller's window.

Instruction Count Optimizations

Compilers implement a wide variety of optimizations. Most of these optimizations employ a common strategy--reduce the total number of dynamic instructions needed to implement a computation. This makes a great deal of sense--the fewer the instructions needed, the faster a program should be. As we will see later, this isn't the whole story, but certainly it is a reasonable first step. We'll identify and examine a number of such instruction count optimizations. In the following list, we'll identify and describe some particular instruction count optimization. More information on these (and many other optimizations) may be found in a survey paper on compiler transformations.

You will need to create a small C program (only a few lines if possible) that can be improved using the given optimization. Run gcc without optimization on your program, demonstrating an unoptimized translation. Then run gcc using -O1 (or -O2 or -O3). Show the optimized code and explain what has been changed in the code as a result of the given optimization. In a few cases you may need to change (by hand) the source program to "force" gcc to implement the optimization. Alternatively, you may edit (very carefully) the Sparc instructions generated. You can feed a ".s" file back into gcc to get it to assemble the given instructions (and link in the needed libraries). In cases where you hand optimize the source or generated code, you should explain why your optimizations are valid, and why they improve the translation.

As an example, if you were asked to illustrate the "leaf procedure" optimization, in which a function (or procedure) that makes no calls is translated so that it pushes no frame or register window, you could use the above example. In the original translation, incr pushes a frame and register window, but in the translation produced by the O1 option, the parameter is accessed directly from the register passed by the caller.

  1. Redundant Expression Elimination (Common Subexpression Elimination)
    Using this optimization an address or value that has been previously computed is reused rather than being unnecessarily recomputed.

  2. Partially Redundant Expression Elimination
    This is a variant of Redundant Expression Elimination. It may be that an address or value has been computed previously along some, but not all, execution paths. If computation of the address or value is added to those paths on which it is not available, a later computation becomes fully redundant and can be eliminated. The goal here is to compute the address or value only once on all execution paths where it is needed.

  3. Constant Propagation
    A variable may be known to contain a particular constant value at a given point (because previous initializations or assignments all use that constant value). References to the variable may be replaced with the constant value the variable is known to contain.

  4. Copy Propagation
    After the assignment of one variable to another, a reference to one variable may be replaced with the value of the other variable (until one or the other of the variables is reassigned). (Copy propagation often "sets up" dead code elimination.)

  5. Constant Folding
    An expression involving constant (literal) values may be evaluated and simplified to a constant value. (Constant propagation enhances the usefulness of constant folding.)

  6. Dead Code Elimination
    Expressions or statements whose values or effects are unused may be eliminated.

  7. Loop Invariant Code Motion
    An expression that is invariant in a loop may be moved to the loop's header, evaluated once, and reused within the loop.

  8. Scalarization (Scalar Replacement)
    A fixed field of a structure or a fixed element of an array that is repeatedly read and/or written may be copied to a local variable, accessed, and later copied back. This allows the local variable (and, in effect, the field or array component) to be allocated to a register. (Array elements and fields of structures aren't directly assigned to registers. Why?)

  9. Local Register Allocation
    Within a basic block (a straight line sequence of code) variables and constants that are repeatedly accessed may be allocated to registers.

  10. Global Register Allocation
    Within a subprogram, frequently accessed variables and constants are allocated to registers.

  11. Interprocedural Register Allocation
    Variables and constants accessed by more than one subprogram may be allocated to registers (rather than loaded, accessed and stored within each subprogram).

  12. Register Targeting
    According to architectural or operating system conventions, certain values must be placed in designated registers (parameter values, function return values, return addresses, stack and frame pointers, etc.). Rather than compute a value into an arbitrary register and then copy it to a designated register, the value can be computed directly into the register from which it will be eventually used.

  13. Interprocedural Code Motion
    Instructions executed within a subprogram may be moved across a call to the subprogram's caller.

  14. Call Inlining
    At the site of a call, the body of the called subprogram may be inserted, with actual parameters used to initialize formal parameters.

  15. Code Hoisting and Sinking
    Sometimes the same code sequence appears in two or more alternative execution paths (e.g., arms of a conditional). It may be possible to hoist the common code sequence to a common shared ancestor or sink the common code to shared successor. This reduces code size, but does not reduce the total number of instructions executed.

  16. Loop Unrolling
    Loops are normally implemented by repeatedly executing the loop body using a conditional branch. For small loop bodies, the cost of testing loop termination (and incrementing a loop index) may be a significant fraction of the overall loop execution time. Moreover, values computed within one iteration can't readily be reused within another iteration. Loop unrolling duplicates a loop body N times (where N is chosen by the user or the compiler). After unrolling, only 1/N as many loop iterations occur. Further, code within the N duplicated loop bodies may be optimized, using common subexpression elimination, scalarization, etc.

Beyond Instruction Counts

Reducing total instruction counts is an important goal, but it isn't by any means the whole story in optimization. Other aspects of the target machine must be considered and exploited to improve overall performance.

Code Scheduling and Software Pipelining

Most modern processors are pipelined. That is, processors strive to make all instructions appear to have the same execution time, issuing and retiring an instruction in one cycle. This effect is often counterintuitive. A complex operation like addition takes no more time than a simpler operation like bitwise oring. Often multiplication is no slower than addition (even in floating point!).

For better or worse, the unit-time execution model is a close, but not entirely accurate, model of instruction execution. Not all processors handle all instructions in unit time. Sometimes integer or floating point multiplication is slower than integer or floating point addition. Division is almost always slower than multiplication or addition. Loads take an unpredictable amount of time, depending on whether the desired value is in a cache or main memory.

You can use the Unix command

time prog args
to measure (to approximately 1/60 second resolution) the time needed to execute prog args. You can time a given program several times, averaging the reported execution time, to obtain a more accurate timing.

Compile (at the -O1 level) and time the following program:

float a[101],b[101],c[101],d[101];

main() {
   int i,j;
   for(j=0;j<101;j++){
	a[j]=b[j]=c[j]=d[j]=1.0;
   }
   for (i=1;i<=1000000;i++){
      for(j=0;j<100;j++){
	a[j]=a[j]+a[j+1];
	b[j]=b[j]-b[j+1];
	c[j]=c[j]*c[j+1];
	d[j]=d[j]/d[j+1];
      }
   }
}
Now change the division in the assignment to d[j] into a multiplication and recompile and retime the program. Even though exactly the same number of operations are performed, execution time changes drastically. Why?

Well, the the UltraSPARCIIe and the UltraSPARCIIi have essentially the same microarchitecture, and the UltraSPARCIIi User Manual (p 10) tells us that floating point divides aren't pipelined. A single precision division takes twelve cycles rather than one to produce an answer.

A common approach to handling instructions that have long execution times (often called long latency instructions) is to schedule them. That is, a long latency operation, like a floating point division, is started well before its result is needed so that other instructions can "hide" the instruction's long latency.

To see the value of instruction scheduling, rewrite the above program so that the division of d[j]/d[j+1] is computed at the top of the loop body, and is assigned to d[j] at the bottom of the loop body. Retime the modified code. How much of the latency of the division is now hidden?

Sometimes scheduling a long latency operation is more difficult. Consider

float a[101],b[101],c[101],d[101];

main() {
   int i,j;
   for(i=0;i<101;i++){
	a[i]=b[i]=c[i]=d[i]=1.0;
   }
   for (i=1;i<=1000000;i++){
      for(j=0;j<100;j++){
	a[j+1]=a[j]/a[j+1];
	b[j+1]=b[j]+a[j+1];
	c[j+1]=c[j]-b[j+1];
	d[j+1]=d[j]*c[j+1];
      }
   }
}
Here the division is already at the top of the loop body, and subsequent computations can't be moved above it because they all depend upon the division. Loop unrolling could help, allowing a division in one iteration to move into an earlier iteration. A form of symbolic loop unrolling called software pipelining can also be used.

In software pipelining, a long latency computation needed in iteration i is started in an earlier iteration. This allows the execution time of one or more loop iterations to "hide" the computation of the long latency value. For example, in the above program, we can compute the value of a[j]/a[j+1], needed in iteration j, in iteration j-1. This allows the division to overlap the other floating point and indexing operations in the loop. To start things off, the initial value of a[0]/a[1] needs to be computed before the inner loop begins. Moreover, parts of the final iteration of loop body may need to be done (as a form of cleanup) after the loop exits.

Once this restructing is done, we have a software pipelined loop that hides all (or most) of the latency of the floating point division. To see the value of this optimization, compile and time the above program (at the -O1 level). Change the division to a multiplication and retime the program, to estimate how fast the program would be if floating point division were pipelined.

Now restructure the inner loop using software pipelining so that the crucial floating point division is initiated one iteration in advance of when it will be needed. Time your software-pipelined version. How close to the "ideal" execution time is it?

Cache Effects

Modern processors use caches to reduce the long delay it takes to fetch instructions and data from main memory. In fact, processors typically have two levels of caching (and three levels are sometimes seen). A small primary (level 1) cache is 8-64K bytes in size, with distinct level 1 instruction and data caches. A second level cache of 256K-4M bytes (shared between instructions and data) supplies misses in level 1 caches, while main memory (up to gigabytes in size) supplies misses in the level 2 cache.

Loading a register from the level 1 cache is fast, typically requiring only 1 or 2 cycles. A miss into the level 2 cache requires 10 or more cycles. A miss into main memory takes 100 or more cycles. Load latencies (delays) make careful cache management essential to high performance computing.

The Data Cache

We want to keep active data in the level 1 cache. However, it is common to access data structures far greater in size than the capacity of the level 1 cache. Hence cache locality is an important issue. Caches are divided into blocks, typically 32 or 64 bytes in size. This means if data at address A is in the cache, data at address A+1 (and A+4) very likely is too. This phenomenon is called spacial locality. Moreover, if data is in a cache at time t it likely will be in the cache at time t+1, t+2, etc. This is temporal locality--if data must be accessed several times, it is good to keep those accesses close in the instruction sequence.

Write a small C program that access the elements of a large matrix in row-major order (A[i][j], then A[i][j+1], etc). Time its execution (make the array large or visit the same sequence of elements many times so that the execution time is at least several seconds). Now visit the same data the same number of times. but now in column-major order (A[i][j], A[i+1][j], etc.) Measure the execution time now. Is there a significant difference? If so, why?

When we try to fit a large data set into a much smaller cache, we may well have capacity misses--data that once was in the cache has been pushed out by a large volume of more recently accessed data. We can also see another phenomenon, conflict misses. Two or more data items may clash in the cache because they happen to be mapped to the same cache location. (A memory word is typically mapped to a cache location by doing a div or mod operation using a power of 2.) That is, even though there is room to spare in the cache, the "wrong" sequence of data addresses may cause needed cache data to be purged. Write a simple C program that illustrates this phenomenon. That is, in the program a small set of elements in an array are accessed repeatedly, and cause repeated cache misses (and a larger execution time). An almost identical program that touches the same number of data elements which happen not to clash, has far few cache misses and executes much faster.

Loop Interchange and Loop Tiling

Optimizers often focus their attention on nested loops. This makes sense--even small improvements in the body of a nested loop are magnified by the millions (or billions) of times the loop body may be iterated. In some cases, the order in which nested loops are iterated may be irrelevant to program correctness. Consider the following program which counts the number of values in the y array that are smaller than a particular value in the x array:

int cnt[1000];
double x[1000], y[1000000];
main () {
   int i,j;
   for (i=0;i<1000;i++)
      cnt[i]=0;
   for(i=0;i<1000;i++) {
      for(j=0;j<1000000;j++) 
	 cnt[i] += (y[j] < x[i]);
   }
}
If we interchange the i and j loops, we still get the same answer, since each element in x must be compared with each element in y.

However, if we compile and time the two versions, there is a very significant difference! Do this, compiling each at the -O2 level. Why is one version so much faster than the other? The optimization that interchanges the execution order of loops to speed execution is called loop interchange.

Now consider a very similar version of the above program:

int cnt[10000];
double x[10000], y[100000];
main () {
   int i,j;
   for (i=0;i<10000;i++)
      cnt[i]=0;
   for(i=0;i<10000;i++) {
      for(j=0;j<100000;j++) 
	 cnt[i] += (y[j] < x[i]);
      }
}
The only difference is that now the x and cnt arrays have been increased by a factor of 10, while the y array has been decreased by a factor of 10. The total number of comparisons that are done is exactly the same. Compile (at the -O2 level) this program as it is given and with the i and j loops interchanged. Does loop interchange help much here? Why are both versions of this program significantly slower than the faster version of the original program (that had x and cnt arrays of size 1000)?

Loop tiling is an optimization that speeds the execution of loops that process arrays too big to fit in the primary cache. An array is logically broken into pieces called tiles. Each tile is small enough to fit conveniently in the primary cache. A tile is loaded into the cache (by accessing it) and all the loop operations that refer to that tile are performed together, enhancing cache temporal locality. For example, in the above example (with x and cnt arrays of size 10000), we can perform all the comparisons involving a tile of the x array, then all the comparisons involving the next x tile, etc. Since we process the x array in "chunks," we need far fewer passes through the y array. Alternatively, we can tile the y array, processing it in "chunks." Create, compile and time both versions (x array tiled versus y array tiled). Which is faster? Why? Is the better of the two tiled versions faster than the original untiled programs that processed an x array of size 1000 and a y array of size 1000000? Why?

The Instruction Cache

Several of the optimizations we explored earlier (inlining, loop unrolling, partial redundancy elimination, etc.) can increase the size of a program-- sometimes significantly. If we increase program size too much, we'll start to see increased instruction cache misses, and a very significant slowdown.

Write a simple C program with a loop that iterates several million times. Measure its execution time. Unroll the loop by a factor of 2, then 4, then 8, etc., measuring execution times. Do your unrolling at the source level, using a simple program or script to produce the expanded source. Produce a graph comparing your unrolling factor with the execution time observed.

Note when the execution time shoots up. Try intermediate unrollings until you find the approximate size in which instruction cache misses start to be a problem. Given this unrolling factor and code generated per copy of the loop, estimate the instruction cache size of the Sparc processor you are using.

What to Hand In

Late Policy

The project is due in class on Tuesday, September 27. It may be handed in during class on Thursday, September 29, with a late penalty of 10% (i.e., the maximum possible grade becomes 90). The project may also be handed in during class on Tuesday, October 4, with a late penalty of 20% (the maximum possible grade becomes 80). This assignment will not be accepted after Tuesday, October 4.


Fri Aug 26 16:39:06 CDT 2005