One easy way to become familiar with the the Sparc instruction set is to look at the code generated by the standard departmental C/C++ compiler (gcc). If we execute
gcc -S prog.cgcc will create prog.s, an assembler file, containing the code it generated for prog.c. If you working on a workstation that does not use a Sparc processor, you will need to log onto a workstation that does contain a Sparc (otherwise, you'll see code for an entirely different architecture!)
int incr(int i){ return i+1; } int main(){ int a; return incr(a); }gcc generates
.file "prog1.c" .section ".text" .align 4 .global incr .type incr, #function .proc 04 incr: !#PROLOGUE# 0 save %sp, -112, %sp !#PROLOGUE# 1 st %i0, [%fp+68] ld [%fp+68], %g1 add %g1, 1, %g1 mov %g1, %i0 ret restore .size incr, .-incr .align 4 .global main .type main, #function .proc 04 main: !#PROLOGUE# 0 save %sp, -120, %sp !#PROLOGUE# 1 ld [%fp-20], %o0 call incr, 0 nop mov %o0, %g1 mov %g1, %i0 ret restore .size main, .-main .ident "GCC: (GNU) 3.4.4"Op codes that begin with a "." are assembler directives. If we look at the label incr we see the body of the incr function. In the save instruction a new register window (for the function's local and out registers) is created and the function's frame is pushed on the run-time stack (by subtracting the frame size, 112 bytes, from the caller's stack pointer (%sp) to create the callee's stack pointer). (The two references to %sp in the save instruction are to different registers!). The caller's stack pointer becomes the callee's frame pointer (%fp).
The function parameter, passed in register %i0, is saved into its memory location (%fp+68), and then is immediately reloaded into %g1. One is added to i (in %g1), then %g1 is copied to %i0. (Register %i0 is the standard function return register.) In the Sparc, branches, calls and returns are delayed. That is, the instruction following a branch, call or return is executed before any transfer of control occurs. This takes some getting used to when you read Sparc assembler; you need to mentally transpose the instructions to get the actual execution order. Thus when we see a ret, restore pair, the restore is done before the return actually occurs. A restore resores the caller's register window (by restoring the caller's stack pointer and frame pointer). The function return value, placed in the callee's %i0 register is now in the caller's %o0 register (because in the register window mechanism, a caller's output registers (%o0 to %o7) overlap a callee's input register's (%i0 to %i7)). Finally, the ret instruction branches to the return point, which is the address %i7+8. (Register %i7 contains the address of the calling instruction, set by a call. But why is the return point 8 bytes past the call instruction?)
Function main, compiled at label main, is similar in structure to incr. It pushes its frame using a save instruction. It loads local a from its frame location, %fp-20 (the brackets indicate that a memory location is being addressed). Local a is correctly loaded into register %o0, the register assigned to the first scalar parameter. The call instruction transfers control to label incr; the call's address is stored in register %o7. The nop after the call indicates that the compiler found nothing useful to do prior to the call. Upon return, incr's return value, in %o0, is clumsily moved to %g1 and then %g1 is moved to register %i0, the return value register. Finally, the resore/ret pair is executed to pop main's frame and return to the caller.
Gcc, in unoptimized mode, generates rather simple-minded code. It does far better when optimization is requested. Flags -O1, -O2, and -O3 request progressively more extensive optimization. Often -O1 is enough to produce a dramatic improvement in code quality. Let's see what is produced by -O1 for our above example:
.file "prog1.c" .section ".text" .align 4 .global incr .type incr, #function .proc 04 incr: !#PROLOGUE# 0 !#PROLOGUE# 1 retl add %o0, 1, %o0 .size incr, .-incr .align 4 .global main .type main, #function .proc 04 main: !#PROLOGUE# 0 save %sp, -112, %sp !#PROLOGUE# 1 call incr, 0 nop ret restore %g0, %o0, %o0 .size main, .-main .ident "GCC: (GNU) 3.4.4"
The generated code has been improved considerably. The function incr is recognized as a leaf procedure (a procedure that makes no calls). A leaf procedure needs no frame. Rather, incr simply accesses its parameter from the parameter register %o0. It increments i, and returns it in %o0. The add is placed in the delay slot of retl, which returns using %o7 (since no new register window was created for the call).
The main function has also been improved. Local variable a has been assigned to register %o0, the parameter register. Function incr is called as before, returning its value in %o0. A restore/ret pair is executed, but with a clever twist. The restore adds %g0, which always contains 0 to %o0. Why? Well, the %o0 that is the target of the trivial addition is in the caller's window, effectively moving the return value from the callee's register window to the correct register in the caller's window.
Compilers implement a wide variety of optimizations. Most of these optimizations employ a common strategy--reduce the total number of dynamic instructions needed to implement a computation. This makes a great deal of sense--the fewer the instructions needed, the faster a program should be. As we will see later, this isn't the whole story, but certainly it is a reasonable first step. We'll identify and examine a number of such instruction count optimizations. In the following list, we'll identify and describe some particular instruction count optimization. More information on these (and many other optimizations) may be found in a survey paper on compiler transformations.
You will need to create a small C program (only a few lines if possible) that can be improved using the given optimization. Run gcc without optimization on your program, demonstrating an unoptimized translation. Then run gcc using -O1 (or -O2 or -O3). Show the optimized code and explain what has been changed in the code as a result of the given optimization. In a few cases you may need to change (by hand) the source program to "force" gcc to implement the optimization. Alternatively, you may edit (very carefully) the Sparc instructions generated. You can feed a ".s" file back into gcc to get it to assemble the given instructions (and link in the needed libraries). In cases where you hand optimize the source or generated code, you should explain why your optimizations are valid, and why they improve the translation.
As an example, if you were asked to illustrate the "leaf procedure" optimization, in which a function (or procedure) that makes no calls is translated so that it pushes no frame or register window, you could use the above example. In the original translation, incr pushes a frame and register window, but in the translation produced by the O1 option, the parameter is accessed directly from the register passed by the caller.
Reducing total instruction counts is an important goal, but it isn't by any means the whole story in optimization. Other aspects of the target machine must be considered and exploited to improve overall performance.
Most modern processors are pipelined. That is, processors strive to make all instructions appear to have the same execution time, issuing and retiring an instruction in one cycle. This effect is often counterintuitive. A complex operation like addition takes no more time than a simpler operation like bitwise oring. Often multiplication is no slower than addition (even in floating point!).
For better or worse, the unit-time execution model is a close, but not entirely accurate, model of instruction execution. Not all processors handle all instructions in unit time. Sometimes integer or floating point multiplication is slower than integer or floating point addition. Division is almost always slower than multiplication or addition. Loads take an unpredictable amount of time, depending on whether the desired value is in a cache or main memory.
You can use the Unix command
time prog argsto measure (to approximately 1/60 second resolution) the time needed to execute prog args. You can time a given program several times, averaging the reported execution time, to obtain a more accurate timing.
Compile (at the -O1 level) and time the following program:
float a[101],b[101],c[101],d[101]; main() { int i,j; for(j=0;j<101;j++){ a[j]=b[j]=c[j]=d[j]=1.0; } for (i=1;i<=1000000;i++){ for(j=0;j<100;j++){ a[j]=a[j]+a[j+1]; b[j]=b[j]-b[j+1]; c[j]=c[j]*c[j+1]; d[j]=d[j]/d[j+1]; } } }Now change the division in the assignment to d[j] into a multiplication and recompile and retime the program. Even though exactly the same number of operations are performed, execution time changes drastically. Why?
Well, the the UltraSPARCIIe and the UltraSPARCIIi have essentially the same microarchitecture, and the UltraSPARCIIi User Manual (p 10) tells us that floating point divides aren't pipelined. A single precision division takes twelve cycles rather than one to produce an answer.
A common approach to handling instructions that have long execution times (often called long latency instructions) is to schedule them. That is, a long latency operation, like a floating point division, is started well before its result is needed so that other instructions can "hide" the instruction's long latency.
To see the value of instruction scheduling, rewrite the above program so that the division of d[j]/d[j+1] is computed at the top of the loop body, and is assigned to d[j] at the bottom of the loop body. Retime the modified code. How much of the latency of the division is now hidden?
Sometimes scheduling a long latency operation is more difficult. Consider
float a[101],b[101],c[101],d[101]; main() { int i,j; for(i=0;i<101;i++){ a[i]=b[i]=c[i]=d[i]=1.0; } for (i=1;i<=1000000;i++){ for(j=0;j<100;j++){ a[j+1]=a[j]/a[j+1]; b[j+1]=b[j]+a[j+1]; c[j+1]=c[j]-b[j+1]; d[j+1]=d[j]*c[j+1]; } } }Here the division is already at the top of the loop body, and subsequent computations can't be moved above it because they all depend upon the division. Loop unrolling could help, allowing a division in one iteration to move into an earlier iteration. A form of symbolic loop unrolling called software pipelining can also be used.
In software pipelining, a long latency computation needed in iteration i is started in an earlier iteration. This allows the execution time of one or more loop iterations to "hide" the computation of the long latency value. For example, in the above program, we can compute the value of a[j]/a[j+1], needed in iteration j, in iteration j-1. This allows the division to overlap the other floating point and indexing operations in the loop. To start things off, the initial value of a[0]/a[1] needs to be computed before the inner loop begins. Moreover, parts of the final iteration of loop body may need to be done (as a form of cleanup) after the loop exits.
Once this restructing is done, we have a software pipelined loop that hides all (or most) of the latency of the floating point division. To see the value of this optimization, compile and time the above program (at the -O1 level). Change the division to a multiplication and retime the program, to estimate how fast the program would be if floating point division were pipelined.
Now restructure the inner loop using software pipelining so that the crucial floating point division is initiated one iteration in advance of when it will be needed. Time your software-pipelined version. How close to the "ideal" execution time is it?
Modern processors use caches to reduce the long delay it takes to fetch instructions and data from main memory. In fact, processors typically have two levels of caching (and three levels are sometimes seen). A small primary (level 1) cache is 8-64K bytes in size, with distinct level 1 instruction and data caches. A second level cache of 256K-4M bytes (shared between instructions and data) supplies misses in level 1 caches, while main memory (up to gigabytes in size) supplies misses in the level 2 cache.
Loading a register from the level 1 cache is fast, typically requiring only 1 or 2 cycles. A miss into the level 2 cache requires 10 or more cycles. A miss into main memory takes 100 or more cycles. Load latencies (delays) make careful cache management essential to high performance computing.
We want to keep active data in the level 1 cache. However, it is common to access data structures far greater in size than the capacity of the level 1 cache. Hence cache locality is an important issue. Caches are divided into blocks, typically 32 or 64 bytes in size. This means if data at address A is in the cache, data at address A+1 (and A+4) very likely is too. This phenomenon is called spacial locality. Moreover, if data is in a cache at time t it likely will be in the cache at time t+1, t+2, etc. This is temporal locality--if data must be accessed several times, it is good to keep those accesses close in the instruction sequence.
Write a small C program that access the elements of a large matrix in row-major order (A[i][j], then A[i][j+1], etc). Time its execution (make the array large or visit the same sequence of elements many times so that the execution time is at least several seconds). Now visit the same data the same number of times. but now in column-major order (A[i][j], A[i+1][j], etc.) Measure the execution time now. Is there a significant difference? If so, why?
When we try to fit a large data set into a much smaller cache, we may well have capacity misses--data that once was in the cache has been pushed out by a large volume of more recently accessed data. We can also see another phenomenon, conflict misses. Two or more data items may clash in the cache because they happen to be mapped to the same cache location. (A memory word is typically mapped to a cache location by doing a div or mod operation using a power of 2.) That is, even though there is room to spare in the cache, the "wrong" sequence of data addresses may cause needed cache data to be purged. Write a simple C program that illustrates this phenomenon. That is, in the program a small set of elements in an array are accessed repeatedly, and cause repeated cache misses (and a larger execution time). An almost identical program that touches the same number of data elements which happen not to clash, has far few cache misses and executes much faster.
Optimizers often focus their attention on nested loops. This makes sense--even small improvements in the body of a nested loop are magnified by the millions (or billions) of times the loop body may be iterated. In some cases, the order in which nested loops are iterated may be irrelevant to program correctness. Consider the following program which counts the number of values in the y array that are smaller than a particular value in the x array:
int cnt[1000]; double x[1000], y[1000000]; main () { int i,j; for (i=0;i<1000;i++) cnt[i]=0; for(i=0;i<1000;i++) { for(j=0;j<1000000;j++) cnt[i] += (y[j] < x[i]); } }If we interchange the i and j loops, we still get the same answer, since each element in x must be compared with each element in y.
However, if we compile and time the two versions, there is a very significant difference! Do this, compiling each at the -O2 level. Why is one version so much faster than the other? The optimization that interchanges the execution order of loops to speed execution is called loop interchange.
Now consider a very similar version of the above program:
int cnt[10000]; double x[10000], y[100000]; main () { int i,j; for (i=0;i<10000;i++) cnt[i]=0; for(i=0;i<10000;i++) { for(j=0;j<100000;j++) cnt[i] += (y[j] < x[i]); } }The only difference is that now the x and cnt arrays have been increased by a factor of 10, while the y array has been decreased by a factor of 10. The total number of comparisons that are done is exactly the same. Compile (at the -O2 level) this program as it is given and with the i and j loops interchanged. Does loop interchange help much here? Why are both versions of this program significantly slower than the faster version of the original program (that had x and cnt arrays of size 1000)?
Loop tiling is an optimization that speeds the execution of loops that process arrays too big to fit in the primary cache. An array is logically broken into pieces called tiles. Each tile is small enough to fit conveniently in the primary cache. A tile is loaded into the cache (by accessing it) and all the loop operations that refer to that tile are performed together, enhancing cache temporal locality. For example, in the above example (with x and cnt arrays of size 10000), we can perform all the comparisons involving a tile of the x array, then all the comparisons involving the next x tile, etc. Since we process the x array in "chunks," we need far fewer passes through the y array. Alternatively, we can tile the y array, processing it in "chunks." Create, compile and time both versions (x array tiled versus y array tiled). Which is faster? Why? Is the better of the two tiled versions faster than the original untiled programs that processed an x array of size 1000 and a y array of size 1000000? Why?
Several of the optimizations we explored earlier (inlining, loop unrolling, partial redundancy elimination, etc.) can increase the size of a program-- sometimes significantly. If we increase program size too much, we'll start to see increased instruction cache misses, and a very significant slowdown.
Write a simple C program with a loop that iterates several million times. Measure its execution time. Unroll the loop by a factor of 2, then 4, then 8, etc., measuring execution times. Do your unrolling at the source level, using a simple program or script to produce the expanded source. Produce a graph comparing your unrolling factor with the execution time observed.
Note when the execution time shoots up. Try intermediate unrollings until you find the approximate size in which instruction cache misses start to be a problem. Given this unrolling factor and code generated per copy of the loop, estimate the instruction cache size of the Sparc processor you are using.
The project is due in class on Tuesday, September 27. It may be handed in during class on Thursday, September 29, with a late penalty of 10% (i.e., the maximum possible grade becomes 90). The project may also be handed in during class on Tuesday, October 4, with a late penalty of 20% (the maximum possible grade becomes 80). This assignment will not be accepted after Tuesday, October 4.