THE SPARC ARCHITECTURE

The SPARC Architecture

The CPU with Windows - Registers

The SPARC architecture's definition includes the IU (Integer Unit) which is the CPU, the FPU (Floating Point Unit) and the CP (CoProcessor) which is optional for the user. Other options are the memory management unit and cache.( Tanenbaum)

An important concept of the SPARC architectute is borrowed from the Berkeley RISC chips, the TMS 9900 mainly. This is register windowing(see figure 1). When a program is running it has access to 32 32-bit processor registers which include eight global reg isters plus 24 registers that belong to the current register window. The first 8 registers in the window are called the in registers (i0-i7). When a function is called, these registers may contain arguments that can be used. The next 8 are the l ocal registers which are scratch registers that can be used for anything while the function executes. The last 8 registers are the out registers which the function uses to pass arguments to functions that it calls.(Gl as)

When one function calls another, the callee can choose to execute a SAVE instruction. This instruction decrements an internal counter, the current workspace pointer, shifting the register window downward. The caller's out registers then become the callee' s in registers, and the callee gets a new set of local and out registers for its own use. Only the pointer changes because the registers and return address do not need to be stored on a stack. The CALL instruction automatically saves its own address in 07 (output register 7) which becomes input register 7 if the CWP is decremented. Therefore the callee can access the return address whether or not it has decremented the CWP.

Register windows are also used to save the processor contexts when traps, or interrupts occur. The SPARC OS's always ensure that there is a register window not being used below the current one. If a trap occurs, then the CWP is decremented and the new win dow saves the processor context. (Glas)

The chip that was implemented by Sun had seven overlapping windows which brought the total of registers to (7*16) + 7 (without counting g0) which is 119 registers. If six levels are not enough due to recursive or deeply nested function calls, then the pro gram attempts to decrement the CWP to the last unused window and it discovers that the window has been marked invalid in a register called the window invalid mask register. This causes a trap and the processor has an opportunity to "spill" register s in order to make more room. It writes some of the contents out to memory.

A long series of subroutine returns can cause a window underflow, which consequently causes the processor to call in a trap handler that fills registers from memory. All the spilling and filling is hidden from an executing user program usually. Spilling a nd filling registers is an essential part of Unix multitasking on SPARC.(Case, Glas)

A Simple Instruction Set

The SPARC instruction set is also like the Berkeley RISC chip because it has 3 possible instruction formats.(shown in figure 2) It also has 74 possible instructions including the floating point operations. SPARC supports integer data types that are signed and unsigned bytes, 16-bit half words, 32-bit words, and 64-bit double words. There is a tagged word format in which the 2 least significant bits serve as flags to indicate the type of object. The floating point numbers can be 32(single), 64 (double), or 128 (quad) bits long; they conform to the IEEE 754 standard. The quad format which was first used in the version 8 of SPARC uses a 112-bit mantissa for applications requiring incredible floating point precision. The floating point unit has 32 32-bit non windowed registers, which must be saved on a per-context basis.

SPARC is "big-endian"- it stores multiple byte objects in memory with the most significant byte at the lowest address. This facilitates Sun's "big-endian" protocols which are remote procedure call (RPC), external data representation (EDR), and network fil e system (NFS).((The UltraSPARC - Technology)

Figure 1: SPARC Instruction Set

SPARC's Delayed Branches

SPARC handles branching in a very interesting way. Delayed branch means that the instruction following a branch instruction is executed while the processor prepares to tranfer control to the destination. SPARC also implements another feature called an ann ul bit which allows the processor to annul the effects of the delay instruction following a conditional branch if that branch isn't taken.

On processors that use delayed branches but cannot annul the delay instruction, the compiler must try to fill the delay slot whether or not the branch is taken. If however the delay instruction can be annulled, the obvious candidate is the instruction that would otherwise reside at the destination of the branch. Thus SPARC compilers are more likely than those for other RISCs to fill the delay slot with useful instruction.(Project: SEMADAM)

SPARC Floating Point

In this case, we will examine the UltraSPARC-1 since it is one of the newer implementation of the SPARC architecture and will thus give a better up to date idea of how it works in the 90's.

The UltraSPARC-I floating-point unit is a pipelined floating-point processor that conforms to SPARC-V9 architecture specifications. Its IEEE-compliant design consists of five separate functional units to support floating-point and multimedia operations. T he separation of execution units allows UltraSPARC-I to issue and execute two floating-point instructions per cycle. Source and data results are stored in a 32-entry register file. Most floating-point instructions have a throughput of one cycle, a latency of three cycles, and are fully pipelined. The FPU is able to operate on both single-precision (32-bit), and double-precision (64-bit) numbers, normalized or denormalized, in hardware, and quad-precision (128-bit) operands in software.

The floating-point unit (FPU) is tightly coupled to the integer pipeline and is capable of seamlessly executing a floating-point memory event and a floating-point operation. The IEU and the FPU have a dedicated control interface which includes the dispatc h of operations fetched by the PDU to the FPU. Once in the queue, the PDU is responsible for distribution of instructions to the FPU. The IU controls the D-cache portion of the operation, while the FPU decides how to manipulate the data. The integer unit and FPU cooperatively detect floating-point data dependencies. The interface also includes IU and FPU handshaking for floating-point exceptions. The FPU performs all floating-point operations and implements a 3-entry floating-point instruction queue to re duce the impact of bottlenecks at the IU and improve overall performance.

Performance

Statistical analysis shows that, on average, 94% of FPU instructions will complete within the typical cycle count. Table 3-2 identifies expected UltraSPARC-I FPU performance.(The UltraSPARC - Technology)


		  Operation	Throughput	Latency
				(Cycles)	(Cycles)

     Add (Single Precision)	1		3

     Add (Double Precision)	1		3

Multiply (Single Precision)	1		3

Multiply (Double Precision)	1		3

  Divide (Single Precision)	12		12

  Divide (Double Precision)	22		22

		Square Root
         (Single Precision)	12		12

                Square Root
	 (Double Precision)	22		22


Table 3-2  UltraSPARC-I FPU execution times assuming 
	   normal floating-point operands and results. (The UltraSPARC - 
Technology)

Memory Management Unit

Superscalar performance can only be maintained if the IEU can be supplied with the appropriate instructions and data -- a job performed by the memory hierarchy. The UltraSPARC-I Memory Management Unit (MMU) provides the functionality of a reference MMU an d an IOMMU, handling all memory operations as well as arbitration between data stores and memory.

The MMU implements virtual memory and translates virtual addresses of each running process to physical addresses in memory. Virtual memory is a method by which applications are written assuming a full 64-bit address space is available. This abstraction re quires partitioning the logical (virtual) address space into pages which are mapped into physical (real) memory. The operating system in turn translates a 64-bit address into a 44-bit address space supported by the processor. The MMU provides the translat ion of a 44-bit virtual address to a 41-bit physical address through the use of a Translation Lookaside Buffer (TLB).

The MMU also provides memory protection so that a process can be prohibited from reading or writing the address space of another, guaranteeing memory integrity between processes. Access protection is also supported to ensure that any given process does no t gain unauthorized access to memory. For example, a process will not be allowed to modify areas that are marked as read-only or reserved for supervisory software.

Finally, the MMU performs the arbitration function between I/O, D-cache, I-cache, and TLB references to memory. In essence, the MMU implements the function of a "traffic cop", controlling and prioritizing access to main memory. At any given time , a contention for memory access may arise between an I/O access involving the bus as well as internal accesses requested by I-Cache, D-Cache, and TLB references.(The UltraSPARC - Technology, Tanenbaum, UltraSPARC I)

(The UltraSPARC - Technology)

Home

Send comments or suggestions to: mutioke@earlham.edu
Last Revision: November 14, 1997