(3.3.5) Niagara 32 way SPARC

Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005. IEEE Xplore

Commercial server applications (databases and Web services) have huge TLP.

60 W of power

Data center racks limited by power supply envelope. supplying power and dissipating generated heat

32 way threaded > OS layer abstracting away hardware sharing.

performance metric : sustained throughput of client requests.

good performance per watt.

commercial server applications : low ILP : large working sets and poor locality of reference on memory access. data dependent branch very difficult to predict.

Niagara : simple cores aggregated on a single die shared on chip cache and high bandwidth off-chip memory.

Pipeline : 4 threads into a thread group > shares the processing pipeline > Sparc pipe. 8 such groups = 32

crossbar interconnect provides communication link between Sparc pipes, L2 cache and shared resources. (also the point of memory ordering).

each thread : unique set of registers, instr and store buffers.

share : L1 caches, TLBs, exec units, most pipeline registers. (critical path by 64 entry fully associative ITLB).

each thread has 8 register windows. (4 such files for 4 threads)

set of registers visible to a thread is the working set (register file cells)

complete set : architectural (SRAM)

copy happens when window is changed. also the thread issue is stopped. cool!

Memory

simple cache coherence with write through policy.

10 % miss rate.

allocate on load, no-allocate on store.

stores do not update the local caches until they have updated the L2 cache.

copy-back policy, writing back dirty evicts and dropping clean evicts.

------

Hardware multithreading primer

Single issue single thread machine

Coarse grained multithreading / switch on event multithreading > thread has full use of CPU until long-latency event such as DRAM miss occurs.

Fine grained multithreading / interleaved multithreading > cycle boundary.

time sliced or vertical multithreaded (SMT)

Chip multiprocessing > each processor executes one thread.