(3.3.5) Niagara 32 way SPARC

Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005. IEEE Xplore

Commercial server applications (databases and Web services) have huge TLP. 

60 W of power

Data center racks limited by power supply envelope. supplying power and dissipating generated heat

32 way threaded > OS layer abstracting away hardware sharing. 

performance metric : sustained throughput of client requests.

good performance per watt. 

commercial server applications : low ILP : large working sets and poor locality of reference on memory access. data dependent branch very difficult to predict. 

Niagara : simple cores aggregated on a single die shared on chip cache and high bandwidth off-chip memory. 

Pipeline : 4 threads into a thread group > shares the processing pipeline > Sparc pipe. 8 such groups =  32 
crossbar interconnect provides communication link between Sparc pipes, L2 cache and shared resources. (also the point of memory ordering). 

each thread : unique set of registers, instr and store buffers. 
share : L1 caches, TLBs, exec units, most pipeline registers. (critical path by 64 entry fully associative ITLB). 




Register windows
     each thread has 8 register windows. (4 such files for 4 threads)
     set of registers visible to a thread is the working set (register file cells)
     complete set : architectural (SRAM)
          copy happens when window is changed. also the thread issue is stopped. cool!
     
Memory 
     simple cache coherence with write through policy. 
     10 % miss rate. 
     allocate on load, no-allocate on store. 
     stores do not update the local caches until they have updated the L2 cache. 
          copy-back policy, writing back dirty evicts and dropping clean evicts. 
------

Hardware multithreading primer
     Single issue single thread machine 
     Coarse grained multithreading / switch on event multithreading > thread has full use of CPU until long-latency event such as DRAM miss occurs. 
     Fine grained multithreading / interleaved multithreading > cycle boundary.

     time sliced or vertical multithreaded (SMT)
     Chip multiprocessing > each processor executes one thread.