Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005. IEEE Xplore |
Commercial server applications (databases and Web services) have huge TLP.
60 W of power
Data center racks limited by power supply envelope. supplying power and dissipating generated heat
32 way threaded > OS layer abstracting away hardware sharing.
performance metric : sustained throughput of client requests.
good performance per watt.
commercial server applications : low ILP : large working sets and poor locality of reference on memory access. data dependent branch very difficult to predict.
Niagara : simple cores aggregated on a single die shared on chip cache and high bandwidth off-chip memory.
Pipeline : 4 threads into a thread group > shares the processing pipeline > Sparc pipe. 8 such groups = 32
crossbar interconnect provides communication link between Sparc pipes, L2 cache and shared resources. (also the point of memory ordering).
each thread : unique set of registers, instr and store buffers.
share : L1 caches, TLBs, exec units, most pipeline registers. (critical path by 64 entry fully associative ITLB).
 Niagara 32 way SPARC_files/Screen shot 2011-09-08 at PM 11.37.09.png)
Register windows
each thread has 8 register windows. (4 such files for 4 threads)
set of registers visible to a thread is the working set (register file cells)
complete set : architectural (SRAM)
copy happens when window is changed. also the thread issue is stopped. cool!
Memory
simple cache coherence with write through policy.
10 % miss rate.
allocate on load, no-allocate on store.
stores do not update the local caches until they have updated the L2 cache.
copy-back policy, writing back dirty evicts and dropping clean evicts.
------
Hardware multithreading primer
Single issue single thread machine
Coarse grained multithreading / switch on event multithreading > thread has full use of CPU until long-latency event such as DRAM miss occurs.
Fine grained multithreading / interleaved multithreading > cycle boundary.
time sliced or vertical multithreaded (SMT)
Chip multiprocessing > each processor executes one thread.