-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Outline * 757 Lecture * Niagara (did not cover much as was in 752) -------------------- Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro, March-April 2005, pp. 21-29. Why Niagara? * Pushes TLP further than others * But processors may be too simple * Disclaimer: I consult for Sun, but not on Niagara Draw on board Figure 2. Niagara Block Diagram * 8 * 4 = 32 hardware threads * DDR2 DRAM up to 128 GBytes * L2 is 3MB 12-way 64-byte blocks interleaved in four banks on cache blocks (Thus, each bank is 3/4 MB 12-way cache with 1K blocks in each way.) * Crossbar with two-entry queue for each source-destination pair * Only 60 watts!!!!! Aside on (Simultaneuous) Multithreading (a.k.a. Intel Hyperthreading) (SKIP) * Early computers stalled CPU for I/O * Multitasking allows CPU to run other jobs while I/O pending * (L2) Cache misses now very slow * Tolerate with fancy ILP, but then stall * Multithreading allows CPU pipeline to run other jobs with L2 miss pending * Basic idea + replicate some things: program counter & registers + share some things: ALUs and caches + other micro-architectural resourses more complex * Net result + Few additional logical "processors" for (almost) free + Helps a lot if you have threads waiting to memory + Hurts a little if you don't have threads + A wash if multiple threads rarely wait to memory (benchmarks?) Niagara Pipeline (one of eight) * 552 Pipe: Fetch, Decode, Execute, Memory Access, Writeback * Nia Pipe: Fetch, *Thread Select*, Decode, Execute, Memory Access, Writeback * Too much for 838: + Draw on board Figure 3. Niagara Pipeline Block Diagram + Go over stages * Register File support 8 register windows for each of 4 threads: 5.7KB! + Each thread see the current 32 in fast SRAM + while the other regiters are in compact SRAM * Includes 8KB L1 I-cache and 16KB L1 D-cache (small, but get easy stuff) Memory System * L1 cache are write-through (with no write allocate) * L1 caches maintain just Valid/Invalid * L2 cache banks keep copy of L1 state + Loads misses reveal "way" of victim + Stores await invaldations of all other sharers + Implements Total Store Order (TSO) * L2 is stardard write-bank cache w.r.t. memory * Niagara systems allow only one Niagara chip Assessment * Pushes TLP further than others * But processors may be too simple * Low clock frequency than chips with long pipes