-------------------------------------------------------------------- CS 758 Programming Multicore Processors Fall 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ TITLE ------------ OUTLINE * Cache Coherence * Niagara (did not cover much as was in 752) * Memory Consistency -- NOT DONE ------------------------------ System Model & Coherence System model -- Figure 2.1 cores w/ private caches w/ controllers icn LLC w/ memory w/ controller Offchip DRAM Not: * Not multisocket * Not hierarchical caches EZ: (OMIT) * Private L2 * Banked LLC * Multiple DRAM channels Incoherence example 2.2 Coherence Invariants 1. Single-Writer, Multiple-Read (SWMR) Invariant. For any memory location A, at any given (logical) time, there exists only a single core that may write to A (and can also read it) or some number of cores that may only read A. 2. Data-Value Invariant. (OMIT) The value of the memory location at the start of an epoch is the same as the value of the memory location at the end of its last read–write epoch. Show Figure 2.3 timeline read-write at 1 read-only at 2, 3 read-write at 2 .. Maintiain invariants * Use (64B blocks) * FSM at caches & LLC * communication with message/bus SWMR in Practice. For block B at time t either 1 Modified (M) block -- write/read in one cache 0-n Shared (S) blocks -- read-only in multiple caches Blocks can also be invalid as well as E or O : MOESI States How? * Snooping -- totally-order broadcast * Directory -- use indirection DISCLAIMER: This is high-level view NIAGARA -------------------- Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro, March-April 2005, pp. 21-29. (TOO MUCH HERE) Why Niagara? * Pushes TLP further than others * But processors may be too simple Draw on board Figure 2. Niagara Block Diagram * 8 * 4 = 32 hardware threads * DDR2 DRAM up to 128 GBytes * L2 is 3MB 12-way 64-byte blocks interleaved in four banks on cache blocks (Thus, each bank is 3/4 MB 12-way cache with 1K blocks in each way.) * Crossbar with two-entry queue for each source-destination pair * Only 60 watts!!!!! Aside on (Simultaneuous) Multithreading (a.k.a. Intel Hyperthreading) (SKIP) * Early computers stalled CPU for I/O * Multitasking allows CPU to run other jobs while I/O pending * (L2) Cache misses now very slow * Tolerate with fancy ILP, but then stall * Multithreading allows CPU pipeline to run other jobs with L2 miss pending * Basic idea + replicate some things: program counter & registers + share some things: ALUs and caches + other micro-architectural resourses more complex * Net result + Few additional logical "processors" for (almost) free + Helps a lot if you have threads waiting to memory + Hurts a little if you don't have threads + A wash if multiple threads rarely wait to memory (benchmarks?) Niagara Pipeline (one of eight) * 552 Pipe: Fetch, Decode, Execute, Memory Access, Writeback * Nia Pipe: Fetch, *Thread Select*, Decode, Execute, Memory Access, Writeback * Too much for 758: + Draw on board Figure 3. Niagara Pipeline Block Diagram + Go over stages * Register File support 8 register windows for each of 4 threads: 5.7KB! + Each thread see the current 32 in fast SRAM + while the other regiters are in compact SRAM * Includes 8KB L1 I-cache and 16KB L1 D-cache (small, but get easy stuff) Memory System * L1 cache are write-through (with no write allocate) * L1 caches maintain just Valid/Invalid * L2 cache banks keep copy of L1 state + Loads misses reveal "way" of victim + Stores await invaldations of all other sharers + Implements Total Store Order (TSO) * L2 is stardard write-bank cache w.r.t. memory * Niagara systems allow only one Niagara chip Assessment * Pushes TLP further than others * But processors may be too simple * Low clock frequency than chips with long pipes Memory Consistency -- NOT DONE ------------------- Coherence's Goal: * Make cache is invisible as in uniprocessors * Once invisible, what does memory do? (can't refer to caches) Example /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 (OMIT) Coherence doesn't say anything, why? (OMIT) Consider coalescing write buffer /* initial A = B = 0 */ P1 P2 A = 1; B = 1 r1 = B; r2 = A; There outcomes but not fourth. (OMIT) How screw up? * write buffer * ooo loads * dir protocol that doesn't wait to ack Define Memory Consistecny * What SW can expect of HW * What HW must provide to SW Want 4Ps: (DID NOT COVER) * Programmability: EZ to program * Performance: faciliates good performance (or low cost) * Portable * Precision Sequential Consistency A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Show two processors po and global mo Show "railroad switch" picture Same as multi-threaded uniprocessor Total Store Ordering (TSO) -- x86 and SPARC ------------------------------------------ /* initial A = B = 0 */ P1 P2 A = 1; B = 1 r1 = B; r2 = A; Allows r1 == r2 == 0? (Why? write buffers) Relaxed (Weak) Ordering -- ARM, IBM Power ----------------------------------------- /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; r1 = A; flag = 1; r2 = B; But many (most) order NOT necessary! E.g., "A = 1" and "B = 1" Why not just enforce necessary orders? Relaxed models * Unordered by default * Use FENCEs to get necesarry order /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ FENCE B = 1; r1 = A; FENCE flag = 1; r2 = B;