--------------------------------------------------------------------
CS 758 Programming Multicore Processors 
Fall 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

------------
TITLE
------------

OUTLINE
* Cache Coherence
* Niagara (did not cover much as was in 752)
* Memory Consistency  -- NOT DONE

------------------------------
System Model & Coherence

System model -- Figure 2.1

cores w/ private caches w/ controllers
icn
LLC w/ memory w/ controller
Offchip DRAM

Not:

* Not multisocket
* Not hierarchical caches

EZ: (OMIT)
* Private L2
* Banked LLC
* Multiple DRAM channels

Incoherence example 2.2


Coherence Invariants

1. Single-Writer, Multiple-Read (SWMR) Invariant. For any memory location A, at
any
given (logical) time, there exists only a single core that may write to A (and
can also read it)
or some number of cores that may only read A.
2. Data-Value Invariant. (OMIT) The value of the memory location at the start of an
epoch is the same
as the value of the memory location at the end of its last readwrite
epoch.


Show Figure 2.3 timeline
  read-write at 1
  read-only at 2, 3
  read-write at 2
..

Maintiain invariants
* Use (64B blocks)
* FSM at caches & LLC
* communication with message/bus

SWMR in Practice. For block B at time t either
1 Modified (M) block -- write/read in one cache
0-n Shared (S) blocks -- read-only in multiple caches

Blocks can also be invalid as well as E or O : MOESI States

How?
* Snooping -- totally-order broadcast
* Directory -- use indirection

DISCLAIMER: This is high-level view


NIAGARA
--------------------

Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun,
Niagara: A 32-Way Multithreaded Sparc Processor,
IEEE Micro, March-April 2005, pp. 21-29.

(TOO MUCH HERE)


Why Niagara?

* Pushes TLP further than others
* But processors may be too simple


Draw on board Figure 2. Niagara Block Diagram

* 8 * 4 = 32 hardware threads

* DDR2 DRAM up to 128 GBytes

* L2 is 3MB 12-way 64-byte blocks interleaved in four banks on cache blocks
  (Thus, each bank is 3/4 MB 12-way cache with 1K blocks in each way.)

*  Crossbar with two-entry queue for each source-destination pair

* Only 60 watts!!!!!


Aside on (Simultaneuous) Multithreading (a.k.a. Intel Hyperthreading) (SKIP)

* Early computers stalled CPU for I/O
* Multitasking allows CPU to run other jobs while I/O pending

* (L2) Cache misses now very slow
* Tolerate with fancy ILP, but then stall
* Multithreading allows CPU pipeline to run other jobs with L2 miss pending

* Basic idea

  + replicate some things: program counter & registers
  + share some things: ALUs and caches
  + other micro-architectural resourses more complex

* Net result
  + Few additional logical "processors" for (almost) free
  + Helps a lot if you have threads waiting to memory
  + Hurts a little if you don't have threads
  + A wash if multiple threads rarely wait to memory (benchmarks?)


Niagara Pipeline (one of eight)

* 552 Pipe: Fetch, Decode, Execute, Memory Access, Writeback

* Nia Pipe: Fetch, *Thread Select*, Decode, Execute, Memory Access, Writeback

* Too much for 758:

  + Draw on board Figure 3. Niagara Pipeline Block Diagram
  + Go over stages

* Register File support 8 register windows for each of 4 threads: 5.7KB!
  + Each thread see the current 32 in fast SRAM
  + while the other regiters are in compact SRAM

* Includes 8KB L1 I-cache and 16KB L1 D-cache (small, but get easy stuff)

Memory System

* L1 cache are write-through (with no write allocate)

* L1 caches maintain just Valid/Invalid

* L2 cache banks keep copy of L1 state 
  + Loads misses reveal "way" of victim
  + Stores await invaldations of all other sharers
  + Implements Total Store Order (TSO)

* L2 is stardard write-bank cache w.r.t. memory

* Niagara systems allow only one Niagara chip


Assessment

* Pushes TLP further than others
* But processors may be too simple
* Low clock frequency than chips with long pipes


Memory Consistency  -- NOT DONE
-------------------

Coherence's Goal:
* Make cache is invisible as in uniprocessors
* Once invisible, what does memory do?  (can't refer to caches)


Example

/* initial A = B = flag = 0 */
		  P1		    P2
		A = 1;		while (flag == 0); /* spin */
		B = 1; 		print A;
		flag = 1; 	print B;

Intuition says printed A = B = 1
(OMIT) Coherence doesn't say anything, why?
(OMIT) Consider coalescing write buffer


               /* initial A = B = 0 */
              P1                        P2
          A = 1;                  B = 1
          r1 = B;                 r2 = A;

There outcomes but not fourth.

(OMIT) How screw up?
* write buffer
* ooo loads
* dir protocol that doesn't wait to ack


Define Memory Consistecny
* What SW can expect of HW
* What HW must provide to SW

Want 4Ps: (DID NOT COVER)

* Programmability: EZ to program
* Performance: faciliates good performance (or low cost)
* Portable 
* Precision
								 

Sequential Consistency

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

Show two processors po and global mo

Show "railroad switch" picture

Same as multi-threaded uniprocessor


Total Store Ordering (TSO) -- x86 and SPARC
------------------------------------------

               /* initial A = B = 0 */
              P1                        P2
          A = 1;                  B = 1
          r1 = B;                 r2 = A;

Allows r1 == r2 == 0?
(Why? write buffers)


Relaxed (Weak) Ordering -- ARM, IBM Power
-----------------------------------------


/* initially all 0 */
		  P1			    P2
		A = 1;			while (flag == 0); /* spin */
		B = 1; 			r1 = A;
		flag = 1; 		r2 = B;


But many (most) order NOT necessary! E.g., "A = 1" and "B = 1"


Why not just enforce necessary orders? 

Relaxed models
* Unordered by default
* Use FENCEs to get necesarry order

/* initially all 0 */
		  P1			    P2
		A = 1;			while (flag == 0); /* spin */
		                        FENCE
		B = 1; 			r1 = A;
		FENCE
		flag = 1; 		r2 = B;