--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

------------
Coherence 2
------------

Outline (two lectures)
 Exclusive (E) and Owned (O) States (now or last time)
 Non-Atomic Bus
 Qualitative Sharing Patterns
 (Starfire)
 Hammond: TLS 
 (Hill: Use Simple Mem Models)


Review System Model & Coherence Invariants
------------------------------

System model -- Figure 2.1

cores w/ private caches w/ controllers
icn
LLC w/ memory w/ controller
Offchip DRAM


Coherence Invariants
------------------------------

1. Single-Writer, Multiple-Read (SWMR) Invariant. For any memory location A, at
any
given (logical) time, there exists only a single core that may write to A (and
can also read it)
or some number of cores that may only read A.
2. Data-Value Invariant. The value of the memory location at the start of an
epoch is the same
as the value of the memory location at the end of its last readwrite
epoch.


Show Figure 2.3 timeline
  read-write at 1
  read-only at 2, 3
  read-write at 2
..

Maintiain invariants
* Use (64B blocks)
* FSM at caches & LLC
* communication with message/bus

Goal:
* Make cache is invisible as in uniprocessors
* Once invisible, what does memory do?  (can't refer to caches)


Specifying Coherence Protocols
------------------------------

FSMs communicating via messages.  Use Table:

Row for state and transient states
Columns for events -- core request and incoming messages

VI:

I-->V Own-Get_DataResp
V-->I Own--Put or Other-Get

Go over Table 6.2
Transient state IV[D]

Note: 

P1's cache has a virtual FSM per block
Pi's cache has same FSM (but may in in different state)
Memory (LLC) also have virtual FSM per block (difference from cache FSM)  --
See Table 6.3

MOESI
-----
		Validity	Dirtiness	Exclusivity	Owned
Modified	   X		    X		    X	          X
(Owned) 	   X		     		    	          X
(Exclusive)	   X		     		    X
Shared  	   X		     		     
Invalid 	 


Stable states stored in cache, e.g., ceiling(log2(5)) = 3 bits
Transient states in MSHRs

Common Transactions: GetS GetM Upgrade (PutS) (PutE) PutO PutM

Common Requests: load, store, RMW, i-fetch, RO-prefetch, RW-prefetch, replace

Protocol Taxonomy (simplified)

* Snooping: totally-ordered broadcast (Chapter 7)

* Directory: point-to-point message with level of indirection (Chapter 8)


Write Invalidate vs. Write Update
* assumed write invalidate & this is more common
* write update -- hard to implement memory consistency models and too much traffic


=====================
Snooping (Chapter 7)
=====================

Simple -- 
--------------------
* Atomic Requests (request ordered same cycle it is issued)
* Atomic Transactions (no other request to SAME block until transaction done)
Show
* $ FSM Figure 7.1
* mem FSM Figure 7.1
* system

Go over FSMs in Figures 7.5 and 7.6
shaded -- not possible
blank -- no action
Mem: IorS[D] -- memory waiting for writeback as part of cores doing M to S transistion
store in M is GetM/SM[D] -- sends data reduntanly -- could have Upgrade


Figure 7.8-7.9  -- Non-atomic request (e.g., queue to get on bus), Atomic Transactions
Store in I, send GetM ==> IM[AD], see own GetM (ordering point), could "do" store
==> IM[D], gets data ==> M, finishes store


Consider "window of vulnerability"
Store in S, send GetM/Upgrade ==> SM[AD], see OTHER GetM so invalidate ==> IM[AD], own GetM
==> IM[D], gets data ==> M, finishes store
Makes upgrade transacton trickier


Normal Writeback
writeback in M, send PutM ==> MI[A], see own PutM, send data ==> I

Writeback racing other GetM
writeback in M, send PutM ==> MI[A], see other GetM, send data ==> II[A], see own PutM ==> I


Exclusive (E) State
--------------------
* Idea: on GetS, if no other sharers, goto E instead of S
* If subsequent store, silently go to M
* If other Gets, silently go to S
* If replace in E, treat like S (silent replacement)

Important to (mostly) private data
* Otherwise read miss (GetS) then write miss (GetM)
* With E read mis (GetS) and then silent upgrade to M

How implement "if no other sharers"
* Add state to LLC if there is (was) at least one sharer
* Before LLCs, often had logical "wired" OR of sharers -- shared line
[Before LLCS, memory often found out whether there was an M block with OR as well -- owned line]

Look at FSMs 7.4 and 7.5


Owned (O) State
--------------------
Advantages
Otherwise in M 7 see other GetS, must send data to BOTH requestor and LLC
O state eliminates extra data message and LLC updates
Historialy, O also allowed subsequent GetS to be source from cache -- which
could be faster than memory (old days) but probably not LLC
See FSM in Figure 7.6 and 7.7


Non-Atomic Bus -- pipelined or out of order
--------------------
See Figures 7.11
NASTY RACES!
Store in I, send GetM ==> IM[AD], see OTHER GetM (do nothing -- this core "before" you
==> IM[AD], own GetM ==> IM[D], OTHER GetM (must promise to forward data you
don't have) ==> IM[D]I then gets data FOOTNOTE send data to other ==> I

FOOTNOTE:  (
* If you send data w/o doing instruction, could livelock
* If always "hold" data until you can do an instruction could deadlock
* Do one instruction if and only if it is oldest


Qualitative Sharing Patterns [Weber & Gupta, ASPLOS3]
-----------------------------

* Read-Only

* Migratory Objects
  - Maniputalated by one processor at a time
  - Often protected by a lock
  - Usually a write causes only a single invalidation

* Synchronization Objects
  - Often more processors imply more invalidations

* Mostly Read
  - More processors imply more invalidations, but writes are rare

* Frequently Read/Written
  - More processors imply more invalidations


--------------------
  
Starfire: Extending the SMP Envelope

Alan Charlesworth, Sun Microsystems

IEEE Micro, Jan/Feb 1998, pp. 39-49.


Notes by Mark D. Hill, 19 Feb 2001


Starfire == Sun Ultra Enterprise 10000

Apt title:-- SMP but no physical bus


Same coherence protocols has E6000, etc.: write-invalidate MOESI

Up to 64 processors with 4-processors, i/o, and memory per board

4 address buses, interleaved on low-order block bits and implemented w/ ASICs

Data network is 144b (16+2 byte) 16x16 crossbar

Many active parts on centerplane (Figure 5)

130 ASICs in interconnect!


Read miss takes 38 cycles (468 ns) (vs. 18 clocks / 216 ns for E6000)

FOLLOW MISS OF TABLE 4 THROUGH FIGURE 4.


Domains very important
     multiple logical machines in one physical machine
     (like mainframe partitions)
     In global in interconnect, for each of 16 boards a 16b mask with
     which boards are in the same partition (& snoops should continue)
     lots of hardware redundancy

     why?
	software migration
	separating test SMP
	consolidating several SMPs
	hard partitioning for cost-accounting, etc.

OMIT {

The Sun Fireplan Interconnect
Alan Charlesworth
IEEE Micro
Jan-Feb 2002
pp. 36-45

Level	Comment
-----	-------
0	1-2 CPUs board w/ two USIII processors/memory (or two PCIs) (or up)
1	8 CPUs (workgroup servers) (or up)
2	24 CPUs (or up)
3	106 CPUs via directory protocol between snoop domains

}

------------------------

Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu,
Mike Chen, and Kunle Olukotun,
The Stanford Hydra CMP,
IEEE Micro, March-April 2000, pp. 71-84.


With wire delay, more transisitors, growing complexity,
and limited ILP, let do a CMP instead?

Why not before?

* Not enough transistors -- Moore's Law provides
* Not enough parallel applications -- so thread-level specualtion (TLS)

Base CMP
--------
4-cores
write-through L1 caches
shared write-back L2 cache
read bus -- cache block wide
write buses -- word wide
simple write-invalidate protocol
Buses use multiple segments with repeaters -- logical but not physical buses
(Probably not sequentially consistent -- processor consistent?)

Thread-Level Speculation (TLS)
-----------------------------
Grew out of Wisconsin Multiscalar Project

Think Core i executes loop iteration k, 4+k, ...
Also subroutines?  Fine-grain loops a problem

Can do w/o TLS but then worse-case synchronization must maintian
memory RAW, WAW, and WAR hazards.

Assume no dependencies and have hardware detect
Must:

1. Forward data
2. Handle RAW
3. Discard speculative state
4. Handle WAW
5. Handle WAR

Show the above via example (with time vertical)

Time	Iteration i	Iteration i+1
 |
 V	(1) st A

			(1) ld A

			(2) ld B

	(2) st B

			(3) discard

			(4) st C

	(4) st C

			(5) st D

	(5) ld D

Hydra adds:

* Bits on each L1 cache lines to track speculative reads/writes
* Writeback buffer to defer L2 updates until commit (buffers at
L2 cache)

1. Forward data: L1 invalidated; read L2
2. Handle RAW: see write to line with speculative read marked
3. Discard speculative state: zap L1 cache and writeback buffers
4. Handle WAW: update L2 in commit order
5. Handle WAR: writes by "later" thread not seen; actually pre-invalidate
   remembers for when processor is re-assigned (e.g., from k to 4+k)

Analysis

Performance okay
New results for Java JVM better
Not accepted commercially -- many CMPs but no announced TLS support
Is communication too slow to pseudo-TLP?


Review Question
* Marc: TLS is Memory, but not Registers
* Asim: TLS successful?  Shijin & Aditya: TLD used?  Brian: Perform?
* Guolang: five memory sys requirement; two way to divide -- loops, subroutines
* David: question on writes
* Daniel: MIPS coprocessor -- not important
* Andy:  Speedup good?


NEEDs TO MOVE

@ARTICLE{hill:simple,
    AUTHOR = "Mark D. Hill",
    TITLE = "Multiprocessors Should Support Simple Memory Consistency Models",
    JOURNAL = COMPUTER,
    YEAR = 1998,
    VOLUME = 31,
    NUMBER = "8",
    PAGES = "28-34",
    MONTH = "August",
    WHERE = "bound"}


My current thinking
    [Pai, et al., An Evaluation of Memory Consistency Models ... ILP Processors"
    ASPLOS96] (in reader)
    [Hill, Multiprocessors Should Support Simple Memory Consistency Models,
    Computer, Aug 98] (my web page)
    [Gniady, Falsafi, & Vijaykumer, Is SC + ILP = RC?, ISCA99] (from Purdue)

Most microprocessr will do most of the following
    Coherence caches
    Non-binding prefetching
    Multithreading
    Speculation

The combination of these can make the RC/PC/SC performance differences smaller
    Go through example

    Results:  RC/PC/SC can do all use the same techniques,
    but:

	SC may commit later and exhaust implementation resources
	SC can have more mis-speculations

Quantitatively
    Using Sarita's ASPLOS numbers, 
    SCimpl to PCimpl reduce execution time 10%
    SCimpl to RCimpl reduce execution time 16%

But
    These are scientific programs.  Database?  OSs?

    We have only begun to fight
	large caches, active windows, etc.
	new techniques, see "SC + ILP = RC?"

Thus

    Have hardware implement SC
    Speculative execution closes performance gap
    Get this complexity off SW/HW interface
	so middleware authors can concentrate on their other jobs
	HW designers get a simple, formal correctness criterion.

Reviews
* Asum: common? x86, TSO, powerPC (alpha, RMO), (SC for SGI/HP)
* David: 20% large?
* Brian: Multicore and simpler cores?
* Daniel: Gap today?  multithreaded writebuffers
* Shijin: Why does spec exec reduce SC/RC gap?
* Marc: Advocates RC
* Andy: W-to-R order models?
* Tony & Guoliang: 10-year predictions