--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------
------------
Relaxed Consistency
------------
Outline
* Motivation
* Example Relaxed Model (XC)
* Implementing XC
* SC for DRF
* Other Relaxed Concepts
* High-Level Languages
* (R10000)
* (Hill 1998)
Relaxed Memory Models Motivation
/* initially all 0 */
P1 P2
A = 1; while (flag == 0); /* spin */
B = 1; r1 = A;
flag = 1; r2 = B;
* Recall SC has
Each processor generates at total order of its reads and writes
L-->L, L-->S, S-->S, S-->L
That are interleaved into a global total order
* TSO removes S-->L
But many (most) order NOT necessary!
Why not just enforce necessary orders?
* coaelescing write buffer
* simpler core speculation (e.g., less power)
* weaken coherence -- e.g., use block before getting all directory acks
Concepts also important for HLL
Example Relaxed Model (XC)
Key idea is FENCE: FENCE on Pi does not allow Pi's reference order pass it:
L-->F
S-->F
F-->F
F-->L
F-->S
/* initially all 0 */
P1 P2
A = 1; while (flag == 0); /* spin */
B = 1; FENCE
FENCE r1 = A;
flag = 1; r2 = B;
Also XC orders access to SAME Address A like TSO
L(A)-->L(A)
L(A)-->S(A)
S(A)-->S(A)
Ensures:
A = 1; while (flag == 0); /* spin */
A = 2; FENCE
FENCE r1 = A; /* r1==2, not 1 */
flag = 1;
In Summary, XC from book:
Operation 2
Op 1 Load Store RMW FENCE F+RMW+F
Load A A A X X
Store B A A X X
RMW A A A X X
FENCE X X X X X
F+RMW+F X X X X X
X == Op1
Op1 CAUSALITY
Want 4Ps:
* Programmability: EZ to program -- DRF yes; TO close
* Performance: could be better than TSO
* Portable -- DRF yes; otherwise hard
* Precision -- formal definition obtuse
Show vend diagrams -- Figure 4.5
* executions
* implementations
* SC, TSO than PPC and ARM
HLL -- Figure 5.6
C++ Java
(a) --- ----
c/r compil/runtime binary
(b) ---------------------------
HW system
We have been talking about (b)
Most programmes care abotu (a) and down
Java Memory Model - Manson, Pugh, & Adve
Show picture of HLL memory model
Just like HW implementation can reorder, so can compiler
Example 1
if (ready==0) {}
r1 = data ;; reorder?
Example 2
data = r2 ;; reorder
ready = 1
if (ready==0) {}
r1 = data
Register allocation, constant propagation, hoisting from loop, ...`
Idea:
(1) Make programmers identify synchronization.
(2) Tell programmer to make their programs data-race-free
(3) If DRF,
(a) Programmer can reason with SC
(b) Compiler can use many optimizations between synchronization
operations
(c) Depending on the HW memory consistency model, compiler can insert fenses
All of the above due to Adve & Hill, ISCA 1990.
New C++ Memory does this as well.
(4) What if program is not DRF by error or malicious insent
Java's security model requires that we must say what can happen. Want
(i) simplicity
(ii) most compiler optimzations from 3(b) allowed.
Got (ii) but not (i).
MIPS R10000
Speculative Cores
Address Queue
Coherence memory system
1. Core puts speculative loads and stores in address queue (AQ) in program order
2. Speculative loads obtain value for last store to the same address in AQ or coherent memory system
3. Incoming invalidate requests that seek to zap a block that is a target of a pending load in AQ
cause a mis-speculation to ensure incorrect load values are discarded
Although the address queue executes load and store instructions out of
their original order, it maintains sequential memory consistency. The
external interface could violate this consistency, however, by
invalidating a cache line after it was used to load a register, but before
rhat load instruction graduates. In this case, the queue creates a soft
exception on the load instruction. This exception flushes the pipeline
and aborts that load and all later instructions, so the processor does
not use the stale data.
--Yeager p35, left column, new paragraph 5:
4. Speculative stores write value into address queue
5. Instruction commit in program order
* Loads just removed from address queue
* Stores obtain M copy and then write into cache
--------------------------------NOT USED------------------------------
@ARTICLE{hill:simple,
AUTHOR = "Mark D. Hill",
TITLE = "Multiprocessors Should Support Simple Memory Consistency Models",
JOURNAL = COMPUTER,
YEAR = 1998,
VOLUME = 31,
NUMBER = "8",
PAGES = "28-34",
MONTH = "August",
WHERE = "bound"}
Most microprocessr will do most of the following
Coherence caches
Non-binding prefetching
Multithreading
Speculation
The combination of these can make the RC/PC/SC performance differences smaller
Go through example
Results: RC/PC/SC can do all use the same techniques,
but:
SC may commit later and exhaust implementation resources
SC can have more mis-speculations
Quantitatively
Using Sarita's ASPLOS 1996 numbers,
SCimpl to PCimpl reduce execution time 10%
SCimpl to RCimpl reduce execution time 16%
Thus
Have hardware implement SC
Speculative execution closes performance gap
Get this complexity off SW/HW interface
so middleware authors can concentrate on their other jobs
HW designers get a simple, formal correctness criterion.
But
Whither simple cores?
Whither power?