# CS 758: Advanced Topics in Computer Architecture

Lecture #8: The GPU Memory System (Part 2)

Professor Matthew D. Sinclair

Backup of these slides were developed by Tim Rogers at the Purdue University.

Slides enhanced by Matt Sinclair

#### In-Class Activity

- With a partner, work on the 3 problems for 5 minutes
  - Problem 1: GPU MCM
  - Problem 2: GPU MCM with scopes
  - Problem 3: bugs with writing synchronization code on GPUs
- Then we'll come together as a group and discuss

### CPU Coherence: MESI



- Write miss: Get ownership, invalidate all sharers
- Read miss: Update sharers list
- Synchronization points are cheap
- BUT poor fit for GPUs: [Singh13, Hechtman14]
  - Directory overhead, transient states, excessive traffic, indirection

#### **Complex coherence, simple consistency**

### Atomics Background

- Default: Data-race-free-0 (DRF0) [ISCA '90]
  - Identify all races as synchronization accesses (C++: atomics)

// each thread for i = 0:n

ADD R4, A[i], R1synch (atomic)ADD R5, B[i], R1synch (atomic)

- All atomics örder data accesses
- Atomics order other atomics

...

 $\Rightarrow$ Ensures SC semantics if no data races

### Atomics Background (Cont.)

- Default: Data-race-free-0 (DRF0) [ISCA '90]
  - All atomics order data accesses
  - Atomics order other atomics
     ⇒Ensures SC semantics if no data races
- Data-race-free-1 (DRF1): unpaired atomics [TPDS '93]
  - + Unpaired atomics do not order data accesses
  - Atomics order other atomics
  - $\Rightarrow$ Ensures SC semantics if no data races
- Relaxed atomics [PLDI '08]
  - + Do not order data or other atomics

 $\Rightarrow$ But can violate SC and no formal specification

### Traditional GPU Coherence



Each thread accesses independent data (no races) No data reuse or data sharing Coarse-grained synchronization Optimized for streaming, data parallel applications

### GPU Memory Consistency Model

- Active area of research
- Tightly tied in with coherence protocol
- Provides very weak guarantees
  - Respect program order within a single thread
  - Easy to design hardware
  - Programmers add *fences* to provide extra guarantees
    - Fence guarantee all previous accesses are visible before proceeding
    - ... usually
- Most GPUs use a *scoped* memory consistency model
  - \_\_\_\_threadfence\_\_block local synchronization (usually at L1)
  - \_\_\_\_\_threadfence GPU global synchronization (usually at L2)
  - \_\_\_\_\_threadfence\_\_\_system CPU-GPU global synchronization (flush GPU)

### GPU Coherence with DRF



- With data-race-free (DRF) memory model
  - No data races; synchs must be explicitly distinguished
  - Synchronization accesses (atomics) go to last level cache (LLC)
  - Synchronization points are expensive, preclude reuse

#### Simple but inefficient coherence, simple consistency

### GPU Coherence with HRF



- New memory model: Heterogeneous-race-free (HRF) [ASPLOS '14]
  - Adds scoped synchronization

### GPU Coherence with HRF



- New memory model: Heterogeneous-race-free (HRF) [ASPLOS '14]
  - Adds scoped synchronization
  - No overhead for locally scoped synchronizations
- But higher programming complexity

More efficient coherence, complex consistency

### GPU Coherence with HRF

- heterogeneous HRF
   With data-race-free (DRF) memory model [ASPLOS '14]
  - No data races; synchs must be explicitly distinguished heterogeneous and their scopes
  - At all synch points

#### global

- Flush all dirty data: Unnecessary writethroughs
- Invalidate all data: Can't reuse data across synch points Global
  - Synchronization accesses must go to last level cache (LLC)
  - No overhead for locally scoped synchs
- But higher programming complexity

### DeNovo Coherence with DRF



- Reuse dirty data across synch points more data reuse
- Synchronization accesses can be performed at L1 synch reuse

#### 3% area overhead vs. GPU Coherence + HRF

Efficient coherence, simple consistency (except relaxed atomics)

### Additional Topics

- Extending Coherence across accelerators
  - CCIX, ACE, Spandex, CXL,
  - Challenge: Different accelerators have different coherence requirements
    - E.g., different data widths
  - Especially important as number of accelerators increases
- Multi-GPU/Multi-Chiplet Coherence
  - Challenge: NUMA accesses (ala SMPs), page migration, partitioning
  - Potentially can extend existing mechanisms
- DRFrlx [Sinclair ISCA '17]
  - Extend MCM to provide sane semantics for relaxed atomics
- Timestamps
  - Avoid traditional coherence overheads, but some bookkeeping/delay instead

### Additional Topics (Cont.)

- Coherence Granularity
  - HSC [Power MICRO '13] large granularity for traditional GPGPU apps
  - hUVM [Koukos TACO '16] page granularity coherence
    - Exploits traditional GPGPU apps nature
  - Some relationship with subsequent work on ACE, CCIX, Spandex, etc.
- Coherent Scratchpads
  - Best of both worlds!
- Remote Scopes [Orr ASPLOS '15]
  - Dynamically vary scope granularity to reduce overhead
- hLRC [Alsop MICRO '17]

#### Conclusion

- GPU coherence and consistency are hot, recent topics
  - Lots of ongoing research in the area
- Goal: avoid replicating 20-30 years of CPU MCM research
- Idea: evolve from that CPUs already have now
  - Design Goal: keep GPU coherence protocols simple
  - GPU apps don't need more complex coherence protocols
  - This has implications on MCM
- State-of-the-art (products): GPU + HRF
  - CPU-GPU coherence in real products assumes this for the GPU component

### Backup

#### Coherence & Consistency Qualitative Analysis

| Coherence + Consistency | Reu          | se Data   | Do Synchs    |  |  |
|-------------------------|--------------|-----------|--------------|--|--|
| ouncrence - ounsistency | Owned        | Valid     | at L1        |  |  |
| GPU + DRF(GD)           | X            | X         | X            |  |  |
| GPU + HRF(GH)           | local        | local     | local        |  |  |
| DeNovo + DRF (DD)       | ~            | X         | $\checkmark$ |  |  |
| DeNovo-RO + DRF (DD+RO) | $\checkmark$ | read-only | $\checkmark$ |  |  |
| DeNovo + HRF (DH)       | $\checkmark$ | local     | $\checkmark$ |  |  |

| g Other                                                                                                                                                                                                                                                          | Use Cases Into DRFrlx                                                                                                                                             |  |  |  |  |  |  |  |  |  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|--|
| Work Queas                                                                                                                                                                                                                                                       | 4. CKS                                                                                                                                                            |  |  |  |  |  |  |  |  |  |
| Flags                                                                                                                                                                                                                                                            | Ref Counters                                                                                                                                                      |  |  |  |  |  |  |  |  |  |
| How to incorporate SC violations in approximable applications?<br>How to incorporate overlapped atomics that do not order?<br>How to incorporate violations of SC that do not affect final result?<br>How to incorporate relaxed atomics that do not order data? |                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |
| Category                                                                                                                                                                                                                                                         | Semantics                                                                                                                                                         |  |  |  |  |  |  |  |  |  |
| Unpaired                                                                                                                                                                                                                                                         | SC                                                                                                                                                                |  |  |  |  |  |  |  |  |  |
| Non-Ordering                                                                                                                                                                                                                                                     |                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |
| Commutative                                                                                                                                                                                                                                                      | Final result always SC                                                                                                                                            |  |  |  |  |  |  |  |  |  |
| Speculative                                                                                                                                                                                                                                                      |                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |
| Quantum                                                                                                                                                                                                                                                          | SC-centric: non-SC parts isolated                                                                                                                                 |  |  |  |  |  |  |  |  |  |
|                                                                                                                                                                                                                                                                  |                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |
|                                                                                                                                                                                                                                                                  | g Other<br>Work Queues<br>Flags<br>Flags<br>orate SC violati<br>Corporate relaxe<br>Category<br>Unpaired<br>Non-Ordering<br>Commutative<br>Speculative<br>Quantum |  |  |  |  |  |  |  |  |  |

#### **Review: Cache Coherence Problem**



- Processors see different values for u after event 3
- With write back caches, value written back to memory depends on order of which cache writes back value first
- Unacceptable situation for programmers

## **Coherence Invariants**

#### 1. Single-Writer, Multiple-Reader (SWMR) Invariant



2. Data-Value Invariant. The value of the memory location at the start of an epoch is the same as the value of the memory location at the end of its last read-write epoch.

### **Coherence States**

- How to design system satisfying invariants?
- Track "state" of memory block copies and ensure states changes satisfy invariants.
- Typical states: "modified", "shared", "invalid".
- Mechanism for updating block state called a coherence protocol.

### Intra-GPU Coherence

•

•



#### **GPU Coherence Challenges**

• Challenge 1: Coherence traffic



#### GPU Coherence Challenges

- Challenge 2: Tracking in-flight requests
  - Significant % of L2



#### **GPU** Coherence Challenges

Challenge 3: Complexity

#### Non-coherent L1



#### Non-coherent L2

|            | L1 GETS                                                   | WB Data                                                     | L2 Atomic                                                           | <u>L2</u><br>Replacement    | L2 Replacement<br>clean   | Mem Data                           |
|------------|-----------------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------|-----------------------------|---------------------------|------------------------------------|
| <u>NP</u>  | <b>q</b> <u>lpB</u> i <b>s</b> a <b>j</b><br>/ <u>ISS</u> | <b>q</b> <u>lpB d i <b>x</b> as</u><br><b>j</b> / <u>IM</u> | <b>q</b> <u>lpB</u> <u>d</u> i <b>x</b> as <b>j</b><br>/ <u>IMA</u> |                             |                           |                                    |
| <u>88</u>  | <u>lpR ds set j</u>                                       | <u>f lpW d de mr</u><br><u>set j</u>                        | <u>f lpW d ds a mr</u><br><u>set j</u>                              | <u>f lpE c <b>r</b> /NP</u> | <u>f lpE <b>r</b> /NP</u> |                                    |
| <u>ISS</u> | <u>lpR <b>s j</b></u>                                     | <u>Z</u>                                                    | <u>Z</u>                                                            | <u>Z</u>                    | Z                         | <u>m e s o /SS</u>                 |
| IM         | Z                                                         | Z                                                           | Z                                                                   | <u>Z</u>                    | Z                         | <u>m mt ee s o</u><br>/ <u>SS</u>  |
| <u>IMA</u> | z                                                         | Z                                                           | Z                                                                   | Z                           | Z                         | <u>m mt e a s</u><br>o / <u>SS</u> |

#### MESI L2 States

|          |                        | L1 GET<br>INSTR                            | L1<br>GETS                                 | L1 GETX                                          | L1<br>UPGRADE                                  | L1<br>PUTX          | L1<br>PUTX<br>old | <u>L2</u><br>Replacement  | <u>L2</u><br>Replacement<br><u>clean</u> | <u>Mem</u><br>Data                            | <u>WB</u><br>Data                              | <u>WB</u><br>Data<br>clean             | <u>Ack</u> | <u>Ack</u><br><u>all</u>        | <u>Unblock</u>                | <u>Unblock</u><br><u>Cancel</u> | Exclusive<br>Unblock       |
|----------|------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------------|------------------------------------------------|---------------------|-------------------|---------------------------|------------------------------------------|-----------------------------------------------|------------------------------------------------|----------------------------------------|------------|---------------------------------|-------------------------------|---------------------------------|----------------------------|
| N        | <u>NP</u>              | <b>qln</b> is<br>auj/ <u>IS</u>            | <b>qln</b> isa<br>uj/ <u>ISS</u>           | qlixaQu<br>j/⊡                                   |                                                | t j                 | t j               |                           |                                          |                                               |                                                |                                        |            |                                 |                               |                                 |                            |
|          | <u>SS</u>              | <u>ds n u</u><br>set j                     | <u>ds n u set</u><br>j                     | d <u>fwm</u> <b>u</b><br>set j / <u>SS</u><br>MB | <u>fwm ts <b>u s</b>et</u><br>j / <u>SS MB</u> | ţj                  | ţj                | if <b>r/<u>SI</u></b>     | if <b>r/∐</b>                            |                                               |                                                |                                        |            |                                 |                               |                                 |                            |
| N        | M                      | d <u>n u set</u><br>j / <u>SS</u>          | <u>dd <b>u s</b>et</u> j<br>/ <u>MT MB</u> | <u>d <b>u</b> set j</u><br>/ <u>MT MB</u>        |                                                | t j                 | t j               | i <u>c <b>r</b> s /NP</u> | i <u>r s /NP</u>                         |                                               |                                                |                                        |            |                                 |                               |                                 |                            |
| 1        | MT                     | <u>b <b>u</b> set j</u><br>/ <u>MT IIB</u> | <u>b <b>u s</b>et</u> j<br>/ <u>MT IIB</u> | <u>b <b>u</b> set j</u><br>/ <u>MT MB</u>        |                                                | lmrtj<br>/ <u>M</u> | <u>t j</u>        | ifr/ <u>MTI</u>           | i <u>f <b>r</b> /MCT I</u>               |                                               |                                                |                                        |            |                                 |                               |                                 |                            |
| 5        | MT I                   | <u>zz</u>                                  | <u>zz</u>                                  | <u>ZZ</u>                                        | <u>ZZ</u>                                      | <u>77</u>           | <u>77</u>         |                           |                                          |                                               | <b>q</b> q ct s<br>o / <u>NP</u>               | <u>s o</u> / <u>NP</u>                 |            | <u>ct s</u><br>0<br>/ <u>NP</u> |                               |                                 |                            |
| E        | MCT<br>I               | <u>ZZ</u>                                  | <u>zz</u>                                  | <u>zz</u>                                        | <u>ZZ</u>                                      | ZZ                  | <u>ZZ</u>         |                           |                                          |                                               | <b>q</b> q <u>ct s</u><br><u>o</u> / <u>NP</u> | <u>s o</u> / <u>NP</u>                 |            | <u>s o</u><br>/ <u>NP</u>       |                               |                                 |                            |
| N        | Ш                      | <u>ZZ</u>                                  | zz                                         | <u>ZZ</u>                                        | <u>ZZ</u>                                      | t j                 | t j               |                           |                                          |                                               |                                                |                                        | <u>q o</u> | <u>s o</u><br>/ <u>NP</u>       |                               |                                 |                            |
| Ľ        | <u>s i</u>             | ZZ                                         | ZZ                                         | zz                                               | ZZ                                             | t j                 | ţį                |                           |                                          |                                               |                                                |                                        | d o        | <u>ct s</u><br>o<br>/NP         |                               |                                 |                            |
| I        | <u>188</u>             | <u>n s u j</u><br>/ <u>IS</u>              | nsuj<br>/ <u>IS</u>                        | <u>zz</u>                                        |                                                | t j                 | t j               | <u>zz</u>                 | <u>zz</u>                                | <u>md ex s</u><br>od / <u>MT</u><br>MB        |                                                |                                        |            |                                 |                               |                                 |                            |
| <u>S</u> | <u>15</u>              | <u>n s u j</u>                             | nsuj                                       | ZZ                                               |                                                | t j                 | t j               | ZZ                        | ZZ                                       | <u>md e s od</u><br>/ <u>SS</u>               |                                                |                                        |            |                                 |                               |                                 |                            |
| IS       | IM                     | <u>zz</u>                                  | zz                                         | ZZ                                               |                                                | t j                 | ţj                | ZZ                        | <u>ZZ</u>                                | <u>md ee s</u><br>od / <u>MT</u><br><u>MB</u> |                                                |                                        |            |                                 |                               |                                 |                            |
| M        | SS<br>MB               | <u>zz</u>                                  | <u>zz</u>                                  | <u>zz</u>                                        | <u>zz</u>                                      | zz                  | <u>zz</u>         | <u>ZZ</u>                 | <u>77</u>                                |                                               |                                                |                                        |            |                                 |                               | <u>k</u> / <u>SS</u>            | <u>mu k</u><br>/ <u>MT</u> |
| <u>E</u> | <u>MT</u><br><u>MB</u> | <u>77</u>                                  | zz                                         | <u>zz</u>                                        | <u>ZZ</u>                                      | <u>zz</u>           | <u>ZZ</u>         | <u>ZZ</u>                 | <u>ZZ</u>                                |                                               |                                                |                                        |            |                                 |                               | <u>k</u> / <u>MT</u>            | <u>mu k</u><br>/ <u>MT</u> |
|          | <u>М</u><br><u>МВ</u>  | <u>77</u>                                  | zz                                         | <u>zz</u>                                        | <u>ZZ</u>                                      | t j                 | t j               | <u>ZZ</u>                 | <u>ZZ</u>                                |                                               |                                                |                                        |            |                                 |                               |                                 | <u>mu k</u><br>/ <u>MT</u> |
|          | MT<br>IIB              | <u>zz</u>                                  | zz                                         | <u>ZZ</u>                                        | <u>ZZ</u>                                      | <u>ZZ</u>           | <u>zz</u>         | <u>ZZ</u>                 | <u>ZZ</u>                                |                                               | <u>m o</u><br>/ <u>MT</u><br><u>SB</u>         | <u>m o</u><br>/ <u>MT</u><br><u>SB</u> |            |                                 | <u>nu k</u><br>/ <u>MT IB</u> |                                 |                            |
|          | <u>MT</u><br><u>IB</u> | ZZ                                         | ZZ                                         | ZZ                                               | ZZ                                             | ţj                  | tj                | <u>ZZ</u>                 | <u>77</u>                                |                                               | <u>m o</u><br>/ <u>SS</u>                      | <u>m o /SS</u>                         |            |                                 |                               | <u>k</u> / <u>MT</u>            |                            |
|          | <u>MT</u><br><u>SB</u> | <u>ZZ</u>                                  | ZZ                                         | <u>zz</u>                                        | <u>ZZ</u>                                      | t j                 | t <b>j</b>        | <u>ZZ</u>                 | <u>ZZ</u>                                |                                               |                                                |                                        |            |                                 | <u>nu k</u> / <u>SS</u>       |                                 |                            |

#### **Coherence Challenges**

- Challenges of introducing coherence messages on a GPU
  - 1. Traffic: transferring messages
  - 2. Storage: tracking message
  - 3. Complexity: managing races between messages
- GPU cache coherence without coherence messages?
  - YES using global time

#### **Temporal Coherence**

Related: Library Cache Coherence



#### Temporal Coherence Example





#### **Lifetime Predictor**

- One prediction value per L2 bank
- Events local to L2 bank update prediction value







- Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications
- Lower traffic than 16x-sized 32-way directory

#### Performance





 TC-Weak with simple predictor performs 85% better than disabling L1 caches

#### CPU-GPU Coherence?

- Many vendors have introduced chips with both CPU and GPU (e.g., AMD Fusion, Intel Core i7, NVIDIA Tegra, etc...)
- What are the challenges with maintaining coherence across CPU and GPU?
- One important one: GPU has higher cache miss rate than CPU. Can place pressure on directory impacting performance.
- Power et al., Heterogeneous System Coherence for Integrated CPU-GPU Systems, MICRO 2013: Use "region coherence" to reduce number of GPU requests that need to access directory.

### Synchronization

- Locks are not encouraged in current GPGPU programming manuals.
- Interaction with SIMT stack can easily cause deadlocks:

```
while( atomicCAS(&lock[a[tid]],0,1) != 0 )
; // deadLock here if a[i] = a[j] for any i,j = tid in
warp
```

// critical section goes here

```
atomicExch (&lock[a[tid]], 0);
```

Correct way to write critical section for GPGPU:

```
done = false;
while( !done ) {
    if( atomicCAS (&lock[a[tid]], 0 , 1 )==0 ) {
        // critical section goes here
        atomicExch(&lock[a[tid]], 0 ) ;
    }
}
```

Most current GPGPU programs use barriers within thread blocks and/or lock-free data structures.

This leads to the following picture...

• Lifetime of GPU Application Development

