

# A FORMAL ANALYSIS OF THE NVIDIA PTX MEMORY CONSISTENCY MODEL

Dan Lustig, Sameer Sahasrabuddhe, Olivier Giroux, Apr 15, 2019 (ASPLOS 2019)

### THE NVIDIA PTX MEMORY CONSISTENCY MODEL

Memory consistency model: a set of rules defining the values that loads can legally return



Image: https://devblogs.nvidia.com/unified-memory-cuda-beginners/

This paper: a new scoped memory consistency model for NVIDIA GPUs

- enables flexible intra-GPU, GPU-GPU, and CPU-GPU communication
- PTX documentation online, and axiomatic formalization in paper
- C++ compiler mappings tested using Alloy and verified using Coq

#### MOTIVATION

Why create a new memory consistency model for NVIDIA GPUs?

### **GOAL #1: FIX BUGS IN PRIOR GENERATIONS**

Implementation (compiler + hardware) must respect software specs

#### **Exposing Errors Related to Weak Memory in GPU Applications** Alastair F. Donaldson Tyler Sorensen Imperial College London, UK Imperial **GPU Concurrency:** t.sorensen15@imperial.ac.uk alastair.dor Weak Behaviours and Programming Assumptions [PLDI'16] Jade Alglave<sup>1,2</sup> Mark Batty<sup>3</sup> Alastair F. Donaldson<sup>4</sup> Ganesh Gopalakrishnan<sup>5</sup> Abstract have been shown to John Wickerson<sup>4</sup> Jeroen Ketema<sup>4</sup> Daniel Poetzl<sup>6</sup> Tyler Sorensen<sup>1,5</sup> behaviours beyond th We present the systematic design of a testing environment <sup>1</sup> University College London <sup>2</sup> Microsoft Research <sup>3</sup> University of Cambridge leavings are possible that uses stressing and fuzzing to reveal errors in GPU appli-<sup>4</sup> Imperial College London <sup>5</sup> University of Utah <sup>6</sup> University of Oxford even more challengi cations that arise due to weak memory effects. We evaluate [ASPLOS'15] Abstract Yet GPU concurrency is poorly specified. The vendors' documentation and programming guides suffer from signif-Concurrency is pervasive and perplexing, particularly on icant omissions and ambiguities, which force programmers graphics processing units (GPUs). Current specifications of to rely on folklore assumptions when writing software. languages and hardware are inconclusive; thus programmers To distinguish assumptions from ground truth, we quesoften rely on folklore assumptions when writing software. tioned the hardware guarantees and the assumptions made To remedy this state of affairs, we conducted a large em

### GOAL #2: ENABLE AN IMPROVED SIMT MODEL

Avoid starvation/livelock scenarios possible under prior SIMT model



https://devblogs.nvidia.com/inside-volta/

**NVIDIA** 

#### THE NVIDIA MEMORY CONSISTENCY MODEL

A High-Level Overview

Lustig, Sahasrabuddhe, Giroux, "A Formal Analysis of the NVIDIA PTX Memory Consistency Model", ASPLOS 2019

📀 NVIDIA.

#### NVIDIA MEMORY MODEL: GENERAL APPROACH

- 1. Start with the least common denominator of modern CPU weak memory consistency models
  - As weak as possible (better performance, more microarchitectural flexibility), as long as the result is properly programmable

2. Add scopes

## PTX MEMORY MODEL STARTING POINT

#### The "Least Common Denominator" of existing models

- Causality
  - Release/Acquire, CTA Execution Barriers, Fences (transitively)
- Coherence Order
- Sequential Consistency per Location
  - expected single-threaded and same-address behavior
- Atomicity of RMWs
- No Out-of-Thin-Air Executions
  - still-open theoretical problem: how to prevent self-justifying speculation?

#### NVIDIA MEMORY MODEL: GENERAL APPROACH

- 1. Start with the least common denominator of modern CPU weak memory consistency models
  - As weak as possible (better performance, more microarchitectural flexibility), as long as the result is properly programmable

#### 2. Add scopes

# **SCOPES OR NO SCOPES?**

- Early GPU memory models exposed scopes
  - HSA/HRF [ASPLOS'14], OpenCL
- Later papers: HW coherence protocols can track scopes instead of user, delivering simpler model with no performance loss
  - Remote Scope Promotion [ASPLOS'15], DeNovo+DRF [MICRO'15], Relativistic Cache coherence [HPCA'17], ...
- But: that required extra hardware support is a non-trivial cost we can't/won't commit to!





#### SCOPES AND MORAL STRENGTH



- A scope is a set of threads
- Each synchronizing memory instruction specifies a scope, e.g., "1d.acquire.gpu"
- Morally strong aka mutually scope inclusive: a pair of instructions where the scope of each includes the other

### SCOPES AND MORAL STRENGTH



- A scope is a set of threads
- Each synchronizing memory instruction specifies a scope, e.g., "1d.acquire.gpu"
- Morally strong aka mutually scope inclusive: a pair of instructions where the scope of each includes the other

### SCOPES AND MORAL STRENGTH



- A scope is a set of threads
- Each synchronizing memory instruction specifies a scope, e.g., "1d.acquire.gpu"
- Morally strong aka mutually scope inclusive: a pair of instructions where the scope of each includes the other

# PTX MEMORY CONSISTENCY MODEL AXIOMS

#### The "Least Common Denominator", plus scopes

- Causality, built out of <u>morally strong</u> pairs of memory accesses
  - Release/Acquire, CTA Execution Barriers, Fences (transitively)
- Coherence Order, for <u>causally related/morally strong</u> pairs of memory accesses
- Sequential Consistency per Location, for morally strong pairs of memory accesses
  - expected single-threaded and same-address behavior
- Atomicity of RMWs, for morally strong pairs of memory accesses
- No Out-of-Thin-Air Executions
  - still-open theoretical problem: how to prevent self-justifying speculation?

# PTX MEMORY CONSISTENCY MODEL AXIOMS

The "Least Common Denominator", plus scopes

- Causality, built out of morally strong
  - Release/Acquire, CTA Execution Barri
- Coherence Order, for <u>causally related</u>
- Sequential Consistency per Location,
  - expected single-threaded and same-a

Still-open theoretical problem. now to

- Atomicity of RMWs, for morally strong
- No Out-of-Thin-Air Executions Full details in PTX documentation and in paper!

 $pattern_{rel} := ([W^{\geq REL}]; po\_loc^?; [W]) \cup ([F^{REL}]; po; [W])$   $obs := (morally\_strong \cap rf) \cup (obs; rmw; obs)$   $pattern_{acq} := ([R]; po\_loc^?; [R^{\geq ACQ}]) \cup ([R]; po; [F^{ACQ}])$   $sw := (morally\_strong \cap (pattern_{rel}; obs; pattern_{acq}))$   $\cup sync_{barrier} \cup sc$   $cause_{base} := (po^?; sw; po^?)^+$   $cause := cause_{base} \cup (obs; (cause_{base} \cup po\_loc))$ 

Figure 4. PTX Memory Model Relations

Axiom 1 (Coherence). [W]; *cause*; [W]  $\subseteq$  *co* 

Axiom 2 (FenceSC).

irreflexive(sc; cause)

Axiom 3 (Atomicity).

empty(((*morally strong*  $\cap$  *fr*)

(morally strong  $\cap$  co))  $\cap$  rmw)

Axiom 4 (No-Thin-Air). acyclic( $rf \cup dep$ )

Axiom 5 (SC-per-Location). acyclic((*morally\_strong*  $\cap$  (*rf*  $\cup$  *co*  $\cup$  *fr*))  $\cup$  *po\_loc*)

Axiom 6 (Causality). irreflexive(( $rf \cup fr$ ); *cause*)

Figure 7. PTX Memory Model Axioms

#### **VERIFIED COMPILER MAPPINGS**

How do we know the model is compatible with general-purpose languages like CUDA and C++?

#### **DEFINING AND VERIFYING "CORRECTNESS"**



- 1. Execute according to abstract C++ semantics
- 2. Compile to PTX, execute, and interpret as execution of original C++ program

Correctness: #2 must always be consistent with #1

### C++ COMPILER MAPPINGS



Modern best practice:

Provide a fixed compiler mapping from C++ to PTX assembly

#### DEFINING AND VERIFYING "CORRECTNESS"



Build models of both paths, and then:

- 1. Empirically test for counterexamples in the theory
- 2. Rigorously prove the theory correct
- 3. Empirically test on real hardware
- 4. Whatever else!

#### VERIFYING OUR C++ COMPILER MAPPINGS

- We developed an Alloy -> Coq compiler, enabling a unified framework combining:
  - Alloy: Relational Model Finder
  - Coq: Interactive Theorem Prover
- Empirically test AND formally prove that the compiler mappings are sound
  - Same model for both flows!



#### THE NVIDIA PTX MEMORY CONSISTENCY MODEL Summary and Conclusion

- Scoped weak memory model, designed to maximize performance, architectural flexibility, and portability
- Allows GPU threads to freely communicate with any other thread in the system
  - Fixes bugs, livelock/starvation issues in past generations
- Developed a new unified framework to empirically test AND formally prove the correctness of C++ compiler mappings
  - Extends modern best practice for hardware memory consistency models

All materials available online: <u>https://github.com/nvlabs/PTXMemoryModel</u>