



Data Parallelism, Programming Models, SIMT



















3





















# Multicore Multithreaded SIMT

Many SIMT "threads" grouped together into GPU "Core" SIMT threads in a group  $\approx$  SMT threads in a CPU core

Unlike CPU, groups are exposed to programmers

### Multiple GPU "Cores"





# GPU Programming Models

OpenCL

### **GPU Programming Models**

- CUDA Compute Unified Device Architecture
- Developed by Nvidia -- proprietary
   First serious GPGPU language/environment
- OpenCL <u>Open C</u>omputing <u>L</u>anguage

### From makers of OpenGL

- Wide industry support: AMD, Apple, Qualcomm, Nvidia (begrudgingly), etc.
- C++ AMP <u>C++ A</u>ccelerated <u>M</u>assive <u>P</u>arallelism • Microsoft
- Much higher abstraction that CUDA/OpenCL

### OpenACC – Open Accelerator

Like OpenMP for GPUs (semi-auto-parallelize serial code)
 Much higher abstraction than CUDA/OpenCL

### **GPU** Programming Models

- CUDA <u>C</u>ompute <u>U</u>nified <u>D</u>evice <u>A</u>rchitecture
- Developed by Nvidia -- proprietary
- First serious GPGPU language/environment
- OpenCL Open <u>C</u>omputing <u>L</u>anguage
- From makers of OpenGL
- Wide industry support: AMD, Apple, Qualcomm, Nvidia (begrudgingly), etc.

C++ AMP – C++ Accelerated Massive Parallelism

Microsoft

Much higher abstraction that CUDA/OpenCL

- OpenACC Open Accelerator
- Like OpenMP for GPUs (semi-auto-parallelize serial code)
- Much higher abstraction than CUDA/OpenCL

### OpenCL

- Early CPU languages were light abstractions of physical hardware  $^\circ\,$  E.g., C
- Early GPU languages are light abstractions of physical hardware • OpenCL + CUDA

























### Address Coalescing

Wavefront: Issue 64 memory requests

### Common case:

work-items in same wavefront touch same cache block

### Coalescing:

Merge many work-items requests into single cache block request

### Important for performance:

Reduces bandwidth to DRAM



### Not Your CPU's Cache

By the numbers: Bulldozer – FX-8170 vs. GCN – Radeon HD 7970

|                                                   |        | CPU (Bulldozer) | GPU (GCN) |
|---------------------------------------------------|--------|-----------------|-----------|
| L1 data cache capacity                            | <      | 16KB            | 16 KB     |
| Active threads (work-items)<br>sharing L1 D Cache | <      | 1               | 2560      |
| L1 dcache capacity / thread                       | $\leq$ | 16KB            | 6.4 bytes |
|                                                   |        |                 |           |
| Last level cache (LLC) capacity                   |        | 8MB             | 768KB     |
| Active threads (work-items)<br>sharing LLC        |        | 8               | 81,920    |
| LLC capacity / thread                             | <      | 1MB             | 9.6 bytes |



### Scratchpad Memory Example System: Radeon HD 7970 GPUs have scratchpads (Local Memory) High-end part Separate address space Managed by software: 32 Compute Units: Rename address Manage capacity – manual fill/eviction 81,920 Active work-items 32 CUs \* 4 SIMT Units \* 16 ALUs = 2048 Max FP ops/cycle Allocated to a workgroup · 264 GB/s Max memory bandwidth i.e., shared by wavefronts in workgroup 925 MHz engine clock 3.79 TFLOPS single precision (accounting trickery: FMA) 210W Max Power (Chip) >350W Max Power (card) 100W idle power (card)

### Radeon HD 7990 - Cooking



# A Rose by Any Other Name...

12

| Termin     | ology Head                  | laches #2-         | 5                   |
|------------|-----------------------------|--------------------|---------------------|
|            | Nvidia/CUDA                 | AMD/OpenCL         | Derek's CPU Analogy |
| }          | CUDA Processor              | Processing Element | Lane                |
|            | CUDA Core                   | SIMD Unit          | Pipeline            |
| GPU "Core" | Streaming<br>Multiprocessor | Compute Unit       | Core                |
| GPU        | GPU Device                  | GPU Device         | Device              |
|            |                             |                    |                     |



# Terminology Headache #10 CPUs have scratchpads (Local Memory) • Separate address space • Managed by software: • Manage capacity – manual fill/eviction • Allocated to a workgroup • Le., shared by wavefronts in workgroup • Nvidia calls 'Local Memory'.

AMD sometimes calls it 'Group Memory'.

### Recap

Data Parallelism: Identical, Independent work over multiple data inputs

GPU version: Add streaming access pattern

Data Parallel Execution Models: MIMD, SIMD, SIMT

GPU Execution Model: Multicore Multithreaded SIMT

### OpenCL Programming Model

NDRange over workgroup/wavefront

Modern GPU Microarchitecture: AMD Graphics Core Next (GCN)

- Compute Unit ("GPU Core"): 4 SIMT Units
   SIMT Unit ("GPU Pipeline"): 16-wide ALU pipe (16x4 execution)
- SIMT Unit ("GPU Pipeline"): 16-wide ALI
   Memory: designed to stream

GPUs: Great for data parallelism. Bad for everything else.

# Advanced Topics

GPU Limitations, Future of GPGPU









| Divergence isn't just a performance problem:                  |         |
|---------------------------------------------------------------|---------|
| global int lock = 0;                                          |         |
| <pre>void mutex_lock() {</pre>                                |         |
| <pre>// acquire lock while (test&amp;set(lock, 1) == fa</pre> | alse) { |
| return;<br>}                                                  |         |
|                                                               |         |

# 





### **Memory Divergence**

One work-item stalls  $\rightarrow$  entire wavefront must stall  $\circ$  Cause: Bank conflicts, cache misses

Data layout & partitioning is important

# **Divergence Kills Performance**

## Communication and Synchronization

- Work-items can communicate with: • Work-items in same wavefront
- No special sync needed...they are lockstep!
- Work-items in different wavefront, same workgroup (local)
   Local barrier
- · Work-items in different wavefront, different workgroup (global)
  - OpenCL 1.x: Nope
     OpenCL 2.x: Yes, but...
  - OpenCL 2.x: Yes, but...
     CUDA 4.x: Yes, but complicated

### **GPU** Consistency Models

- Very weak guarantee:
- Program order respected within single work-item
- All other bets are off
- Safety net:
- Fence "make sure all previous accesses are visible before proceeding"
   Built-in barriers are also fences
- A wrench:
- GPU fences are scoped only apply to subset of work-items in system
   E.g., local barrier

Take-away: Area of active research • See Hower, et al. Heterogeneous-race-free Memory Models, ASPLOS 2014

# GPU Coherence?

Notice: GPU consistency model does not require coherence • i.e., Single Writer, Multiple Reader

Marketing claims they are coherent...

- GPU "Coherence":
- Nvidia: disable private caches
   AMD: flush/invalidate entire cache at fences

U ARCHITECTURES: A CPU PERSPECTIVE

### **GPU** Architecture Research

### Blending with CPU architecture:

- Dynamic scheduling / dynamic wavefront re-org
- · Work-items have more locality than we think

### Tighter integration with CPU on SOC:

- Fast kernel launch
- · Exploit fine-grained parallel region: Remember Amdahl's law Common shared memory

### Reliability:

 Historically: Who notices a bad pixel? • Future: GPU compute demands correctness

Power:

Mobile, mobile mobile!!!

# Computer Economics 101 GPU Compute is cool + gaining steam, but... • Is a 0 billion dollar industry (to quote Mark Hill) GPU design priorities: 1. Graphics 2. Graphics N-1. Graphics N. GPU Compute Moral of the story: • GPU won't become a CPU (nor should it)