--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

GPUs and CUDA

See Excerts for Illinois ECE 498AL1
GOT NEW SLIDES but are they better.


http://www.cs.wisc.edu/~markhill/cs757/Spring2008/for.wisc.edu.only/illinois07_gpu_cuda.ppt


See Blackboard Pix
------------------

Put MIMD/SM on left & Data Parallel on right


Singled thread in host memory

Launch Kernel/Grid with global memory and many threads

Single thread in host memory


Break grid thread into multiple Blocks (<= 1K threads each)

Break a block into multiple Warp (32 threads)

At handwave level SPMD for MIMD for Block BUT

* HW only promises to schedule one block at a time
+ thread in block i and wait for thread in block i+j

* Threads w/i WARP run as SIMT
+ MIMD but pseudo-lock step progress
+ Branch divergence -- masking

Memories
+ Currently host and global memory separte with explicit copies
+ to Merge,e.g, APUs

Blocks
+ Shared explicit shared/local/group memory
+ Not a cache (yet)

Threads also private memory (but this is split among time-multiplexed threads


Reviews
* Brian: execution model?
* Daniel: Warp stall?
* Guoliang: $2/core?
* Syed: Texture memory?
* Asim: need new algorithms?
* Eric: Why hugh bw and #registers bottlenecks?
* Andrew E: security & virtualization, comm bottlenecks?
* Marc: vs. Cell?

Critique
+ GPUs geting more GP
+ Tremendous peak
- Whither Amdahl
* Fusion efforts