-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- GPUs and CUDA See Excerts for Illinois ECE 498AL1 GOT NEW SLIDES but are they better. http://www.cs.wisc.edu/~markhill/cs757/Spring2008/for.wisc.edu.only/illinois07_gpu_cuda.ppt See Blackboard Pix ------------------ Put MIMD/SM on left & Data Parallel on right Singled thread in host memory Launch Kernel/Grid with global memory and many threads Single thread in host memory Break grid thread into multiple Blocks (<= 1K threads each) Break a block into multiple Warp (32 threads) At handwave level SPMD for MIMD for Block BUT * HW only promises to schedule one block at a time + thread in block i and wait for thread in block i+j * Threads w/i WARP run as SIMT + MIMD but pseudo-lock step progress + Branch divergence -- masking Memories + Currently host and global memory separte with explicit copies + to Merge,e.g, APUs Blocks + Shared explicit shared/local/group memory + Not a cache (yet) Threads also private memory (but this is split among time-multiplexed threads Reviews * Brian: execution model? * Daniel: Warp stall? * Guoliang: $2/core? * Syed: Texture memory? * Asim: need new algorithms? * Eric: Why hugh bw and #registers bottlenecks? * Andrew E: security & virtualization, comm bottlenecks? * Marc: vs. Cell? Critique + GPUs geting more GP + Tremendous peak - Whither Amdahl * Fusion efforts