Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, "NVIDIA TESLA: A Unified Graphics and Computing Architecture", IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55. IEEE Xplore |
Compute Unified Device Architecture (CUDA) parallel programming model and development tools
Tesla = unified graphics and computing architecture
Old GeForce :
vertex transform
lighting processor
fixed-function integer pixel fragment pipeline
(OpenGL DirectX)
Vertex shaders, floating point fragment pipeline
Vertex processors
> operate on vertices of primitives such as points, lines, triangles
> transform : coordinates to screen space
> lighting and texture parameters
> output to setup unit and rasterizer (convert lines to dots/pixels)
> low-latency, high precision math operations
> programmable typically
Pixel fragment processors
> interior of primitives/interpolation
> high latency lower precision texture filtering
> more pixels than vertices => more of this
Tesla > Vertex + Pixel into one unified processor architecture.
scalable processor array
memory > DRAM + fixed-function raster operation processors that perform color and depth frame buffer directly in memory
Command processing
> interface to CPU > responds to commands from CPU, fetches data from system memory, checks command consistency and performs context switching
> input assembler > collects geometric primitives and fetches associated vertex input attribute data.
> execute vertex/geometry/pixel shader/compute
barrier syncronization
SIMT (multiple thread)
unit creates, manages, schedules and executes threads in groups of 32 || threads called wraps.
each thread executes independently. individual threads can be inactive due to independent branching or predication
full potential > if all threads take same exec path.
SIMD/SIMT
SIMT : applies one instr to multiple independent threads in ||, not just to multiple data lands > controls the exec and branching behavior
SIMD : controls a vector of multiple data lanes.
wrap : 32 threads of the same type (vertex/g/p/c)
Implementing zero overhead > scoreboard qualifies each wrap for issue each cycle. instr scheduler prioritizes all ready wraps... fairness
intermediate languages use virtual registers, optimizer analyzes data dependencies, allocates real registers, eliminate dead code, folds instr together, optimizes SIMT branch divergence and convergence points.
Memory access
> local/shared (SM)/global
> coalescing memory over separate requests.
Texture unit input > texture coordinates outputs > filtered samples, typically four-component (RGBA) color.
Rasterization > Viewpoint > clip > setup > raster > zcull > block.
GPU ^
-------------------------------------------
Parallel computing Architecture
Throughput applications
extensive data parallelism
modest task ||
intensive floating point arithmetic
latency tolerance (perf is amount of work done)
streaming data flow (relatively low data reuse)
modest inter-thread sync
Difference between GPU : require that || threads syncronize, communicate, share data and cooperate > thread block : Cooperative thread array
thread : computes result elements selected by its TID
CTA : computes result blocks selected by CTA ID
grid computes result blocks and sequential grids compute seq dep application steps.
relaxed memory order : preserves order of reads and writes to the same memory address.
Scaling required.
Scalability : varying number of SMs, TPCs, ROPs, caches and memory partitions.
future development : scheduling, load-balancing, enhanced scalability for derivative products, reduced sync and comm overhead, new graphics features, increased mem bandwidth and power efficiency.