(3.3.6) nVidia Tesla

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, "NVIDIA TESLA: A Unified Graphics and Computing Architecture", IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55. IEEE Xplore

Compute Unified Device Architecture (CUDA) parallel programming model and development tools

Tesla = unified graphics and computing architecture

Old GeForce :

vertex transform

lighting processor

fixed-function integer pixel fragment pipeline

(OpenGL DirectX)

Vertex shaders, floating point fragment pipeline

Vertex processors

> operate on vertices of primitives such as points, lines, triangles

> transform : coordinates to screen space

> lighting and texture parameters

> output to setup unit and rasterizer (convert lines to dots/pixels)

> low-latency, high precision math operations

> programmable typically

Pixel fragment processors

> interior of primitives/interpolation

> high latency lower precision texture filtering

> more pixels than vertices => more of this

Tesla > Vertex + Pixel into one unified processor architecture.

scalable processor array

memory > DRAM + fixed-function raster operation processors that perform color and depth frame buffer directly in memory

Command processing

> interface to CPU > responds to commands from CPU, fetches data from system memory, checks command consistency and performs context switching

> input assembler > collects geometric primitives and fetches associated vertex input attribute data.

> execute vertex/geometry/pixel shader/compute

barrier syncronization

SIMT (multiple thread)

unit creates, manages, schedules and executes threads in groups of 32 || threads called wraps.

each thread executes independently. individual threads can be inactive due to independent branching or predication

full potential > if all threads take same exec path.

SIMD/SIMT

SIMT : applies one instr to multiple independent threads in ||, not just to multiple data lands > controls the exec and branching behavior

SIMD : controls a vector of multiple data lanes.

wrap : 32 threads of the same type (vertex/g/p/c)

Implementing zero overhead > scoreboard qualifies each wrap for issue each cycle. instr scheduler prioritizes all ready wraps... fairness

intermediate languages use virtual registers, optimizer analyzes data dependencies, allocates real registers, eliminate dead code, folds instr together, optimizes SIMT branch divergence and convergence points.

Memory access

> local/shared (SM)/global

> coalescing memory over separate requests.

Texture unit input > texture coordinates outputs > filtered samples, typically four-component (RGBA) color.

Rasterization > Viewpoint > clip > setup > raster > zcull > block.

GPU ^

-------------------------------------------

Parallel computing Architecture

Throughput applications

extensive data parallelism

modest task ||

intensive floating point arithmetic

latency tolerance (perf is amount of work done)

streaming data flow (relatively low data reuse)

modest inter-thread sync

Difference between GPU : require that || threads syncronize, communicate, share data and cooperate > thread block : Cooperative thread array

thread : computes result elements selected by its TID

CTA : computes result blocks selected by CTA ID

grid computes result blocks and sequential grids compute seq dep application steps.

relaxed memory order : preserves order of reads and writes to the same memory address.

Scaling required.

Scalability : varying number of SMs, TPCs, ROPs, caches and memory partitions.

future development : scheduling, load-balancing, enhanced scalability for derivative products, reduced sync and comm overhead, new graphics features, increased mem bandwidth and power efficiency.