(2.5.7) 3D die-stacking

Gabriel H. Loh, Yuan Xie, Bryan Black, Processor Design in Three-Dimensional Die-Stacking Technologies, In IEEE Micro, vol. 27(3), pp. 31-48, May-June, 2007. IEEE Xplore link

two natural topologies : face to face or face to back

copper-copper bonding process builds an interdie connection > die-to-die (d2d) or 3D via.

d2d via pitch significantly larger than individual transistor.

> size determines possible 3d part of procesor blocks and funct units.

> latency.

> RC delay : 35% of full stack of vias connecting met 1 to 9.

30% of pins used for power

high power density

face to face > requires through silicon vias (TSVs) for I/O and power. low inductance, so no problem.

3d => 50% area of 2 2d chips => 50% area for pins => power delivery 50%

++ because wire loads are low : power demand goes down

thermals

successive layer farther away from the heat sink

power density

sim result : 2d processors ~ 3d config worst case temp!

wire reduction 3D placement

Partitioning granularity Layer 1| Layer 2

Entire cores CPU | L2

Functional block units ROB | ALU

Logic gate Mux [31:16] | Mux [15:0]

Transistor level PMOS | NMOS

3D cache

cores on one layer, cache on another

2x cores + 2x cache on one, 2x core + 2x cache on other

within cache

stacked bit lines

stacked word lines

word lines more delay than bit lines => stacking word lines provides lower overall latency.

but power => lower bit lines (much longer)

(1) Eliminating critical wires > latency + power reductions

(2) different partitioning strategies to match communication density of a given d2d via interface

wire via pitch does not scale at the same rate as feature-size.

(3) partitioning > power, performance, area.

Mixed process integration

DRAM

onstack DC to DC convertors

decoupling capacitors

Cool places to use

eliminate pipeline wires

higher timing margins : clock skew/jitter is low

better performance/watt ratio

clock frequency improvements > most effective for power

reduce the number of pipeline stages (wires)

NUCA : inherent problem : managing data that cores share.

Dynamic NUCA

3D version of L2 : 90% reductions in cache migrations

- - needs 3D place and route, floorplanning tools, 3D visualization and layout.

fault on a single layer > complete waste of entire stack

DFT > difficult in the presence of finely partitioned 3D structs, a die might have only 50% complete circuit.