(2.5.7) 3D die-stacking

Gabriel H. Loh, Yuan Xie, Bryan Black, Processor Design in Three-Dimensional Die-Stacking Technologies, In IEEE Micro, vol. 27(3), pp. 31-48, May-June, 2007. IEEE Xplore link

two natural topologies : face to face or face to back

copper-copper bonding process builds an interdie connection > die-to-die (d2d) or 3D via. 

d2d via pitch significantly larger than individual transistor.
     > size determines possible 3d part of procesor blocks and funct units.
     > latency.
     > RC delay : 35% of full stack of vias connecting met 1 to 9. 

30% of pins used for power

high power density

face to face > requires through silicon vias (TSVs) for I/O and power. low inductance, so no problem.
     3d => 50% area of 2 2d chips => 50% area for pins => power delivery 50%
     ++ because wire loads are low : power demand goes down

thermals
     successive layer farther away from the heat sink
     power density
     sim result : 2d processors ~ 3d config worst case temp!
     wire reduction 3D placement
     
Partitioning granularity Layer 1| Layer 2
     Entire cores CPU | L2 
     Functional block units ROB | ALU
     Logic gate Mux [31:16] | Mux [15:0]
     Transistor level PMOS | NMOS

3D cache
     cores on one layer, cache on another
     2x cores + 2x cache on one, 2x core + 2x cache on other
     within cache
          stacked bit lines
          stacked word lines
          word lines more delay than bit lines => stacking word lines provides lower overall latency.
          but power => lower bit lines (much longer)

(1) Eliminating critical wires > latency + power reductions
(2) different partitioning strategies to match communication density of a given d2d via interface 
     wire via pitch does not scale at the same rate as feature-size.
(3) partitioning > power, performance, area.

Mixed process integration
     DRAM
     onstack DC to DC convertors
     decoupling capacitors

Cool places to use
     eliminate pipeline wires
     higher timing margins : clock skew/jitter is low
     better performance/watt ratio
     clock frequency improvements > most effective for power
     reduce the number of pipeline stages (wires)
     
NUCA : inherent problem : managing data that cores share.
     Dynamic NUCA
     3D version of L2 : 90% reductions in cache migrations

- - needs 3D place and route, floorplanning tools, 3D visualization and layout.
     fault on a single layer > complete waste of entire stack
     DFT > difficult in the presence of finely partitioned 3D structs, a die might have only 50% complete circuit.