(3.4.3) alpha 21364 network architecture

Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, David Webb, "The Alpha 21364 Network Architecture," IEEE Micro, vol. 22, no. 1, pp. 26-35, Jan./Feb. 2002. IEEE Xplore link
152 mil transistors 22.4Gbytes/s router bandwidth.

paper talks about the router : 1.2GHz (same speed as the processor)
    using aggressive routing algos,  distributed arbiters. large on-chip buffering and pipleined router implementation
    support for directory-based cache coherence (virtual packet classes)
    
Classes : 
    flit > packet portion transported in parallel on a single clock edge
    Request
    forward
    block response (with data)
    nonblock response (ack)
    write I/O
    Read I/O
    special (no-op packets), buffer dealloc info between routers etc.,

2D torus

Virtual cut through routing
    flits of a packet proceed through multiple routers until a router blocks the header flit. 
    blocking router buffers all the packet's flits until congestion clears
    blocking router schedules the packet to destination. [buffer space for 316 packets]

Adaptive routing 
    minimum rectangle (diagonal distance)
    either continue on the same direction or turn. 
Deadlock avoidance
    because of cyclic dependences
    eg : request packets can fill up a network and prevent block response packets
    > solve : virtual channels with priority. 
    Adaptive routing can create intra-dimension and inter-dimension deadlock
    Jose Duato's theory : as long as packets can drain via a deadlock free path, no deadlock will occur. 
     intradimention :     VC0 and VC1 (non adaptive, horizontal first, vertical next) on primary axis depend on secondary axis, but not vice versa. when they change dimension, they recompute their virtual channels in the new dimension. 
    Dally scheme : incremental number of all processors in a dimension : if source's number is < destination : VC0 else VC1.
         but VC0, VC1 not well balanced.
    VC0, VC1 can return to adaptive if a subsequent router is not blocked. 

Router Architecture
    input and output ports : local (cache and memory controller), interprocessor (off chip network), I/O.
    clock forwarding
    
Local arbiters : various readiness tests to determine if a packet can be speculatively dispatched via the router. 
         valid input, output port is free, next buffer has a free buffer, target output port is free, anti-starvation mechanism, read I/O packet does not pass write I/O packet.

fairness. : coherence dependence priority mode and rotary rule.