(3.3.2) SGI Origin

James Laudon, Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. ISCA 1997: 241-251. ACM DL Link

ccNUMA Highly scalable server

Cache-coherent globally addressable memory 
Distributed shared memory <> Directory based protocol

minimize latency between remote and local memory > hardware/software support to insure most references are local
     also > easy migration for existing SMP software (by maintaining ratio of latencies a min)

Effective page migration and replication
     per-page hardware memory ref counters
     block copy engine (at near peak mem speed)
     reduce TLB update cost

interconnect > multi-level fat-hypercube topology.

sync help > fetch-and-op primitives (MIPS already has load-linked/store-conditional)

Network 
     6x 2x links per router
     wormhole routing
     global aribitration to max utilization under load
     4 virtual channels
     adaptive switch between 2 virtual channels
     CRC checking and retransmission on error

Cache Coherence
     non-blocking
     request forwarding (like DASH)
     silent CEX replacement : clean exclusive state
     network ordering does not matter!
          detect and resolve all ooo deliveries. 
     adaptive routing to deal with congestion
     deadlock
          negative acknowledgements to all requests (DASH does this)
          SGI : backoff message : target of intervention or the list of shares to invalidate. 
          > invalidations + invention backoff > better forward progress (when heavily loaded systems)

Coherent request buffer >     read and write request tracked for both processors

Page migration > the requestor's count and home count are compared, if difference exceeds a software programmable threshold register, an interrupt is generated to home node > potential migration handled by the OS.

Directory poisoning 
     Avoid TLB update problem (shootdown > The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown) 
     during read phase > latest copy is written back to memory, directory in Poison state. 
     another processor access > synchronous bus error. 
     error handler > error due to page migration, invalidates the TLB entry for that page. 
     >> low cost to migrate

OTHER architectures
Stanford DASH
     4 processor SMP cluster is one node.
     ORI : 2 proc node with coherence handled by directory-based protocol.
     > + cache to cache sharing within the node.
     > -- intranode sharing significant > more processors required > bus will become huge.
     > -- remote latency
     > -- remote bandwidth = half local bandwidth
     SGI Origin : 2:1 remote to local access.
Convex Exemplar X
     512 proc in 8x4 torus config
     third level cluster cache
     
DSM Systems comparison
     overhead of SMP nodes sets minimum on how effective machines are in small configs
     workload throughput oriented?
     > SMP has lower communication costs between processors within the same SMP node +++
     > parallelism beyond the size of the SMP node is important > latency + bandwidth overhead shall be very large.

Conclusions
     highly modular system
     fat hypercube network high bisection bandwidth, low latency interconnect.
     low latency to both remote memory and local memory
     hw/sw page migration and fast synchronization