James Laudon, Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. ISCA 1997: 241-251. ACM DL Link |
ccNUMA Highly scalable server
Cache-coherent globally addressable memory
Distributed shared memory <> Directory based protocol
minimize latency between remote and local memory > hardware/software support to insure most references are local
also > easy migration for existing SMP software (by maintaining ratio of latencies a min)
Effective page migration and replication
per-page hardware memory ref counters
block copy engine (at near peak mem speed)
reduce TLB update cost
interconnect > multi-level fat-hypercube topology.
sync help > fetch-and-op primitives (MIPS already has load-linked/store-conditional)
Network
6x 2x links per router
wormhole routing
global aribitration to max utilization under load
4 virtual channels
adaptive switch between 2 virtual channels
CRC checking and retransmission on error
Cache Coherence
non-blocking
request forwarding (like DASH)
silent CEX replacement : clean exclusive state
network ordering does not matter!
detect and resolve all ooo deliveries.
adaptive routing to deal with congestion
deadlock
negative acknowledgements to all requests (DASH does this)
SGI : backoff message : target of intervention or the list of shares to invalidate.
> invalidations + invention backoff > better forward progress (when heavily loaded systems)
Coherent request buffer > read and write request tracked for both processors
Page migration > the requestor's count and home count are compared, if difference exceeds a software programmable threshold register, an interrupt is generated to home node > potential migration handled by the OS.
Directory poisoning
Avoid TLB update problem (shootdown > The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown)
during read phase > latest copy is written back to memory, directory in Poison state.
another processor access > synchronous bus error.
error handler > error due to page migration, invalidates the TLB entry for that page.
>> low cost to migrate
OTHER architectures
Stanford DASH
4 processor SMP cluster is one node.
ORI : 2 proc node with coherence handled by directory-based protocol.
> + cache to cache sharing within the node.
> -- intranode sharing significant > more processors required > bus will become huge.
> -- remote latency
> -- remote bandwidth = half local bandwidth
SGI Origin : 2:1 remote to local access.
Convex Exemplar X
512 proc in 8x4 torus config
third level cluster cache
DSM Systems comparison
overhead of SMP nodes sets minimum on how effective machines are in small configs
workload throughput oriented?
> SMP has lower communication costs between processors within the same SMP node +++
> parallelism beyond the size of the SMP node is important > latency + bandwidth overhead shall be very large.
Conclusions
highly modular system
fat hypercube network high bisection bandwidth, low latency interconnect.
low latency to both remote memory and local memory
hw/sw page migration and fast synchronization