(3.3.2) SGI Origin

James Laudon, Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. ISCA 1997: 241-251. ACM DL Link

ccNUMA Highly scalable server

Cache-coherent globally addressable memory

Distributed shared memory <> Directory based protocol

minimize latency between remote and local memory > hardware/software support to insure most references are local

also > easy migration for existing SMP software (by maintaining ratio of latencies a min)

Effective page migration and replication

per-page hardware memory ref counters

block copy engine (at near peak mem speed)

reduce TLB update cost

interconnect > multi-level fat-hypercube topology.

sync help > fetch-and-op primitives (MIPS already has load-linked/store-conditional)

Network

6x 2x links per router

wormhole routing

global aribitration to max utilization under load

4 virtual channels

adaptive switch between 2 virtual channels

CRC checking and retransmission on error

Cache Coherence

non-blocking

request forwarding (like DASH)

silent CEX replacement : clean exclusive state

network ordering does not matter!

detect and resolve all ooo deliveries.

adaptive routing to deal with congestion

deadlock

negative acknowledgements to all requests (DASH does this)

SGI : backoff message : target of intervention or the list of shares to invalidate.

> invalidations + invention backoff > better forward progress (when heavily loaded systems)

Coherent request buffer > read and write request tracked for both processors

Page migration > the requestor's count and home count are compared, if difference exceeds a software programmable threshold register, an interrupt is generated to home node > potential migration handled by the OS.

Directory poisoning

Avoid TLB update problem (shootdown > The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown)

during read phase > latest copy is written back to memory, directory in Poison state.

another processor access > synchronous bus error.

error handler > error due to page migration, invalidates the TLB entry for that page.

>> low cost to migrate

OTHER architectures

Stanford DASH

4 processor SMP cluster is one node.

ORI : 2 proc node with coherence handled by directory-based protocol.

> + cache to cache sharing within the node.

> -- intranode sharing significant > more processors required > bus will become huge.

> -- remote latency

> -- remote bandwidth = half local bandwidth

SGI Origin : 2:1 remote to local access.

Convex Exemplar X

512 proc in 8x4 torus config

third level cluster cache

DSM Systems comparison

overhead of SMP nodes sets minimum on how effective machines are in small configs

workload throughput oriented?

> SMP has lower communication costs between processors within the same SMP node +++

> parallelism beyond the size of the SMP node is important > latency + bandwidth overhead shall be very large.

Conclusions

highly modular system

fat hypercube network high bisection bandwidth, low latency interconnect.

low latency to both remote memory and local memory

hw/sw page migration and fast synchronization