(3.3.3) Wildfire : Scalable SMPs

Erik Hagersten and Michael Koster, WildFire: A Scalable Path for SMPs, Proc. 5th IEEE Symposium on High-Performance Computer Architecture, January 1999, 172-181. IEEE Xplore link

SUN

Distributed shared-memory (DSM) prototype

Origin is a DSM optimized architecture > reduces remote memory latency

cc-NUMA has greater potential for scalability, they are less optimal for access patterns caused by real communications (producer consumer and migratory data).
OS optimization for capacity miss and conflict miss. 
scheduling is difficult.

SMP : does not care where code/data is placed, suspended process can be resch on any other proc.

problem : not possible to build SMP with huge number of CPUs spanning several physical boxes. 

DSM protocol : keeps only a handful of nodes coherent. (complexity/latency of nodes reduced by small number of nodes). large node has memory banks > higher bandwidth (interleaving). better node locality. Node is aware of load balancer. 

Wild fire ccNUMA built from unusually large nodes. 

when possible, local memory is used. 
only when a process has more threads than processors, multiple nodes 
> CMR : coherent memory replication (version of S-COMA, simple cache only memory arch). 
   Allocates local "shadow" physical pages. 
   Solaris OS uses integrated hw counters to determine which pages to switch from ccNUMA to CMR. 
   responding to memory access patterns. (adaptive algo)
   -- Memory-resident pages and "large" physical pages cannot be replicated. > can be explicitly replicated.

interface implementation
   NIAC : Network Interface Address controller and NIDC.
   4 NIDC chips controlled by 1NIAC chip (high bandwidth required on the coherent interface).
   NIAC : bus interface and global coherence layer. 

   WFI > 2 translations : local phyical CMR to global phy address and back.

deterministic directory : state of cache and state of directory are always in agreement. 
   > blocking directory (only one outstanding transaction per cache line)
   > three-phase writebacks

Hierarchical affinity scheduling > OS tries to schedule a process first on the processor it last ran. then on some processor on the same node.

>>>>>>>>
SGI Origin : 2 R10000 connected by Hub, Mem + Directory (the 2 processors share the local cache, fast access to remote cache) page migration. 
Wildfire : large number of SMP nodes connected to distributed directory network
NUMA-Q node : i dont know. 

>>>>>>>>

created locality : doing something other than just responding to create 

Coherent Memory Replication and Hierarchical Affinity Scheduling control are implemented as Deamon processes (OS)