(3.3.1) T3E

Steven L. Scott, Synchronization and Communication in the T3E Multiprocessor, Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp. 26-36. ACM DL Link

Parallelism 
     Explicit : MPI
     Implicit : Shared memory (better suited for irregular/ dynamic par)

Low sync overheard > lots of processors, lesser granularity of work split.

Commodity microprocessors inefficient for multiproc
     memory access is cache line based > stride access/vector access pretty bad. 
     Address space limited     (TLB)
     helpful to have non cached memory access (MPI to other proc)
     latency reduction rather than latency toleration

T3D <<<
     Shared address space
     bidirectional 3D torus writing
     barrier network : 4 wire wide, degree four spanning tree
     remote memory access
          Block transfer engine : bulk, async data transfer between proc memories
          prefetch queue
          DTB annex : extend the memory space outside the processor
     
T3E overview
     PE : DEC 21164 + shell(control + router + local memory)
     self-hosted : Unicos/mk 
     I/O : GigaRing channel
     E registers : shared area
     virtual eureka/barrier network

Global virtual address
     any part can be masked to produce virtual PE 
     Get and put operations to write to E-registers
Atomic memory operations (swap, fetch&inc, fetch&add, compare&swap, masked_swap
Messaging
     SEND command
     message queue control work with tail, limit and threshold to interrupt
Eureka/Barrier synchronization