Steven L. Scott, Synchronization and Communication in the T3E Multiprocessor, Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp. 26-36. ACM DL Link |
Parallelism
Explicit : MPI
Implicit : Shared memory (better suited for irregular/ dynamic par)
Low sync overheard > lots of processors, lesser granularity of work split.
Commodity microprocessors inefficient for multiproc
memory access is cache line based > stride access/vector access pretty bad.
Address space limited (TLB)
helpful to have non cached memory access (MPI to other proc)
latency reduction rather than latency toleration
T3D <<<
Shared address space
bidirectional 3D torus writing
barrier network : 4 wire wide, degree four spanning tree
remote memory access
Block transfer engine : bulk, async data transfer between proc memories
prefetch queue
DTB annex : extend the memory space outside the processor
T3E overview
PE : DEC 21164 + shell(control + router + local memory)
self-hosted : Unicos/mk
I/O : GigaRing channel
E registers : shared area
virtual eureka/barrier network
Global virtual address
any part can be masked to produce virtual PE
Get and put operations to write to E-registers
Atomic memory operations (swap, fetch&inc, fetch&add, compare&swap, masked_swap
Messaging
SEND command
message queue control work with tail, limit and threshold to interrupt
Eureka/Barrier synchronization