(3.3.1) T3E

Steven L. Scott, Synchronization and Communication in the T3E Multiprocessor, Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp. 26-36. ACM DL Link

Parallelism

Explicit : MPI

Implicit : Shared memory (better suited for irregular/ dynamic par)

Low sync overheard > lots of processors, lesser granularity of work split.

Commodity microprocessors inefficient for multiproc

memory access is cache line based > stride access/vector access pretty bad.

Address space limited (TLB)

helpful to have non cached memory access (MPI to other proc)

latency reduction rather than latency toleration

T3D <<<

Shared address space

bidirectional 3D torus writing

barrier network : 4 wire wide, degree four spanning tree

remote memory access

Block transfer engine : bulk, async data transfer between proc memories

prefetch queue

DTB annex : extend the memory space outside the processor

T3E overview

PE : DEC 21164 + shell(control + router + local memory)

self-hosted : Unicos/mk

I/O : GigaRing channel

E registers : shared area

virtual eureka/barrier network

Global virtual address

any part can be masked to produce virtual PE

Get and put operations to write to E-registers

Atomic memory operations (swap, fetch&inc, fetch&add, compare&swap, masked_swap

Messaging

SEND command

message queue control work with tail, limit and threshold to interrupt

Eureka/Barrier synchronization