CS838 Discussion Notes 9/16

I.       The Tera Computer System

A.   Background

1.     Burton Smith is an incredible salesman when selling NRE, but not so good at selling and delivering product – only one Tera was sold, although “sold” may not be the right word.

2.     The HEP and Horizon computers were predecessors of the Tera.

3.     Tera Computer Company actually bought Cray, and some features of the Tera Computer made it into some Cray models.

B.   Network Architecture

1.     256 processors, each capable of 128 instruction streams, are connected in a 3-D torus along with memory, I/O, and communication nodes.

2.     Confusion reigned about the mental leap taken that the speed of light is finite; therefore the volume of the network is proportional to latency3. We decided it was probably valid in a grossly simplified way.

3.     It is questionable whether or not the system had sufficient memory; however it was a large amount of memory at the time of the system’s design. It was most likely enough given the timeframe.

4.     Data memory addresses are randomized to eliminate memory bank hotspots caused by pathologic strided accesses. This randomization needs to be quick, and was most likely implemented with a glob of XOR gates.

C.   Processor Architecture

1.     128 instruction streams per processor may have been going a little overboard. On the other hand, it may have worked for the codes they were interested in. The question was raised as to what is a good trade-off point between more processors and more threads – open question.

2.     In order to allow for 128 instruction streams on a chip, control logic was cut extensively by using a static pipeline. To allow successive issue of sequential instructions, each instruction specified the first future dependent instruction, up to 7 away.

3.     8 branch target registers allowed prefetching of instructions beyond branches and fast traps.

4.     One instruction consists of up to 3 ops. The combination of this LIW instruction, (1), (2), and (3) allowed a huge number of instructions to flow through a potentially long pipeline uninhibited.

5.     To allow switching between 128 streams at cycle granularity, the register file had to be replicated for each processor. This huge register file most likely needed to be split to allow for 6 read ports, 3 write ports, and reasonable access times.

6.     Stream creation is regulated using slim, scur, and sres, and the trapping mechanism is used to start a new stream.