--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------


Outline Supercomputing
* Motivation
* Cray T3E
* Top 500
* Two Views of Future: Intel & Nvidia
* (Roadrunner)
* Propect

-----------------------------

Motivation

Supercomputing -- capability computing
* Flops
* Interconnect
* Programmability?

Early: Most computers
1975-85 Vector machines w/ 1 or few processors
1995-2005- incorporate killer microprocessors -- MPP
* T3E -- shared memory w/o coherence
* Clusters w/ Special ICNs
2005-present
* add special vector units, IBM Cell
* NOW GPUs

(ADD LATER
Going forward?
* Exacscale has serious power problem
* Before 1000x faster for 1x power
* Now 1000x faster for 100x power -- NO!
* Back to custom
* How relevant?
)

-----------------------------

%T Synchronization and Communication in the T3E Multiprocessor
%A Steven L. Scott
%J Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems
%K ASPLOS VII
%C Cambridge
%D October 1996
%P 26-36
%Y topic: MULTIPROCESSORS/synchronization, ARCHITECTURE/real computers


Use 21164 but
   two outstanding cache misses
   33b PA = 8GB but want 128GB
   poor TLB reach
   poor locality
   coherence protocl in the way

T3D w/ 21064 too ad hoc
   DTB Annex
   kitchen sink: loads/stores, prefetch queue, block transfer engine (BLT)

T3E
   upto 2K processors
   3D torus
   no board-level cache

E-registers
  512 user + 128 system w/ EMPTY/FULL bits
  Get ereg#, (mem)
      sw data, (addr) where
	  data == precursor-to-global-address
          addr = command || ereg#
  ld gpr#, addr' (where addr' = command | ereg#)

  st gpr#, addr' (where addr' = command | ereg#)
  Put ereg#, (mem)

  Also vector get/put

Addresses
   segmented addressing
   controlled interleaving
   See Figure 4

Atomic memory operations
  fetch&inc, fetch&add, compare&swap, masked_Swap
  store operands in e-regs
  invoke with store to I/O space

Messages
  64-bit message queue control word 
  signal1 || tail21 || limit21 || threshold21
  All messages 64B

  Send e-reg-block, MQCW-address
  at sender
     write data into 8 eregs
     invoke send
  at MQCW
     store at MQCW + tail*64 (if tail < limit)
     if tail > threshold, set signal & generate interrupt (a warning)

Barrier (and) / Eureka (or)
   32 units
   mask for processors per unit

Reviews (OLD COMMENT: Polina has new ones)

Workloads
* Marc: Why best perf of shared memory? Latency reduction/tolerance? address transs?
* Daniel, Eric:  Why board-level cache (L2/L3) bad?  Increase BW?
* Brian: Successful?
* Syed: Why single-word loads more efficient?
Other
* David: Where E-regs? Fuzzy barriers
* Andrew N: E-regs elsewhere?  CMPs?
* Shijin: Get/put?
* Guoliang:  Eureka and address translation?
* Andrew E: "miss speculation"???


-----------------------------
Top 500 Computer Sites
http://www.top500.org/
Nov 2011

1	K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

Site:	RIKEN Advanced Institute for Computational Science (AICS)
System URL:	
Manufacturer:	Fujitsu
Cores:	705024
Power:	12659.89 Kw
Memory:	1410048 Gb
Interconnect:	Custom
Operating System:	Linux

http://en.wikipedia.org/wiki/K_computer
K computer – named for the Japanese word "kei" (京?), meaning 10
quadrillion[1] – is a supercomputer produced by Fujitsu, currently installed
at the RIKEN Advanced Institute for Computational Science campus in Kobe,
Japan.[2][3][4] The K computer is based on adistributed memory architecture,
with over 80,000 compute nodes.[5]

88,128 2.0GHz 8-core SPARC64 VIIIfx processors packed in 864 cabinets, for a
total of 705,024 cores, manufactured by Fujitsu with 45 nm CMOS
technology.[10] Each cabinet contains 96 computing nodes, in addition to 6 I/O
nodes. Each computing node contains a single processor and 16 GB of memory.
The computer's water cooling system is designed to minimize failure rate and
power consumption.[11]
[edit]Network

The K computer uses a proprietary six-dimensional torus interconnect called
Tofu, and a Tofu-optimized Message Passing Interface based on the open-source
Open MPIlibrary.[11][12][13] Users can create application programs adapted to
either a one-, two-, or three-dimensional torus network.[14]

2	NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050
Tianhe-1A - NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050
Site:	National Supercomputing Center in Tianjin
System URL:	
Manufacturer:	NUDT
Cores:	186368
Power:	4040.00 Kw
Memory:	229376 Gb
Interconnect:	Proprietary
Operating System:	Linux
http://en.wikipedia.org/wiki/Tianhe-I
Tianhe-I, Tianhe-1, or TH-1 (天河一号) (Mandarin pronunciation:pinyin:
Tiān​hé yī​hào), in English, "Milky Way (literally, Sky River) 
In October 2010, Tianhe-1A, an upgraded supercomputer, was unveiled at HPC
2010 China.[15] It is now equipped with 14,336 Xeon X5670 processors and 7,168
Nvidia Tesla M2050general purpose GPUs. 2,048 FeiTeng 1000 SPARC-based
processors are also installed in the system, but their computing power was not
counted into the machine's official Linpackstatistics as of October 2010.[16]
Tianhe-1A has a theoretical peak performance of 4.701 petaflops.[17] NVIDIA
suggests that it would have taken "50,000 CPUs and twice as much floor space
to deliver the same performance using CPUs alone." The current heterogeneous
system consumes 4.04 megawatts compared to over 12 megawatts had it been built
only with CPUs.[18]
The Tianhe-1A system is composed of 112 computer cabinets, 12 storage
cabinets, 6 communications cabinets, and 8 I/O cabinets. Each computer cabinet
is composed of four frames, with each frame containing eight blades, plus a
16-port switching board. Each blade is composed of two computer nodes, with
each computer node containing two Xeon X5670 6-core processors and one Nvidia
M2050 GPU processor.[19] The system has 3584 total blades containing 7168
GPUs, and 14,336 CPUs, managed by the SLURM job scheduler.[20]The total disk
storage of the systems is 2 Petabytes implemented as a Lustre clustered file
system,[1] and the total memory size of the system is 262 Terabytes.[16]
Another significant reason for the increased performance of the upgraded
Tianhe-1A system is the Chinese-designed NUDT custom designed proprietary
high-speed interconnect calledArch that runs at 160 Gbps, twice the bandwidth
of InfiniBand.[16]


3	Cray XT5-HE Opteron 6-Cray XT5-HE Opteron 6-core 2.6 GHz

Jaguar -Site:	DOE/SC/Oak Ridge National Laboratory
System URL:	http://www.nccs.gov/computing-resources/jaguar/
Manufacturer:	Cray Inc.
Cores:	224162
Power:	6950.00 Kw
Memory:	
Interconnect:	Proprietary
Operating System:	Linux
http://www-test1.nccs.gov/wp-content/uploads/2007/08/x1e_architecture.pdf

http://en.wikipedia.org/wiki/Jaguar_(supercomputer)
Jaguar is a petascale supercomputer built by Cray at Oak Ridge National
Laboratory (ORNL) in Oak Ridge, Tennessee. The massively parallel Jaguar has a
peak performance of just over 1,750 teraflops (1.75 petaflops). It has 224,256
x86-based AMD Opteron processor cores,[2] and operates with a version of Linux
called the Cray Linux Environment.[3] Jaguar is a Cray XT5 system, a
development from theCray XT4 supercomputer.

Jaguar's XT5 partition contains 18,688 compute nodes in addition to dedicated
login/service nodes. Each XT5 compute node contains dual hex-core AMD Opteron
2435 (Istanbul) processors and 16 GB of memory. Jaguar's XT4 partition
contains 7,832 compute nodes in addition to dedicated login/service nodes.
Each XT4 compute node contains a quad-core AMD Opteron 1354 (Budapest)
processor and 8 GB of memory. Total combined memory amounts to over 360
terabytes (TB).[9]

-------------------

Micro 2011 Keynotes

Add future to history
http://pages.cs.wisc.edu/~markhill/restricted/micro11_sodani_keynote.pdf
http://pages.cs.wisc.edu/~markhill/restricted/micro11_keckler_keynote.pdf

-------------------

(Roadrunner)

-------------------