-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Outline Supercomputing * Motivation * Cray T3E * Top 500 * Two Views of Future: Intel & Nvidia * (Roadrunner) * Propect ----------------------------- Motivation Supercomputing -- capability computing * Flops * Interconnect * Programmability? Early: Most computers 1975-85 Vector machines w/ 1 or few processors 1995-2005- incorporate killer microprocessors -- MPP * T3E -- shared memory w/o coherence * Clusters w/ Special ICNs 2005-present * add special vector units, IBM Cell * NOW GPUs (ADD LATER Going forward? * Exacscale has serious power problem * Before 1000x faster for 1x power * Now 1000x faster for 100x power -- NO! * Back to custom * How relevant? ) ----------------------------- %T Synchronization and Communication in the T3E Multiprocessor %A Steven L. Scott %J Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems %K ASPLOS VII %C Cambridge %D October 1996 %P 26-36 %Y topic: MULTIPROCESSORS/synchronization, ARCHITECTURE/real computers Use 21164 but two outstanding cache misses 33b PA = 8GB but want 128GB poor TLB reach poor locality coherence protocl in the way T3D w/ 21064 too ad hoc DTB Annex kitchen sink: loads/stores, prefetch queue, block transfer engine (BLT) T3E upto 2K processors 3D torus no board-level cache E-registers 512 user + 128 system w/ EMPTY/FULL bits Get ereg#, (mem) sw data, (addr) where data == precursor-to-global-address addr = command || ereg# ld gpr#, addr' (where addr' = command | ereg#) st gpr#, addr' (where addr' = command | ereg#) Put ereg#, (mem) Also vector get/put Addresses segmented addressing controlled interleaving See Figure 4 Atomic memory operations fetch&inc, fetch&add, compare&swap, masked_Swap store operands in e-regs invoke with store to I/O space Messages 64-bit message queue control word signal1 || tail21 || limit21 || threshold21 All messages 64B Send e-reg-block, MQCW-address at sender write data into 8 eregs invoke send at MQCW store at MQCW + tail*64 (if tail < limit) if tail > threshold, set signal & generate interrupt (a warning) Barrier (and) / Eureka (or) 32 units mask for processors per unit Reviews (OLD COMMENT: Polina has new ones) Workloads * Marc: Why best perf of shared memory? Latency reduction/tolerance? address transs? * Daniel, Eric: Why board-level cache (L2/L3) bad? Increase BW? * Brian: Successful? * Syed: Why single-word loads more efficient? Other * David: Where E-regs? Fuzzy barriers * Andrew N: E-regs elsewhere? CMPs? * Shijin: Get/put? * Guoliang: Eureka and address translation? * Andrew E: "miss speculation"??? ----------------------------- Top 500 Computer Sites http://www.top500.org/ Nov 2011 1 K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect Site: RIKEN Advanced Institute for Computational Science (AICS) System URL: Manufacturer: Fujitsu Cores: 705024 Power: 12659.89 Kw Memory: 1410048 Gb Interconnect: Custom Operating System: Linux http://en.wikipedia.org/wiki/K_computer K computer – named for the Japanese word "kei" (京?), meaning 10 quadrillion[1] – is a supercomputer produced by Fujitsu, currently installed at the RIKEN Advanced Institute for Computational Science campus in Kobe, Japan.[2][3][4] The K computer is based on adistributed memory architecture, with over 80,000 compute nodes.[5] 88,128 2.0GHz 8-core SPARC64 VIIIfx processors packed in 864 cabinets, for a total of 705,024 cores, manufactured by Fujitsu with 45 nm CMOS technology.[10] Each cabinet contains 96 computing nodes, in addition to 6 I/O nodes. Each computing node contains a single processor and 16 GB of memory. The computer's water cooling system is designed to minimize failure rate and power consumption.[11] [edit]Network The K computer uses a proprietary six-dimensional torus interconnect called Tofu, and a Tofu-optimized Message Passing Interface based on the open-source Open MPIlibrary.[11][12][13] Users can create application programs adapted to either a one-, two-, or three-dimensional torus network.[14] 2 NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050 Tianhe-1A - NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050 Site: National Supercomputing Center in Tianjin System URL: Manufacturer: NUDT Cores: 186368 Power: 4040.00 Kw Memory: 229376 Gb Interconnect: Proprietary Operating System: Linux http://en.wikipedia.org/wiki/Tianhe-I Tianhe-I, Tianhe-1, or TH-1 (天河一号) (Mandarin pronunciation:pinyin: Tiān​hé yī​hào), in English, "Milky Way (literally, Sky River) In October 2010, Tianhe-1A, an upgraded supercomputer, was unveiled at HPC 2010 China.[15] It is now equipped with 14,336 Xeon X5670 processors and 7,168 Nvidia Tesla M2050general purpose GPUs. 2,048 FeiTeng 1000 SPARC-based processors are also installed in the system, but their computing power was not counted into the machine's official Linpackstatistics as of October 2010.[16] Tianhe-1A has a theoretical peak performance of 4.701 petaflops.[17] NVIDIA suggests that it would have taken "50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone." The current heterogeneous system consumes 4.04 megawatts compared to over 12 megawatts had it been built only with CPUs.[18] The Tianhe-1A system is composed of 112 computer cabinets, 12 storage cabinets, 6 communications cabinets, and 8 I/O cabinets. Each computer cabinet is composed of four frames, with each frame containing eight blades, plus a 16-port switching board. Each blade is composed of two computer nodes, with each computer node containing two Xeon X5670 6-core processors and one Nvidia M2050 GPU processor.[19] The system has 3584 total blades containing 7168 GPUs, and 14,336 CPUs, managed by the SLURM job scheduler.[20]The total disk storage of the systems is 2 Petabytes implemented as a Lustre clustered file system,[1] and the total memory size of the system is 262 Terabytes.[16] Another significant reason for the increased performance of the upgraded Tianhe-1A system is the Chinese-designed NUDT custom designed proprietary high-speed interconnect calledArch that runs at 160 Gbps, twice the bandwidth of InfiniBand.[16] 3 Cray XT5-HE Opteron 6-Cray XT5-HE Opteron 6-core 2.6 GHz Jaguar -Site: DOE/SC/Oak Ridge National Laboratory System URL: http://www.nccs.gov/computing-resources/jaguar/ Manufacturer: Cray Inc. Cores: 224162 Power: 6950.00 Kw Memory: Interconnect: Proprietary Operating System: Linux http://www-test1.nccs.gov/wp-content/uploads/2007/08/x1e_architecture.pdf http://en.wikipedia.org/wiki/Jaguar_(supercomputer) Jaguar is a petascale supercomputer built by Cray at Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tennessee. The massively parallel Jaguar has a peak performance of just over 1,750 teraflops (1.75 petaflops). It has 224,256 x86-based AMD Opteron processor cores,[2] and operates with a version of Linux called the Cray Linux Environment.[3] Jaguar is a Cray XT5 system, a development from theCray XT4 supercomputer. Jaguar's XT5 partition contains 18,688 compute nodes in addition to dedicated login/service nodes. Each XT5 compute node contains dual hex-core AMD Opteron 2435 (Istanbul) processors and 16 GB of memory. Jaguar's XT4 partition contains 7,832 compute nodes in addition to dedicated login/service nodes. Each XT4 compute node contains a quad-core AMD Opteron 1354 (Budapest) processor and 8 GB of memory. Total combined memory amounts to over 360 terabytes (TB).[9] ------------------- Micro 2011 Keynotes Add future to history http://pages.cs.wisc.edu/~markhill/restricted/micro11_sodani_keynote.pdf http://pages.cs.wisc.edu/~markhill/restricted/micro11_keckler_keynote.pdf ------------------- (Roadrunner) -------------------