USB Logo  
NSF Logo

US/Venezuela Workshop on High Performance Computing 2000
Seminario EEUU/Venezuela de Computación de Alto Rendimiento 2000

Jeffrey S. Brown
Los Alamos National Laboratory

Conicit Logo

Towards Tera-scale Performance in the ASCI Program

Abstract

The United States Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI) program mission is to develop computer simulation capability that allows the DOE to certify the safety and reliability of the nuclear stockpile in the absence of physical testing. To achieve the necessary level of confidence in the simulations, very high resolution, high-fidelity multi-physics computer simulation applications are being developed that require tera-scale compute platforms. The ASCI program currently has platforms deployed at the research laboratories that deliver three tera-flop peak performance, with plans to deploy up to 100 tera-flop capability in a few years. ASCI simulations must be able to exploit the full potential of these powerful machines in order to meet program objectives, a task made quite difficult by the complexity of the compute platforms and numerical methods being developed.

This talk will provide an overview of the ASCI program with an emphasis on the unique ASCI software development and run-time environment designed to develop and tune very large-scale parallel programs. I will provide an overview of the software development environments on the three ASCI platforms, including work being done through the Ultra-scale tools initiative and through research collaborations with U.S. Universities, with an emphasis on performance measurement and analysis on the Blue Mountain system at Los Alamos National Laboratory (LANL).

The LANL Blue Mountain system is a cluster of forty-eight, one-hundred twenty-eight processor Origin 2000 (O2K) computers manufactured by Silicon Graphics Incorporated (SGI), interconnected with HiPPI over a 3D torus topology. Achieving high performance over such a complex system is a unique challenge, from single process dynamics such as cache reuse, to on-box dynamic memory performance over the internal O2K network (NUMA effects), to across-box message passing dynamics over the HiPPI. LANL is developing a performance tool capability to help the code developer understand machine resource use issues and map an application onto the machine in a way that avoids resource contention and bottlenecks. This talk will explore these issues and show current results and performance tool capabilities.


Last modified: Fri Nov 5 09:39:17 CST 1999 by bart