|
Project Sponsor
NSF Partnerships
for Advanced Computational Infrastructure (PACI) Program,
National Computational Science Alliance (NCSA)
Project Overview
This project is investigating and developing new high performance job schedulers for
the NSF TeraGRID and other leading edge systems in the
NCSA advanced computational infrastructure. At an NCSA Executive Committee
meeting in August 1996, Applications Technologies (AT) team leaders listed
high performance job scheduling as one of the top five Alliance infrastructure
needs. The multi-year goal of this project is to develop job schedulers
that deliver optimal performance for the objectives and requirements
of the Alliance user community. These objectives include fast turnaround
for the ordinary jobs as well as the ability to dedicate machine resources
to large breakthrough "hero" simulations (several per month), and the ability
to reserve time for Grid applications that need to run on multiple autonomous
systems.
The scope for this project includes
designing new scheduling policies for the
NCSA
Origin 2000,
the
NCSA
Linux cluster,
and the new $53M
TeraGrid.
These NCSA leading edge systems represent a significant
challenge for job scheduling policies. For example, the NCSA Origin 2000 (O2K) system
has 8,000 - 12,000 batch job arrivals per month, and the jobs submitted to the
space-shared nodes collectively
utilize 90-100% of the available processor cycles and an average of 70-90% of the
available memory, depending on the month.
The key features of our approach are:
- designing new policies based on applicable theoretical scheduling results,
- evaluating the new policies using a highly accurate simulator that is validated
against measured system performance, and
- integrating the new policies into the existing schedulers (by NCSA staff).
Our results to date include:
- A new Priority Backfill scheduler for the space-shared systems that was put into
production operation for the NCSA O2K space-shared hosts in July 2000. This
scheduler improved average job
waiting time from 6.5 hours to 2.4 hours, and improved the 95th percentile waiting time
from 30 hours to 6 hours. The measured performance improvements, which agreed
with our predicted improvement, occurred during a period in which
the workload submitted to the system remained similar from month to month.
-
An O2K workload characterization that includes (a) the
resources requested by each job (i.e., processors, memory, and runtime), (b) the
actual resource usage (i.e., actual runtime, and processor and
memory usage), and (c) a unique approach to capturing the observed correlations
among these job parameters. To our knowledge, this is the first complete
statistical characterization of a large production workload, which can be used to
generate synthetic workloads
that have the observed characteristics as well as the observed correlations
among the characteristics. The workload characterization
defines the workloads for which the policy evaluation results are valid,
and provides insights for designing future improved scheduling policies.
On-going work includes significant further improvements in system turnaround time
for the Origin 2000, as well as development of high performance scheduling
policies for the NCSA Linux cluster and the TeraGrid, including policies
that take into account the diverse job requirements for storage, processing, and communication
network resources.
|