HPJS Project Home Page

HPJS: High Performance Job Scheduling

Directed by Mary K. Vernon

Overview
People
Publications
Current Topics

Project Sponsor

NSF Partnerships for Advanced Computational Infrastructure (PACI) Program,
National Computational Science Alliance (NCSA)

Project Overview

This project is investigating and developing new high performance job schedulers for the NSF TeraGRID and other leading edge systems in the NCSA advanced computational infrastructure. At an NCSA Executive Committee meeting in August 1996, Applications Technologies (AT) team leaders listed high performance job scheduling as one of the top five Alliance infrastructure needs. The multi-year goal of this project is to develop job schedulers that deliver optimal performance for the objectives and requirements of the Alliance user community. These objectives include fast turnaround for the ordinary jobs as well as the ability to dedicate machine resources to large breakthrough "hero" simulations (several per month), and the ability to reserve time for Grid applications that need to run on multiple autonomous systems.

The scope for this project includes designing new scheduling policies for the NCSA Origin 2000, the NCSA Linux cluster, and the new $53M TeraGrid. These NCSA leading edge systems represent a significant challenge for job scheduling policies. For example, the NCSA Origin 2000 (O2K) system has 8,000 - 12,000 batch job arrivals per month, and the jobs submitted to the space-shared nodes collectively utilize 90-100% of the available processor cycles and an average of 70-90% of the available memory, depending on the month. The key features of our approach are:

designing new policies based on applicable theoretical scheduling results,
evaluating the new policies using a highly accurate simulator that is validated against measured system performance, and
integrating the new policies into the existing schedulers (by NCSA staff).

Our results to date include:

A new Priority Backfill scheduler for the space-shared systems that was put into production operation for the NCSA O2K space-shared hosts in July 2000. This scheduler improved average job waiting time from 6.5 hours to 2.4 hours, and improved the 95th percentile waiting time from 30 hours to 6 hours. The measured performance improvements, which agreed with our predicted improvement, occurred during a period in which the workload submitted to the system remained similar from month to month.
An O2K workload characterization that includes (a) the resources requested by each job (i.e., processors, memory, and runtime), (b) the actual resource usage (i.e., actual runtime, and processor and memory usage), and (c) a unique approach to capturing the observed correlations among these job parameters. To our knowledge, this is the first complete statistical characterization of a large production workload, which can be used to generate synthetic workloads that have the observed characteristics as well as the observed correlations among the characteristics. The workload characterization defines the workloads for which the policy evaluation results are valid, and provides insights for designing future improved scheduling policies.

On-going work includes significant further improvements in system turnaround time for the Origin 2000, as well as development of high performance scheduling policies for the NCSA Linux cluster and the TeraGrid, including policies that take into account the diverse job requirements for storage, processing, and communication network resources.

vernon@cs.wisc.edu