Data-Driven Batch Scheduling

John Bent

Department of Computer Sciences,
University of Wisconsin-Madison

johnbent@cs.wisc.edu

Abstract:

In this thesis, we present a data-driven batch scheduling system. Current CPU-centric batch schedulers ignore the data needs within workloads and execute them by linking them transparently and directly to their needed data. When scheduled on remote computational resources, this elegant solution of direct data access can incur an order of magnitude performance penalty for data-intensive workloads.

To concretely motivate this problem, we provide here a detailed analysis of six current data-intensive, scientific, batch workloads. From this analysis, we derive quantitative bounds on expected scalability and demonstrate the infeasibility of scheduling these workloads using current CPU-centric systems that lack data-awareness.

Adding data-awareness to CPU-centric batch schedulers allows a careful coordination of both data and CPU allocation thereby reducing the performance cost of remote execution. To achieve this coordinated schedule, however, batch schedulers need complicity from storage systems to allow transfer of control over low-level storage decisions from the storage system to the batch scheduler. Wielding explicit storage control, the batch scheduler can then carefully coordinate storage and CPU allocations using a variety of data-driven scheduling policies.

We offer one such system. A modified batch scheduling system that understands the nature of batch workloads as revealed by our new measurement study, that leverages the explicit storage control provided by our new distributed file system, and that can use our new analytical predictive models to select one of the five distinct data-driven scheduling policies that we have created.

Full Document: Postscript   PDF
BibTex: Bib
Talk Slides: PowerPoint
Book Front: JPG
Book Back: JPG