In this thesis, we present a data-driven batch scheduling system. Current
CPU-centric batch schedulers ignore the data needs within workloads and execute
them by linking them transparently and directly to their needed data. When
scheduled on remote computational resources, this elegant solution of direct
data access can incur an order of magnitude performance penalty for
To concretely motivate this problem, we provide here a detailed analysis of six current data-intensive, scientific, batch workloads. From this analysis, we derive quantitative bounds on expected scalability and demonstrate the infeasibility of scheduling these workloads using current CPU-centric systems that lack data-awareness.
Adding data-awareness to CPU-centric batch schedulers allows a careful coordination of both data and CPU allocation thereby reducing the performance cost of remote execution. To achieve this coordinated schedule, however, batch schedulers need complicity from storage systems to allow transfer of control over low-level storage decisions from the storage system to the batch scheduler. Wielding explicit storage control, the batch scheduler can then carefully coordinate storage and CPU allocations using a variety of data-driven scheduling policies.
We offer one such system. A modified batch scheduling system that understands the nature of batch workloads as revealed by our new measurement study, that leverages the explicit storage control provided by our new distributed file system, and that can use our new analytical predictive models to select one of the five distinct data-driven scheduling policies that we have created.