Performance Availabilty for Networks of Workstations

Doctoral Dissertation

Remzi H. Arpaci-Dusseau,

Software systems for large-scale distributed and parallel machines are difficult to build. When run in dynamic, production environments, not only must such systems perform correctly, but they must also operate with high performance. Much of the previous work in distributed computing has addressed the design of large-scale systems that function correctly, in spite of correctness faults of individual components. However, there has been little development of techniques to tolerate performance faults -- unexpected performance fluctuations from the components that comprise the system. Due to this shortcoming, many systems are overly sensitive to performance variations, in that global performance is high if and only if all system components perform exactly as expected.

In this dissertation, we address this deficiency by formalizing the concept of performance availability. Our hypothesis is that modern software systems must provide mechanisms to enable performance availability. Without such mechanisms, global system performance is likely to be unpredictable and substantially less than ideal. By furnishing application writers with the proper tools to cope with common performance faults, robust system performance can be achieved.

To test our hypothesis, we present the design and implementation of River. River provides a generic data-flow environment as a substrate for the construction of performance-robust applications. Two novel, distributed algorithms form the heart of performance availability in River: a distributed queue allows producers to flexibly move data to variable rate consumers, thus avoiding consumer-side performance faults, and graduated declustering carefully allocates producer bandwidth across consumers, similarly avoiding producer-side performance faults. In tandem, with no centralized components or global information, these two constructs can be used to implement performance-robust applications.

We demonstrate the utility and efficiency of the River environment and its core primitives through a series of simulation and implementation experiments. First, we rigorously explore the performance properties of both the distributed queue and graduated declustering, establishing that they perform as desired under a broad range of performance faults. Then, we apply the mechanisms to the construction of several data-intensive query-processing primitives, transforming them into programs that are robust to disk performance faults.

Full paper: Compressed Postscript