Research Projects
- Paradyn
- KernInst
- MRNet

TCPP Ph.D. Forum 2009

Scalable Distributed Process Group Control and Inspection via the File System

Research Overview

My Ph.D. thesis, entitled "Scalable Distributed Process Group Control and Inspection via the File System", addresses the need for a common, scalable infrastructure for operating on large groups of distributed files using a new idiom, group file operations, that provides a simple, intuitive interface. The cornerstone of the idiom is a new gopen operation that creates a group file handle for use with existing file operations (e.g., read and write). The key benefit of the group file operation idiom is eliminating explicit iteration when applying the same operation to a group of files. A file-based idiom also promotes conciseness and portability for group operations. File operations are well-understood and intuitive, and support in programming languages and operating systems is ubiquitous. Also, file operations are data format agnostic and work on files containing binary data, text, or both. Despite the benefits, introducing group semantics for existing file operations while maintaining intuitive behavior presents several unique challenges. In particular, I address the interface and scalability issues associated with group status and data operands. My approach is to define intuitive group semantics for existing file operation interfaces and only make extensions when necessary.
Unfortunately, the group file operation idiom alone does not provide scalability when operating on large groups of distributed files. The mechanisms underlying group file operations must be scalable. To this end, I have designed TBON-FS, a new distributed file system that provides scalable group file operations by leveraging tree-based overlay networks (TBONs) for scalable distribution of group file operation requests and aggregation of group status and data results. Aggregation is a key technique for scalable analysis of the vast amount of data produced by tools and middleware at large scales.
The goal of my research is to advance the state of the art in the development of tools and middleware for the large-scale distributed systems, with specific focus on HPC systems. Scalable tools and middleware are crucial for efficient use of HPC resources, as they provide the means for running user applications and problem diagnosis. Only by addressing the scarcity of tools and middleware that can effectively function on the largest of current and upcoming systems can we hope to maximize the efficient use of HPC systems. By providing an intuitive, general idiom and infrastructure for scalable group operation on distributed files, I hope to encourage the rapid development of new scalable tools, as well as allow existing software to easily adopt a scalable solution, thereby ending unnecessary duplication of effort.
More Information

The full text of the TCPP Ph.D. Forum paper submission can be found here. The paper includes an overview of completed work and my plans for furthering the research. A PDF version of the poster is also available.
A paper describing the group file operation idiom and its applications for tools and middleware is available here.

Computer Sciences | UW Home