Research Overview
My Ph.D. thesis, entitled "Scalable Distributed Process Group Control and Inspection via the File
System", addresses the need for a common, scalable infrastructure for operating on large groups of
distributed files using a new idiom, group file operations, that provides a simple, intuitive
interface. The cornerstone of the idiom is a new gopen operation that creates a group file handle
for use with existing file operations (e.g., read and write). The key benefit of
the group file operation idiom is eliminating explicit iteration when applying the same operation
to a group of files. A file-based idiom also promotes conciseness and portability for group
operations. File operations are well-understood and intuitive, and support in programming languages
and operating systems is ubiquitous. Also, file operations are data format agnostic and work on
files containing binary data, text, or both. Despite the benefits, introducing group semantics
for existing file operations while maintaining intuitive behavior presents several unique
challenges. In particular, I address the interface and scalability issues associated with group
status and data operands. My approach is to define intuitive group semantics for existing file
operation interfaces and only make extensions when necessary.
Unfortunately, the group file operation idiom alone does not provide scalability when
operating on large groups of distributed files. The mechanisms underlying group file operations
must be scalable. To this end, I have designed TBON-FS, a new distributed file system that
provides scalable group file operations by leveraging tree-based overlay networks (TBONs) for
scalable distribution of group file operation requests and aggregation of group status and
data results. Aggregation is a key technique for scalable analysis of the vast amount of
data produced by tools and middleware at large scales.
The goal of my research is to advance the state of the art in the development of tools
and middleware for the large-scale distributed systems, with specific focus on HPC systems.
Scalable tools and middleware are crucial for efficient use of HPC resources, as they provide
the means for running user applications and problem diagnosis. Only by addressing the scarcity
of tools and middleware that can effectively function on the largest of current and upcoming
systems can we hope to maximize the efficient use of HPC systems. By providing an intuitive,
general idiom and infrastructure for scalable group operation on distributed files, I hope to
encourage the rapid development of new scalable tools, as well as allow existing software to
easily adopt a scalable solution, thereby ending unnecessary duplication of effort.
More Information
The full text of the TCPP Ph.D. Forum paper submission can be found
here. The paper includes
an overview of completed work and my plans for furthering the research. A PDF version of the
poster is also available.
A paper describing the group file operation idiom and its applications
for tools and middleware is available
here.