Reducing the Checkpointing Burden of Condor:  Analysis and Implementation

John Bent, Gregory R. Bronner


Long running processes are computationally greedy.  In a distributed environment, migrating these processes allows effective load balancing.  Checkpointing techniques are used to capture the state of a process in order that it can be restarted on a different node.  Unfortunately, checkpointing adds much additional overhead to the system both in terms of the bandwidth required to take periodic checkpoints and the latency of the checkpoint events.

Condor is a high throughput distributed scheduler developed by researchers at the University of Wisconsin.  The specific implementation of checkpointing used by Condor adds even more constraints.  Checkpoints in Condor are stored on a centralized checkpoint server.  The network bandwidth used daily to send checkpoints to the server routinely exceeds fifty gigabytes of data.  In addition, Condor is an extremely opportunistic scheduler, probing all networked machines and scheduling jobs whenever available resources are found.  The sociological implications when users find Condor jobs running on their machines can be serious.  For this reason, Condor is often unable to checkpoint itself when the users return to their machines and job progress can be lost.

In this paper, we explore optimized checkpointing techniques -- methods by which only the modified portions of a job's image must be written to disk.  We start with an examination of work done in this area by James Plank and his group of researches at the University of Tennessee.  We then conduct a series of measurements on user jobs submitted to the UW Condor pool to see how applicable the Plank techniques are in an actual production environment.  From these measurements we conclude that incremental checkpointing would remove much of the checkpointing bandwidth burden.

Incremental checkpointing is a method which checkpoints only the modified pages of the process image.  By adding incremental checkpointing to the Condor code, we then measure the degree to which incremental checkpointing can further reduce the checkpoint burden by speeding up the checkpoint event.



Full paper: Postscript
Slides: Powerpoint
Source code: tarball