This research was conducted by Guoliang Jin, Aditya Thakur, Ben Liblit, and Shan Lu. The paper appeared in the 25th ACM SIGSPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2010).
Fixing concurrency bugs (or crugs) is critical in modern software systems. Static analyses to find crugs such as data races and atomicity violations scale poorly, while dynamic approaches incur high run-time overheads. Crugs manifest only under specific execution interleavings that may not arise during in-house testing, thereby demanding a lightweight program monitoring technique that can be used post-deployment.
We present Cooperative Crug Isolation (CCI), a low-overhead instrumentation framework to diagnose production-run failures caused by concurrency bugs. CCI tracks specific thread interleavings at run-time, and uses statistical models to identify strong failure predictors among these. We offer a varied suite of predicates that represent different trade-offs between complexity and fault isolation capability. We also develop variant random sampling strategies that suit different types of predicates and help keep the run-time overhead low. Experiments with 9 real-world bugs in 6 non-trivial C applications show that these schemes span a wide spectrum of performance and diagnosis capabilities, each suitable for different usage scenarios.
The full paper is available as a single PDF document. A suggested BibTeX citation record is also available.