With the widespread deployment of multi-core hardware, writing concurrent programs
has become inescapable. This has made fixing concurrency bugs (or crugs) critical
in modern software systems. Static analysis techniques to find crugs such as data
races and atomicity violations are not scalable, while dynamic approaches incur
high run-time overheads. Crugs pose a greater challenge since they manifest only
under specific execution interleavings that may not arise during in-house testing.
Thus there is a pressing need for a low-overhead program monitoring technique
that can be used post-deployment.
We present Cooperative Crug Isolation (CCI), a low-overhead instrumentation technique
to isolate the root causes of crugs. CCI inserts instrumentation that records occurrences
of specific thread interleavings at run-time by tracking whether successive accesses
to a memory location were by the same thread or by distinct threads. The overhead of
this instrumentation is kept low by using a novel cross-thread random sampling
strategy. We have implemented CCI on top of the
Cooperative Bug Isolation framework.
CCI correctly diagnoses bugs in several nontrivial concurrent applications while
incurring only 2-7% run-time overhead.