Back to index
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems
Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak
EPFL, Switzerland
One-line Summary
CrystallBall uses run-time model checking techniques to predict future events and steer away from inconsistent states in the distributed systems.
Overview/Main Points
- Background: model checking
- first proposal for HW
- Input (non-determinism)
- current system states: node states, in-flight msgs
- system model: lcoal actions and msg handlers
- huge search space (factorial order) exploration given the inputs and models, reduced by
- defining properties
- patial-order reduction
- keep previous visited state info with signatures in a hash table
- Bounded model checking with an accepted coverage rate
- BFS
- DFS for a particular input
- CrystallBall
- Goal
- asynchronous local-state checkpointing (regular and foced ckps)
- do not violate msg happen-before ordering
- the checkpointing protocol
- each node independently takes checkpoints tagged with a logical timestamp.
- take a checkpoint right before handling a msg whose timestamp is larger than the local timestamp.
- when asked for checkpoints, forcing checkpoints if the local timestamp is less than that of the checkpointing msg.
- Other things about checkpointing
- consistent neighborhood snapshot
- developers specify properties to checkpoint
- avoid exploring the same state again by caching the search history in a hash table
- predict later events per node through run-time model checking
- prevent the predicted failures by
- dropping messages
- leveraging the non-determinism in distributed system
- Execution steering
- event filters: temporarily block centern state-machine transitions by dropping messages
- non-disruptiveness: do not change the original program semantics.
- Limitations of snapshot-based approach
- Inconsistency may have already happened before taking snapshots.
- Model checker may run searches for non-exist execution paths due to “ old snapshot ”.
Relevance
Flaws