Distributed Snapshots: Determining Global States of Distributed Systems

K. Mani Chandy @ Texas-Austin, Leslie Lamport @ Stanford

ACM Trans. on Computer System 1985

1. Introduction

Problem: How to capture the global state of distributed system where only local states are available

Local state: state of each process, state about msgs each process has sent or received

Challenge: how to get the global snapshot from bunch of local snapshots taken not at the same time: maybe undecidable!

What if the problem is reduced only to stable properties?

Stable property y of system D = predicate y on state S of D s.t.

if y(S) is true, then y(S') is always true for every S' which is reachable from S

examples: termination of a computation, deadlock of the system, tokens disappeared, etc.

Deadlock detection & termination detection problems are special case of stable property detection problem!

Also many distributed algorithms are structured as a seq. of phases, s.t.

Each phase = transient part + stable part

transient part: actual work of the phase is done

stable part: system waits indefinitely, doing nothing, until certain stable property becomes true

Main attention of the paper: detection of the termination of a phase

2. Model of a Distributed System

A distributed system = directed graph G(V, E)

V = set of processes, E = set of channels (connections) between processes in V

Channel

infinite buffer capacity

reliable and in-order delivery

delay is indefinite but finite

state of the channel = the set of msgs injected to the channel but not received yet

Process = state transition machine(St, Ev, a)

St = the set of state

a = the initial state

Ev = the set of event

An event is an atomic action and may change the state of the process and possibly the state of channel incident to the process

Event = (p, s, s', M, c)

p = the process in which the event occurs

s = the state of p before the event

s' = the state of p after the event

c(optional) = a channel incident to p

M(optional) = a msg sent to or received from c

State of distributed system = ( {process_state}, {channel_state}, init_state )

init_state = each component process is in its init state & channels are empty

Valid event = (p, s, s', M, c): an event which can occur on the current system state

p is in s; if c is incident to (not from) p, then M is in the head of c

Valid seq. of event: obvious

next_state of the system = next (system_state, event): obvious

A computation = a sequence of states of the system, starting from init_state

3. The Algorithm

3.1 Motivation for the Steps of the Algorithm

Local state recoding

Process state: recorded by each process

Channel state: processes connected by the channel should cooperate

The examples in this section show two anomalies of system states which are composed from local states taken at different times

Anomalies in these examples occur because:

channel's state includes msgs sent after sender's state is recorded

channel's state is taken before sender's state: msgs sent in between the time when channel's state is taken and when sender's is taken are missed

The condition for consistent system record

The state of channel should be the sequence of messages which are

sent before sender's state is recorded

not received when the receiver's state is recorded

Idea

sender records its state -> sends 'mark' msg ->

receiver records its state ->

receiver records channel state until it receives the 'mark' msg

3.2 Global-State-Detection Algorithm Outline

Marker-Sending Rule:

for each outbound channel c

send 'marker' after recoding its state and before sending any further message

Marker-Receiving Rule

if I have not recorded my state {

record my state with current status;

record channel state as 'empty'

} else {

record channel state as the seq. of msgs received after I recorded my state and before I got the 'mark' msg

}

In short, each process records its own state and states of incoming channels!

3.3 Termination of the Global State Recoding Algorithm

	Receiving a marker make the receiver record its state!
	Periodic recording by all processes guarantees the termination
	If the graph is strongly connected, single process spontaneously recording its state guarantees the termination of the algorithm
	How to collect those local states to combine the global state is described in another paper

4. Properties of the Recorded Global State

Notation: S₁ ->(e₁)-> S₂ = global state S₁ goes to S₂ by e₁. Hence S₁, ... S_n is equivalent to e₁, ..., e_n.

Recording local states (including propagating marker) takes time -> System keeps changing its real state while the recording is going on

=> Recorded global state can be different from the state at time of record reading. Even worse,

if system goes S₁, S₂, ..., S_i, ..., S_j, ..., S_n, and the recording starts at S_i and terminates at S_j,

the recorded state S* can be different from any of S_i, ..., S_j

However, it is proved that:

S* is reachable from S_i, and

S_j is reachable from S*

Theorem: There exists a computation e'₁, e'₂, ..., e'_n, s.t.

e₁, ..., e_i-1 = e'₁, ..., e'_i-1 & e_j, ..., e_n = e'_j, ..., e'_n

e'_i, ..., e'_j-1 is a permutation of e_i, ..., e_j-1

for i >= x, or x >= j, S_x = S'_x

there exists k, s.t., i <= k <= j, S* = S'_k

5. Stability Detection

Input: stable property y, i.e., a predicate

Output: Boolean variable definite s.t.,

(y(initial state of the system) -> definite) && (definite -> y(final state of the system))

Note:

y (init state) definite y (init state) -> definite

F T/F T

T T T

T F F

definite y (final state) definite -> y (final state)

F T/F T

T T T

T F F

Hence, definite = false means that y(initial state) = false, and definite = true means that y(final state) = true

Algorithm is obvious from the definition of stability

record a global state S*;

definite = y(S*);

definite	y (final state)	definite -> y (final state)
F	T/F	T
T	T	T
T	F	F