**DISTRIBUTED SNAPSHOT** ======================== This Note is taken heavily from Andrea's: http://pages.cs.wisc.edu/~dusseau/Classes/CS739-S06/Writeups/time.ppt Assumption: graph needs to be *Strongly connected*, to guarantee that the algorithm terminates. SUMMARY: - stable property - this paper, snapshot global state S, and calculate y(S) # Goal, complication, definition Goal - Want to record global state of distributed system (i.e., state of each process, state of each communication channel) - Useful so can observe system properties + Computation terminated? + System deadlocked? + Number of tokens? Complication: - Distributed system has no shared state nor shared clock - Cannot record global state simultaneously everywhere Distributed snapshot: Record local state at different times and combine into meaningful picture Obtain cut in logical time, remain consistent by preserving logical ordering (if not ordering in physical time) # System Model - Distributed system: Finite set of processes and channels; described by graph - Processes + Set of states, initial state, set of events - Channels + FIFO, error-free, infinite buffers, arbitrary but finite delay Why FIFO? infinite buffer? finite delay? > infinite buffer: get rid the case that process tries to put in a full buffer + State (of channel?): Sequence of messages sent but not yet received # Distributed Snapshot Algorithm - Goal: Record local state (each process plus adjoining channels) that produce "meaningful" global state - Idea: After local snapshot, send a marker along channels Receiver records messages in channel before marker - Initial: some process decides to initiate snapshot (perform periodically) # Maker Rules - Marker-sending rule for p: + Send marker along each channel (after recording state of p) before sending more messages - Marker-receiving rule for q on channel c: if q has not recorded state yet: > record state of q > record state of c as empty if q has recorded state already: > record state of c as the msg sequence that arrived since it recorded its state - Termination + When state recorded of all processes and all channels + Must have algorithm to collect and assemble information too (example: banking example in andrea notes) # Properties of Recorded Global State - Recorded global state, S*, may not have occurred - If it bothers you that S* doesn’t actually exist... + Given a permutation of the actual sequence of events + S* is reachable from Sinit + Sfinal is reachable from S* + Stable properties will hold in S* as well - Question: How to permute sequence of events? - Goal: Want snapshot to correspond to single logical cut + Slide events so snapshots taken at same "logical time" + Some events across processes will switch order with others > Specifically, post-recorded events and prerecorded events > prerecorded events occurred before state of p was recorded > Can't tell ordering of concurrent pre and post events # Conclusions - Distributed snapshots: Allow one to reconstruct valid picture of system from snapshots taken at different points in time + Record individual process states plus channels - Snapshots useful for determining if stable properties hold or not - Recorded state S* may not correspond to “reality”, but if stable properties hold in beginning and end states, then hold in S* too