CS 739 Reviews - Fall 2014: Distributed snapshots: determining global states of distributed systems

problem:
- they want a method to capture a snapshot of the global state of the distributed system.
- they are not going to use a global clock, or any other synchronization schemes.
- the problem is that each process can only capture its own state, and they can't synchronize the captures
- they need an algorithm to capture a consistent snapshot of the whole system.

solution:
- they describe the system as processes and the channels between them. an event would be a change int the state of the system. they represent it with a 5-tuple (the process, state before and after event, message and altered channel if any)
- in their model, each process saves its own state and the state of incoming channels.
- they mentioned that these states can be collected using other methods and they didn't describe that.
- they provide an algorithm to save the states and prove that the captured states are consistent. (the process sends a marker on its out going channels upon saving its state. other processes save their state upon receiving the marker)
- they show that these states can be used for detecting the stable properties.

Posted by: Alireza Fotuhi | October 2, 2014 08:03 AM

Summary,
Given a distributed system that is modeled in terms of a set of processes and events on such processes that take them through a set of states, this paper describes a set of algorithms that can be used to determine a global state for the distributed system.

Problem,
In this model of the distributed system, the process communicate by sending and receiving messages and can record these messages along with their own state.
Given that each of the process in the system records its own view of the system state, the authors develop algorithms that can be used to decide on a consistent view of the entire distributed system (which is the global state of the system). This is explained with an analogy of combining the snapshots of a natural scenery from different photographers to produce a composite image that represents the scenery.

Contributions,
Since the global state of the system is recorded in terms of the set of messages and states of the processes along with that of the communication channel, it is possible to easily record an inconsistent view of the entire system. For example in a system that conserves a single token, it is possible to easy record a global state that shows more that one token or no token at all. The algorithms designed in this paper, avoid such inconsistencies by having the processes send a marker in the communication channel when they record their state. By clearly defining the set of rules to be followed by the processes when sending and receiving such markers, the algorithms avoid inconsistencies in the recorded global state mentioned earlier.

Flaws,
In this paper, the authors assume that communication channels between the processes never loose messages or deliver messages out of order. But in reality there is a possibility for message loss or out of order delivery. And the algorithms do not handle this case. Further it is possible that the markers, needed for the correct termination of algorithms, could themselves be lost and the algorithms don't handle that case as well.

And further, all the individual states of the processes in the system should be collected to form the global state. The authors describe a simple flooding mechanism for achieving this, but again they don't consider failure scenarios.

Relevancy,
Though stateless server design is becoming popular for designing large distributed systems, there are scenarios where maintaining a consistent global state becomes necessary. Examples of such scenarios are deadlock detection, checkpointing etc and these algorithms along with mechanisms to deal with system failures can be used to design such systems.

Posted by: Sathiya Kumaran | October 2, 2014 07:48 AM

Summary:
This paper presents an algorithm for a single process in a distributed system to determine a global state of the system during a computation.

Problem:
All processes cannot record their local states at precisely the same instant and they should use that information to decide the global state. A algorithm is needed so that processes only record their own states and the states of communication channels, and the set of process and channel states recorded can form a global system state.

Contribution:
(1) Formally model the distributed system. A distributed system consists of a finite set of processes and channels. A global state of a distributed system is a set of component process and channel states. The computation of the system is defined to be the event sequence in the state machine.
(2) Design an algorithm to capture global state snapshots. In the global-state recording algorithm, each process records its own state, and the two processes that a channel is incident on cooperate in recording the channel state. The paper also proves the actual global state is reachable from the snapshots despite that snapshots are not actual global state.
(3) Extend the algorithm to stability detection.

Applicability:
The paper made several assumptions first: channels have infinite buffers, are error-free, and deliver messages in the order sent. These are not possible in real life. So I don't think there is much direct applicability for this algorithm. However, this paper gives very clear model for the distributed system and simple way to analyze its behaviour. It lays a sold foundation for other researches and the variance of this algorithm may be useful.

Posted by: Jing Fan | October 2, 2014 07:43 AM

Summary
The paper aims to determine the global states of distributed system. Based on the distributed system model (consisting finite of processes and channels), the author proposed a simple algorithm to determine the global state of the system.

Problem Description
The global state of a distributed system consists two aspects: 1). the local states of each individual processes, 2). the state of the communication channels. The reason why we need global states, is that some actions need to be executed at particular global states, e.g. termination detection, system deadlock. Since distributed system is deployed in a decentralized fashion, to collect information, one may need to traverse all the nodes (each node may have its own local states). Furthermore, synchronizing all the nodes using a single global clock is extremely hard, so we cannot determine the global states based on local clock. Finally, since the network’s behavior is random, the network delay may vary over time. All of these difficulties make this problem hard to solve.

Contribution
1. The first contribution comes from the model they built to describe the distributed system through the concept of processes and channels. Although the model has some assumptions, e.g., infinite buffers, error-free and in-order delivery, it helps authors to focus on the main problem they want to solve (Global state determination). Also, this nice model can be either directly applied or modified, so that other researchers can be benefited from this model.
2. The second contribution is they found the requirement of consistent global states based on their model, and they further proposed a simple global-state-detection algorithm to solve the problem. Although afterwards researches may have different models, they still need to take this condition into consideration.
3. They investigated the properties of recorder global state recorded by proposed algorithm. They proved that the recorded global state S* can be reachable from Si, and termination state can also be reachable from S*.
4. A small example is showed: they can use proposed algorithm to solve stability-detection problem. This shows their algorithm can be applied to real scenario.

Discussion
My concern is whether Snapshots can give real-time actual global state, since all the recorded processes and channel states must collected and assembled first before forming the recorded global state. But still, Snapshots can be used to detecting stable property.

For research society, this paper has a nice mathematics model for distributed system, which can be applied or modified for afterwards researches. For real distributed system, this algorithm can be easily implemented for system monitoring or deadlock detection purpose.

Posted by: Shu Wang | October 2, 2014 07:42 AM

SUMMARY: The paper presents an algorithm for determining global state of a distributed system using the underlying system for transport without affecting the computation itself.

PROBLEM: Capturing the state of all nodes and channels simultaneously is not possible because there is no global clock, and processes can only record their own state. If the state of a channel is recorded followed by the state of a process receiving from that channel, it is possible a message could be duplicated (and depending on ordering, messages could be lost as well). As such, an algorithm is needed that can work over a period of logical time to correctly record the state of all processes while maintaining the consistency of the communications between them. Furthermore, it must do so using no out-of-band communications, and therefor must overlay on top of existing processes and channels without disrupting the underlying computation.

CONTRIBUTIONS: Before this paper, some research was taking an informal approach and coming up with incorrect algorithms. This paper formalizes the exact definitions of process, channel, local, and global states and their relationships in order to make properties of the algorithm provable (and then proves them). Although they describe an example model of a distributed system, it's important to note the algorithms presented don't require any specific model to work. The authors nicely address the problem of non-determinism by showing that the global state determined is a possible state, reachable from the initial state, and leading to the same terminal state. Finally, the algorithm solves not just a specific problem of "deadlock detection" or "termination" but presents an algorithm for detection any type of externally defined stable property.

APPLICABILITY: Like the previous paper, this one deals with the concept of time and the challenges it poses. There is no such thing as "instantaneous communication" so time problems will always be relevant. Beyond the problem of stability detection, determining global state has a major impact on the idea of checkpointing and migration. Even today we are working on challenges of checkpointing, and while you can now checkpoint an entire VM, we often deploy clusters of computers to solve large problems, and the issues of communications that are "in-flight" between cluster nodes is exactly the same problem as described here. However, since over the broader Internet we have even less control of the channels themselves, additional work needs to be done to address the unreliable (and untrusted) nature of the channels.

Posted by: Zach Miller | October 2, 2014 07:36 AM

Summary:
- The paper lays the foundation for recording events and message passing using (p, s, s’, M, c).
- The paper develops timelines by reordering all prerecording events to precede all post recording events.
- The paper develops stability detection by checking the reply of an externally defined function on the global state.

Problem:
- The paper wants to piece many separate events to draw a meaningful conclusion and check stable properties of a distributed system.
- There is no shared clock, so processes will have to communicate with each other to obtain the order of events.
- Difficult to piece together snapshots because there are many different states with non-determinism. E.g. piecing many snapshots together (Fig. 8) could results in nothing that actually happened (Fig. 7)

Contributions:
- The paper develops stability detection such as termination and deadlocks from distributed snapshots. This contribution opens the door for more advances in distributed system, such as start processing the next computation, or changing the system to a low power state when computer termination has been detected.
- The paper presents proofs that guarantee that stable property will hold if definite is true. This helps with the implementation because there is less of a need to check for correctness.

Applicability:
- The technical contribution of this paper is significant under the assumption that is there no common clock, or else piecing snapshots together will be trivial. However, a common physical clock is required for a strong clock conditions, and we do know the requirements to implement a common clock [Lamport 1978]. Therefore, depending on the cost of actually implementing a common clock, the applicability of the technical contribution of this paper may be limited.
- Having stability detection is crucial when resources are shared such as Hadoop multi-node clusters. The client and the application will wants to be able to track the progress of large MapReduce jobs submitted to the cluster. Without stability detection, it will be difficult to detect deadlocks and a buggy process could hog the resources until the nodes are manually restarted.

Posted by: Kai Zhao | October 2, 2014 07:24 AM

Summary
Chandy And Lamport provided a formal characterization of distributed computation and in that simplistic model they devised a global snapshot algorithm that can capture a "meaningful" global state and proved its correctness.

Problem
The main goal is to capture a global state of a distributed system with following constrains --

No central authority
No access to synchronized global (physical) clock
process will only record the messages that intercepts them
the recording procedure should not affect the computation of the system

Contribution

Simplified modelling of Distributed system and definition of process state, global state. They also defined the relationship between these and point of computation formally.
Introduction of simple marker algorithm and proving that it can capture a "meaningful" global state that.
They extended the concept of logical clock to process snapshot capture with minimal overhead
Definition of event as quintuple (process, local state before the event, local state after the event, message, channel through which the message is sent) is a smart simple way of capturing event state. This greatly simplifies their further analysis of correctness of the algorithm(s).
They viewed DS as a large distributed state machine and tries to build the state of the whole machine from individual node's view of the system.

Limitation
I am slightly concerned about the assumption that they made about the system, whether a real system with enough nodes can meaningfully be assumed like that -- no packet loss, no failure, only two channels etc. And how the algorithm will react to breaking of any assumptions those scenarios should be discussed in a practicality oriented paper. Also the global snapshot is synchronous in logical sense and can be greatly vary in actual (absolute) time, in some DS use cases this might not acceptable.

Applicability
Apart from the concern about the criticality of assumptions, I think their solution is remarkable in terms of simplicity and robustness and is used in deadlock detection or global state capture vastly.

Posted by: Rahul Chatterjee | October 2, 2014 07:18 AM

Summary
The authors in this paper present a distributed algorithm for determining the global state of the system.

Problem
In a distributed system, where sending/receiving messages is the only way of communication finding global state of the system is difficult due
1) Arbitrary communication delays
2) The processes can only record their local state
3) The processes cannot record their state at same time due to absence of a global clock

Due to this limitations, finding global state at an instance is difficult. The paper relaxes the criteria of global state at an instance to that of consistent global state and proposes an algorithm to provide the same.

Contributions
1) Author provides formal definition for distributed computation.
2) Author propose a distributed algorithm to provide "meaningful" global state.According to the algorithm,the process p1 sends marker with its local state and receives message from other processes,only when other processes received marker of p1 first time or received messages with no marker of p1 later.
3) Author provides proof that the proposed snapshot algorithm provides consistent global state of a distributed system.
4) They also show how to use the proposed snapshot algorithm to detect stable properties.

Limitations:
1) The algorithm assumes too much. In my opinion, the assumptions greatly reduce the applicability of the algorithm.
2) The state given by the algorithm can be considered as good guess rather than "actual" state.

Relevance:
The paper gives a formal definition for the distributed computation. It provides an algorithm for providing "best guess" snapshot of the system without centralized control. Although, the assumptions and the best guess characteristic of snapshot limits the applicability of the algorithm, it is one of the first papers to try to provide solution to the problem of providing snapshot of a distributed system without centralized control.

Posted by: Sreeja Thummala | October 2, 2014 04:14 AM

Summary:
This paper presents an algorithm for recording the "meaningful" global state of a distributed system without interrupting the normal computation. This global state recording algorithm can be used to detect "stable properties" such as deadlock detection, token disappearance, etc.

Description:
In a distributed system, it is hard to record the global snapshot of the system because each process works independently. A centralized process can be used, but that would block the computation on all the processes to gather the states from the whole system. The algorithm presented in this paper uses markers which controls the state capture process at every process. The sending process sends a marker after it has recorded its state and before sending any more message. When some process receives a marker, it has to record its state and the state of incident channel as the sequence of in-flight messages. When the algorithm terminates, each process knows state of every other process and the in-flight messages in all the channels. However, this recorded state may not occur in the sequence of states the system went through. But still this is a "meaningful" global state of the system.

Contributions:
1. A good contribution of the paper is properly defining the model of the distributed system in terms of process, channels, events and global state and; relationship between them. It provides very simple example to explain the problem and the proposed solution. The proposed solution also uses very simple sending and receiving rules.
2. The algorithm proposed does not interrupt the normal execution of computation i.e. it doesn't block the execution to record global snapshot. It is kind of epidemic algorithm where infected process sends marker to neighboring processes.
3. The author provides very good reasoning about the correctness of the algorithm. At the time this paper was written, the other algorithms which existed were either incorrect or impractical.

Applicability:
As the paper mentioned, the algorithm can be used to detect "stable properties". It can also be used for check-pointing. Different process can cooperate using the algorithm to agree upon a checkpoint of the whole system which can be used to recover the system in case of a crash.

Posted by: Avinaash Gupta | October 2, 2014 02:19 AM

Summary:
The paper proposes a simple algorithm for determining the global state of the system while the process is performing the computations in the distributed environment. It has proofs and illustrations for support. It also provides insights on solving the stability detection problem that can be used in cases such as the deadlock detection.

Problem:
• A process running on a system can only log information about itself (both Tx and Rx) but nothing else. Hence understanding the global state at any given instant is a challenge.
• The above issue is escalated since the process is required not to hamper its ongoing computations since it would affect its performance.

Contributions:
• The work in this paper resonates with parts from the previous paper by Lamport. The main idea is to achieve understanding of the global state at any instant in spite of the lack of accurately synchronized clocks.
• The key idea in the paper is to observe and record the sequence of events in a system and then use this as input to the model developed by the authors to determine the actual global state of the system. Before this, people did not really think that they could attain information about the global state of the system without synchronized clocks.
• The idea used in the state-detection algorithm is to enforce the marker-sending and marker-receiver rules to record the states. This however, can lead to some unexpected state because of the variable latencies in the channel.
• The authors prove that such unexpected states (S*) can actually be reached by a state with a computation sequence which is one of the permutations of the events. This sequence can be obtained as proved by the theorem. So is the system’s end points were S’ and S” then S* is reachable from S’ and S” can be reached from S*.
• The stability detection algorithm is an extension to the work and the authors show how it can be used to solve some practical problems such as deadlock detection.

Limitation:
I think this is a very nice and simple methodology and is well presented with illustrations. The only place of concern would be its assumptions about the system robustness (a system is very likely to be prone to errors) and also lack of more examples of huge multi-node systems.

Applicability:
The work in this paper is definitely commendable and although I am not sure, it seems many of the debugging tools (for distributed systems) such as the HPCToolkit use this idea (may be more evolved form) in their implementation since getting the global state information continues to be non-trivial.

Posted by: Chetan Patil | October 2, 2014 02:06 AM

Summary
The paper provides a formal representation of a distributed system as a set of processes connected by a communication channels, and looks to identify a mechanism to construct a global distributed snapshot of such a system that is meaningful in inferring global system state and track what has been described as ‘stable properties’ of the system. The authors also describe how such a snapshot can be used to solve problems such as deadlock detection, detection of compute termination and performing checkpointing.

Problem
Processes are capable of recording their internal state, and communicating this over channels that connect to other processes in the system. However, to attain a consistent global snapshot, all the processes would need to capture their internal state in a near simultaneous manner to arrive at a global snapshot that is consistent. The authors provide a comparison to stitching together several individual photographs to form a panorama even though the independent snaps were taken independently, to come up with a meaningful panoramic snapshot. Despite not being the same as an exact global snapshot, this stitched image can still provide key input in identifying if stable properties continue to hold. Constructing this snapshot is the problem being addressed.

Contributions

The marker algorithm to create a distributed snapshot. Processes record their internal state, and send out markers over communication channels that emerge from them. A process that receives a marker, that hasn’t recorded it’s state, would have seen all messages on the channels sent before the marker and internalized them, and will hence simply record its own state marking the channel as empty. A process that has already created a local recording, would attribute all messages between its last recording to the next marker as belonging to the incident channel. This provides a consistent set of process and channel states, which can be stitched together to arrive at a global state.

The argument that the recorded global state, despite not necessarily belonging to a real observable global state, can be reduced to one of the global termination states, and is hence consistent enough to identify stable properties of the system holding is a contribution, because it’s not necessarily intuitive unless explained.

Applicability
This being a theory paper makes some strong assumptions about the operating conditions of the system. Despite these assumptions, the authors make a strong case for its use in identifying the state of stable properties in the system, and its application in deadlock detection and check pointing.

Posted by: Vijay Kumar | October 2, 2014 01:54 AM

Global state detection is one of the problems in distributed systems. Finding a correct global state in a system can be very helpful in situation such as finding the status of computation, finding whether the system is deadlocked and to find a stable state.This paper tries to solve the problem of recording the global state without affecting the system. The paper assumes that there is communication channel between processes and the channel is error-free that is messages are received in order and it has infinite buffer. The paper also explains about the deterministic and non-deterministic model of sample distributed system and stable states in that.

Then the paper with example proofs, finds out inconsistent states that can be recorded like 2 tokens in a single token system and no tokens in the single token system. They also come out with inequalities for the model. With that they come up with an algorithm for recording the global state. The algorithm introduces a marker, which is sent without disturbing the ongoing computation but marks the beginning of snapshot. The marker being received stores the recipient's state and channel messages. To snapshot it again, another marker is sent to the other nodes. This algorithm will terminate if no marker is in the incoming queue of any node, and every process should take finite time to handle messages. With an example model, applying the recording algorithm to record a snapshot. They claim that through proof that, even though the snapshot doesn't match one of the stable state, if the permutation of one of the states in the stable state resembles the snapshot then if the stable property applies for the snapshot, then the property applies for the system too.

Contributions:
1. This paper proves that an algorithm could be developed to record the global state of the system even in the absence of synchronized clocks.
2. With the help of markers and the in-order (FIFO) channels, they make sure that a consistent state of the system is recorded.
3. They have also come up with a stable property proof, that the global state can be used to verify the stability property of the system.

Limitations:
1. The paper doesn't talk about the packet failures in the network.
2. It also assumes infinite length queue at the nodes.

Posted by: Dinesh Rathinasamy Thangavel | October 2, 2014 01:45 AM

Summary:
Leslie Lamport and K. Mani Chandy provide an algorithm to determine the global state of a distributed system that is currently 'in motion' running computations, and without the assumption of a common clock. One process initiates the request, and each process can only record its own state. With the cooperation of all the other processes, stable properties such as “system deadlock”, “computation complete”, and “all tokens in a ring have disappeared” can be identified. A proof is also provided to demonstrate the correctness of their algorithm.

Problem:
Several algorithms for determining global states such as deadlocks and were reportedly inaccurate and impractical. Part of this was due to a lack of understanding regarding the various relationships between local processes states, global system states, and the various points in distributed computations. Capturing the state of a system that is effectively in motion without disrupting the various processes is a complex task and requires a composite picture to be created that effectively represents the amalgam of many different individual process states while still resulting in a 'meaningful' global representation. We need to form a picture of the global state based on local processes recording their own states without the use of a 'common clock'. To correctly identify stable properties we need to detect that 1-a phase has ended, and 2-a new phase was initiated.

Contributions:

1-The distributed system model:
Extreme clarity regarding all aspects of the theoretical model is provided which casts new light on the problem. The distributed system is likened to a strongly connected graph, which behaves like a non-deterministic finite state machine where various vertices are represented as processes that can interact.

2-The Global State Detection Algorithm:
One process must initiate by sending a marker, and as each other process receives the marker, they update their own respective states, until finally, the global state is returned to the initiating process. This allows the system to identify a stable property of the global state, although this happens without disturbing the underlying computations, and this is all done without the assumption of a physical clock. This is a bit more problematic in an event such as a deadlock, and towards the end a solution is provided for the stability detection problem.

3-A proof that the Global State Detection Algorithm is correct:
Instead of trying to provide an implementation, the authors focus on proving that their algorithm is correct, and that it terminates. In order to provide a provable solution, it was assumed that buffers are infinite and message delay is arbitrary, but finite.

Applicability:
The coverage of the stability detection problem seemed too brief; and I was a little disappointed that they didn't provide a deeper explanation. Overall, the authors provided an improved theoretical model to describe the problem more accurately (vs their predecessors), as well as an algorithm which they proved was mathematically correct. Their global state detection algorithm was groundbreaking back in 1985 when this was written, and I would guess that many aspects of their solutions are currently implemented in modern systems.

Posted by: Jason Feriante | October 2, 2014 01:37 AM

The authors discuss about an algorithm that is used to record the global state of a distributed system on the fly from the local states of the processes and the communication channels in the absence of any shared memory or a global clock. They also illustrate the usage of the detected global state to determine if a stable property holds or has occurred.

Contributions :

1. They propose a simple algorithm using marker sending and receiving rules to ensure that the global state detected is consistent and it doesn’t affect the underlying computation.
2. The idea behind the marker mechanism is quite clear and straightforward. Every process sends a marker along the FIFO channel after recording its state and before sending a message. On the other hand, if the receiving process has not recorded its snapshot, it would include the message preceding the marker in its snapshot otherwise it would include the message preceding the marker in the snapshot of the channel.
3. They also guarantee that this algorithm would terminate as the initial process would propagate the marker over the channels to other states and this process continues and thus, all processes reachable from the initial processes would eventually record their state in a finite time as they will receive a marker. The local snapshots are combined together to form the global snapshot.
4. Since the global state detected need not be one of the transition states, they also formally proved that this state is reachable from the initial state and also that the final global state is reachable from this recorded global state.
5. They extended the usage of the algorithm to find whether or not a stable property (e.g deadlock detection/termination of computation) exists. It is based on the reasoning that, if a stable property exists before the snapshot algorithm starts, it will exist in the global state too.

Discussion :
A couple of ideal assumptions have been made about the model of the distributed system such as, the channels are error free and have infinite buffers. In a practical scenario, this is not always the case. If a process does not receive the marker, the algorithm would not work as expected. The algorithm assumes that there are no failures in the system.

Relevance :
This is a basic algorithm and the probably the baseline for many of the other snapshot algorithms. The concept of global state snapshots is widely used in distributed systems mainly for the purposes of stable property detection, checkpointing, state monitoring etc.

Posted by: Krishna Gayatri Kuchimanchi | October 2, 2014 12:58 AM

Summary:

In the paper the authors provided an algorithm corroborated with a theoretic proof by which a process in a distributed system can determine a global state of the system during a computation; i.e. taking a meaningful snapshot of the distributed systems.

Problem:

Determining the global state of a distributed systems is essentially very difficult since processes in a distributed system are limited to record its own state and the messages it sends or receives and all processes cannot record their local states at precisely the same instance unless they have access to a common 'clock'; and in a distributed system processes usually don't share a common clock or memory. Therefore important problem is to create an algorithm by which processes record their own states and the states of the communication channels so that the set of the processes and channel states recorded form a consistent global system state.

Contribution:

A clear and formal definition of relationships among local process states, global system states and points in a distributed computation that were previously not well understood or defined.
The introduction of the "marker" in the global-state-detection algorithm was a simple yet effective idea to iron out inconsistencies.
The global-state-detection algorithm is superimposed on the underlying computation; i.e. it runs concurrently with the computation without altering it.
Theoretical proof for the correctness of the algorithm is provided.

Applicability:

This one of the most notable papers that have laid the foundation of modern theory of distributed systems as it solves one of the fundamental problem of determination of consistent global states in a distributed system. Some of the basic assumptions in the paper can be facilitated by using a more reliable communication protocol such as TCP/IP and also the algorithm can be adapted such that there are multiple snapshots occurring simultaneously. Hence this is a very relevant and applicable paper for today's distributed systems.

Posted by: Saikat R. Gomes | October 2, 2014 12:19 AM

Summary: This paper first gives a rigirous definition of the global state of a distributed system. Based on that definition, the paper gives an algorithm for a process to determine the global state of the distributed system. This algorithm runs a snapshot: it only observe the state of processes without altering it. Furthermore, the proof and examples of the algorithm's correctness is also given.

Problem: Determining the global state of a distributed system is very important in stable properties detection, e.g. "terminated computation" or "dead lock detection". This problem is also very challenging because of (1) there lacks a rigirous definition of global state in distributed system. (2) The algorithm to determine global state can only snapshot the local states in each process, without altering it (e.g. stop it), so we cannot get the instant states in every processes. (3) There is no shared clock, so no absolute order of events can be observed.

Contributions:
1. The author defines the process as a deterministic finite automata on local states and events. Based on that, the global state of the distributed system is defined as the aggregation of the local states in each process and the message queue in each channel between processes.

2. Based on the definition, the author propose an algorithm to determine the global state. The algorithms use marker message. At first a process record its local state and then send the marker to all its outgoing channels. The markers triggers the other process which receive the marker to do the same thing (record its states, record the message it received, send the marker). They prove that this algorithm will eventually converges to determine a global state (when the processes are strongly connected).

3. Based on the property of global state, they further utilize the algorithm to determine if the global state is stable (and prove the correctness).

Application: This is a theory paper which is valuable by pioneering thinking about the rigirous definition of local/global state in distributed system, and give a formally proved algorithm to determine the global state. This theoretical flavor paper inevitablely makes oversimplification on the problem: actually, the process may be complicate to be represented based on a limited number of states, the connection is unreliable with transmission error, the message may not arrive as the order they send. Nontheless, the very early paper catches the simple principle of the distributed system, which makes the algorithm applicable.

Posted by: Shike Mei | October 2, 2014 12:17 AM

Summary:
The authors present one method for tackling the problem of capturing useful global state in distributed systems without the ability to synchronize simultaneous snapshots from all separate nodes ("processes").

Problem:
It can be useful to have a global view of the state of a distributed system to reason about the current state, diagnose problems from past state, or checkpoint the system. However, attaining such a view is not simple as coordinating the recording of state at all processes at the same time is not likely to succeed.

Solution:
The authors, starting with a simple model of a system (always-up, state machine processes, FIFO directed channels between processes), present a technique wherein the recording at each process can occur at different times and still produce a meaningful global state, S*, when merged.

While S* may not represent the actual global state at any single time during computation by the system, the authors proved that if the snapshot began at system state S', and ended at S", then S* is reachable from S' and S" is reachable from S*.

Contributions:
1) While not an instantaneous snapshot of a global state reached by the system, the proof of the characteristics of S* show it can be used for some interesting tasks, such as the stable property test described in the paper.
2) It seems simple while reading it today, but the ability to produce a feasible global state consistent with actual execution without affecting computation in a distributed, non-synchronized environment is not trivial.

Applicability:
Certainly if a global view of a system that is guaranteed to maintain system invariants is desired today, this technique is still applicable, and would produce better (w.r.t. invariants and reachability) results than some central entity querying all nodes periodically and maintaining a sort-of state of the system. For the latter technique, you may see out-of-sync states between nodes which don't always make sense.

With failures, however, it is hard to see the difference. How could an algorithm like the one presented provide any guarantees of the usefulness of S* if a process is unavailable to detail its state approaching/at its point of failure?

Posted by: Brandon Davis | October 1, 2014 11:51 PM

Summary:
The paper's topic is about the state of the distributed system. It explains the relationships between local process states, global system states and points in a distributed computations. After defining a model of the distributed system, the authors summarize the required information to define the state of the distributed system, and propose an algorithm to get the snapshot of the state on-the-fly without affecting the normal operations of the distributed system.

Problem:
As the name suggests, distributed system’s component are distributed, so is the state information about a distributed system. Intuitively it’s harder to collect all the state informations spread over the components instantly compared to get the state information in a single machine. Moreover, it’s not feasible to plan a future time point when all the involved components to get an snapshot independently in the “same” time and then a node aggregates the snapshots, because every node doesn’t have the exactly synchronized clock. In addition, some state informations might be in transmission over the communication channel, they can’t be moved from one node to another without time (at least in classical physics), the best place to collect them is the node which involved in the communication, and the collection should be done asynchronously. In summary, it's challenging to save and collect the information from all the distributed system's components and to make the aggregated information reasonable and useful.

Contributions:
This paper provides very clear definition about the system state, and discuss what kind of aggregated state information is useful.
This paper also clearly defines the roles of the node in taking snapshot. The messages transmitted over the channels are saved locally in the node.
The algorithm can make sure the snapshot is consistent. And we don't need to pause the system in order to get the snapshot.
The algorithm is very simple, only adds one new message to the original system. And the logic added to the node is composed by "Marker receiving rule" and "marker sending rule" which are fairly easy to be implemented.. From the marker receiver's view, marker message are used to trigger the receiver process to save local process state, and get started to record the channel state for all the incoming channels except the one from which the marker message receive. Then the channel states recording for these incoming channels will be stopped and the states will be saved persistently in receiving marker message from them. From the marker message sender's view, the marker message marks the border of the state for this sender.

Discussion:
The model and algorithm defined in this paper have many assumptions, fortunately many of them are easy to be implemented. Like FIFO and reliable communication channel, atomic state saving, etc. I think the algorithm can be used directly with minor extension in real system, e.g. distributed system debugging, distributed system monitoring.
However, I have one concern about the latency of the algorithm, specifically, the latency to get the snapshot. Imagine we are debugging a crash issue in the distributed system. The node found the problem and send out a marker message to start the snapshot catching process. It might take long time to get the snapshot, which might not be "fresh" enough for the debugging purpose.
Another thought not specific to this paper, the "clock" here we have been discussing are based on the traditional understanding about time, think about what would happen in the context of theory of relativity or quantum communication would be interesting.

Posted by: Peng Liu | October 1, 2014 11:42 PM

Summary

This paper presents an algorithm by which the global state of a distributed system can be captured. The global state includes the states of the individual processes and channels. The authors present an algorithm to do this and show how their procedure captures a "meaningful" global state. They state and prove properties about the state captured by their procedure. They also show how stability properties can be detected on the global snapshot.

Problem

Global state detection is useful is many practical problem settings like deadlock detection, computation termination detection. But the problem is hard to solve because the state of all the processes and channels have to be captured at the same instant. There is no one single shared clock that the processes can use to do this. If the processes capture their local states individually, then they should do it such a fashion that when stitching them together produces a meaningful global state. Another problem is how to stitch the states of the individual processes and channels to produce the final picture.

Contributions

1. I believe this work was the first to realize that many practical distributed system problems like deadlock detection boil down to capturing the global state and telling whether a stable property holds or not in that state.
2. The algorithm presented to capture the global snapshot - Briefly, a process sends a marker along all its channels after it captures its own state before sending any further actual messages. A process receiving a marker checks if it has already taken a snapshot. If not, it records its local state else it takes all actual messages it received (after it took its local snapshot) along that channel and stores it as the channel's state.
3. The state captured by this procedure can be a state that never occured in the system. Though this statement seems weird and makes one doubt about the correctness of the algorithm, the authors provide a thorough proof about the properties about this state and prove that such a state can be useful to capture.

Discussion

1. It is not clear if the algorithm can terminate early in stability detection. For example, if we are trying to find out "cessation of computation" and a process records a local snapshot and it knows that it has not completed the computation, I believe that the algorithm does not end there. Probably this may not apply for all stability properties. For example, we can't do this early termination for deadlock detection.
2. I feel that the model of the distributed system that the authors assume is too ideal. Nonetheless, even in this setting the problem is still hard.

Relevance

I believe creating system snapshots (while allowing workloads to run) is still an active research problem. I believe many snapshot-ing features in current services and products (databases, virtual machines, storage boxes) use ideas presented in this paper. But I think taking snapshots with centralized control is much easier to implement and debug.

Posted by: Ramnatthan Alagappan | October 1, 2014 11:36 PM

Summary:
This paper presents algorithms by which a process in a distributed system can determine a global state of the system during a computation.

Problem:
Detecting the global state is very important in distributed system and many practical issues based on this algorithm, like detecting the stability of whole system.
One problem in distributed system that make solving this detection difficult is that there is no global or share clock in the system. Grabbing states of different components not at the precisely same instants would cause the inconsistency of the global view of the system.
Another problem is that the detecting algorithm should cooperate with the underlying computation algorithm, but cannot alter it.

Contributions:
1. Modeling the distributed system and define the relationships among local process states, global states and points in the computation system. The author introduces a 5-tuple to define one event of the process and next function to define the state transient between states.
2. Global-State-Detection algorithm. An additional marker message is introduced into the system. When it is necessary to get the state of system, a marker message is send to make each process record its own local state before the marker message is got.
3. The global system state is formed by each local states of underlying components like process and message channels.
4. Proof of his theorem with new definitions of prereording and postrecording.
5. A stability detection algorithm based on the global state detection algorithm.

Applicability:
As the author states, one application of this algorithm is to help detect the stability of a distributed system, like if the computation has terminated. Another obvious application is get the snapshots of the whole system. This feature is very popular in current distributed systems or distributed database systems, which allows users to store one snapshot, try some uncertain operations and restore from the previous snapshot if operations fail.

Limitation:
In the author’s algorithm, the correction of parsing marker message is the guarantee of getting the consistent view of the global state. But if this scenario is extended to multiple machines that connected with unstable networks, the incorrect order of receiving marker message or losing marker messages might still cause the inconsistency problem from local states of each components.

Posted by: Lichao Yin | October 1, 2014 11:35 PM

Summary
The paper discusses an algorithm that allows a process to determine the global state of a distributed
System. Which is then used to solve the problem of stable property detection.

Problems to solve
Assuming processes do not have access to a common clock an algorithm through which each process records its own state and the communication channel. Each one of these disjoint snapshots when put together should form the global system state.
The algorithm to detect global state must not interfere with the state computation of individual processes and must happen concurrently.
Understanding the relationship between local and global states.
Need to detect a stable phase in a process running many events in order to define the end of a phase.

Contributions
Describes a distributed system model consisting of a finite set of processes and communication channels with the assumption of infinite channel buffer. Using the model the author sets up an example namely the single-token conservation problem to motivate the need for the algorithm.

Algorithm is able to make a meaningful global snapshot without the help of a common clock for the processes. This is done as each process records its own state and that of the channel via which it sends messages. An issue of inconsistency arose which was taken care of using a marker or checkpoint.

The paper then provides with theorems and proofs to confirm the correctness of the algorithm and that it computes the global state without affecting the underlying computations.

The algorithm is then extended to solve the stability detection problem and also discusses its correctness. This helps detect global states such as system deadlock.

Applicability/Discussion
Knowing the global state of a distributed system is important and the above algorithm provides with the means under the assumptions provided. Node failure during global state computation wasn’t looked into as it could have an invalid snapshot of a process which failed due to a fault.

Posted by: Shiva Prashant Chada | October 1, 2014 11:35 PM

Summary:
Leslie Lamport has done it again, this time with the work of co-author, K.M. Chandy, they introduce the idea of determining and recording the global state of a distributed system by means of snapshots. Much like many of Lamport’s papers this is a theoretical paper with some briefly given practical algorithms, theorems, and corresponding proofs.

Problems:
The main problem confronted was finding a global state in a distributed way such that it is useful or “meaningful”. This meaningfulness comes in form of a stableness property which is simply once the global state has the stable property; it will continue to have the stable property any time in the future. The difficulty comes in arriving at this stable property for a global state in a distributed system without the use of a shared clock.

Contributions:
The central contribution was the simple algorithm for taking snapshots of the distributed system. Very quick, there are three steps to it which are very similar to rumor-mongering in gossip protocols:
• A process wishing to take the snapshot records its own state and then sends a special marker message to all the other processes requesting a snapshot.
• Upon receiving a marker message the receiver records it own state and returns it to the origin requester. All subsequent messages sent by this receiving process have the special marker embedded in the message.
• If a process receives a message that is unmarked after receiving the special marked message from the original process, this is new unrecorded state, which must be part of the snapshot. The unmarked message is recorded and sent back to the original requester.
In this way all state before a specified cutoff is recorded in the snapshot. There are several theorems and a proof to demonstrate how and why this works. Some unruly assumptions are made for the algorithm including: it isn’t fault tolerant and expects the network to be perfect. While TCP does give some guarantees to reliability and order, the larger the distributed system, the higher the probability of a physical fault that could render the network or some nodes inoperable. Another flawed assumption is that every process can communicate with any other process in the system. In the event of a network partition, this algorithm will never recover.

Applicability:
As discussed in the paper there are a few direct applications, such as detecting when a distributed system has terminated completely or when it has deadlocked over some resources. Another application given was stability detection like termination detection, which could trigger the distributed system to perform the next phase of computation. Since this is a snapshot of the system, backups of the distributed state machine could also be an application, so as to recover from total failure.

Posted by: Peter Collins | October 1, 2014 11:34 PM

Summary:
The paper describes an algorithm for computing the global state of a distributed system. This is later shown to be useful in detecting many stable scenarios in the system like deadlocks or loss of tokens. This algorithm runs concurrently with the computation and hence the system need not be stalled while snapshotting.

Problem:
The authors are trying to solve the following problem.
In a distributed system, taking a global snapshot is quite challenging because events might be occurring while the snapshot is being taken. On the other hand, we cannot afford to stall the events whenever a snapshot algorithm is being run. Hence, we need a solution which will run concurrently with the computation and will capture a consistent global state for all the processes in the system.
I felt that the photography example was very neat as it exactly explains the problem.

Contributions:
a. Providing a snapshot algorithm which captured the global state without affecting the underlying computation.
b. "Defining” what a distributed system is in a simple, intuitive way, eg. a directed graph, where the nodes are processes and the edges are the channels of communication, makes it very easy to view.
c. Using the single token, multi token example, the paper proves that the events in a distributed system are very non-deterministic and hence computing global state is not trivial.
d. I particularly liked the concept of using markers to initiate the recording of global state and their explanation of when this process would terminate was very neatly explained.
e. Stable state detection using the recorded global state.

This paper doesn’t consider situations where a node (process) or multiple nodes can potentially fail. It would also be nice to know what will happen if events can concurrently occur.

Applicability:
This paper is one of the most significant works of Lamport. This approach might be very applicable to systems where capturing snapshots efficiently is important. Systems like HDFS implement this global snapshotting technique but I’m not sure if they use Lamport’s algorithm.

Posted by: Anusha Dasarakothapalli | October 1, 2014 11:32 PM

Summary: In this paper, the author discuss how to
detect a global state in a distributed system where
each node can only takes note of their own state
and the message sent/received from a channel.
One special property that the user requires (and
takes advantage of) for the global state is
that it is "stable", means that all other global states
that are reachable from the global state that
we want to detect are also OK to be detected.
One such example is termination of a certain phase.

Problem: The key problem in a distributed system
designed without careful consideration is that
the local states that processes note might not
be consistent, that is, their union might not be
any global state that is reachable from the
initial global state. It is easy to imagine
why this could happen--each process only takes
note of its local state, how can it know to
be consistent with others?

Problem: In my opinion, one key property that the
user takes advantage of, is that changes of the node
that might cause consistency problem can only be
triggered by the event of passing messages.
Therefore, to avoid this, as long as the sender
takes note before sending a message, and the receiver
takes note before receiving the message, it is OK.

The property of the target global state is stable is used
further to make sure no further synchronization
is needed among nodes--as long as the global state
is reachable from the target global state, we can
detect it and does not need to worry it would change back.

Applicability: The proposed algorithm assumes
that there is no failure among the channel,
and all messages will be sent/received
among the receiver and sender within finite
amount of time. Therefore, where there are
channel failure, more careful consideration
is necessary. However, the similar idea
(i.e., they way of maintaining consistency)
has been used (I actually do not know their
order of application) in distributed databases
protocols like 2-phase locking.

Posted by: Ce Zhang | October 1, 2014 11:17 PM

Summary:
In this paper the authors formalize the notion of the state of a process in a distributed system. They then build upon this formalization by formalizing the global state of the distributed system as the set of each process’s state. The paper then presents an algorithm for each process to determine the global state of a distributed system at a given time. The authors also prove the correctness of the algorithm by demonstrating through proof and example certain properties that must hold.

Problems:
Distributed systems as their name suggest consist of separate components and processes operating asynchronously and independently. Therefore, it is very difficult to determine the global state of a distributed system because (i) there is no centralized presence that behaves as the clock and knows the set of states S for set of processes P in the distributed system and (ii) the system is in continuous computation and computation cannot (and should not) be stopped every time global state needs to be determined (akin to the pictures of birds flying example). As a result of (ii) and snapshot of the global state will not perfectly represent some time t.

Contributions:
- The formalization of a processes state s and the extension of formalizing global state S as the set of these s_i’s at some time t.
- The development of a global state snapshot algorithm that provable allows some process p in the distributed system to determine the global state of the distributed system at some time t without interrupting the computation being done in the distributed system.
-The proof and examples of the correctness of the global state snapshot algorithm.
-Stability detection of the distributed system through the use of the global state snapshot algorithm. Can determine cases of deadlock, termination, etc.

Application to Real Systems:
Although the paper makes certain underlying assumptions to prove the correctness of the algorithm it does provide a mechanism to create global state snapshots in a decentralized fashion assuming the distributed system is functioning properly. I assume that there are distributed system that take advantage of such an algorithm, but differ by many degrees in actual implementation because failure cases can be common at large scale. It seems that a centralized snapshotting algorithm would be far easier to implement and reason about (especially during failure cases).

Posted by: Aaron Cahn | October 1, 2014 11:01 PM

Summary:
The paper describes global state detection algorithm for distributed system. Each process records [or snapshots] its state and state of its channels such that an aggregation provides information about the complete state of the system. The snapshots are taken in a timely manner to ensure correctness of the system.

Problem:
State of a system at any particular time is determined by state of processes (nodes) and the communication channels (links).
Given the distributed nature and absence of accurate global locks, it is difficult to know the overall state of the system.

The paper presents an algorithm by which every process can determine the global state of the system which affecting the underlying computation.

Contributions:
1. Define a relationship among local process states, global system states and points in distributed computation.
Many earlier algorithms failed to do so.
2. Mechanism to take snapshot (i.e. record state) by various processes without affecting computation.
3. Use snapshots to solve stable property problem.

Discussion/Application:
- As compared to other paper, the authors make a lot of assumptions like error-free channels, strongly connected graphs, no network failure and no influence of external events. It would be interesting to know how the algorithm needs to be modified when these assumptions are lifted.
- As the paper states, knowing about the state detection algorithms can be helpful in deadlock detection, computation termination.

Posted by: Harneet Singh | October 1, 2014 10:57 PM

Summary: Authors motivate and present an algorithm for determining global state in a distributed system. In this paper, they strictly define what a global state is and how a process can record it along how when a process records state can add some nuance to this determination. They prove that their algorithm at all stages to be a transition to a global state snapshot and use this to create an algorithm for determining whether a stable property holds.

Problem: Determining what properties that the global state of an entire distributed system has is not trivial. Each process in a distributed system can only know what its state is and what messages that it sends/receives. Naively, one would try to determine this global state by having the processes themselves send their local state through the channels to other processes and have the global state arise here. The problem with this is that this cannot be synchronized in a way that is precise enough since the processes cannot record their state at the same instant and communicate this state across the communication medium in a way that does not affect the underlying computation.

Contribution: The authors concretely define what it means for a system to have a global state and create an algorithm where each process can individually collect their local state and the state of their incident channels and have this aggregate process converge to a global state. They then apply this and devise an algorithm to determine whether the global state is “stable”.

The authors define a simple model for a distributed system. Their model has the distributed system as a graph where processes are nodes connected to other processes via edges that represent an error-less infinity capable channel. Each process has a state associated with it along with the state of one of the channels incident to it. The global state is the aggregate of all the process and channel states.

The algorithm that they propose is that whenever a process spontaneously records it state, it sends a marker out on the channel. Any process that receives this marker will record its state or, if they have already recorded their state, record the messages that they received before the marker was received. This aggregate process achieves a snapshot of the global state of the system.

Finally, the authors prove that for any transition set of recordings in this process, there is an ordering of prerecord and postrecord events that lead the state S* to a global state. Using this, they extend it to a simple algorithm for determining whether the global state in a distributed system is stable.

Applicability: The paper is significant with regards to the theoretical framework that it gives to the notion of what global state in a distributed system means. Like the last paper (by one of these same authors), the framework that is concretely defined here will always be relevant. However, again like the last paper, the algorithm developed might not be scalable to modern systems. In reality, channels do not have infinite capacity or are error less. Given the vast amount of data being moved around in modern distributed systems, it seems dubious that their algorithm is able to concurrently record the process state and the messages being received in the channel in a way that does not affect the running of the computation (a requirement laid out at the beginning of the paper).

Posted by: David Tran-Lam | October 1, 2014 10:48 PM

Distributed Snapshots: Determining Global States of Distributed Systems

Summary:
In this paper the authors have shown that determining the global state of the distributed system is a hard problem since the processes don't share a clock or memory. They have proposed an algorithm for a process to determine the global state of the distributed system. The main idea of the algorithm is to take different snapshots of the processes and the communication channels at a particular instance of time(this is simulated using the markers) and to combine those snapshots to provide a consistent global state of the distributed system.

Problem:
A process can record only it's own state and the messages it sends and receives. The problem arises when a process wants to record the state of other processes and the messages they send and receive to get a global system state. This is not easy since this requires the cooperation from other processes to record their state and share it with the process that requests it. The authors' solution to the problem is a distributed algorithm which can record the state of a distributed system in a correct way.

The other problem the authors try to address is the stable property detection problem. Stable properties are those which when becomes true, remains true thereafter. For example, properties like "computation has terminated", "the system is deadlocked" are stable properties. It is difficult to detect the stable properties in a distributed system and the authors come up with a stability detection algorithm to solve the stability detection problem.

Contributions:
1. Defining the relationships between the local process states and the global system state was a major contribution since the previous algorithms failed to do so and as a result they were proved to be incorrect or impractical.
2. The idea of determining the global state as a sum of the parts of the local states(i.e. the states of the individual processes and the communication channels) at a particular instant of time. Even though the whole system system state was difficult to be captured at a particular instance of time as they was no shared time, the algorithm could do it with the help of markers.
3. The model of a distributed system that they have come up with was an important contribution since it helped explain the algorithm more precisely with a well defined model. I felt that this provided a common platform and terminology for other researchers. For example representing a distributed system as a Directed Graph of Vertices and Edges.
4. Using their algorithm for detecting stable state in a distributed system was an important contribution since this proves the usefulness and the value of their algorithm.
5. I also felt that the process of capturing the global state of the system without affecting the ongoing computations was great but I really didn't understand how that would be possible since process p which requests for a global snapshot, doesn't send anymore messages after sending the marker until process q records it's state. Isn't the computation affected by this ?

Relevance:
The algorithms introduced in this paper is very fundamental for distributed databases as they would need to detect the stable state at various points in time. Also the file system snapshot procedure in distributed file systems may be influenced from this work.

Posted by: Adalbert Gerald | October 1, 2014 10:00 PM

Summary:
In this paper authors have presented a decentralized algorithm for taking meaningful snapshot of global state of a distributed system, that is of all the nodes and all the links. The snapshot to be taken is when the system continue to operate - that is, changing state and sending messages, and the algorithm should run concurrently without alter the underlying computation.

Problem:
Distributed systems by their nature lack an easy method of central coordination and state measurement. So if you need to observe the global state, one method would be to elect a single leader, stop everything while that leader sends out individual state requests, then resume once it collects and announces the results. This is a highly intrusive process. The requirement is for a decentralized global state detection algorithm which can run in background without affecting the ongoing computation.

Contribution:
- Self stated contribution of defining relationship between local process states, global system states and points in distributed computation. Recording the state is done asynchronously at each local process with the help of marker.
- The algorithm for collecting snapshot without interrupting or rescheduling the system's normal operation.
- The method designed is simple and doesn't require single source but can have multiple independent trigger sites.
- Authors have showed that recorded global state is never-occurred one, but it is still consistent with computations in the system and is of significance.

Flaw:
- Author does not account for failures in the system, i.e. nodes fail to respond or network links goes down.
- Broadcasting of data across all links which will happen with state change, can cause overhead in the system.

Applicability:
The algorithm described for capturing global state is simple and straight forward. It is highly applicable for problems which need to detect stable state like deadlocks in database systems, computation termination. But without accounting for failures in the nodes and link, it is unlikely for the algorithm to be implemented as it is in the real world systems.

Posted by: Bhaskar Pratap | October 1, 2014 09:44 PM

Summary: This paper presents an algorithm that can take a snapshot of a global state on a distributed system.

Problem: In a distributed system, the global state is the state at some instant consisting of all the states of component processes and communication channels. It can be used to determine some stable properties, such as if a computation has terminated, or if the system is deadlocked. However it is also hard to get a valid global state because in a distributed system, it cannot be guaranteed that the snapshot of all components is taken at the same instant. Moreover simply piecing together all local snapshots may not provide a consistent global state.

Contribution:
1. Invented the formal definition of a distributed system, events, and states. In the definition the author assumes that a distributed system is a directed graph where each node is a single machine and edges are one-way communication channels. Each machine has its own finite state space. It is assumed that the channels have an infinite buffer, but the communication delay can be arbitrary large but finite. The state of a channel is the pending messages on that channel. A global state is a collection all machines states and channel states. The definition is clear and practical.

2. Gave an algorithm that can take a snapshot of the whole distributed system. The algorithm makes use of a marker message to indicate other computers the timing of recording their states. The algorithm is guaranteed to terminate if the topology of the distributed system is strongly connected. The global snapshot taken is guaranteed to be a valid state at some point between the initialization and the termination of the algorithm.

3. Used the above-mentioned algorithm to solve the stable property problem. A stable property is a property that once a system has, it will keep having. The solution is simple: first take a global snapshot, then test for the property against the snapshot. If the property holds, then it's guaranteed that the property will hold since the termination of the snapshot algorithm; otherwise it's guaranteed that the property does not hold until the start of the snapshot algorithm.

Applicability: The proposed algorithm is very applicable and useful and important in grabbing snapshots of a database, file system, even the process states on a single machine.

Posted by: Menghui Wang | October 1, 2014 09:42 PM

Summary:
In this paper, the authors have proposed a algorithm that takes snapshot of a distributed system. The stable state detection problem is solved with the help of this algorithm, which is useful in detecting many stable properties of a distributed system including dead lock detection etc.

Problem:
Existing deadlock detection algorithms which are based out of global state capture are not very accurate as they don’t really take into account the relationship between the state of the processes that constitute the system and the global state of the system in its entirety.

Contributions:
1. In this paper, the authors clearly define a process state, which is used as a building block to develop the global state, and how the individual process events affects the global state.

2. Devised a model that clearly explains as to when a process would affect a global state and when it would not.

3. Identified that in a distributed system all computations need not be deterministic and how this would affect the snapshotting algorithm.

4. They ensure that the snapshotting algorithm did not actually change the existing computation that was going on in the system. It was running concurrently along with the underlying computation.

5. Usage of markers in the algorithm to hint a process to start recording its state into the global state, thereby contributing to the snapshot. Hence, the composite view of all the processes forms the global snapshot of the system.

6. Extended the snapshotting algorithm to solve the stable property detection which proved to be useful in a variety of distributed systems problems.

Limitations:
1. The paper does not really mention what happens when concurrent event occurs and how it would affect the state machine of the processes and the system.

2. The assumptions like unlimited buffer size or error free channels, might not be applicable in reality and hence, the system could get more complex. For instance, if a marker is lost, the process might not even record its state or the channel’s state into the global state.

Applicability/Conclusion:
This was one of the very early papers, to explain that it is not possible for multiple processes to record its state all at once. This is a follow up from the previous Lamport’s paper, that synchronizing clocks is not easy in a distributed system. Though the implementation might not be an easier one, many of the solutions like distributed dead lock detection or detecting if a token was lost, etc can be really useful in today’s distributed systems. Today’s snapshotting file system examples could be : HDFS, MOOSEFS.

Posted by: Manasa Subramanian Ganapathy Subramanian | October 1, 2014 09:40 PM

Summary:

The paper describes a mechanism to record the state of a distributed system and shows how it can be used to solve many problems in distributed systems.

Problem:

Capturing the state of a distributed system is non-trivial because the state of the system changes while the snapshot is being taken. We cannot afford to "freeze" the underlying system for the purpose of taking a snapshot. A mechanism has to be devised to take the snapshot concurrently with the underlying computation. There are also classes of problems like stable property detection that need to be solved.

Contributions:

- Provided a mechanism to take the snapshot without interfering with the underlying computation.
- Existing literature had incorrect algorithms. Relationships between states, processes and inter-process communication was poorly understood. This paper formalizes these concepts to avoid ambiguity and get provable results.
- Shows how the recorded global state can be used to check for stability even if though it doesn't necessarily correspond to an actual state in the system.

Limitations:

- Not all distributed processes are simple enough to be modeled as state machines with channels in between them. Using a mechanism like this would constrain the designer's flexibility in implementing the system if the system must be modeled to fit the algorithm.
- The simplifying assumption about messages not being dropped in channels or nodes not going down doesn't seem to be realistic. I'm not sure how easy it is to extend the algorithm to deal with this.

Applications:

- Hadoop FS has a snapshot mechanism for capturing file system state. Not sure if they use the same idea, but it probably influenced it.
- Deadlock detection in distributed system (one example of stable property detection).

Posted by: Satyanarayana Shanmugam | October 1, 2014 09:23 PM

Summary: this paper describes what a distributed system state is, how to detect it, its properties and finally, how it can be used.

Problem: in a distributed system, every process has its own clock and memory, and in turn its own view of the state of the system. For certain problems, computation, for example, it can be useful for all the process to know the global state -- that is the state of all the processes together, and not just the local one. This can be difficult to obtain since the processes do not have a shared clock and must rely on a channel for communication. Without a global state, the system as a whole may be unable to read certain properties and know what state it is currently in, and when it has transitioned into a new one. This makes detecting end of computation or a dead lock a difficult problem.

Contributions:

What is a distributed system? The introduction provides a formal, eloquent, description of the essence of a DS -- to my knowledge, this is the first (or easiest to understand).

A snapshot can be taken without clocks: the channels can be used to propagate a message through the system, taking all processes into account. The state is thus not something that happens in one particular point in physical time, but is more of a composite of many snapshots (bird photographer analogy).

Stability detection, although previously discussed, is an important use of the snapshot. It has practical uses in deadlock detection, among other things.

Applicability: most of the contributions of this paper are theoretical. The authors make a lot of assumptions about the system they are operating on. For example, (L1) and (L2) assume that channels deliver messages and processes can record states in finite time. Although these seem reasonable enough, they may be pretty difficult implementation details, especially for fault tolerant systems.

Posted by: Theo | October 1, 2014 09:03 PM

Summary:
In this paper, the authors present a mechanism to obtain snapshots in a distributed system. Their model consists of processes which communicate with each other using channels. Taking the snapshot of such a system involves determining the global state of the system for which they provide an algorithm and prove that it is correct. They also show how to detect stability.

Problem they are trying to solve:
To have a snapshot at an exact instance in a distributed system, you would need to have all the systems share a global clock which is not practical. So one requires a mechanism which allows individual systems/processes to take local snapshots and then be able to reason about the global state formally by using these local snapshots.

Solution they propose:
They start off with a lot of assumptions such as FIFO, error-free and order-preserving channels with infinite buffers to scope the problem. They then describe an algorithm which enables capturing the global state without affecting the computation of the system. This algorithm involves capturing local state by the processes and the state of the channels incident on them using markers and then combining these states across all processes and channels to get a global snapshot of the system. Typically one process initiates the process by taking a local snapshot and sending a marker through its outgoing channels and thus serving as a trigger for other processes to capture their own local states and states of the channels they receives messages from. They formally prove that the global state thus recorded is correct and is reachable by a sequence of allowed events in the distributed system. The same algorithm extends to define the requirements for stability property.

Contributions and Applicability:
This work has brought Lamport and Chandy lot many awards including the 2014 “Dijkstra Prize in Distributed Computing”, so it has clearly had a big impact in the field of distributed systems. I am not aware of how existing systems take active global snapshots currently but this paper provides an elegant way of answering various questions such as “has the computation terminated”, “is the system in a deadlock” etc.

Posted by: Chaithan Prakash | October 1, 2014 07:44 PM

Summary:

The paper presents an algorithm for generating a snapshot of the global state of a distributed system. Each process records its state and the state of the channels incident on it in a timely manner(ensuring correctness) such that the combined recording can form the required global snapshot. They also present the usefulness of such a snapshot in stable property detection like computation termination, deadlock detection.

Problem:

Processes only have a local restricted view of the system and are unaware of other processes views; as a result ensuring correctness of a combined state snapshot from individual recording is quite challenging.
Apart from correctness, the snapshot protocol should not affect the underlying computation.

The analogy provided in the paper with photography was really good, as it helped in explaining the practical significance of these problems.

Contributions:

The self-proclaimed contribution of "understanding relationships among local process states, global state & points in distributed computation" is well justified by usage of markers in their algorithm design which helps in initiating the recording of process states at proper timing - this proves the correctness of the global snapshot though it was asynchronously recorded on individual processes.
The taken snapshot is shown to be a non-occurring global state and the proof that follows it to show the significance of the recorded global state was really good.
The definition of pre and post-recorded events and their independence when occurring in two different processes provided a great insight.

Limitation:

Their assumption about channels being perfectly error free seems little far-fetched; an extension of the algorithm to handle loss of markers(which could be causes due to a system crash too) by detection and retransmission could overcome this issue.

Applicability:

Although the algorithm may be difficult to implement practically, global states could be useful in Database applications for both deadlock detection and checkpoint generation. I believe having a master system, which periodically gets data from other systems to create a global view would be way easier to implement and would serve the same purpose.

Posted by: Meenakshi Syamkumar | October 1, 2014 06:05 PM

Summary: This paper describes the problem of getting a snapshot of the global state of a distributed system – for each node and link. The authors come up with a simple and straightforward solution.
Problem: A distributed system changes over time, and it is impossible to pause all computation simultaneously in order to ask each node and link what they are doing. The question is, given that the system’s state is not constant, how can I get a consistent picture of the global state? In addition, we need to do this without affecting the underlying computation.
Contributions: It seems to me that the importance of this paper is in defining the model and the problem clearly. As stated in the paper, the solution itself is obvious once the constraints and relationships are set down. By sending markers after recording state, each node can put down a version of its own state, and the state of any incoming links. Indeed, this will be done in finite time.
The second contribution is in their understanding that this global state is not necessarily reflected in any of the state configurations the system went through. However, they prove that it is a possible state configuration, and thus is consistent with the system (that is, the system could have started where it did, ended up in the global state their algorithm found, and ended up in the same ending state.)
Discussion: Relating theory papers to real world systems is a bit more difficult than, say, Dynamo. Luckily this paper gives a direct application (actually, part of their reason they investigated this) – global deadlock detection. Given a function to look at the state of a system and check if it’s in deadlock, one could use this algorithm to determine the state, and check if deadlock has occurred.
Flaws: The flaws here lie in the assumptions. The most basic, as in Lamport’s previous paper on logical time, is this assumes no outages (either nodes or network components). Because this looks for a snapshot view of the system, but the algorithm actually happens over some delta time, it is possible that nodes become available or unavailable while the algorithm is in progress. This of course goes against what a “snapshot” is, where there is no change of state during the snapshot. In addition, if an algorithm did deal with outages, it is possible that the recorded global state not be necessarily consistent. For example, if it was recording the global state of a token ring, and the machine holding the token went out, then it would show the token ring as not having a token. Whether or not this is the case depends on the application and possibly whatever is causing the machine to not respond.

Posted by: Frank Bertsch | October 1, 2014 02:52 PM

CS 739 Reviews - Fall 2014

Distributed snapshots: determining global states of distributed systems

Comments

Post a comment