Kenneth Birman @ Cornell, Andre Schiper @ Switzerland, Pat Stephenson @ Cornell
ACM Trans. on Computer Systems 1991
Why multicast is important? In message passing system--synchronization by messaging, multicast is the most complex form of messaging! |
|||||||
A simple example of how multicast is used in distributed computing
|
Type of groups: Fig 1
|
|||||||||
Expected usage of group
|
Asynchronous atomic multicast which keeps causality order |
|
Support every type of groups in section 2 |
System = fixed set P of processes with disjoint address spaces |
|||||||||
Group g
|
|||||||||
Multicast
|
|||||||||
View of group g
|
|||||||||
Message transport layer: reliable point-to-point & multicast service
|
|||||||||
Execution of a process = a partially ordered sequence of atomic events--all or nothing & not interleaved by another action--such as:
|
|||||||||
System causality relation
|
|||||||||
Liveness
|
Virtual synchronous programming environment: user can program as if the system schedules one distributed event at a time |
|||||||||||||||
Aspects of virtual synchrony system
|
|||||||||||||||
Fault tolerance: fault atomicity + CBCAST delivery order
|
How can we say that send(m) -> send(m')? We need some virtual time |
|||||||
VT(p): vector time of process p
|
|||||||
VT of message
|
|||||||
The comparison of VT: comparison of corresponding counters |
Single process group with fixed membership |
CBCAST order can be said
|
|||||||||||
CBCAST protocol
|
|||||||||||
CBCAST always guarantees that causal order is kept
|
ABCAST does not keep causality order |
|||||||||||||||||||||||
Causal ABCAST: an modified ABCAST which keeps causality |
|||||||||||||||||||||||
Idea
|
|||||||||||||||||||||||
Protocol
|
A CBCAST msg is said to stable, if it gets received by all recipients |
|
Stability number n of a process: VT(msg)[pid] <= n, for all stable msg sent by the process |
|
Stability number may be piggybacked with CBCAST msg or sent by itself |
VT(msg) needs only carry vector timestamp fields that have changed since the last multicast by the sender
|
|||
VT compression could save nothing due to the additional field to record which indices are included in the timestamp |
|||
For bursty and highly localized traffic, VT compression can save a lot |
To achieve atomicity of msg delivery while group membership changes dynamically, several issues must be addressed
|
VT counts the number of msgs sent -> with multiple groups, msg is delivered to some processes -> when a process receives a msg with VT[i] = k, it does NOT know how many msgs from process i will precede the msg -> Need VT for each group: denoted as VT(msg, group) or VT(sender/receiver, group) |
|||||||||||||
Also stability number for each group |
|||||||||||||
CBCAST protocol for multiple group
|
|||||||||||||
=> VT for every group should be included in each msg -> 6.3 Extended VT Compression |
Causal ABCAST which uses CBCAST no more keeps the order of delivery among different groups -> Need global ordering, but little practical and out of scope of this paper |
VT Compression by sending only delta: changed field, even entire VT for a group can be omitted |
|
By changing the CBCAST protocol as to deliver concurrent messages in causality in the order of reception, VT field can be further compressed |
|
Locality will increase the efficiency of compression |
Message carries VT for (virtually) all groups in the system -> Processes hold the VT of groups which do not belongs indefinitely -> Big overhead |
|
Solution: Periodic flush(= setting fields of VT to zero) for every active group & throwing away VT for groups not being refreshed within the timeout |
(1) Virtually synchronous addressing: no process failure or departure from a group is assumed => 5.5 method still works |
|||||
(3) Delivery atomicity and virtual synchrony when failure occur
|
VT carried with msgs is still headache |
|||
If communication structure is known in advance to have a particular pattern, the number of VTs can be reduced drastically |
|||
Communication pattern: graph with vertex = group, edge connecting two group iff two group shares members |
|||
Biconnected component = subgraph with every pair of vertices connected by at least two paths |
|||
k-bounded: max(|biconnected component|) = k |
|||
Theorem: If a system uses the VT protocol to maintain causality, it is sufficient and necessary for a process p to maintain and transmit those VT timestamps corresponding to groups in the biconnected component to which p belongs
|
If process joins and leaves group dynamically, it is almost impossible to get the graph of 6.6 |
|||||||||
Conservative solution
|
|||||||||
Multicast epochs: process leaves a group only due to failure
|
|||||||||
(2) Multicast epochs: process leaves a group for other reasons, remaining active and possibly joining other groups
|
Client/Server model in ISIS
|
|||||
Diffusion model
|
|||||
Hierarchical model: asynch -> O(number of members of subgroup) |
ISIS uses RPC for point-to-point message
|
How to inform the system of application structure, such as acyclic comm pattern? Current ISIS uses conservative method to detect comm pattern. |
|||||
Possible extension: introduction of causality domain
|
How should be membership changes detected and managed? Use CBCAST to inform membership change -> flush |
|
Also need a mechanism to detect a failure of a member. In another paper |
Is causality only in terms of msg being sent or received enough? |
|
VT is too complex and heavy |