Mike's notes on replication: http://pages.cs.wisc.edu/~swift/classes/cs739-sp10/lectures/Lecture%2012%20-%20replication-v3.pdf **FLexible Update Propagation for Weakly Consistent Replication** ================================================================= # Strict consistency vs. eventual consistency - eventual consistency: + after data update, inconsistent still holds for some time (i.e client may see inconsistent), but finally, the system is consistent + use where performance or availability is need for than consistency (see the circle argument of Mike's notes) Take: 3 design decisions in bayou: - pair-wise communication: + arbitrary communication topologies + arbitrary policy choices (when, to whom) - exchange write operations only + low bandwidth network - propagate writes between replicas in an order that is closed with respect to the writes' accept, causal, total order + incremental progress + use of transportable media # Question: What is S.O (don't quite get it ... Leo) # 1. Introduction *Weak consistency* - update any where - but require a reconciliation protocol to propagate updates Advantage: - relax consistency - flexible policy: when , to whom, and what to reconcile - availability, fast response time (performance) *Summary of Bayou anti-entropy protocol* - support arbitrary communication topologies: + mechanism to propagate update between any two replicas, then finally updates will propagate through the system - operation over low-bandwidth network + update operation is propagated, not the full database content - incremental progress: + allow interruption, like transient network disconnection - eventual consistency + each update eventually reach every replica - efficient storage management + logged updates can be discard to reclaim storage - propagation through transportable media - light-weight management of dynamic replica set: + creation and retirement of a replica through communication with only one available replica (not all replicas, like some other protocol) - arbitrary policy choice: + flexible choice: when, to whom, and what ... ==>Take: very flexible, support a variety of application needs, network topology and so forth. # 2 Basic Anti-entropy *Storage model* - each server has: + an ordered log of updates + a database resulting from the in-order execution of these updates - entry in the write log: + can contain write received from application or other servers - When a server *first* receives a write from a client application: + the server assigns an increasing *accept-stamp* to the write > accept-stamp can be time-stamp or generation counter - Hence, each write has 2 fields + accept-stamp + server id: id of the server that first assigned the accept-stamp - Accept-stamp defines a *total order* over all writes accepted by a server - Accept-stamp defines a *partial order* over all writes in the system: + write A and B can have the same accept-time but different server-id ==> need a way to break tie (e.g. based on server id) *Design choice* (3) - Choosing anti-entropy partner: + randomly (flexible, work with various topology) + restricted: less flexible - Exchange write operation rather than database contents + the amount of data propagated depends on update activity rather than the database size ==> good when a database size is large + avoid ambiguity introduced by creation or deletion of replicated objects + write operation can easily stored in log - Enforcing partial accept-order: prefix-property + Server R that holds a write stamped Wi that was initially accepted by X will also hold all writes accepted by X prior to Wi. + Version vectors V represent the set of writes know to a server: ==>R.V(X) is the largest accept-stamp of *any* write known to R that was originally accepted from a client by X. (i.e. R knows all writes accepted by X prior (and including) R.V(X)) *Basic Alg.* /*basic anti-entry executed at S(ender), to update receiving server R(receive)*/ anti-entropy(S,R) { Get R.V from R w = first write in S.write-log WHILE (w) DO { IF R.V(w.server-id) < w.accept-stamp THEN /*w is new for R*/ SendWrite(R, w) w = next write in S.write-log } } Feature of this basic alg. - simple, flexible: + support variety of topology + flexible policy choice: when and to whom to reconcile - S keeps every writes operation (since the beginning) ==> this can be a waste of space - incremental: in case of failure during a sendwrite, later anti-entropy does not need to resend previous writes, because all writes sent prior to the failure is stored and processed by the receiver (i.e. the R.V has been updated) - the delivery order of writes is important (to ensure prefix property) ==> need some transport protocol that guarantees ordered delivery # 3. Effective Write-log management Basic algorithm requires that sender S needs to keep all writes from the very beginning (don't discard anything). Now, we are going to relax this requirement, i.e, S can discard the write log. Question: which write S can discard. In Bayou, only writes that a *stable* can be discarded. *Problem*: the writes may not fully propagate before being discarded ==> may require transferring full database state from one server to another. # *Stable write*: - committed write, no need to be re-executed at server - use *primary protocol* to stabilize write: + one database is assigned as the primary replica + primary server stabilize the position of the write in the log when it first receives it (it is not necessary that primary server receives write from client directly) (when a primary receive a write, of course it will apply the change for its database, but it also need to stabilize the write) + when stabilizing the write, the primary also assigns a CSN to the write (CSN = commit sequence number, which is monotonically increase) + New partial order based on CSN (and the accept-stamp) *preserving the prefix property requirement* - servers reconcile uncommitted writes with primary using basic algorithm - committed writes are propagated before tentative writes *Anti-entropy with support for committed writes* /* S.CSN: highest commit sequence number known to S */ anti-entropy(S,R) { Get R.V and R.CSN from receiving server R #first send all the committed writes that R does not know about IF R.CSN < S.CSN THEN w = first committed write that R does not know about WHILE (w) DO IF w.accept-stamp <= R.V(w.server-id) THEN # R has the write, but does not know it is committed SendCommitNotification(R, w.accept-stamp, w.server-id. w.CSN) ELSE SendWrite(R, w) END w = next committed write in S.write-log END END w = first tentative write #now send all the tentative writes WHILE(w) DO IF R.V(w.server-id) < w.accept-stamp THEN SendWrite(R. w) w = next write in S.write-log END } Note: the notification is pretty light-weight, do not have to send the entire write # write-log truncation - can truncate any prefix of stable part of the log - implication: write-log may not hold enough write to perform incremental reconciliation (e.g. S receives some writes and stabilize and discard them, but these write may not propagate to R, because R has been disconnected for a long time) ==> in this case, a full database transfer need to happen *How S detects that there are writes that S has truncated, but has not yet been propagate to R*: - each server maintains another version vector, S.O, that characterizes the omitted prefix of the server’s write-log; - using OSN: omitted sequence number - if S.OSN > R.CSN, then there exist committed writes that S truncated from its log, but R has not received them (the writes) yet *hence the algorithm* IF S.OSN > R.CSN THEN # need to execute full database transfer Roll back S's database to the state corresponding to S.O sendDatabase(R, S.DB) sendVector(R, S.O) # this will be R's new R.O vector sendOSN(R, S.OSN) # R's new R.OSN will now be R.OSN END rest is the same as above NOTE: the receiver R removes all writes from its write-log that are covered by the new omitted vector (S.O), but it retains all writes that are not covered by S.O, because these write may not be known to the sender S. *Feature* - retransfer full database may require a lot of bandwidth - database transfer is not incremental # Storage and Networking Resource Trade-Off - Server may retains writes in the log to avoid full database transfer - or it can aggressively truncates writes, but occasionally have to do a full database transfer - Hence, if the database size is large and network bandwidth is low, then should go for incremental update (i.e, retains all the write-log) - Again, this trade-off enables flexible policy: + how many writes should be retained # Rolling back the Write-log (not get it ???) - When roll back happens: + at sender, during a full database transfer + at receiver, it has to roll its log back to the position of the earliest write it receives. (this happens at most once per anti-entropy session) (???? I don't get it here) - Optimization: + if a receiver involves into two roll-back sessions, just need to roll back to the insertion point of the earliest write being received + receiver does not need to redo to rolled-back writes until the next read from an application ==> tradeoff between the cost of anti-entropy session, and the latency of the next read from a client # 4. Some protocol extensions *Anti-entropy though Transportable Media* + instead of using the network, S can simple write the updates to a files as well as S.CSN, SV, ... (this allow receiver when reading the file identify if receiver is up to date or not) *Session guarantees* to reduce client-observed inconsistencies when accessing different servers: by causal-accept-order: ==> need to make sure that write A precedes write B if and only if at the time write B is accepted by some server from a client, write A was already known to that server. How to implement this? Use logical clock at each server. This clock is advanced: - when new writes are accepted by the server from clients - when write with higher accept-stamps are received through anti-entropy *Eventual Consistency*: If we enforce a total order based on accept-stamp using server ID to break tie, then we can get eventual consistency # 5. Discussion - discuss about which feature provided by which design choice (Section 5.1) - disadvantage: + size of version vectors and write-log + if update activity is high, then write-log can be larger then db size + even when apply the write-log truncation, the write-log size will increasingly grow if the update rate is much larger than the commit rate - policy choices 1) When to do reconciliation + periodically + manually + system trigger 2) Choosing whom to perform anti-entropy + reachable + based on network connection (tradeoff between bandwidth usage and propagation rate) 3) how aggressively to truncate write-log tradeoff between storage needed between networking resource (since sometime, a full database transfer require) 4) which server to create new replica + depends on: > server ID length > up-to-dateness > connection bandwidth > write-log completeness - security in Bayou: + enforced by digital certificate + server authenticates each other before anti-entropy + writes include certificates to authorize database access