Mike's notes on replication:
http://pages.cs.wisc.edu/~swift/classes/cs739-sp10/lectures/Lecture%2012%20-%20replication-v3.pdf

**FLexible Update Propagation for Weakly Consistent Replication**
=================================================================

# Strict consistency vs. eventual consistency
- eventual consistency:
	+ after data update, inconsistent still holds for some time
	(i.e client may see inconsistent), but finally, the system is consistent
	+ use where performance or availability is need for than consistency
	(see the circle argument of Mike's notes)

Take:
3 design decisions in bayou:
- pair-wise communication:
	+ arbitrary communication topologies
	+ arbitrary policy choices (when, to whom)
- exchange write operations only
	+ low bandwidth network
- propagate writes between replicas in an order that is closed with respect to
the writes' accept, causal, total order
	+ incremental progress
	+ use of transportable media


# Question:
What is S.O (don't quite get it ... Leo)

# 1. Introduction
*Weak consistency*
- update any where
- but require a reconciliation protocol to propagate updates
Advantage:
- relax consistency
- flexible policy: when , to whom, and what to reconcile
- availability, fast response time (performance)

*Summary of Bayou anti-entropy protocol*
- support arbitrary communication topologies:
	+ mechanism to propagate update between any two replicas, then finally
	updates will propagate through the system
- operation over low-bandwidth network
	+ update operation is propagated, not the full database content
- incremental progress:
	+ allow interruption, like transient network disconnection
- eventual consistency
	+ each update eventually reach every replica
- efficient storage management
	+ logged updates can be discard to reclaim storage
- propagation through transportable media
- light-weight management of dynamic replica set:
	+ creation and retirement of a replica through communication with only one
	available replica (not all replicas, like some other protocol)
- arbitrary policy choice:
	+ flexible choice: when, to whom, and what ...

==>Take: very flexible, support a variety of application needs, network topology
and so forth.

# 2 Basic Anti-entropy

*Storage model*
- each server has:
	+ an ordered log of updates
	+ a database resulting from the in-order execution of these updates
- entry in the write log:
	+ can contain write received from application or other servers
- When a server *first* receives a write from a client application:
	+ the server assigns an increasing *accept-stamp* to the write
		> accept-stamp can be time-stamp or generation counter
- Hence, each write has 2 fields
	+ accept-stamp
	+ server id: id of the server that first assigned the accept-stamp
- Accept-stamp defines a *total order* over all writes accepted by a server
- Accept-stamp defines a *partial order* over all writes in the system:
	+ write A and B can have the same accept-time but different server-id
	==> need a way to break tie (e.g. based on server id)

*Design choice* (3)
- Choosing anti-entropy partner:
	+ randomly (flexible, work with various topology)
	+ restricted: less flexible
- Exchange write operation rather than database contents
	+ the amount of data propagated depends on update activity rather than
	the database size
		==> good when a database size is large
	+ avoid ambiguity introduced by creation or deletion of replicated objects
	+ write operation can easily stored in log
- Enforcing partial accept-order: prefix-property
	+ Server R that holds a write stamped Wi that was initially accepted by X
	will also hold all writes accepted by X prior to Wi.
	+ Version vectors V represent the set of writes know to a server:
		==>R.V(X) is the largest accept-stamp of *any* write known to R that was
		originally accepted from a client by X.
		(i.e. R knows all writes accepted by X prior (and including) R.V(X))

*Basic Alg.*

/*basic anti-entry executed at S(ender), to update receiving server R(receive)*/
anti-entropy(S,R) {
	Get R.V from R
	w = first write in S.write-log
	WHILE (w) DO {
		IF R.V(w.server-id) < w.accept-stamp THEN
			/*w is new for R*/
			SendWrite(R, w)
		w = next write in S.write-log
	}
}

Feature of this basic alg.
- simple, flexible:
	+ support variety of topology
	+ flexible policy choice: when and to whom to reconcile
- S keeps every writes operation (since the beginning)
	==> this can be a waste of space
- incremental: 
	in case of failure during a sendwrite, later anti-entropy does not 
	need to resend previous writes, because all writes sent prior to the failure
	is stored and processed by the receiver (i.e. the R.V has been updated)
- the delivery order of writes is important (to ensure prefix property)
	==> need some transport protocol that guarantees ordered delivery

# 3. Effective Write-log management
Basic algorithm requires that sender S needs to keep all writes from the very
beginning (don't discard anything). 
Now, we are going to relax this requirement, i.e, S can discard the write log.

Question: which write S can discard. In Bayou, only writes that a *stable* 
can be discarded. 

*Problem*: the writes may not fully propagate before being discarded
==> may require transferring full database state from one server to another.

# *Stable write*:
- committed write, no need to be re-executed at server
- use *primary protocol* to stabilize write:
	+ one database is assigned as the primary replica
	+ primary server stabilize the position of the write in the log when it 
	first receives it
(it is not necessary that primary server receives write from client directly)
(when a primary receive a write, of course it will apply the change for its
database, but it also need to stabilize the write)
	+ when stabilizing the write, the primary also assigns a CSN to the write
	(CSN = commit sequence number, which is monotonically increase)
	+ New partial order based on CSN (and the accept-stamp)

*preserving the prefix property requirement*
- servers reconcile uncommitted writes with primary using basic algorithm
- committed writes are propagated before tentative writes

*Anti-entropy with support for committed writes*
/* S.CSN: highest commit sequence number known to S */
anti-entropy(S,R) {
	Get R.V and R.CSN from receiving server R 
	#first send all the committed writes that R does not know about 
	IF R.CSN < S.CSN THEN 
		w = first committed write that R does not know about 
		WHILE (w) DO 
			IF w.accept-stamp <= R.V(w.server-id) THEN
			 	# R has the write, but does not know it is committed
				SendCommitNotification(R, w.accept-stamp, w.server-id. w.CSN) 
			ELSE
				SendWrite(R, w) 
			END
			w = next committed write in S.write-log 
		END
	END 
	w = first tentative write 
	#now send all the tentative writes 
	WHILE(w) DO
		IF R.V(w.server-id) < w.accept-stamp THEN 
			SendWrite(R. w)
		w = next write in S.write-log 
	END
}

Note: the notification is pretty light-weight, do not have to send the 
entire write

# write-log truncation
- can truncate any prefix of stable part of the log
- implication: write-log may not hold enough write to perform incremental 
reconciliation
(e.g. S receives some writes and stabilize and discard them, but these write may 
not propagate to R, because R has been disconnected for a long time)
==> in this case, a full database transfer need to happen

*How S detects that there are writes that S has truncated, but has not yet been
propagate to R*:
- each server maintains another version vector, S.O, that characterizes the 
  omitted prefix of the server’s write-log;
- using OSN: omitted sequence number
- if S.OSN > R.CSN, then there exist committed writes that S truncated from its 
log, but R has not received them (the writes) yet

*hence the algorithm*
IF S.OSN > R.CSN THEN
	# need to execute full database transfer
	Roll back S's database to the state corresponding to S.O
	sendDatabase(R, S.DB)
	sendVector(R, S.O) # this will be R's new R.O vector
	sendOSN(R, S.OSN) # R's new R.OSN will now be R.OSN
END
rest is the same as above

NOTE: the receiver R removes all writes from its write-log that are covered
by the new omitted vector (S.O), but it retains all writes that are not covered
by S.O, because these write may not be known to the sender S.

*Feature*
- retransfer full database may require a lot of bandwidth
- database transfer is not incremental

# Storage and Networking Resource Trade-Off
- Server may retains writes in the log to avoid full database transfer
- or it can aggressively truncates writes, but occasionally have to do 
a full database transfer
- Hence, if the database size is large and network bandwidth is low, 
then should go for incremental update (i.e, retains all the write-log)
- Again, this trade-off enables flexible policy:
	+ how many writes should be retained
	
# Rolling back the Write-log (not get it ???)
- When roll back happens:
	+ at sender, during a full database transfer 
	+ at receiver, it has to roll its log back to the position of the earliest
	write it receives. (this happens at most once per anti-entropy session)
	 (???? I don't get it here)

- Optimization:
	+ if a receiver involves into two roll-back sessions, just need to roll
	back to the insertion point of the earliest write being received
	+ receiver does not need to redo to rolled-back writes until the next read
	from an application
	==> tradeoff between the cost of anti-entropy session, and the latency
	of the next read from a client

# 4. Some protocol extensions

*Anti-entropy though Transportable Media*
	+ instead of using the network, S can simple write the updates to a files
	as well as S.CSN, SV, ... (this allow receiver when reading the file
	identify if receiver is up to date or not)

*Session guarantees*
to reduce client-observed inconsistencies when accessing different servers: by 
causal-accept-order:
==> need to make sure that write A precedes write B if and only if at the time
write B is accepted by some server from a client, write A was already known
to that server.

How to implement this? Use logical clock at each server. This clock is advanced:
- when new writes are accepted by the server from clients
- when write with higher accept-stamps are received through anti-entropy

*Eventual Consistency*:
If we enforce a total order based on accept-stamp using server ID to break tie,
then we can get eventual consistency

# 5. Discussion
- discuss about which feature provided by which design choice
(Section 5.1)
- disadvantage:
	+ size of version vectors and write-log
	+ if update activity is high, then write-log can be larger then db size
	+ even when apply the write-log truncation, the write-log size will
	increasingly grow if the update rate is much larger than the commit rate
- policy choices
	1) When to do reconciliation
		+ periodically
		+ manually
		+ system trigger
	2) Choosing whom to perform anti-entropy
		+ reachable
		+ based on network connection
		(tradeoff between bandwidth usage and propagation rate)
	3) how aggressively to truncate write-log
		tradeoff between storage needed between networking resource
		(since sometime, a full database transfer require)
	4) which server to create new replica
		+ depends on:
			> server ID length
			> up-to-dateness
			> connection bandwidth
			> write-log completeness
- security in Bayou:
	+ enforced by digital certificate
	+ server authenticates each other before anti-entropy
	+ writes include certificates to authorize database access