* propagation:
	- use merge operation to deal with out-of-order update message:
		> each list has two sublist: active and deleted
		> clean up deleted list once a day, throw away item older than 14days
	- when 2 admin perform conflict updates at two servers at almost same time
	it is hard to predict what will prevail. It is ok
	- so eventual consistency
	- registration server checks its inbox every 30 seconds
	
Example: say we have a entry that is replicate in server 1 and server 2.
The entry looks like below in both S1 and S2:
Active: A B C
Del:    NULL

Now a client comes and REMOVE A, and ADD A at S1. So the entry in S1 looks like:

After REMOVE
Active: B C
Del   : A(0)

0 here is the timestamp

After ADD
Active: A(1) B C
Del   : NULL

Since each update message propagate to Server S2. Assuming no MERGE operation,
and the message is simply say REMOVE A or ADD A. If the addition come first,
the REMOVE comes second, then entry at S2 looks like this

After ADD
Active: A B C
Del:    0

After REMOVE
Active: B C
Del:    A

Which is NOT what we expect.

To SOLVE this, GrapeVine use merge operation, and the entire new entry version
is send as update message. Consider the sequence of merge at S2, if it receive
the add, then the REMOVE:


Update Message with ADD:
Active: A(1) B C
Del   : NULL

Current version at S2:
Active: A B C
Del   : NULL

==> NEW version after merge
Active: A(1) B C
Del   : NULL

Now, receiving update message with REMOVE 
Active: B C
Del   : A(0)

==> After merging:
Active: A(1) B C
Del   : NULL

This works, but 2 problem:
1) update message is costly, since it contains the entire content
2) if the update message get lost, inconsistency

* What if update message get lost?
	- system can have permanent inconsistency
	- Then, how to resolve? On next update on that entry, the merge operation
	will find inconsistency.
	==> Problem: what if that entry never get update (later), well still
	inconsistent
	
Example: at S1 and S2, we have
Active: A(-1) B C
Del   : NULL

Now client remove A at S1, but the update operation get lost, so S2 doesn't 
get the update message:

At S1: [B C], [A(0)]		// the first list is active, second one is del
At S2, [A(-1) B C], [NULL]

! Inconsistent

Now say client add D to the entry at S1
S1: [B C D(1)] [A(0)]

and now S2 receive new version:
update message: [B C D(1)][A(0)]
current at S2 : [A(-1) B C], [NULL]

Now merge, --> figure out inconsistent

Based on the time stamp, knowing that A should have removed


NOTE: The time stamp here is actually the timestamp + server who initiate
the update. So timestamp define the total order (since we only compare 
timestamp of same server) --> similar to bayou

RULE: last time stamp wins

PROBLEM: if 2 admin create 2 entry with the same name at different server,
we want the earlier creation win.
Solution: doesn't deal with this. just assume one admin for one registry
so the problem is very unlikely to happen

MY NOTES (2nd revision)
========

# 4. EFFECTs of SCALE

Distributed and Replicated Registry
-----------------------------------
Each registration server store registration data, containing:
~ info about all other message server and registration server (address)
~ names of all registries
~ mapping registry --> registration servers containing replica of that registry

Good:
- These info is small
- not grow with the size of users in community
- The number of registries can increase, but not as fast as the number of users

Lookup algorithm
----------------
- simply loop through list of available servers and choose the nearest one
- Good: simple, operation NOT frequent
- Bad: if # servers increase, and operation is FREQUENT

BAD delivery algorithm:   
-----------------------
- Distribution lists: 
  + recipient lists of a message
  + recursive: can contains group entry, hence need to be resolve...
- The algorithm: message server processes a message with distribution lists
  1. Resolve the distribution list
     + For group entry in the distribution list, look up (registration server)
       to find the group member. 
     + Group member can be another group (recursive), hence may need additional
       lookup
  2. For EACH entry (individual), assuming no cache:
     + lookup to find its inbox lists
     + DIRECTLY contact one if the message server in the list, and forward 
       the message.
- WHY BAD?
  1. one-by-one process each entry in the distribution list, 
     + hence slow, when the distribution list size increases
     + the expansion cost if nested-level of nested group is high
     SOLUTION: distribute computation
       ~ expand one level instead of recursively
       ~ sort the expanded list by registry name, create a steering list
         containing all recipient that belong to that registry.
         at the end, have a list of mapping like registry --> steering list
       ~ for each registry name, find the message server whose registration
         server contains replica for that registry, and forward the message
  	     as well as corresponding steering list to that message server
  2. Directly contact the inbox site may not be a good idea, if the path
	 from Sender to inbox site contains a lot of link. Because the chance 
	 of successful direction connection reduces when number of link in between
	 increase.
	 SOLUTION: instead of direct connection, multi-forward the message if
	 there is a GraveVine server (not the final destination) in between
	 the path
	 
	 
**Experience with Grapevine: The Growth of a Distributed System**
=================================================================

Look at the title, this paper all about the *lesson* learn when system is
growing, i.e under high workload (i.e what they called "surprises")
- some of the original assumptions are broken
- some performance problem.
...
What else? Let's find out.

Write success, read cannot be?


# First, what is grapevine anyway?
- distributed system providing message delivery & registration service (mostly)
- implemented in Mesa
- each node: 128KB memory, 5MB hard disk (small isn't it?)
- each server: message server program + registration server program
- each service is client of other
  + message server ask registration server for inbox of particular Rname
  + registration server propagate updates by sending message
- message delivery service:
	+ client sends message to individuals or distribution list
	+ messages are buffered in inboxes until recipients request them
	+ message can be anything
	+ replicated submission service: any message server can accept message 
	  to deliver
	+ user has several inboxes ==> replicating delivery path
- registration service:
	+ naming entry: {RName, value}
	+ 2 types: group, individual
	+ group can be recursive, i.e group contain another group? (complex, right?)
	+ group often used as distribution list
	+ partitioned naming scheme:
		> RName = F.R, R = registry, F = name within R
		> Registry correspond to locations, organizations
	+ distributed: each registration server contains all RNames for a registry
	+ replicated: each registry is replicated at different servers
	Problem: how to propagate update? Update can be any where, and propagated
	later using grapevine message (We will see a problem here? merge vs. diff)
	
	Hence, there can be inconsistency if updates to same item happens at 
	two different server at almost the same time. How to deal? Live with it.
	
	One more problem, since update is propagated as message, the updates
	can be delivery out of order during propagation. To deal with this:
	- Deleted list
	- merge operation: (which turns out to be expensive)

- Note: The GrapeVineUserPacket lib make user transparent with the distributed
 nature

- Clients of grapevine:
  + user of mailing system
  + file server 
  + ...


Now, come to the lessons?
What problem we can think of when the system growth and had high load?
- scalability
- configuration
- transparency of distribution and replication
- adjusting to the load
- reliability 

# What are the good and bad of Grapevine when dealing with scale?
Scale = add more similar servers (not the powerful one), and we will be able
to solve of the problem of high load
- Good: KISS
	+ registration server keeps all information it needs to know
	==> do not need to add a level of indirection to spread the info 
	Besides, the amount of disk space for these info is small (15KB)
	  ~ names and addresses of all message and registration servers
	  ~ names of all registries
	  ~ mapping: registry --> (servers contain replica for that registry)
	+ choose the nearest server: scan over 30 servers (simple)
	  ~ good if this operation is not too often
	  ~ bad if frequent, or # servers large
	+ division of registration database into registries
		> registries tend to increase in their number, not in their size
		hence, this is good		
		==> the registration data on one registration server will not grow
		with the size of the system
- Bad:
	+ if # of server increase, scanning list of servers may be expensive
	+ distribution list: contains any individual from different registries
		> increase as a fraction of total user community 
		> problem: processed one by one, hence slow
		  For each RName (assuming no cache) need to contact registration server
		  to find out corresponding inbox sites, then forward the message to
		  that inbox...
		> solution: construct sublist based on registry, and forward message
		and the sublist to server in charge of that registry
		==> hence, distribution of computation
			now, load is tied to size of registry (which not change too much)
			rather than total number of user in the system
	+ what if the number of message increases? can overwhelm the system
	Solution: use some filter mechanism (not yet implemented)
	+ what if the internet scale? # hops between servers grow?
	==> direct connection becomes unsuitable because of unreliable links

# Configuration: a policy question
- When to add new server?
	+ load on existing server get too large
	+ slow or unreliable network link
- separation of message server and registration server
	+ when we have multiple local servers
- placement of primary inbox in local areas with more than one server
    + divide the assignment of inboxes evenly among servers
     (to avoid overloading a particular server)
	+ should regard to the benefit of sharing
	+ easy to administer
	==> placement along operational lines
- where to place secondary servers:
	+ alternative 1: next nearest server
		==> can overload this sever when primary one fails
	+ alternative 2: place at the "other-end" of an unreliable link
	In practice: 	primary inbox on the server itself	
					secondary at nearest server
					third at the other end of the internet
- Registry naming and replication
	+ based on geographical area
		problem: think about when want to send message to an organization
		that is split into 2 geographical areas.
		==> that two servers need to have entire registry of others
		Why? because registry is replication grain
	+ where to put registry replicas
		> fast response: put registry near the server that need it
		(i.e message servers containing inbox for that registry)
		> contents of distribution list close to the message servers 
		   addressed to them
		> availability: what if a link is down?
			==> put at both side of the link
		> prevent data loss: what if a disk broken down?
		> avoid overloading when update

# Transparency of distribution and replication
- bad 1: delayed update propagation
	+ can take minutes, because using message to propagate
	(recall message delivery algorithm)
	==> large inconsistency windows
		> 2 clients access different servers can see inconsistency
    Example1: create a new user in a registration server, add that user
    to a distribution list in another registration server		
    E.g.2: add user to a list, send email to that list, the newly added user
    may not receive the message
    SOLUTION:
		> retry
		> restrict one admin to a registration server during one session
	+ deleted name:
	  ~ need to remove that name from distribution lists that contain it
	  ~ but we don't now which registry those lists belong to
	  Hence:
	  ~ notify owner of distribution list when a name is invalid 
	  ~ and delete that name for the list at this point
	  I.e. we remove the name from the list at sending / resolving time
- bad 2: automatic re-mail all message when inbox is removed or demoted
	+ may cause substantial load in the system (if inbox is large)
	+ force user to retrieve message when his inbox is removed (really?)
	+ Should have done: when inbox is removed, just mark it, but not
	really remove it. It will be actually remove when user download all mail.
	(Note:refer to what lesson we have learned here, in those experience papers)
- bad 3: duplicated message (not due to client send the message twice)
  + Why duplicated message: distribution lists can overlap
  + E.g: 
    ~ 2 lists L1 and L2, both contain recipient R.
    ~ during message delivery, L1 can not be expanded, hence leave for later
      (i.e. simply go ahead and process L2)
    So client first check the inbox and see the message.
    Later on when L2 is re-mail, duplicate message again.
    NOTE: each message has a postmark, so if L1 and L2 both send to message
    server the duplication will be detect 
  + Alternative: wait until all expansions are completed, so that can
    find out duplicate message, but this may be expensive, and incur delays
- bad 4: placing delivery with none local server *immediately* can result in
*longer* delivery time than placing message with local server *a little time*
later.
	Why? because the link to remote server may be unreliable
	The problem is: Grapevine does not have knowledge about the internet
	(congestion, reliable, etc...). It just based on link-count to make decision
- bad 5: using distribution list and process it (for list expansion, inbox site
location) immediately may lead to longer delivery time if the nearest replica
 for registry is far away
==> it is better to forwarding the message to a server near that replica
- bad 6: hide details from user (if something wrong happen)
(Note: is this mentioned in one of the experience paper)
  + Details: for example, when user cannot receive his mail, he want to know
    what went wrong (link broken, server crashes, etc...)

# Adjust to the load
Wrong predictions about the load, leading to performance problem
Since the system expands, distribution list size increase, and the number of
message, updates, etc...   
Therefore, things work well previously may be broken
- Bad 1: propagate update using entire changed entry in a message, receiver 
*merges* it with the local entry to produce new version
(remember: a list contains 2 sublist: active and deleted ...)
	+ size of distribution list, frequency of update increase
	hence cost of merge increases
	+ some servers would get an hour or more behind in delivering messages
	  because of the update load, hence may lead to some problem, for example
	  run out of disk space ...
	+ Solution: make frequent case fast (and it is ok to have infrequent cases 
	costly)
	    > What is frequent case? 
			- Users Add/remove their own names
			- Administrator/list owner add/delete a member.
		> for add and remove, include *only change* in update message
			- reduce size of message
			- reduce update cost
		> Still retain merges for other operations. Why? recall that update
		messages can be out-of-order. What if a user delete an entry, then
		add the same entry? But The addition propagates to registration server
		before the addition. Merge handles this case. Also recall that each
		message has timestamp, hence merge works.
- Bad 2: file server needs to contact Grapevine to check user's credential
	+ require both file and grapevine servers be available to access a file
	+ slower (what if fileserver do authentication itself)
	+ may overload Grapevine server
	Solution: 
	+ *caching* credential at file server
	+ After time-out, authentication is rechecked
	+ if cannot contact to any registration server, that authentication is
	assumed to be valid
	Implication: certain authentication and access control can take a long time
	to become effective. I.e. adding permission to reference a file takes effect
	quickly; the effect of revoking permission and changing passwords can be
	delayed, but that is consistent with our security environment.
- Bad 3: problem with *nested group*
	+ groups can contain group ...
	  access control determines membership in the closure, hence *slow*
		> need to contact registration server to determine
			- type of a name (is it a group or an individual)
			- if it is a group, then what is its list of member
	+ Solution:
		> avoid lookup by adding a special symbol to name
		  when name is a group name ==> save one lookup
		> for complex group, use parallel *flattened* version for access control 
		checks, hence avoid lookup caused by multi nesting level
		  This is like a post-processed version access control list of complex
		  group... Hence no need to back-and-forth lookup
		Note: this "flattened" version can be slightly out of date.
- Bad 4: inbox is intended to be a buffer, but can be use at a repository
(for terminal user), hence disks at server can be fill up with messages 
	+ Normal case: messages in an inbox is read sequentially and deleted
	(after being downloaded to client)
	+ Some cases: user read messages but does not need to delete those messages
	PROBLEM: # of user increase, they don't have GrapeVine client, hence, 
	they use other way, still read message without deleting those from inbox
	==> DISK can be filled up
	+ solution: restrict the number of users can use that service
	
# Operation of a dispersed system
- 2 type of user:
	+ operator: need to be on site
	+ expert: not need to be on site
- Deal with failures, management task
	+ a corrupted disk makes server unable to start
	  ~ can recover with redundant, low-level structural information
	+ at high level, all registry data is replicated on other servers, allowing
	  the content of registration database entries to be replaced.
- NOTE: GrapeVine does not replicate message body. Why? KISS
	+ they said: cost is too high for the benefit derived
	  ~ benefit: user can read his mail from any of his inbox, no need to 
	    retrieve message from ALL inbox at once. 
		More over, can tolerate total disk failure at message server if
		that message has not been retrieve.
	  ~ cost: replication comes at cost. What if a server crashes, and come back
		The message retrieval protocol need to change: what inbox to receive
		from? What if server containing a message shutdown before user retrieve
		mail, then come-back after user has done that.
		==> need some mechanism to clean up garbage messages.
    + they said that the failure probability is too rare.
- feature:
	+ statistic
	+ monitoring
	+ trace log (for easy debug)
	+ dead letter facility: undeliverable message is return with possible cause

# Reliability
- provided by replication of functionality
  + replicate message submission path: any message server can accept a message
  + replicate message delivery path: user has at least 2 inboxes
  + replicate registration database
- refuse connection if free disk space below 5%
- It is possible for a server to deadlock itself with disk full. Hence may lead
  to multiserver deadlock. S1 fails, all message forward to S2, S2 can overload
  hence fails.
- Important lesson: using redundancy to achieve reliability requires the system
  to have spare resources in normal case.
  + this spare resources allow to handle pick-load
  + E.g: remailing a large inbox may paralyze the system, causing other servers
    to fails
SOLUTION: split the secondary inboxes for individuals whose primary inboxes are 
on a certain server among several other servers, so when the primary server 
fails the load is spread to more than one other server.
Say u1, u2, u3 all has primary inbox and secondary inbox in S1 and S2, 
respectively. If S1 fails --> all mails comes to S2 and overload it.
Better to: secondary inboxes of u1, u2, and u3 to S2, S3, S4...

In General: think carefully about how loads will be distributed when various 
servers become unavailable or when various communication links go down, and 
configure the registration data and inbox site assignments so that sensible 
behavior will result

- in message archiving: 
  + archive messages more than 7 days old
  + Problem 1: can be a problem if the file server keeping archived 
	messages for that inbox is not available Thus availability of the message 
	delivery path to an individual that has  archived messages can depend on the 
	availability of a particular archival file server
  + Solution: A larger disk and less frequent archiving would also be desirable
  + Problem 2: 7 days is threshold is fixed, and not scale with the load.
    if a servers receive a very large amount of message in a very short time,
	the disk would be come full quickly. ==> Hence archiving should be more
	and more aggressive

- dealing with potential unavailability: careful management of resource
	+ disk space management: reject connection if free space drop below ...
- handling peak load using redundancy:
	+ problem: spare resource can be used up without notice, hence can not
	be anticipated promptly