* propagation: - use merge operation to deal with out-of-order update message: > each list has two sublist: active and deleted > clean up deleted list once a day, throw away item older than 14days - when 2 admin perform conflict updates at two servers at almost same time it is hard to predict what will prevail. It is ok - so eventual consistency - registration server checks its inbox every 30 seconds Example: say we have a entry that is replicate in server 1 and server 2. The entry looks like below in both S1 and S2: Active: A B C Del: NULL Now a client comes and REMOVE A, and ADD A at S1. So the entry in S1 looks like: After REMOVE Active: B C Del : A(0) 0 here is the timestamp After ADD Active: A(1) B C Del : NULL Since each update message propagate to Server S2. Assuming no MERGE operation, and the message is simply say REMOVE A or ADD A. If the addition come first, the REMOVE comes second, then entry at S2 looks like this After ADD Active: A B C Del: 0 After REMOVE Active: B C Del: A Which is NOT what we expect. To SOLVE this, GrapeVine use merge operation, and the entire new entry version is send as update message. Consider the sequence of merge at S2, if it receive the add, then the REMOVE: Update Message with ADD: Active: A(1) B C Del : NULL Current version at S2: Active: A B C Del : NULL ==> NEW version after merge Active: A(1) B C Del : NULL Now, receiving update message with REMOVE Active: B C Del : A(0) ==> After merging: Active: A(1) B C Del : NULL This works, but 2 problem: 1) update message is costly, since it contains the entire content 2) if the update message get lost, inconsistency * What if update message get lost? - system can have permanent inconsistency - Then, how to resolve? On next update on that entry, the merge operation will find inconsistency. ==> Problem: what if that entry never get update (later), well still inconsistent Example: at S1 and S2, we have Active: A(-1) B C Del : NULL Now client remove A at S1, but the update operation get lost, so S2 doesn't get the update message: At S1: [B C], [A(0)] // the first list is active, second one is del At S2, [A(-1) B C], [NULL] ! Inconsistent Now say client add D to the entry at S1 S1: [B C D(1)] [A(0)] and now S2 receive new version: update message: [B C D(1)][A(0)] current at S2 : [A(-1) B C], [NULL] Now merge, --> figure out inconsistent Based on the time stamp, knowing that A should have removed NOTE: The time stamp here is actually the timestamp + server who initiate the update. So timestamp define the total order (since we only compare timestamp of same server) --> similar to bayou RULE: last time stamp wins PROBLEM: if 2 admin create 2 entry with the same name at different server, we want the earlier creation win. Solution: doesn't deal with this. just assume one admin for one registry so the problem is very unlikely to happen MY NOTES (2nd revision) ======== # 4. EFFECTs of SCALE Distributed and Replicated Registry ----------------------------------- Each registration server store registration data, containing: ~ info about all other message server and registration server (address) ~ names of all registries ~ mapping registry --> registration servers containing replica of that registry Good: - These info is small - not grow with the size of users in community - The number of registries can increase, but not as fast as the number of users Lookup algorithm ---------------- - simply loop through list of available servers and choose the nearest one - Good: simple, operation NOT frequent - Bad: if # servers increase, and operation is FREQUENT BAD delivery algorithm: ----------------------- - Distribution lists: + recipient lists of a message + recursive: can contains group entry, hence need to be resolve... - The algorithm: message server processes a message with distribution lists 1. Resolve the distribution list + For group entry in the distribution list, look up (registration server) to find the group member. + Group member can be another group (recursive), hence may need additional lookup 2. For EACH entry (individual), assuming no cache: + lookup to find its inbox lists + DIRECTLY contact one if the message server in the list, and forward the message. - WHY BAD? 1. one-by-one process each entry in the distribution list, + hence slow, when the distribution list size increases + the expansion cost if nested-level of nested group is high SOLUTION: distribute computation ~ expand one level instead of recursively ~ sort the expanded list by registry name, create a steering list containing all recipient that belong to that registry. at the end, have a list of mapping like registry --> steering list ~ for each registry name, find the message server whose registration server contains replica for that registry, and forward the message as well as corresponding steering list to that message server 2. Directly contact the inbox site may not be a good idea, if the path from Sender to inbox site contains a lot of link. Because the chance of successful direction connection reduces when number of link in between increase. SOLUTION: instead of direct connection, multi-forward the message if there is a GraveVine server (not the final destination) in between the path **Experience with Grapevine: The Growth of a Distributed System** ================================================================= Look at the title, this paper all about the *lesson* learn when system is growing, i.e under high workload (i.e what they called "surprises") - some of the original assumptions are broken - some performance problem. ... What else? Let's find out. Write success, read cannot be? # First, what is grapevine anyway? - distributed system providing message delivery & registration service (mostly) - implemented in Mesa - each node: 128KB memory, 5MB hard disk (small isn't it?) - each server: message server program + registration server program - each service is client of other + message server ask registration server for inbox of particular Rname + registration server propagate updates by sending message - message delivery service: + client sends message to individuals or distribution list + messages are buffered in inboxes until recipients request them + message can be anything + replicated submission service: any message server can accept message to deliver + user has several inboxes ==> replicating delivery path - registration service: + naming entry: {RName, value} + 2 types: group, individual + group can be recursive, i.e group contain another group? (complex, right?) + group often used as distribution list + partitioned naming scheme: > RName = F.R, R = registry, F = name within R > Registry correspond to locations, organizations + distributed: each registration server contains all RNames for a registry + replicated: each registry is replicated at different servers Problem: how to propagate update? Update can be any where, and propagated later using grapevine message (We will see a problem here? merge vs. diff) Hence, there can be inconsistency if updates to same item happens at two different server at almost the same time. How to deal? Live with it. One more problem, since update is propagated as message, the updates can be delivery out of order during propagation. To deal with this: - Deleted list - merge operation: (which turns out to be expensive) - Note: The GrapeVineUserPacket lib make user transparent with the distributed nature - Clients of grapevine: + user of mailing system + file server + ... Now, come to the lessons? What problem we can think of when the system growth and had high load? - scalability - configuration - transparency of distribution and replication - adjusting to the load - reliability # What are the good and bad of Grapevine when dealing with scale? Scale = add more similar servers (not the powerful one), and we will be able to solve of the problem of high load - Good: KISS + registration server keeps all information it needs to know ==> do not need to add a level of indirection to spread the info Besides, the amount of disk space for these info is small (15KB) ~ names and addresses of all message and registration servers ~ names of all registries ~ mapping: registry --> (servers contain replica for that registry) + choose the nearest server: scan over 30 servers (simple) ~ good if this operation is not too often ~ bad if frequent, or # servers large + division of registration database into registries > registries tend to increase in their number, not in their size hence, this is good ==> the registration data on one registration server will not grow with the size of the system - Bad: + if # of server increase, scanning list of servers may be expensive + distribution list: contains any individual from different registries > increase as a fraction of total user community > problem: processed one by one, hence slow For each RName (assuming no cache) need to contact registration server to find out corresponding inbox sites, then forward the message to that inbox... > solution: construct sublist based on registry, and forward message and the sublist to server in charge of that registry ==> hence, distribution of computation now, load is tied to size of registry (which not change too much) rather than total number of user in the system + what if the number of message increases? can overwhelm the system Solution: use some filter mechanism (not yet implemented) + what if the internet scale? # hops between servers grow? ==> direct connection becomes unsuitable because of unreliable links # Configuration: a policy question - When to add new server? + load on existing server get too large + slow or unreliable network link - separation of message server and registration server + when we have multiple local servers - placement of primary inbox in local areas with more than one server + divide the assignment of inboxes evenly among servers (to avoid overloading a particular server) + should regard to the benefit of sharing + easy to administer ==> placement along operational lines - where to place secondary servers: + alternative 1: next nearest server ==> can overload this sever when primary one fails + alternative 2: place at the "other-end" of an unreliable link In practice: primary inbox on the server itself secondary at nearest server third at the other end of the internet - Registry naming and replication + based on geographical area problem: think about when want to send message to an organization that is split into 2 geographical areas. ==> that two servers need to have entire registry of others Why? because registry is replication grain + where to put registry replicas > fast response: put registry near the server that need it (i.e message servers containing inbox for that registry) > contents of distribution list close to the message servers addressed to them > availability: what if a link is down? ==> put at both side of the link > prevent data loss: what if a disk broken down? > avoid overloading when update # Transparency of distribution and replication - bad 1: delayed update propagation + can take minutes, because using message to propagate (recall message delivery algorithm) ==> large inconsistency windows > 2 clients access different servers can see inconsistency Example1: create a new user in a registration server, add that user to a distribution list in another registration server E.g.2: add user to a list, send email to that list, the newly added user may not receive the message SOLUTION: > retry > restrict one admin to a registration server during one session + deleted name: ~ need to remove that name from distribution lists that contain it ~ but we don't now which registry those lists belong to Hence: ~ notify owner of distribution list when a name is invalid ~ and delete that name for the list at this point I.e. we remove the name from the list at sending / resolving time - bad 2: automatic re-mail all message when inbox is removed or demoted + may cause substantial load in the system (if inbox is large) + force user to retrieve message when his inbox is removed (really?) + Should have done: when inbox is removed, just mark it, but not really remove it. It will be actually remove when user download all mail. (Note:refer to what lesson we have learned here, in those experience papers) - bad 3: duplicated message (not due to client send the message twice) + Why duplicated message: distribution lists can overlap + E.g: ~ 2 lists L1 and L2, both contain recipient R. ~ during message delivery, L1 can not be expanded, hence leave for later (i.e. simply go ahead and process L2) So client first check the inbox and see the message. Later on when L2 is re-mail, duplicate message again. NOTE: each message has a postmark, so if L1 and L2 both send to message server the duplication will be detect + Alternative: wait until all expansions are completed, so that can find out duplicate message, but this may be expensive, and incur delays - bad 4: placing delivery with none local server *immediately* can result in *longer* delivery time than placing message with local server *a little time* later. Why? because the link to remote server may be unreliable The problem is: Grapevine does not have knowledge about the internet (congestion, reliable, etc...). It just based on link-count to make decision - bad 5: using distribution list and process it (for list expansion, inbox site location) immediately may lead to longer delivery time if the nearest replica for registry is far away ==> it is better to forwarding the message to a server near that replica - bad 6: hide details from user (if something wrong happen) (Note: is this mentioned in one of the experience paper) + Details: for example, when user cannot receive his mail, he want to know what went wrong (link broken, server crashes, etc...) # Adjust to the load Wrong predictions about the load, leading to performance problem Since the system expands, distribution list size increase, and the number of message, updates, etc... Therefore, things work well previously may be broken - Bad 1: propagate update using entire changed entry in a message, receiver *merges* it with the local entry to produce new version (remember: a list contains 2 sublist: active and deleted ...) + size of distribution list, frequency of update increase hence cost of merge increases + some servers would get an hour or more behind in delivering messages because of the update load, hence may lead to some problem, for example run out of disk space ... + Solution: make frequent case fast (and it is ok to have infrequent cases costly) > What is frequent case? - Users Add/remove their own names - Administrator/list owner add/delete a member. > for add and remove, include *only change* in update message - reduce size of message - reduce update cost > Still retain merges for other operations. Why? recall that update messages can be out-of-order. What if a user delete an entry, then add the same entry? But The addition propagates to registration server before the addition. Merge handles this case. Also recall that each message has timestamp, hence merge works. - Bad 2: file server needs to contact Grapevine to check user's credential + require both file and grapevine servers be available to access a file + slower (what if fileserver do authentication itself) + may overload Grapevine server Solution: + *caching* credential at file server + After time-out, authentication is rechecked + if cannot contact to any registration server, that authentication is assumed to be valid Implication: certain authentication and access control can take a long time to become effective. I.e. adding permission to reference a file takes effect quickly; the effect of revoking permission and changing passwords can be delayed, but that is consistent with our security environment. - Bad 3: problem with *nested group* + groups can contain group ... access control determines membership in the closure, hence *slow* > need to contact registration server to determine - type of a name (is it a group or an individual) - if it is a group, then what is its list of member + Solution: > avoid lookup by adding a special symbol to name when name is a group name ==> save one lookup > for complex group, use parallel *flattened* version for access control checks, hence avoid lookup caused by multi nesting level This is like a post-processed version access control list of complex group... Hence no need to back-and-forth lookup Note: this "flattened" version can be slightly out of date. - Bad 4: inbox is intended to be a buffer, but can be use at a repository (for terminal user), hence disks at server can be fill up with messages + Normal case: messages in an inbox is read sequentially and deleted (after being downloaded to client) + Some cases: user read messages but does not need to delete those messages PROBLEM: # of user increase, they don't have GrapeVine client, hence, they use other way, still read message without deleting those from inbox ==> DISK can be filled up + solution: restrict the number of users can use that service # Operation of a dispersed system - 2 type of user: + operator: need to be on site + expert: not need to be on site - Deal with failures, management task + a corrupted disk makes server unable to start ~ can recover with redundant, low-level structural information + at high level, all registry data is replicated on other servers, allowing the content of registration database entries to be replaced. - NOTE: GrapeVine does not replicate message body. Why? KISS + they said: cost is too high for the benefit derived ~ benefit: user can read his mail from any of his inbox, no need to retrieve message from ALL inbox at once. More over, can tolerate total disk failure at message server if that message has not been retrieve. ~ cost: replication comes at cost. What if a server crashes, and come back The message retrieval protocol need to change: what inbox to receive from? What if server containing a message shutdown before user retrieve mail, then come-back after user has done that. ==> need some mechanism to clean up garbage messages. + they said that the failure probability is too rare. - feature: + statistic + monitoring + trace log (for easy debug) + dead letter facility: undeliverable message is return with possible cause # Reliability - provided by replication of functionality + replicate message submission path: any message server can accept a message + replicate message delivery path: user has at least 2 inboxes + replicate registration database - refuse connection if free disk space below 5% - It is possible for a server to deadlock itself with disk full. Hence may lead to multiserver deadlock. S1 fails, all message forward to S2, S2 can overload hence fails. - Important lesson: using redundancy to achieve reliability requires the system to have spare resources in normal case. + this spare resources allow to handle pick-load + E.g: remailing a large inbox may paralyze the system, causing other servers to fails SOLUTION: split the secondary inboxes for individuals whose primary inboxes are on a certain server among several other servers, so when the primary server fails the load is spread to more than one other server. Say u1, u2, u3 all has primary inbox and secondary inbox in S1 and S2, respectively. If S1 fails --> all mails comes to S2 and overload it. Better to: secondary inboxes of u1, u2, and u3 to S2, S3, S4... In General: think carefully about how loads will be distributed when various servers become unavailable or when various communication links go down, and configure the registration data and inbox site assignments so that sensible behavior will result - in message archiving: + archive messages more than 7 days old + Problem 1: can be a problem if the file server keeping archived messages for that inbox is not available Thus availability of the message delivery path to an individual that has archived messages can depend on the availability of a particular archival file server + Solution: A larger disk and less frequent archiving would also be desirable + Problem 2: 7 days is threshold is fixed, and not scale with the load. if a servers receive a very large amount of message in a very short time, the disk would be come full quickly. ==> Hence archiving should be more and more aggressive - dealing with potential unavailability: careful management of resource + disk space management: reject connection if free space drop below ... - handling peak load using redundancy: + problem: spare resource can be used up without notice, hence can not be anticipated promptly