**My note's in 2nd revision** ============================= ASSUMPTION: - Dynamo targets apps with weaker consistency - non-hostile environment - commodity hardware - simple read/write operation to data item uniquely identified by a key Vs. Bayou (Flexible consistent) - similar: allow read/write to continue even when during network partitions - different conflict resolution mechanism vs. BigTable - only key/value access - focus on high availability, where updates are not rejected even in the wake of network partitions and server failures vs. Relational database - less consistent - no need schema - high available **DYNAMO** ========== # CAP Theorem Among Consistency, Availability, (ability to tolerate) Partition, you can only get 2 but not 3 of them. AP: Dynamo, Grapevine, DNS... # Dynamo design choice - Goal: always-on key-value storage system + read: return what ever can be found (and may let apps deal with inconsistency) + write: always succeeds (even incase of failures or network partition) - AP, and eventual consistency. Why? + performance + availability - when to resolve inconsistency + always writable data store (i.e write is never rejected) + push conflict of resolution to read (what if we want to resolve conflict on write? need to choose a total order among write What about Chord? use *stabilization* procedure) - who resolves inconsistency? + data store: can apply simple policy like (last write win) ~ may limited + app: may have better understanding how to resolve e.g: when read, just merge all data (for shopping cart) ~ more flexible, best suit for its client's experience - why not DB? + centralized + expensive + worse fault tolerance + app at Amazon is simple, hence no need relational scheme - Some others: + incremental scalability + symmetry: no node take special roles --> simplify provisioning and maintenance + decentralization: more scalable, available + heterogeneity: # Dynamo: bird-eye view - always writable - security is not a concern (because of single domain administration) - flat namespace (key-value), no relational schema - ultimate goal is fast response (for both read/write) (hence, will not routing request through multiple host like Chord ==> there should be some way for a node to route request directly) # Dynamo: Details 1) Simple Interface ------------------- - put(key, context, object) + context includes version of object (we will see how it will be use later) - get(key) 2) Partitioning --------------- - by consistent hashing + to distribute load across multiple storage hosts + each node is assigned a number (position in the ring) + hashing data item's key to yield its position on the ring, and walk the ring clock-wise to find the first node with a position larger than the item's position + each node responsible for the region in the ring between it and its predecessor node on the ring - face some challenges: + non-uniform data and load distribution because we assign each node a random position + the algorithm is oblivious to heterogeneity in the performance - Dynamo uses a slightly different consistent hashing - use virtual nodes: each node is assigned to multiple locations - Why? + node join/leave will only affect the immediate neighbor + load balancing: ~ if a physical node becomes unavailable, the load handled by this node is evenly dispersed across remaining available nodes. + deal with heterogeneity: # virtual node depends on physical node's capacity + NOTE: we see something similar to GrapeVine load distribution here. ~ the goal is to spread the load of the crash machine ~ if not done carefully, for example, if M1 contains all inbox for user 1 and 2 and 3. And M1 crashes, If all secondary inbox of user 1, 2 and 3 are in M2, then M2 quickly become overload. Hence, if is better if we put secondary inbox of user 1 2 and 3 in different machine, i.e. spreading the load on failure 3) Replication -------------- - for high availability and durability - data item is stored at a node and replicated in its next N successors - *preference list*: list of nodes responsible for a particular key + with virtual nodes, the first N successor positions for a particular key may be owned by less than N distinct physical nodes + hence, when building this list, skip positions to make sure the list contains only distinct physical nodes - every node can determine which nodes should be in this list, given a key k (how, section 4.8) 4) Data Versioning ------------------ - for high availability for write + i.e write can return before update propagate + hence, eventual consistency + but, subsequent get() may return data that is not the latest - treat result of each modification as a new and immutable version of data - allow multiple version of objects at the system, resolve later + most of the time, resolve is simple, new version subsumes previous ones + some cases (failures + concurrent updates): conflicting versions --> client makes semantic reconciliation: collapse multiple branches of data evolution back in to one. E.g.: "merging" different versions of a customer's shopping cart Implication: "add to cart" is never lost, but deleted item can resurfaced - use vector clocks (list of ) to define causality between version + partial ordering + if two version are concurrent, then reconcile - resolve inconsistency + system level: if V1 <= V2, then V2 is newer can V1 can be discarded + otherwise, V1 and V2 are parallel branches, need to resolve at app level (e.g, merge) when read + When read, all versions of data are presented to client for semantic resolve - on write, client must provide the context specifying what which version of the object it is updating - Good example: Figure 3 Page 211 - problem: size of vector can grow too large + when: network partition, multiple server failures + Dynamo solution: clock truncation ~ store a timestamp for each pair, indicate the last time the node updated the data item ~ when need to truncate, choose the oldest pair ==> just quick fix, and may loose causal relationship, make it hard to reconcile But they claimed it is ok, not a problem in practice 5) Execution of put and get --------------------------- - 2 way: + directly contact a storage node ~ dynamo library links with client code ~ lower latency because of skipping a potential forwarding step ~ but need to update the membership information periodically • default: 10 seconds • see stale membership for duration of 10 seconds + through a load balancer, which in turns pick a node ~ client does not have to link with dynamo code ~ but may have higher latency - Any storage node can receive get and put operations for any key ~ if this node is not in preference list, forward to other node in key's preference list - *sloppy* quorum consistency protocol to maintain consistency among replicas + read/write operations are performed on first N *healthy* nodes from the preference list + R: # nodes that must participate in a successful read operation + W: # nodes that must participate in a successful write operation + Condition: R + W > N - What happens when put? + upon receiving put() request, the coordinator: ~ generates the vector clock for the new version ~ write the new version locally ~ sends new version (along with the vector clock) to N highest-ranked nodes ~ Write succeeds if at least W-1 nodes return success - What happens when get? + the coordinator receives get() request + the coordinator request all existing version of data for that key from N highest-ranked reachable nodes in the preference list for that key + then waits for R responses before returning result to the client + if there are multiple versions of the data, it returns all versions that it thinks to be causally unrelated + The client code reconciled divergent versions, and the new version is written backs NOTE: resolve inconsistency on read. 6) Handling Temporary Failure ----------------------------- - use hinted handoff to remember temporary unavailable real destination hence, can deal with temporary partition or crashes - later on, when real destination comes back, hinted handoff is used to copy replica to that destination - E.g: + consider a ring contain A -> B -> C -> D -> E -> F -> A + N = 3 + normally, a write on key K in charge of A will be written to A, B, C + if A is temporary down, update on K will be stored in B, C, D + Note that because D is not A (i.e real destination), a hinted handoff is stored together with K in D + later on, when D detects that A alive, it will send replica of K to A Note: + there should be someway to detect unavailability of a node? (gossip) + What if D also fails? E take over, and so on - WHY using hinted handoff? + note: goals of Dynamo is high availability, always on ... + hence read, write are not failed due to temporary node or network failures + W can be set to 1, Dynamo only rejects a write request if all nodes are unavailable - How about handling datacenter failures? + solution: put replicas across data center ~ by constructing preference list such that storage nodes are spread across multiple data centers. - What bad about hinted handoff? + not so good in case of permanent failure + in above example, what if D becomes unavailable before it can return hinted replicas to A + Solution: Anti-entropy (see section 7 below) 7) Handling Permanent Failure ----------------------------- - hinted handoff is not effective in case of permanent failure - Solution use: anti-entropy using Merkle Tree + Merkle Tree: ~ hash tree, leaves are hashes of values of individual keys ~ parent nodes are hashes of their children ~ advantage: each branch can be check independently without requiring node to download the entire tree or the entire data set + helps compare a lot of data that hasn't change + helps detect inconsistency faster + minimize the amount of transfer data + each node maintain a Merkle Tree for a particular key range + Protocol: 1) first exchange and compare root 2) it match, no need to go further 3) otherwise, go down further and repeat 2 This scheme allow to find the key that change with less data transfer Problem: key range changes when node joins/leaves --> need to recalculate tree Solution: refining partitioning scheme (section 6.2 in paper for details) 8) Membership and Failure detection ----------------------------------- - Explicit node join and leave: + admin make request explicitly + history (of node leave/join) is stored persistently, and exchange later - use gossip-based protocol to learn about other node key ranges and membership + every interval, contact a random node, and exchange membership information + not only membership info, the partition and placement info is also exchanged: ~ hence, a node can route a request directly to a responsible node - External discovery: + what is logical partition? ~ An admin contacts node A and join node A to the ring, ~ then contacts node B and join node B to the ring. ~ Nodes A and B would each consider itself a member of the ring, ] ~ yet neither be immediately aware of the other. + use seeds to prevent logical partition (seeds are node knows to all node, and discovered by external mechanism) - failure detection: using time out + if B does not response to A message, then to A, B failed (even B responses to C message) ~ If clients request a lot, A can detect B fails quickly ~ if No client request, no need to detect at all (not necessary) ==> Hence, detection at client request time. 9) Adding/removing storage nodes -------------------------------- When we add new node X to the ring, some nodes no longer have to store some of their keys, and these nodes transfer those keys to X. # EXPERIENCE and LESSON LERNED (may be interesting) We can use Dynamo with different configuration: - Business logic specific reconciliation: + client application performs its own reconciliation logic + e.g.: shopping cart service (just merge everything) - Timestamp based reconciliation: + "last write win": the object with the largest physical timestamp value is chosen as the correct version + e.g.: maintain customer's session - high performance read engine: + some services are read mostly (require high performance), and rarely write + R = 1, and W = N + e.g.: services that maintain product catalog and promotional items Client applications can tune the values of N, R, and W to achieve their desired levels of performance, availability and durability. - For example, the value of N determines the durability of each object. - If W is set to 1, the system will never reject a write request as long as there is at least on not in the system that can successfully process a write request - However, low values of W and R can increase the risk of inconsistency as write requests are deemed successful and returned to the clients even if they are not processed by a majority of the replicas. This also introduces a vulnerability window for durability when a write request is successfully returned to the client even though it has been persisted at only a small number of nodes. - Traditional wisdom holds that durability and availability go hand- in-hand. However, this is not necessarily true here. For instance, the vulnerability window for durability can be decreased by increasing W. This may increase the probability of rejecting requests (thereby decreasing availability) because more storage hosts need to be alive to process a write request. Balancing performance and Durability ------------------------------------ - For performance, it is tricky, since performance of read or write operation is limited by the slowest among R and W replicas. - Trade-off durability for performance + each storage node maintains an object buffer in main memory + write request is stored in the buffer, get periodically written to disk by a writer thread + read check the buffer first ... - To reduce durability risk + coordinator chooses one out of the N replicas to perform a "durable write" + it wait for only W response, hence, if we configure W appropriately (W new node joins the system, "steals" its key range from other nodes > these others node need to scan their local store > scans: expensive, resource intensive, execute in background > hence, bootstrapping can take a long time ~ Merkel Tree recalculation problem > key ranges for many nodes changes when a node joins/leaves > hence, recalculate Merkel trees for these node > expensive, too ~ Snapshot entire key space is hard, because key ranges are random + Fundamental issues: ~ schemes for data partitioning and data placement are intertwined ~ not possible to add more node to systems without affecting partition (while sometimes, it is desirable) ==> need independent schemes for partitioning and placement - Strategy 2: T random tokens per node and equal sized partitions + Feature: ~ hash space is divided into Q equally sized partitions/ranges ~ each node is assigned T random tokens ~ Q is set such that: Q >> N and Q >> S * T where: • S is number of nodes in the system ~ Tokens are only used to build the function that maps values in the hash space to the ordered list of nodes • not to decide partitioning (like that in Strategy 1) ~ A partition is placed on the first N unique nodes that are encountered while walking the consistent hashing ring clockwise from the end of the partition + Advantages: ~ decoupling of partitioning and partition placement ~ enabling the possibility of changing the placement scheme at runtime - Strategy 3: Q/S tokens per node, equal-sized partitions + similar to 2 + Q/S tokens per node (not randomly) + when a node leaves the system, its tokens are randomly distributed to remaining nodes such that these properties are preserved + when a node joins the system it "steals" tokens from nodes in the system in a way that preserves these properties. Strategy 3 seems to be best choice because: - Faster of bootstrapping/recovery + Since partition ranges are fixed, they can be stored in separate files + a partition can be relocated as a unit by simply transferring the file + avoiding random accesses needed to locate specific items + hence simplifies the process of bootstrapping and recovery - Ease of archival: + simpler archival because the partition files can be archived separately + By contrast, in Strategy 1, ~ the tokens are chosen randomly and, ~ archiving the data stored in Dynamo requires retrieving the keys from individual nodes separately and is usually inefficient and slow - Disadvantage: + changing the node membership requires coordination in order to preserve the properties required of the assignment. Divergent Versions: When and How many ------------------------------------- - When? + failure: node failure, data center failure, network partition + large number of concurrent writers to a single data item, handled by multiple nodes - For usability and efficiency: keep the number of divergent version small - Resolve: + by vector clocks + and if not, by client apps Client-driven or Server-driven Coordination ------------------------------------------- - 2 way: + directly contact a storage node ~ dynamo library links with client code ~ lower latency because of skipping a potential forwarding step ~ but need to update the membership information periodically • default: 10 seconds • see stale membership for duration of 10 seconds + through a load balancer, which in turns pick a node ~ client does not have to link with dynamo code ~ but may have higher latency Balancing background vs. foreground task ---------------------------------------- - foreground: get/put - background: + data handoff (due to hinting or node leaves/joins) + replica synchronization (due to permanent failure) - Challenges: background activity does not affect foreground task - Solution: monitoring foreground operation, enable background task when appropriate NOTE: exchanging membership table (routing table) works for hundreds nodes but may not scale to thousands node.