« Epidemic algorithms for replicated database maintenance | Main | Time, Clocks, and the Ordering of Events in a Distributed System »

Dynamo: Amazon's Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007

Review due Thursday, 9/25


“Dynamo” covers many of the distributed aspects of the key-value store that Amazon built. It utilizes many important aspects of distributed systems, resulting in a highly configurable and sound system.

Dynamo was built with the needs of the organization in mind. This meant that outages had to be brought to a minimum, since any outage could result in massive monetary loss. In addition, they were constantly growing, so it had to be scalable. Finally, Dynamo was built for services who needed a configurable data store, such that it can be tuned depending on the use case (tuned for availability, durability, and performance tradeoffs.)

There are many important contributions in Dynamo. One of the first is a discussion of the design requirement and considerations of the system. This is an important section because it describes a real-world application’s generalized requirements. They want low latency through the 99.9% or more tail; systems that only give averages would not suffice for such strict requirements. They also focus on the idea of having an “always writeable” data store.

The design principles they propose are important. They are incremental scalability, symmetry, decentralization, and handling heterogeneity. These principles form the basis for the data store. These design principles can be applied to distributed systems as a whole. Often, it is not only simpler, but more fault tolerant for the system to not rely on a few nodes in particular. If all nodes have equal roles and communicate with each other in an efficient manner, the system state will be easier to manage.

The actual architecture is provided in detail. I will not enumerate over each individual point, but talk about the only one I have not heard of before - sloppy quorum and hinted handoff. Sloppy quorum is a variation of the traditional quorum where a vote is taken from only among healthy nodes. This provides availability in the face of temporary failures and partitions. If a node is down and a write is directed towards it, it is instead redirected to the next healthy node not in the quorum. This node stores the write for the missing node, and occasionally tries to retransmit the update back to the missing node. This provides durability in the face of server outages.

Finally, I want to discuss the configurability of their system. Using quorum for reads and writes means that R, W, and N are all configurable by the system. But in addition, any reconciliation can be done by the business logic. This greatly enhances the number of use cases of the system as a whole. This is especially important as different applications have different requirements for their data, and these differences will only diverge more in the future.

I already discussed some of the importance of the requirements and principles earlier. But also important is the fact that they used a number of famous techniques to implement in their system - consistent hashing, vector clocks, merkle trees, anti-entropy, and gossip-protocols. This is a tremendous backing of the distributed systems communities ideas, and pushes forward the fact that many of the ideas can be implemented, and in fact lead to usable systems that meet the needs of a company. There is simply no need for a distributed systems engineer to work from scratch.

Just a few. First, the general problem (if it’s a problem) of eventual consistency. But more important is that the system does not guarantee “read-your-write” consistency. This can be difficult for the application designer.


Dynamo describes a highly available key-value store that emphasizes high availability and reliability as a trade off for consistency.


Provide a robust, highly available key-value service that is capable of handling Amazon's traffic with TP 99.9 guarantees.

- Focuses on always-write use case.
- Focus on TP 99.9 instead of average latency.
- Allows clients to resolve conflicts instead of making assumptions on their usage patterns.
- Read repair for opportunistic update of stale data.
- Sloppy Quorum to work with failed nodes in preference list.
- Hinted handoff mechanism to deal with migration of keys.
- Uses a medley of popular distributed system techniques for the system.
- Partitioning technique over consistent hashing to improve load distribution.

- Assumes no security issues.
- Existing applications will need to incur complexity of resolving conflicts.
- Designed for couple of hundreds of nodes. Scaling to tens of thousands of nodes is non-trivial.


It is used by several services where always being able to write is more important and the companies that use it are advertised on https://aws.amazon.com/dynamodb/

This paper describes about Dyanamo -- Amzon’s highly available key-value store. At the cost of of eventual consistency dyanamo provides durability availability and scalability with maintaining SLA. Also this paper presents great trade-offs to provide a system that can serve millions of reads and writes with minimal glitches.

Relational databases are fast but require (provides) strict consistancies giving upon availability. Amazon focused on availability and customized (optimized) Dyanamo for write heavy workload with standard Distributed system requirements -- fault tolerance, scalable and eventually consistent. They put together existing techniques and made the largest e-commerce website in the world. Also dyanamo has to address network partition and heterogenity efficiently at this large scale.


  • Use of vector clock to resolve conflicts amongs various versions of the same object is smart. This gives them high availability to write even in the case of network partitions.
  • Use of hinted handoff and Anti-Entropy using merkle trees to deal with temporary and long failures.
  • Sloppy quorum technique to provide write availability at the cost of read complexity. This is finely tuned with the line of ecomerce websites.
  • Event driven coordination function is SEDA provide efficient message piplelining
  • Simplicity of interface makes the optimization by clients easier

Amazon is using this system to provide largest ecommerce sevice in the world. They were able to provide performance that is acceptable to 99.9th percentile and clear fit for any ecommerce websites.

- they want to implement an efficient key-value store that is "always" available for write. they have a large number of clients and requests and failure and delay can hurt user experience.
- data bases are usually slower and they try to increase consistency at expense of availability.
- they want a higher performance. instead of guaranteeing a latency for mean or std, they want to have a limit on latency of 99.9 percentile.

- they use a consistent hashing for partitioning the data
- for replicating the data, they use N tokens in consistent hashing ring.
- they use vector clock for handling the versioning
- they use quorum like system for read and write (need R node to read and W node to write)
- they use hinted hand off to handle node failure: if a node that was the target of write is unavailable, they write the update to another node, but the new node knows which node was the intended target and keep trying to write the update to intended target.
- they use anti-entropy for synchronizing the nodes. they use Merkle tree to make it more efficient.
- they use a gossip-based method to inform other nodes about the adding or deleting new nodes.
- they rely on the application for handling the conflicts.

they showed the performance of the system under real load.
they optimized the system to increase the performance if it was needed
- they added a buffer and kept the records in memory and used a "writer thread" to write them over time (to decrease the cost of I/O)
- they changed the partitioning scheme to balance the load (they explained the original strategy and two improved strategies)

- The paper discusses the design considerations for implementing a highly available distributed data store to support performance, reliability, efficiency, and scalability.
- Dynamo is designed to be 1) always writable, 2) all nodes are trusted, and 3) latency sensitive.

- Partitioning is solved with consistent hashing.
- High availability for writes is resolved with vector clocks.
- Temporary failures are resolved with quorums and hinted handoff.
- Permanent failures are resolved with anti-entropy.
- Membership and failure detection is resolved by gossiping.
- Other optimization problems such as client-driver versus server driven and how to schedule background tasks to not interfere with foreground tasks.

- Virtual nodes are used for load balancing in a system with unreliable nodes. This seems like a weak technical contribution because it sounds very similar to the idea of virtual names presented in Web caching paper [Karger 1999].
- Data versioning is done so that everybody sees the same information and all updates are reconciled.
- Client-driven coordination is used to download contents from the server to handle requests even faster
- Other contributions of this paper (using quorums and gossiping) are the knob adjustment of ideas presented in previous papers. The knob settings are useful if any other company wants to replicate Dynamo’s performance of a highly available key-value store.

- The paper considers both versioning scheme, such as merging writes and last write wins. Although the paper went through detailed explanation of how the client can manage version evolution of an object over time [section 4.4], the paper said Dynamo uses last write wins [section 6] without saying why their version merge will not work well. Therefore, the paper itself is not implementing its own ideas.
- The paper presents different solutions for handling temporary failures (hinted handoff) versus permanent failures (anti-entropy). Since permanent failures takes a lot longer to handle with Merkel trees and passing roots of trees and since it is difficult to claim whether the failure is temporary, the paper does not mention how Dynamo will decide when to execute the anti-entropy.
- The paper could use a forth strategy for ensuring uniform load distribution by using Q/S tokens per node and partition by token value. The partition by token value is easier to implement because of the relaxation on the constraint of how equally-sized partition while Q/S tokens per node will keep uniform load distribution on each node.

SUMMARY: The paper describes Dynamo, a key/value store used by Amazon which allows end
users to tune various parameters but which in general prefers availability over

PROBLEM: Amazon wants to make billions of dollars, without spending it on
hardware or software development. Normal solutions revovling around databases
are too heavy- handed and don't scale, so they fall back to something much
simpler, a highly available key/value store.

DISCUSSION: Dynamo is bult for a trusted environment: no security needed, no
hackers. Because latency is very important, primary key access is all that
Dynamo provide. All internal schemas are relatively "flat". Dynamoc focuses
on availability of WRITE operations, compared to standard database replication
algorithms. Writes will not fail due to underlying failures or concurrent
writes. Writes must always succeed. This is partly due to "customer
experience". Incremental Scalability: Nodes can be added one at a time ro
relieve either load or ?? (availability, find the ref) I feel "99.9th %tile" is
an "academic marketing" number. it was chosen arbitrarily and not computed.
There's no feedback or cost/benefit analysis in deciding that number.

DynamoDB is a highly available Amazon database services for applications with little less focus on maintaining consistency. The motivation for this kind of distributed key-value store is that the traditional database are slow and generalized, and also available replication technologies are more towards consistency than availability.

1. Incremental scalability is implemented using consistent hashing, where the scale up/down of server nodes are handled.
2. Partitioning of data is handled using consistent hashing.
3. Dynamo gives high availability of writes i.e it allows writes to happen and the conflict resolution is carried out during reads. Reconciliation of different versions are carried out using vector clock methodology. But the conflict resolution is left to client.
4. Temporary failures of nodes are handled using Sloppy quorum, where first N healthy nodes are used for quorum instead of first N nodes. They also have a hinted handoff, where a metadata about the range of keys for the redirected requests are stored and moved back to the actual machine after it is up.
5. In order to synchronize the machine when it is recovered from failure, anti-entropy is used by comparing the Merkle trees of the nodes.
6. Membership and failure detection of the nodes are implemented using a gossip based protocol.

1. Since the datacenters of Amazon size are so big and to deliver a distributed system of that scale for a large application base, Amazon had considered the availability to maintain 99.9 percentile in servicing the requests and recoverable and high faullt-tolerant. It also promises eventual consistency with quorum technique.

This paper consists presents details of Amzon’s highly available key-value store – dynamo. A primary goal for Amazon is to use 99.9th percentile latency in SLAs. They enable the applications that use Dynamo to customize their own dynamo instance by figuring out the various trade-offs. They design the system to be eventually consistent, highly available, scalable and durable while also accounting for heterogeneity of the machines.
RDMS provides strong consistency for storage but is not highly available. For Amazon, high availability, especially for writes is a primary concern. Also of prime importance is the latency at which the requests get served. These goals are from the user’s perspective. From management perspective, the system needs to be reliable, durable, scalable and be able to efficiently use heterogeneity. Addressing all these goals in a large scale distributed system that consists of multiple datacenters is all but trivial.
1. They note that their workloads consist of accessing objects that can be fetched by a primary key only and does not require all the functionality provided by an RDMS. Hence they opt to use a distributed key-value store.
2. They want writes to be highly available and hence the use sloppy quorum and hinted handoffs. They use vector clocks to resolve conflicts during reads and if unable to resolve them completely, will return all values to the client and let them figure it out.
3. They use consistent hashing to distribute the keys to nodes. They use the concept of virtual nodes and map a single physical node to multiple points of the circle to have a more uniform distribution of load. A key consists of a coordinator node which is the first immediate clockwise node on the ring and is stored on ‘n’ immediate clockwise nodes in the ring.
4. They make use of heterogeneity in servers by mapping more copies of the more powerful servers onto the ring so they get more keys.
5. An important contribution which is different from most other systems is to resolve conflicts when reading rather than writing. Also, they allow R, W to be configured to be less then N.
6. Any node in the ring can be the coordinator for a read whereas only the nodes in a key’s preference list can be a coordinator for writes.
7. They employ anti-entropy using Merkle-trees for synch - reduces the load created by the process.
Overall, many of these techniques are still widely used. For example, consistent hashing is used by many web and storage services to distribute requests. Also widely used are vector clocks. Also, the importance they give to 99.9th percent latency and not just the average one is a good takeaway. Dynamo is running on Amazon’s datacenters so it is definitely still relevant. And one of the main themes of the paper is to sacrifice strong consistency for high availability which we see more and more of in other distributed systems as well these days, e.g., OpenStack Swift.

This paper is presented by Amazon, and they provide their experience when they built single highly-available system, Dynamo , using different techniques. In order to achieve this goal, it sacrifices some consistency under some failure situation.

Problem Description
Unlike previews works, which focus on the consistency, Amazon found reliability was the most important requirements in e-commerce platform. Especially, some small outage might lead a significant financial consequences and impacts of customer trust. Therefore, they tried to achieve high availability rather than consistency in Dynamo.

1.Although Dynamo considers availability more than the consistent, which may lead to some update conflicts in some failure situation, all the updates will still reach all replicas eventually. The solution is they push the complexity of conflict resolution to the reads rather than writes to ensure that writes will never be rejected. Also, they decide to let application to solve this problem rather than data store, because the application always is best suitable for client’s experience.
2.For the partitioning algorithm, Dynamo uses consistent hashing to distribute the load across multiple storage hosts. To make sure the uniform data and load, they also adapt the concept of “virtual nodes”, to achieve fine-tuning partitioning scheme.
3.In order to deal with data versioning problem, Dynamo uses vector clocks to capture the difference between the two version of the same object.
4.To maintain consistency among all replicas, Dynamo uses consistent protocol similar to quorum systems. However, to achieve availability when server failures and network partitions, Dynamo uses a “Sloppy Quorum”, which means all read and write operation as performed on the first N healthy nodes from preferences list.
5.Dynamo also implements an anti-entropy protocol to keep the replicas synchronized, and uses Merkle trees to minimize the cost for detecting inconsistencies and data transmissions.
6.To avoid having a centralized registry for storing membership and nodes liveness information, Dynamo adapt gossip-based membership protocol.

Unlike other system mainly focus on the consistency or load balance, Dynamo consider more about availability, in order to achieve better client’s experience. Though it sacrifices some consistency under certain failure scenarios, eventual consistent will achieve. The experience of Amazon is applicable for e-commerce company, such like newegg and so on.

This paper consists presents details of Amzon’s highly available key-value store – dynamo. A primary goal for Amazon is to use 99.9th percentile latency in SLAs. They enable the applications that use Dynamo to customize their own dynamo instance by figuring out the various trade-offs. They design the system to be eventually consistent, highly available, scalable and durable while also accounting for heterogeneity of the machines.
RDMS provides strong consistency for storage but is not highly available. For Amazon, high availability, especially for writes is a primary concern. Also of prime importance is the latency at which the requests get served. These goals are from the user’s perspective. From management perspective, the system needs to be reliable, durable, scalable and be able to efficiently use heterogeneity. Addressing all these goals in a large scale distributed system that consists of multiple datacenters is all but trivial.
1. They note that their workloads consist of accessing objects that can be fetched by a primary key only and does not require all the functionality provided by an RDMS. Hence they opt to use a distributed key-value store.
2. They want writes to be highly available and hence the use sloppy quorum and hinted handoffs. They use vector clocks to resolve conflicts during reads and if unable to resolve them completely, will return all values to the client and let them figure it out.
3. They use consistent hashing to distribute the keys to nodes. They use the concept of virtual nodes and map a single physical node to multiple points of the circle to have a more uniform distribution of load. A key consists of a coordinator node which is the first immediate clockwise node on the ring and is stored on ‘n’ immediate clockwise nodes in the ring.
4. They make use of heterogeneity in servers by mapping more copies of the more powerful servers onto the ring so they get more keys.
5. An important contribution which is different from most other systems is to resolve conflicts when reading rather than writing. Also, they allow R, W to be configured to be less then N.
6. Any node in the ring can be the coordinator for a read whereas only the nodes in a key’s preference list can be a coordinator for writes.
7. They employ anti-entropy using Merkle-trees for synch - reduces the load created by the process.
Overall, many of these techniques are still widely used. For example, consistent hashing is used by many web and storage services to distribute requests. Also widely used are vector clocks. Also, the importance they give to 99.9th percent latency and not just the average one is a good takeaway. Dynamo is running on Amazon’s datacenters so it is definitely still relevant. And one of the main themes of the paper is to sacrifice strong consistency for high availability which we see more and more of in other distributed systems as well these days, e.g., OpenStack Swift.

This paper describes Dyanmo , a highly available key value store designed by Amazon that is reliable , scalable and simultaneously maintains service level agreements with various applications

-Conflict Resolution is achieved by using vector clocks to maintain various versions of the same object

-Incremental Scalability was achieved by implemented Consistent hashing . Keys were partitioned by mapping them to a set of virtual nodes which were then mapped to a physical node . This technique not only helped in load balancing with minimal data movement but also fault tolerant . The number of virtual nodes could also be varied across physical nodes accounting for various capacity constrains

-Replication is done to achieve high availability . Nodes hold a preference list of other nodes that are responsible for holding a particular key . Updates are made visible globally using a gossip style protocol making the system eventually consistent and also maintaining a view of the current membership among nodes . Data inconsistencies are compared using anti merkel( for every key range) trees which reduce minimize the amount of data that needs to be transferred for synchronization .

-Consistency is achieved by a sloppy quorum technique on the healthy nodes of a preference list . A hinted handoff mechanism is used to accept writes by a secondary node on behalf of a temporarily failed node thereby achieving high availability

-Performance such as low latency ( 99.99% of times ) is guaranteed for some application by writing data into object buffers in main memory , which is later writing to disk by writer threads

-The coordination function of nodes uses an event driven framework similar to SEDA for efficient message piplelining

Dyanmo is definitely a relevant system since it is fully functional and deployed at Amazon . It it a great example of tweaking the system properties of a distributed system based on various application requirements .

Summary: Dynamo is a distributed, reliable key value store. It uses
various schemes we've learned so far in the class, but not all of them for
their intended use.

- Scalable: Amazon has so many machines that hardware failure is inevitable;
hardware failure is treated like a normal system trait, so it has to be
addressed with scalability (the ability to seamlessly add new nodes). How is
this done without disrupting performance/availability?
- Latency requirements: this system is run on commodity hardware, but has
stringent SLAs. A page can have 150 dependencies, which means that every key
request latency has to be minimized.
- Eventually consistent: with multiple nodes accepting put requests, how do you
bring the system to an eventually consistent state, and how long should this
- CAP: what to do in face of a partition? How to achieve availability?
- Conflict resolution: with data stored on multiple nodes, what happens when
multiple versions are updated concurrently?

- "always writable": their approach to availability is that it should always be
possible to write data. This is because they think it's better for user
experience that data not be lost, instead of having higher latency for a get
request. Thus, they minimize the write time, even in face of a conflict. When
a read is done, this is where the conflict resolution takes place.
- Scalability is inherent in the design. Every node is treated equally, and the
system is decentralized. This makes it easier to add new nodes to the system,
and to not have any bottlenecks because of few nodes performing certain
tasks. There is no specialization, except for responsibility of storing a
partition of the keys. I think that this is possible because of the nature of
the system: 'just' storing data. Is it possible to have homogeneous,
unspecialized nodes for other applications?
- "Zero hop DHT" - to minimize latency, for every key requested, every node can
lookup the node that stores that key. This is an easy way to have an expected
time to respond, by bounding the number of hops to be performed. Would it be
possible to experiment with a N hop DHT for different latency requirements?
- They use other techniques discussed in class, but for specific parts: hashing
for partitioning keys; anti-entropy for failure recovery; gossip (epidemic)
for detecting failure and membership; Quorum for writing in case of failure
- They expose an extremely simple interface, just like our first project: get()
and put(). In addition they also allow the client to specify a context when
writing, which makes it possible to know which version the client thinks its
writing to.
- Certain nodes are responsible for coordinating the replication of certain
keys. This decentralizes the task of replication, without burdening all the
nodes with this responsibility. Because there's a coordinator node, someone
is responsible for the keys, so they shouldn't get lost.

Applicability: they expose a simple interface can be used by anybody,
for many applications. Instead of having a different type of store (e.g.
relational) they provide a very basic one that can be the building block for
(or interfaced to) many applications. By having stringent SLAs, there can be an
expected response time, which makes this very appealing to other customers
(other that internal Amazon services).
Version control is used to resolve conflicts. It uses contexts to know which
version is being written to by the client. In a distributed context, with an
application writing to different nodes, this is necessary to ensure that the
data is eventually consistent.
Clients can decide whether to request through a load balancer, or use a client
library to choose a node directly. The first option is simple, and the second
one allows for a lot of flexibility.

This paper describes highly available, fast key value storage architecture.

Amazon DynamoDB tries to develop highly available, scalable and reliable storage system that guarantees SLA.

1)Dynamo sacrifices consistency for the sack of availability.
2)One of the mainly desired feature of DynamoDB is high availability for writes. To achieve this reconciliation and consensus are mainly done with reads. Dynamo uses vector clocks with reconciliation during reads.
3) To handle temporary failures, Dynamo adopts sloppy quorum and hinted handoff.
4) To handle permanent failure, Merkle trees based anti entropy methodsare used. Merkle tree is also used to reduce the amount of data to be transferred during synchronization.
5)Keys are partitioned on ring of logical nodes. Amazon implements consistent hashing to achieve incremental scalability. Several logical nodes can be stored in one actual node and N copies of data are stored in the successive logical nodes in the ring for failure tolerant.
6) They use gossip based protocol to detect churn.

(1) Only incremental scalability.
(2) No security based measures. It is conveniently assumed to run in secured environmnet.
(3) Space requirement for vector timestamps maybe prohibitive. The
solution of compressing vector timestamps is not clear.
(4) Merkle tree are stored in memory as of now.It will be a problem if the key partition size is large.
(5) Read operations are made complex and costlier than write operations. Cannot say how far this is clever as normal applications do more reads than writes.
(6) Application using dynamo db need to act as catcher for all bugs related to conflict resolution.

DynamoDB is at present one of the leading service provider for key value storage services.

The paper discusses the implementation details and the design decisions taken for Amazon’s key value storage system named Dynamo.

Problems trying to solve
• Due to the architecture of Amazon being decentralized, loosely coupled and service oriented with a large number of services the need to improve availability arises. Performance, reliability and efficiency are other requirements also including scalability.
• Conflict resolution is another issue tackled. The question of when and who resolves them has been solved.
• Incremental Scaling is a key design requirement that needs to be fulfilled.
• The paper also tries to handle failures both transient and permanent.
• Partitioning techniques in order to achieve load balancing.
• Membership and failure detection

• The Dynamo system trades Consistency for more Availability.
• Dynamo aims at being an always writable data store so the reads have to resolve conflicts.
• Implements Incremental Scalability i.e the ability to dynamically partition data based on number of nodes at a given time. This is done by extending consistent hashing with virtual nodes.
• Symmetry is preserved with the use of Gossip-based membership protocol And failure detection.
• Decentralization which leads to a more scalable and available system.
• Heterogeneity is achieved using virtual nodes.
• Use of “Sloppy Quorum” to remove the problems of availability which posed a problem when using strict quorum.
• To counter permanent failures inconsistencies need to be detected and fixed. This is done with the use of Merkle trees which also reduced network traffic.
• Analysis of partitioning methods on a ring so as to provide better load balancing.

Dynamo is a scalable and available write optimized key value store which is currently used by Amazon. The paper achieves the goals set of improved availability, scalability and performance.

Authors at Amazon discuss Dynamo, a distributed key-value store with a primary goal of high availability in the presence of constant failures.

- Need system with high availability, particularly in the case of writes. Customers want to be able to update carts/orders etc.
- Low-latency as always, but in this case targeted goal of low latency for the 99.9th percentile.
- Eventual consistency
- Handle constant failure or individual components, entire data centers, and/or partitions.
- Scale incrementally.

+ The last two goals are aided by the use of consistent hashing with virtual nodes.
+ Tunable quorum is used (can tune R, number of readers, W, number of writers, and N, number of nodes to route items to during consistent hashing). This can help with latency (lower R or W), consistency (raise W), or availability (lower R or W), depending on a service's needs.
+ Vector clocks are utilized to allow de-centralized writes in general and continued operation in a partitioned network during failure. Divergent history vectors (which are rare,

They seemed to meet all of their stated goals with Dynamo. Using consistent hashing, quorums, and allowing conflicts which may be resolved during read operations allowed extremely high write availability. It would be nice to see more performance numbers or comparisons.

The paper introduces Dynamo, a highly available key-value store designed and implemented at Amazon. It was primarily designed to provide high availability in large scale distributed systems by compromising with weak consistency. The paper summarizes the main design requirements that led to Dynamo and how Dynamo has been implemented for use in different Amazon services and the resulting performance analysis.

Amazon is the largest e-commerce platform and it demands high reliability in its large scale distributed system. It also requires to design the entire system that handles failure like any other event. Many of Amazon’s services are write-sensitive and hence demand always-writeable. It also requires very low latency for many of its latency-sensitive services. There is always a trade-off between availability, consistency, cost-effectiveness and performance and Dynamo plans to achieve the 3 traits (also in case of failures) by sacrificing strong consistency.

• The key idea of Dynamo is to provide a highly available system by allowing the system to be eventually consistent. This kind of service finds use in many of the Amazon services which can tolerate some slack in consistency.
• Just like Facebook prefers to have faster read transactions (as against write), Amazon services aim for “always-writeable” property (faster writes) to give better customer experience. Hence the conflict resolution is taken care during the reads to make writes faster.
• It has a simple key-value store and it is especially suitable since the objects in Amazon services are typically small(less than 1 MB).
• Dynamo is designed to be used in an environment where all the nodes can be trusted.
• It provides very low latency for latency-sensitive applications, for example over 99.9% transactions can be performed in less than 300ms. Each of the nodes has sufficient information to route directly to a node, thus avoiding hops and making it faster.
• It is a novel idea how the Dynamo uses the modified consisting hashing for partitioning. It supports heterogeneity among servers. So a powerful server would be given more than one node (virtual) and mapped on the circle and corresponding load is assigned to that server.
• Data versioning is performed by passing the context information during the transactions and this helps in reconciliation of the objects eventually.
• The consistency protocol is similar to the quorum and the parameters(N,R,W) are configurable and it relaxes the quorum membership and uses hinted handoff to handle temporary node failures. These values are tuned depending on the service required.
• It uses anti-entropy for replica synchronization and the using the Merkle trees help in faster detection of the inconsistencies.

We can actually experience this implementation when we shop on Amazon and even some other e-commerce and travel websites. The selection and addition of the products is fast but sometimes (rarely) the products are revised during the checkout. So I think implementations like Dynamo have a great use and are currently being used in many of the commercial services.

Summary :

The paper talks about the design and implementation of Dynamo, a distributed key-value store used on the Amazon platform. The authors discuss on how requirements of scalability, high availability, reliability, fault tolerance and consistency are addressed in Dynamo.

Contributions :

1. The design pays more emphasis to high availability at the cost of low consistency as the applications demand it. Their stringent service level agreements guarantee high performance and reduced request serving latency.
2. They achieve good scalability by partitioning the data effectively using the technique of consistent hashing. They employ a strategy to use equal sized partitions such that the load is uniformly balanced across all the nodes.
3. The system guarantees eventual consistency as most of the application scenarios can tolerate inconsistencies. It ensures high reliability by replicating on a collection of nodes called the preference list.
4. Uses object versioning along with vector clocks to perform conflict resolution on updation. A good idea was to resolve conflicts between replicas at read time rather than write time because the data store needs to be highly writable at all times.
5. Hinted handoff is an interesting idea to keep the system always on inspite of temporary failure by resorting to a sloppy quorum approach.
6. Anti-entropy with merkle trees is used as the mechanism for synchronization of replicas in case of permanent failures. They require lesser transfer of data between nodes to resolve conflicts.
7. Gossip based protocols are used to propagate membership changes and therefore do not incur much overhead like anti-entropy. Thus, it avoids any centralized storage of membership information.
8. At read time, when it retrieves any stale versions from the nodes, it uses a clever technique called read-repair wherein it tries to update them to the latest versions at that time and thus avoids overhead of anti-entropy.
9. For writes, it tries to choose a node that has done a faster previous read which leads to a higher probability of “read-your writes”.
10. Maintains the writes in a write buffer for high performance but balances durability by performing a durable write to one of the replicas.
11. They have also explored the option of improving the latency by moving the co-ordination component to the client in case of which the extra overhead of routing requests by a load balancer is avoided.

Relevance :

The design decisions are quite fine tuned with respect to the application requirements. Dynamo is a highly decentralized system and is used for a huge number of services at Amazon. It has been able to successfully provide high levels of availability across all its applications with guaranteed performance.

The paper talks about Dynamo, a highly available and scalable key-value storage system serving Amazon's e-commerce platform. The paper discusses the ensemble of techniques implemented by this distributed system and how it guarantees various Service Level Agreements (SLAs).

For better customer experience, the e-commerce system at Amazon must never be down, hence should be always available. The system cannot afford to loose updates by customer (e.g. adding items to cart etc.), hence must be "always writeable". The system should be decentralized to avoid single point of failure. The system must be scalable as the number of customers increase.

Main contribution of the system is that it shows that how various distributed system concepts can be implemented together to achieve a high available and eventually consistent key value store with focus of reliability, scalable and performance.
1. Simple primary key based data model. Store information on large number of distributed, low cost nodes.
2. Sacrifice strong consistency for high availability. The system aims for 'eventual consistency'.
3. Replication is done for reliability and availability. All conflict resolutions are done at read time, keeping write logic simple.
4. Increment Scalable through consistent hashing and supports heterogeneous components (add virtual noes based on the capacity).
5. Decentralized in nature: No single coordinator, every node can act as a coordinator for the data it handles.
6. Sloppy quorum and hinted handoff to handle temporary failures.
7. Anti-entropy (using Merkle trees) to recover from permanent failures. Its an interesting way to detect inconsistencies by just using hashes and avoiding significant network traffic to transfer tree-related information.
7. Vector clocks (list of [node,counter]) to capture different versions of an object
8. Flexibility to tune the read and write quorums for a desired level of consistency, performance and availability by services.

A large number of web portal (social networks or e-commerce platforms) use highly available key-value stores. Memcached implementation at Facebook is optimized for reads while dynamo targets writes. Other implementation of key-value stores include LinkedIn's Voldemort, which derives inspiration from Amazon's Dynamo


The paper presents the design and implementation of Dynamo, Amazon's Key-Value store. Dynamo is a distributed, replicated data store which is mainly aimed at providing very high availability("always writeable" guarantee), reduced latency and high tolerance for network partitions(as long as there are a few operating machines, services would still be running). In accordance with CAP theorem, their system is ready to tradeoff strong consistency, thereby providing eventual consistency.


Following are the shortcomings with using any existing RDBMS methods for storing the data:

  • The keys and values are small and simple, hence no requirement for complex querying.
  • No operations span multiple data entries.
  • Difficult to achieve scaling and data partitioning using DB's.

Dynamo solves these problems by using a zero-hop distributed hash table solution.


  • One of the key attractions of their design was to allow individual services to configure their own requirements for availability, consistency and durability, as these requirements tend to vary a lot for difference applications.
  • Requirements and measures being done on 99.9th percentile rather than average showed how significant it is for their design to meet majority of the user's needs.
  • Data partitioning and replication is achieved by using multiple virtual nodes placement on to a consistent hash ring, thereby accounting for sufficient load balancing and handling of heterogeneity.
  • As they resort to eventual consistency, conflict resolution should be handled. They handle resolution during read operations as they are targeting a "write intensive" data store, also use data versioning for this purpose.
  • Vector clocks are used for resolving syntactic conflict resolution and semantic conflict resolution will be handled by the client application.
  • The concept of using sloppy quorum(along with hinted handoff) was impressive as a regular quorum does not ensure availability guarantees.
  • To handle permanent failures, Dynamo uses Merkle tree based anti-entropy, which is well-suited to avoid heavy comparisons to identify out-of-sync state.
  • Mapping of physical nodes to virtual nodes(membership) is handled by using a gossip-based protocol, which is highly efficient in spreading the updates quicker.
  • The choice of co-ordinator node(the nodes which is meant to handle the required operation) is very critical for ensuring low latency and their choice of picking a write co-ordinator based on the previous least-latency read is very clever.
  • In their attempt to improvise the load balancing done by the consistent hashing ring, they make a significant change of separating data partitioning and placement.


Dynamo's implementation is currently in deployment for many Amazon online services. Dynamo's design presents many interesting features which can be used individually in many applications that need replication, decentralization and high availability. Their design is also targeted to provide high customer-satisfaction for majority of the users which is highly critical for any online service. With the trend moving towards online purchases for all possible day-to-day requirements, design considerations used in this architecture can be used to easily build further enhanced systems.


Dynamo (the successor of Amazon Simple Storage S3) is a high availability distributed key-value store built for the Amazon platform. To achieve high performance, availability and scalability Dynamo compromises consistency under certain failure scenarios . In this paper the authors describe its design and implementation.


One of the biggest challenges faced by Amazon is reliability at a massive scale given the fact that they are one of the largest e-commerce operations in the world. Even small outages can result in significant financial consequences and can impact customer trusts. Hence the systems needs to provide data storage that is decentralized with high performance, availability and scalability capabilities, even during times of sever failures. Eventual consistency was good enough hence strict, immediate consistence was not a requirement. Also redundancy and duplication of data was also an inherent requirement for Dynamo.


  • For better end-user experience, they measured performance of 99.9% percentile instead of simply looking at mean, median.

  • Married and implemented many of the already existing techniques and methodologies such as consistent hashing, quorum (sloppy), object versioning, anti-entropy propagation algorithm, gossip protocols, vector clocks; to achieve synergy in an active production environment.

  • High customizability to suit the specific needs of the client application with the parameters governing the replication count and the number of participating notes that are required to carry out read or write requests.

  • Eventual consistency and reconciliation of conflicts is achieved with data versioning and vector clocks.

  • Understanding the environment and the requirements led to the adoption of simple key-value stores over RDBMS that carries additional overheads detrimental to the requirement of high availability and performance.


This paper is very relevant not only because it is used by Amazon successfully but also because it shows the effectiveness of distributed key-store in the perspective of availability and performance. Today DynamoDB is a fast, fully managed NoSql key-store database as a service that even makes it simple and cost effective for small business to store and retrieve any amount of data. Apart from supporting the Amazon platform and e-commerce, it is a great fit for gaming, ad tech and mobile. With the advent of flash storage, even faster performance is achieved.

The paper introduces Dynamo, a distributed key-value store by Amazon. The goal of the system is to provide high availability by sacrificing consistency under certain failure scenarios thus providing a system which is "always-on". The system is eventually consistent.

Failure of components is the norm at Amazon, where millions of commodity components are involved in providing the services to thousands of customers. Even in the face of failures, they wanted to develop a system which should be available at all times for customers to make transactions, which are mostly writes(eg. adding an item to the shopping cart). Therefore the problem they were addressing was to develop a distributed store which is always writeable. They also noticed that the use of the RDBMS was only based on the primary key for most of their services.

1. The important contribution of the paper is to combine different techniques and ideas in distributed systems like consisten hashing, quorum protocols, anti-entropy etc to build a highly available KeyValue store that works as expected.
2. Pushing the task of conflict resolution to the application which reads the data was a very good idea as the application can decide to resolve the conflict the way it seems most suitable.
3. The idea of storing the context of the object(which has the version number of the object) while an object is stored using put() was very useful in conflict resolution.
4. The introduction of "preference lists" to determine the set of nodes that store a particular key helped in replication.
5. The use of "vector clocks" to capture the causality between the different versions of the same object helped in conflict resolution and reconciliation.
6. The idea of using "sloppy quorum"(performing the reads and writes only on the first N healthy nodes in the reference list) instead of strict quorum was very important as it guaranteed the avaialbility of the system under server failure and network partition problems.
7. Using "hinted handoff" to manage temporary failures made sure that the reads and writes are always possible in the system.
8. The use of "Merkle trees" to minimize the amount of data transferred and to increase the rate of detection of inconsistencies between replicas helped to improve the anti-entropy process to be faster.
9. Treating the failures as a local issue and not as a global issue made them to come up with decentralized failure detection protocols.
10. The ability of tuning the system parameters N, R, W according to the kind of service needed was also a good feature of dynamo.

The concepts discussed in this paper have wide spread applicability in real life as the system is the backbone of the world's largest online store, Amazon. The designers have understood the problem they were trying to solve and have developed a highly available, incrementally scalable and eventually consistent system which many of us use for our online shopping needs.

This paper presents the design and implementation of Dynamo - a distributed key value store. Dynamo is used by several internal Amazon services for storing application state. Dynamo compromises consistency for very high availability.

Failures are rather norm and not exceptions in a huge system. Most of the component services at to Amazon need high availability. The authors describe a design that can be highly available (relax on consistency) in the face of failures. The system is fully implemented and is functional in production. The services also have strict performance demands.

This paper uses lots of ideas that were introduced by previous research works. For example, consistent hashing, anti-entropy, gossip-based protocols etc. Still, putting all these ideas together and designing a system that works very well in practice is quite impressive. Some of the techniques used and introduced are:
1. Showed that eventually consistent data is good enough to run a very large e-commerce site.
2. Syntactic conflict resolution using vector clocks whenever possible. The idea of identifying causally related updates was nice. Also, giving the application a chance to do sematic conflict resolution using some business logic.
3. Providing handles to tune consistency, availablity and durability guarantees to the applications. Parameters like R,W and N can be tuned to get different guarantees from Dynamo.
4. Using merkle trees for doing efficient anti-entropy synchronization. This tree can cut down traffic very well.
5. Gossip based protocols to spread membership changes.
6 Achieving 'read-your-writes' consistency by choosing the node which replied fast to the previous read as the co-ordinator seems neat.
7. Hinted hand-off to take care of temporary failures.
8. Client driven co-ordination to reduce latency for operations.

The ideas discussed in the paper are highly relevant. These ideas form the infrastructure required to run the World's largest e-commerce business. Some ideas like configurable C & A trade-offs, hinted hand-off etc are really good and I believe are being used by other similar implementations.

Amazon's Dynamo is a highly available key-value store which provides 99.9 percentile reliability and latency for hundreds of services and millions of users (i.e. at massive scale) with decentralized techniques such as gossip and consistent hashing which effectively results in a single high performance (and scalable) system.

Amazon has tens of millions customers and tens of millions of servers which are both geographically dispersed. How could Amazon build a system that could meet 99.9 percentile SLA contract requirements, and provide a reliable that was always writable? Clearly, the traditional, strongly consistent ACID RDBMS data model was incompatible with the high level of availability and low level of latency that Amazon required.

Amazon decided to use a highly available key value store with an eventually consistent paradigm to construct their distributed storage system. Data integrity & security took a backseat to performance since the environment is completely internal to Amazon (and is assumed to be non-hostile).

1-Amazon dynamo demonstrated that a multi-billion dollar financial (ecommerce) organization could succeed with an eventually consistent model.
2-Conflict resolution during read instead of write seems original. Get() returns multiple versions of an object and the vector-clock helps you determine which version is the correct one.
3-Put() returning data immediately before the replicas were written also improves performance.
4-The sloppy quorum allows the user to determine factors like data durability so that the right balance of performance and reliability can be met for various services with different requirements.
5-The hinted handoff helps the system remain “always writeable”
6-The Merkle hash tree saves bandwidth & helps speed up data comparisons during reconciliation of replicas
7-Very efficient gossip protocols are used for propagating membership changes and for detecting failures
8-Consistent hashing, virtual nodes and “strategy 3” combine to allow a flexible, scalable model where partitioning and placement can be handled independently.

I believe the top contribution is also very applicable: traditional RDBMS databases dominated for a long time and it was good for a company (Amazon) to demonstrate that an eventually consistent approach was completely viable for a major ecommerce institution. My only concern is that many companies seem to possibly misunderstand this and quite a few “drank the NoSQL Koolaid”. Amazon made very pragmatic engineering decisions to strike a delicate balance between reconciling conflicting data while keeping the customer happy (by always allowing writes).

Dynamo is Amazon’s highly available key-value store used to persist state for their e-commerce platform while being scalable. This is done by giving up consistency for availability in the event of network partitions, however the system is still eventually consistent as there is a way to reconcile conflicting updates after the partition is resolved. Other techniques heavily used are consistent hashing, object versioning using vector clocks, managing consistency using a “sloppy-quorum” and Merkel trees, and using a gossip protocol for failure detection and membership.
Amazon needed a highly-available simple store for their business logic that was decentralized, fault tolerant, durable, and performant with guaranteed Service Level Agreements (SLAs) for their consuming services that are measured in the 99.9 percentile. Another key property is the system had to be highly-scalable in that storage nodes could be added and removed due to failure, maintenance, or need with no down time.
Amazon techniques used in previous papers we’ve read to create a tunable system useful for various workloads. Several optimizations and extensions to these techniques were made to increase their practicality for production use.
• Using consistent hashing they created a preference list of a variable number of nodes that formed a cluster which could handle ranges of keys.
• These clusters usually consisted of 3 nodes with 2 readers and 2 writers, which formed a “sloppy quorum” that used hinted handoff, anti-entropy, and Merkel trees to maintain consistent replicas.
• Using a gossip based protocol for failure detection and membership discovery, but centralizing it a bit to avoid partitions by having “seed” nodes that were a source of truth for membership and eventually reconciled against by every other node.
• Using vector clocks to manage conflicts from partitions, but if unsolvable passing the conflict off to the consumer. If there were too many pairs of vector clocks used, timestamps were needed to remove the least recently used from the list of pairs due to memory. Although this truncation is unlikely to occur often, resorting to timestamps after using vector clocks seems like a hack. What if there was clock skew between servers and the LRU vector was actually the most up to date?
• Coordinating requests on the servers at first such that any server could coordinate a read and any server in the preference list could coordinate a write. Consuming clients could save a network hop by coordinating themselves using up to date membership information. To me having the server coordinate and incur an additional network hop seemed wholly unnecessary.
Besides the obvious application of Amazon’s internal key-value store for it’s business applications. This paper lead to the creation of a very similar open-source database, Cassandra, created at Facebook for their Index Search. Later at Amazon this system was monetized as part of their Amazon Web Services (AWS) to become DynamoDB.

The paper talks about the distributed key value store of amazon - Dynamo that was designed with high availability and scalability goals. The key value store which provides eventual consistency is used by a variety of amazon’s applications.

The main goal of the engineers is to develop a highly available system for writes. This is mainly due to the nature of workloads amazon’s services deal with. The distributed store also needs to be highly decentralized in order to avoid single point of failure. Finally, network failures should not stop the system from serving customer requests.

1. Employed variety of key algorithms including consistent hashing, anti entropy, gossip propagation, quorum for reads and writes.
2. Applied fixed partition strategy to consistent hashing in order to minimize the book keeping information at every node and at the same time achieve a uniform load distribution.
3. Allows application to decide on the consistency levels required by configuring the read-write quorum parameters.
4. Using vector clocks in order to reconcile updates.
5. Refining consistent hashing to use more/less virtual names for heterogenous resource base.
6. Resolving conflicts during read (as opposed to resolving during a write done by many systems). This proves to be efficient as writes need to always succeed for their system.
7. The metrics with which the system was measured at 99.9 th percentile of the distribution. This would make more sense to their working set, since they value each and every customer.
8. A greedy approach as to optimize at each level of service dependency by proving SLAs.

The system is designed for specific set of workloads which tolerate eventual consistency and require high write availability. In a nut shell the design decisions meet amazons goals to gain more customers.

This paper summarizes Amazon’s highly available, scalable and distributed key value store. This was developed to deal with the strict availability and performance needs imposed by the SLAs, thereby sacrificing strong consistency and making the system eventually consistent.

In a huge system like Amazon, where there are a lot of nodes, single node failures and network partitions are not very uncommon. So, in those situations it is very important that customers who use Amazon should still able to connect to the system, requiring the system to be available at all times. Reliability of the system is also a key factor because even the smallest outage has significant financial consequences and impacts customer trust. So, Amazon came up with a solution to prioritize availability and also ensuring high performance and correctness.

The system uses a lot of key concepts/ideas in solving their problems at every stage.
a. Since databases that ensure ACID properties tend to have poor availability, Dynamo chooses a key-value store as they are retrieving values using only a single primary key. Storing key,value pairs requires much lesser space and also is highly scalable and easy to maintain.
b. Many techniques used are purely targeted at controlling performance at the 99.99th percentile.
c. For partitioning data across nodes, the system uses consistent hashing. The advantage of this technique is that it provides incremental scalability and uniform load balancing. To account for heterogeneity of nodes, there are multiple virtual tokens for a node on the circle it is mapped to.
d. The system provides flexibility to the clients to tune features like durability, consistency and availability where they can choose replication factors or the number of read and write quorums.
e. Providing highly available wirtes by using vector clocks and reconciliation of conflicts during reads either on the server side or the client side depending on the application.
f. They use anti-entropy using Merkel trees to compare and exchange updates because it helps in reducing the amount of data that needs to be transferred while checking for inconsistencies.
g. A gossip protocol is used to identify when a new node is being added to the system. And the same is done for failure detection. This eliminates the need to have a centralized master which takes care of notifying when a new node is added or removed.

This system is currently used by many services at Amazon and is highly applicable to current systems where availability is a primary concern. Techniques such as consistent hashing are being used in systems like Facebook’s Memcache. The problems listed in the paper are highly relevant to distributed systems engineers, especially on how to resolve write conflicts in the case of a partition in a network and such. Moreover, the idea of using Merkel trees for Anti Entopy is very efficient and is used by systems like Cassandra.

This paper presents the design and implementation of one of Amazon’s highly available and scalable distributed key-value data store as well as some design discussions.

Many services on Amazon platform only needs primary-key access to a data store. In this case, using a relational database is inefficient and would limit scalability and availability. So, a simple key-value data storage is a good choice.
The database should be always writeable. For amazon, one of the most famous e-commerce operations, the customer experience is the first consideration. Rejecting users’ operations is not tolerant, like forbidding user to add goods into shopping carts. So the data storage should be writable even when the network is temporarily unavailable.
A reasonable metrics for Service Level Agreement (SLA) should be proposed to guarantee all users’ experiences, not just the majority.

A variant of consistent hashing is used to have more uniform data and load distribution and heterogeneity in performance. Specifically, the concept of virtual nodes is used in the consistent algorithm. Each nodes are able to adjust the number of virtual nodes it can maintain according to its capacity.
Data are allowed to have different versions due to the concurrent writes or node failures via the mechanism of vector clocks. Reconciliation work can be done by any servers by checking the vector clocks. If the data is still unreconciled, send versions to upper handle logics. Different applications can have different reconciliation policies depends on their own requirements.
Temporary node failure: If one desired node is not reachable at some point of the time, the data would be sent to another node. This node would keep the data and also attempt to check if the failed node is recovered. If it is, return the data to its desired node.
Permanent node failure: The anti-entropy is adopted. To reduce the comparison overhead, the node would maintain a Merkle tree on its managed key range.
Client-driven coordination: request coordination is moved from servers to clients so that eliminate the overhead of the load balancer and the extra network hop.
A 99.9%th percentile of the distribution is used to measure the SLA in Amazon.

Most technologies used in Amazon’s Dynamo are already well-know, which means these kinds of theories are effective in practise. Like Facebook Memcache, all designs are practical driven: the idea underlying might pretty simple, but the designers modified these ideas to make them accommodate to the real requirements, e.g. Virtual node is used to make consistent algorithm more failure-tolerant.

In this paper authors have described Dynamo, Amazon's highly available and scalable distributed key-value store. To achieve high availability it sacrifices consistency under certain failure conditions and leverages on client application for reconciling conflicts. To support heterogeneous set of applications, Dynamo provides high degree of customization, allows client applications to select their desired levels of performance, availability and durability by tuning N,R,W and allows application level conflict handling.

Many of Amazon's applications do not require the full complexity of an ACID database. They can do without transactions, structured data and perfect consistency. But they have high reliability and availability requirement to provide seamless user experience. Therefore availability and guaranteed execution time is their highest priority.

-Integration of multiple techniques such as consistent hashing, replication, merkle trees, anti entropy algorithms, sloppy quorum, object versioning in a production system.

-Letting the users know that perfect consistency and availability cannot be achieved and providing users the functionality to configure N,R,W and reconciliation mechanism according to applications requirements.

-A partition aware client library to route requests to the coordinator directly.

-Use hinted hand-off to guarantee the system availability.

-And their method for measuring performance; rather than using mean or median performance, they look at the latency of the 99.9th percentile.

As described in the paper, I think Dynamo cannot handle a requirement where a single application needs tuning the system for heterogeneous type of work loads. It may need to run several instances of Dynamo with each tuned differently for the work load types.

Dynamo has been successfully deployed and supports Amazon's core online services. It is a very good example of an incrementally scalable distributed system that provides high availability and durability. The paper discusses a number of interesting techniques used in its design that can be used by system designers while building similar systems.

Summary: This paper introduces Amazon's key-value storage system Dynamo. Dynamo is a distributed data store that achieves high availability and efficiency at the cost of low consistency guarantees. It features a never-fail writing procedure even when hardware failures happen.

Problem: They want to build a data storage system that meets the needs of a super scale e-business website.
1. High availability. Due to the nature of e-business, even a small portion of the users are affected for a small period of time, still lots of money is lost. Therefore the availability is their top priority. The problem is to build a system that is available even when hardware failures happen or a whole data center is destroyed by tornados.

2. Low latency. Latency directly affects a user's experience. So they set a very high stringent requirement for the latency of their storage system. They want 99.9% of the requests are processed within hundreds of milliseconds.

3. Scalability. The system should scale well when we add more nodes.

4. Durability. It is never okay to lose user's data, even when hardware fails. Therefore data must be replicated in different nodes.

Contribution: Amazon solved the following problems by building a system that will sacrifice some extent of consistency. Their main contributions are:
1. Novel consistency model that can meet the needs of high availability and low latencies. The Dynamo system will accepts multiple conflicting versions of values associated with the same key. When user does a lookup, all the conflicting versions will be returned and it is the user's responsibility to resolve the conflict. This design greatly simplifies the consistent handling part of the data storage system. Moreover it is usually reasonable because the user has better knowledge than the system to know how to resolve conflicts.

2. Use of both consistent hashing and replications to facilitate a load-balanced and durable system. Dynamo uses consistent hashing to locate the node to store data in. What's different from the traditional consistent hashing is that it also employs a quorum-like algorithm to store the data in multiple nodes that are next to the coordinator. The combination of the two mechanism results in a system that is both load-balanced and durable.

3. Improvements to the traditional consistent hashing. They proposed three strategies that can ensure data is uniformly distributed. In two of the strategies, in contrast to traditional consistent hashing, the data universe is first divided evenly into chunks, and then each node is assigned some chunks. In that way, the schemes for data partitioning and data placement are separated which made it easier to add nodes to handle load balancing.

Applicability: This work is applicable to most scenarios that do not have a tight requirement for consistency.

This paper presents the design of Amazon's dynamo, a highly available key-value store which provides users to tune various trade-offs according to their requirement.

According to CAP theorem, having both high available and strong consistent replicated system is impossible in case of partitioned topology. Dynamo relax the consistency requirement (it provides eventual consistency) to achieve high availability to fulfill the SLA agreement. Consistent hashing is used to partition and replicate data among different nodes. The responsibility to figure out the consistent view of an object is delegated to the applications which are provided with context information (Vector clocks) to do so. Every read (write) to the system require read (write) quorum no of nodes which can be configured. For eg, to make the system highly available for writes, clients can use a write quorum of 1. It also uses anti-entropy to resolve inconsistency between nodes.

1. The main contribution of this paper is to combine many different techniques (consistent hashing, vector clocks, anti-entropy,
gossip based protocol for membership changes) to provides high availability and fulfill SLA.
2. While the Facebook memcache system was designed for read intensive workload, dynamo is designed to be highly available for writes. This forces then to use vector-clocks to keep each version of an object. Later during read, clients get all the conflicting versions and make their decision. Applications are developed keeping this in mind and this technique suits quite well for their workload.
3. They used hinted hand-off to deal with temporary failures. Merkle trees are used to resolve inconsistencies caused by permanent failures in different key ranges. Using merkle tree reduces the traffic between nodes.
4. I liked the idea of using gossip based protocol to detect membership changes and failure detection.
5. They made a change in their implementation of allowing write to any node instead of a coordinator node. They mentioned that a write is usually followed by a read. So they choose that node for write which responds quickest to the earlier write node. This also increase their chances of "read your writes" consistency.

Applicability: These techniques are implemented and used by the biggest e-commerce website Amazon to server millions of requests per second. This itself shows the applicability of this paper. The system provides many configurable parameters (R, W, N) and ability to clients to resolve conflicts. Through these, it can be configured to be used under different types of workload. Different clients can use it to get different trade-offs in consistency and availability.

Summary: This paper presents the design and implementation of Dynamo, a highly available key-value storage of Amazon. It uses eventual consistency for high availability.

Amazon's services need tens of thousands of servers in different data centers and the platform need to be of high performance, reliability, efficiency and scalability. For some services, key-value store functions are enough. In this case, traditional database will lead to inefficiency, limited scale and availability.

(1) Instead of rejecting writes, Dynamo pushes the complexity of conflict solution to the reads.
(2) The application performs the process of conflict resolution.
(3) Dynamo relies on consistent hashing to distribute across multiple storage hosts. It uses virtual nodes to achieve better load balancing and heterogeneity.
(4) Dynamo replicates data on multiple hosts. A preference list will be responsible for storing a particular key.
(5) Use eventual consistency. Dynamo uses vector clocks to control different versions of object. Upon read request, non-reconciled branches will all be returned with version information. The consistency protocol is similar to quorum system.
(6) It uses "sloppy quorum ". All reads and writes are performed on the first N healthy nodes from the preference lists. Hinted handoff can deal with temporary node or network failure.
(7) It uses Merkle tress for anti-entropy.

It's really a huge project to build a system that can scale to tens of thousands of nodes. Dynamo is a successful project and servers Amazon. It also tells about "tradeoff". For example, the eventual consistency for high availability, latency requirement measured at 99.9% of the distribution. Though Dynamo is just a internal platform in Amazon, I think security can still be a problem. At least the employees may have different kind of privileges to access the data. Actually Dynamo can have extra plugged-in component as security component, maybe something similar to Apache Sentry?

The paper presents the distributed key-value store called Dynamo, including the core components of its design and the nuances of its implementation. The system is focused on high availability and the characteristic of being “always writeable” as it’s used to drive a number of customer facing applications. This approach is taken because not meeting such demands would mean forfeiture on SLAs and a cost penalty for Amazon.

-Provide a system that was highly available and “always writeable”
-Provide guarantees at the 99.9th percentile instead of the mean and median that traditional systems employ.
-Provide a system that is incrementally scalable and easy to maintain as it scales
-Provide a system that takes into account the heterogeneity of the underlying infrastructure its running on top of.

-Consistent hashing with virtual nodes (although this is far from novel to their system)
-Graceful handling of failures with hinted handoffs and anti-entropy
-Data versioning which allow eventual consistency and updates to be asynchronous in the system.
-Always writeable semantics through the use of vector clocks
-Giving users of the system knobs to tune certain features of the system to meet their applications specific needs

Application to Real Systems:
Dynamo is heavily utilized within Amazon which is a company that operates at massive scale. It therefore has proven itself as an applicable real world system that solves a very important problem for Amazon and other applications that run on top of this service.

Summary: This paper introduces Amazon's storage system (called Dynamo) which is highly available and with hundreds of thousands of servers geographically distributed.

Problem: Amazon's storage system should provide customers e-commerce service, e.g. best seller lists, shopping carts and so on. The demand of customers require
1. Rejection of service is bad for user experience. So the system should be highly available with simple requests. The majority of the customers' demand are not complex queries, but are simple key-value problem (ask key, get value), so the relational databases, which is specialized for complex queries, are not used. Instead, systems only store simple key-value pairs. Moreover, the strong consistency always decrease the availability. This system use weaker consistency (allow the propagation of replicas and data is always readable) to guarantee availability.

2. The service should be efficient. Each service is decomposed to hundreds of requests for the storage system, so there is service level agreements (SLA) for the expected request distribution and latency. Because of the zipf law of the distribution, to limit a worst-case latency is unrealistic. A more practical way is to bound latency for majority of the requests. In this paper, their experience is to bound latency for 99.9% percentile of the request distribution.

3. The storage system has hundreds of thousands of heterogeneous servers. To guarantee availability, these servers should be symmetric and decentralized to make it more available, simpler, more scalable.

Contributions: They integrates a bunch of systems and methods from previous papers to a product for Amazon's functionality. Their engineering experiences are:
1. They use the consistent hashing for incremental scalability. To further improve the availability, they use the virtual node concept and project each real node (server) to multiple virtual node on the ring.

2. They provide eventual consistency, which allows updates to be propagated to all replicas asynchronously. They use multiple version data (each put creates a new version) for this consistency. For reconciliation, Dynamo use vector clocks to capture different causality among the versions. And the reconciliation will give all the versions at the leaves.

3. They use quorum systems to maintain consistency among replicas. To deal with transient failure of servers, they use sloppy quorum: all read and writes are performed on the first N *healthy* servers. Also hinted handoff is used for availability. To deal with permanent failures, they use anti-entropy.

4. Application specific tuning of N,R,W in quorum to trade-off between availability, consistency, performance and durability. They usually choose N,R,W=(3,2,2).

Applications: Similar with Facebook's memcache, Dynamo is the system behind Amazon. It is application driven, and use many existed techniques with huge amount of engineering experience.

Summary: Paper presents the design and implementation of the Amazon system Dynamo; a highly available key value store. Its main focus on 99.9 percentile performance makes this system unique in its goal of optimizing customer experience. Users of the system also have useful “knobs” they can turn in order to tailor it to their workloads.

Problem: The CAP theorem states that there will always be a trade-off between consistency and availability with regards to tolerating network partitions. Depending on one’s goals, one will have to make different choices with regards to the design of a distributed system. Amazon wanted to build a system that optimized for the customers experience using their services; this means that the system must have high availability. However, one does not want to forgo consistency completely and problems arise when trying to strike this delicate balance when it comes to users of the system. Users want a system that is highly available, yes, but also a system that is fairly consistent. Amazon uses interesting techniques when it comes to addressing this tradeoff.

Contribution: This papers contributions lie not in laying out novel new techniques in a reliable data store, but in the combining existing techniques together into a highly available system. It uses a combination of consistent hashing with virtual nodes for replication location and load balancing; it uses a modified “hinted handoff” quorum for consistency; vector clocks to help with versioning; etc.

Another interesting contribution was their small modification to their consistent hashing algorithm because the regular way of using it wasn't giving them the performance they desired. The way they had first implemented the system was to couple the data partitioning and data placement scheme. Therefore, whenever they added a new node, this affect the partitioning of the data involving the nodes in the ring to have to go through a costly scan of their keyspace. Since they were only optimizing for performance, this scanning had to be done as a low priority task and new nodes sometimes took a whole day to come into proper utilization. Their solution was to partition the data space and have each virtual node responsible for the same number of partitions.

Another nice feature of the system is that it gave services the ability to have a small say on what trade-offs they wanted to use. It offered clients the ability to determine how they wanted to resolve conflicts with the data, either the client application can reconcile the divergent version or they can defer the reconciliation to the server (which would do simple “last write wins” logic). And clients have the ability to tune what kind of performance they can get out of the system by specifying the quorum variables.

Applicability: Like the facebook memcache paper, this system is obviously applicable as it is being used by a major tech company that services an enormous number of requests. Like most systems that are used in production, practicality is a major consideration when it comes to the design and implementation of the system, something that is sometimes overlooked with other research work. One of the interesting aspects of Dynamo that make it applicable to more than just Amazon services is that it offers the user of that system the ability to tune it slightly in order for it to perform under specific workloads. That is, they give an administrator of the Dynamo system the ability to control the quorum replication (N,W,R) and the specific tuning of those values allows the system to be able to have high performance for reads or writes.

Dynamo is Amazon’s highly available key value store solution that was created to meet Amazon’s strict SLAs for availability and performance, while providing users of Dynamo the ability to tune the store based on their needs. Based on an eventual consistency model, Dynamo attempts to make write acceptance a high priority, even under scenarios of partition, with reconciliation being left to applications that use the store unless it can syntactically correct a divergence. By allowing applications to specify values of R, W and N, as well as being able to define their own reconciliation methods, Dynamo provides a large amount of flexibility without sacrificing performance/correctness.

Engineering Solutions

  • Providing highly available writes, by using vector clock based context to perform reconciliation during reads is an interesting idea. (the ‘when')

  • Data versioning to enable applications to determine how to reconcile inconsistencies caused by partitions.(the ‘who’)

  • Providing flexibility to applications to choose how to tune Dynamo’s availability, durability and consistency, through choices for replication factor, and the write and read quorums.

  • Handling transient failures (through sloppy quorums and hinted handoff), and permanent failures (through anti-entropy) provides availability despite faults.

  • An interesting approach to increase the odds of getting ‘read your writes’ consistency is proposed where the write co-ordinator is chosen based on who responded to the last read quickest. This in a sense, will also ensure some form of load balancing.

  • The decoupling of partitioning and placement to prevent imbalances created under low system load.

This paper is interesting in its focus on achieving performance that is acceptable to the 99.9th percentile, with a willingness to tradeoff stricter consistency for very high availability. This is clearly a good fit for Amazon, and works well for their various applications where they don’t want a customer’s intent to buy something not get translated into a purchase because of technical difficulties. The fact that it’s tunable and allows applications to customize the store to work for different demands of consistency and durability make this an interesting use case.

Post a comment