« Web caching with consistent hashing | Main | Epidemic algorithms for replicated database maintenance »

Scaling Memcache at Facebook

Scaling Memcache at Facebook. Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. NSDI 2013

Review due Tuesday, September 16.

Comments

Summary: In this paper, the authors discuss their approach in
building a distributed key-value store by using, memcached,
a single-node key-value store as building block. The author
characterize the workload in Facebook, and discuss tradeoffs
of building such a distributed key-value store under this
real-world workload.

Problem: To build such a distributed key-value store, there
are multiple goals the authors try to achieve:

- The main goal is to reduce the latency under the given
workload;
- The second goal is to reduce the load on the server
and make the load more balanced.
- Another goal is to support the necessity of making
maintaining different regional clusters;
- Another goal is to decrease maintenance overhead, e.g.,
cold-start time, and upgrade overhead.

To achieve these goals, there are two key technical challenges:

- Load Reduction and Balancing
- Consistency Maintenance

Contributions: In this paper, the author mentioned a large
set of techniques, in my mind, most of them are classic
techniques developed by distributed system community over
decades. However, this does not diminish the contribution,
which, in my mind, is a study of these techniques over
real workload.

To target workload that the author interests in is a workload
that read is one order of magnitude more frequent than
write, and it is OK to get a staled version during the
read. Therefore, classic techniques, including,
batching and lease are validated in this workload.

One key design philosophy is that the caching layer
and persistence layer should be separated. In this case,
the caching layer is memcache, and the persistence layer
is MySQL. By making this separation, it is easier to
engineering separate components. This design decision
is interesting to me.

Applicability: This work should be pretty applicable
in similar workload, which should be expected in many
Web-based application.

problem:
- the problem that they are trying to address is to expand their key-value store.
- as a result of increase in the number of users, they need a system that can handle the massive amount of requests that they receive.
- they want to create a distributed system and be efficient and consistent as much as possible.
- they are using an open source "memcached" and they want to expand that.

solution:
- they explained their solution in three stages:
- one cluster with multiple memcached servers and a separate storage unit.
- multiple clusters (that can have overlapping memcached servers.
- regional clusters (with one region being the master storage and the others being slaves, writes happen on master and slaves will be updated)

they made several changes to increase the performance:
- using UDP for get (to increase the speed)
- creating several pools for different types of requests
- creating gutter pools to handle the failures
- using mcrouter to decrease the traffic (instead of having all the servers talk to each other)
- using sliding window to increase throughput but prevent overloading servers
- cold cluster warm-up (use old "warm" clusters while the new cluster filling its cache)
- using lease tokens and remote markers to handle consistency problems
- they also made some changes in the default memcached to increase its performance (making it multi-thread and adding fine grained locks)

in some cases, they chose to lose consistency to gain lower latency (use stale information instead of fetching the new information for some cases), while in other cases they chose to increase latency to increase consistency (when an update happens in the slave clusters)

Summary:
- The paper presents and simulates a in-memory caching solution, memcached.
- Data replication, connection coalescing (mcrouter), window sizing, leases, and Gutter pools were investigated and implemented to reduce latency and distribute load.
- Invalidation daemons (mcsqueal), regional pools, Cold Cluster Warmup, new data center locations, and shared data channels were used for replication and consistency.
- Automatic hash table expansion, global locks, an UDP port for every thread, and slab classes was used to optimize performance.
Problem:
- Maintaining performance, efficiency, fault-tolerance, and consistency in the very large scale distributed system used by Facebook.
Contributions:
- The paper scaled a memcached-based architecture, which separated cache from persistent storage. The implications of this contribution mean that the cache and the persistent storage could be scaled independently.
- Connecting coalescing (mcrouter) allows an open connection between every web thread and memcached server, which satisfies the high memory demand of open TCP connections. Open TCP connections are very beneficial for delete operations in parallel because TCP provides confirmation. Therefore, mcrouter leads to a high degree of parallelism to achieve high throughput.
- Leases allow clients to write data back into the cache upon cache misses. Leases prevent stales set by invalidating the lease token if memcached has received a delete request during the lease. Leases also prevents thundering herds by setting a time limit on how soon clients can request the key value, which will give memcache enough time present validated data.
- The paper considers many challenges of a very large scale in-memory caching. Also, the paper many more ideas that worked (e.g. Cold Cluster Warmup to reduce miss rates of new clusters) and ideas that never even left the prototype/design stage (ideas that are prohibitively slow or expensive). This paper is essentially the Grapevine paper of caching in today’s generation of distributed systems.
Applicability:
- Paper is very applicable as it is currently being used successfully by Facebook.
- Many of the ideas presented in the paper are a slight variation to the contributions of the Grapevine paper. For example, connection coaling via mcrouter in this paper is a variation of adding more distribution layers in the Grapewine paper. Therefore, this paper will work well on distributed systems.
- Although the paper does provide reasons for each decision and implementation, the paper did not evaluate half of them. For example, the paper did not evaluate the performance improvement of automatic hash table expansion nor the adaptive slab allocator. Therefore, whether the same mechanisms should be implemented in other systems is questionable.

Summary:
This paper discusses Facebook’s use of memcache as the foundation for a caching infrastructure via a distributed hash map. Specifically the authors discuss the issues faced at different levels of scale and the algorithms and techniques they used to overcome each of these scaling bottlenecks.

Problems:
-The need for a caching infrastructure that scales to the traffic and request patterns of millions of users on a daily basis. (scale)
-The need to elegantly handle workload and traffic patterns that exhibit very high read rates. (latency)
-The handling of server failure and server reboot that is transparent to the client. (fault tolerance)
-Reducing load on the overall infrastructure; specifically avoiding stale sets and the thunder herd problem. (load)
-Efficiently broadcasting changes (updates/deletion) of a key-value pair throughout clusters and regions. (consistency)

Contributions:
-The discussion of the bottlenecks and design issues that emerge at different levels of scale. Facebook has essentially provided a mini road map of their evolution of the memcached infrastructure.
-Overall enhancements that can be made to the open source software memcache so that it is more suited for large scale systems.
-Characterization of the workloads inherent at the different levels of scale they discuss throughout the paper.
-Reducing latency and load via clever use of UDP, TCP, and batched requests.
-Pushing complexity into the client to avoid the need for intra-cache (server to server) communication.
-The mcrouter software which was publicly released.

Application to Real Systems:
The discussed implementation of memcache is currently used at Facebook. Therefore in that regard it can be seen as highly applicable to a real world system because Facebook operates at a massive scale on a day to day basis. Although the implementation is highly tailored to the load and traffic Facebook experiences, in general, the discussion of resolving bottlenecks at different levels scale is useful knowledge.

Summary
This paper reports on building a caching system using an inmemory key-value store, memcached as its base.

Problem
The authors start with a single server memcached instance and build on top of it to realize a highly performant, scalable distributed key-value store. This store serves as the cache for the backend databases. It is heavily optimized for workloads that have high fan-out and predominantly reads than writes. The system is aiming to support ~1B reads per second. The authors describe various techniques and show effectiveness of their techniques.

Contributions
This paper presents a number of techniques. Some of the most interesting ones are:

1. Use of 'delete' operation which is idempotent instead of 'set' to invalidate cache entries.
2. Use of connectionless UDP for reads which has less overhead in terms of memory and latency.
3. TCP Connection coalescing using mcrouter component.
4. Use of flow control algorithms in the memcache clients to avoid incast congestion incase of all-to-all communication between the web servers and the memcache servers.
5. Use of leases to avoid stale sets. Solving thundering herds problem with memcached arbitrating the access to the database on a miss from a web server.
6. The idea of pools to separate high and low churn keys was neat.
7. The mcsqueal pipeline which does the invalidation from the storage layer seems novel. Using mcrouters to batch the invalidation calls is also nice.
8. Taking advantage of MySql replication features for the backend replication seems good.
9. The use of 'remote markers' to tackle non-master writes is neat.
10. Standalone server improvements like micro-locks, using UDP for gets, modified slab allocators all help increase the efficacy of a single memcached node.

Discussion
1. Though the paper presents a number of useful techniques, I somehow felt that there was no one directed design goal - but this is understandable. Given the diverse range of applications like messaging, gaming etc that this caching system has to support, it is really hard to accomodate all these application needs with a single directed goal.
2. It would have been good if there were some discussion on failing master regions (I suppose they automatically elect the new master and continue).

Relevance
The ideas and techniques discussed in this paper are used to run the biggest social network. This single fact suggests that their design choices have been very favorable. The techniques are applicable not only to Facebook but also to any company that wants to build large-scale distributed system.

Summary
The present paper provides improvements made to "memcached", an open source in memory hash table, to develop a scalable distributed key-value store which provides low latency access to a shared storage pool. The authors provide design choices at various level of the distributed system( server level, cluster level, region level, system level).

Problem
Social networking sites attract the attention of billion of users. They need to provide real time communication and blazing fast access and updates. In addition to that the underlying platform should scale gracefully as the number of users might keep on increasing. To achieve the speed and scalability, the authors make use of fact that reads are much more than writes and sometimes it is acceptable to have stale data.The paper provides their design choices in building a highly scalable distributed caching system which aims to provide such responsiveness and scalablity on top of a simple in-memory hash table (memcached) dealing with issues such as inconsistency, communication latency at that scale.

Contributions
They decoupled caching and storage. This means that they can scale cache and storage separately according to their needs.
Reduce the amount of RTT by batching and then sending parallel requests. This is done by constructing a DAG.
Building a mcrouter to coalesce the TCP connections of a web server as TCP connections are highly resource consuming.
Incast congestion is solved by a TCP sliding window protocol based mechanism on outstanding requests of a web server.
They issue lease token to eliminate reading stale data caused by race conditions.In addition to that by having a hold off time of 10 seconds and issuing lease tokens every 10 seconds, they eliminate thundering herds.
They partitioned available memcached server into pools and assigned working sets interfering with each other hit rates to different pools.
They try to avoid cascading failures by redirecting a request from non responsive server to Gutter pool.
They provide a simple way of invalidating caches by running a daemon (mcsqueal) in databases.
They try to decrease overreplication by sharing some servers(region pool) across all the clusters in a region.
They try to bring up a server as soon as possible by cold cluster warm up by having the cold (new) cluster fetch data from warm (running) cluster instead of the storage layer.
They provide a simple master slave model for different regions consistency.
To eliminate the possibility of a non-master region to read stale date they implement remote marker mechanism.

Limitation
1. There is a master slave model between different regions. This might lead to a problem when during network partition.
2. There is very high dependence on manual tweaking and manual configuration in the system.
3. Corner cases are not addressed and dismissed saying they are rare and this of no significant operational importance.
4. Remote marker are

Applicability
This is a fairly new paper. The memcached system is still in operation in Facebook.


Summary
The present paper provides improvements made to "memcached", an open source in memory hash table, to develop a scalable distributed key-value store which provides low latency access to a shared storage pool. The authors provide design choices at various level of the distributed system( server level, cluster level, region level, system level).

Problem
Social networking sites attract the attention of billion of users. They need to provide real time communication and blazing fast access and updates. In addition to that the underlying platform should scale gracefully as the number of users might keep on increasing. To achieve the speed and scalability, the authors make use of fact that reads are much more than writes and sometimes it is acceptable to have stale data.The paper provides their design choices in building a highly scalable distributed caching system which aims to provide such responsiveness and scalablity on top of a simple in-memory hash table (memcached) dealing with issues such as inconsistency, communication latency at that scale.

Contributions
They decoupled caching and storage. This means that they can scale cache and storage separately according to their needs.
Reduce the amount of RTT by batching and then sending parallel requests. This is done by constructing a DAG.
Building a mcrouter to coalesce the TCP connections of a web server as TCP connections are highly resource consuming.
Incast congestion is solved by a TCP sliding window protocol based mechanism on outstanding requests of a web server.
They issue lease token to eliminate reading stale data caused by race conditions.In addition to that by having a hold off time of 10 seconds and issuing lease tokens every 10 seconds, they eliminate thundering herds.
They partitioned available memcached server into pools and assigned working sets interfering with each other hit rates to different pools.
They try to avoid cascading failures by redirecting a request from non responsive server to Gutter pool.
They provide a simple way of invalidating caches by running a daemon (mcsqueal) in databases.
They try to decrease overreplication by sharing some servers(region pool) across all the clusters in a region.
They try to bring up a server as soon as possible by cold cluster warm up by having the cold (new) cluster fetch data from warm (running) cluster instead of the storage layer.
They provide a simple master slave model for different regions consistency.
To eliminate the possibility of a non-master region to read stale date they implement remote marker mechanism.

Limitation
1. There is a master slave model between different regions. This might lead to a problem when during network partition.
2. There is very high dependence on manual tweaking and manual configuration in the system.
3. Corner cases are not addressed and dismissed saying they are rare and this of no significant operational importance.

Applicability
This is a fairly new paper. The memcached system is still in operation in Facebook.


Summary:
The paper discusses how Facebook used “memcached”, the in-memory caching solution, to develop a distributed key-value store that is designed to support an extraordinary usage of over a billion requests per second. The authors use the openly available memcached and modify it to fit the needs of Facebook and they specifically try to leverage the fact that their system is based majorly on reads rather than writes and also the fact that they can tolerate some inconsistencies.

Problems:
Although key-value store can be envisioned to be simple, a huge scale such as that of Facebook brings complexities such as huge number of requests per second, latency issues especially with both users and the servers located across the globe. The vanilla solutions do not take advantage in cases where the reads are predominantly higher than writes. In some cases, inconsistencies can be tolerated and this can lead to added performance benefits which are not harnessed completely.

Contributions:
• Distribution of items across the memcached servers through consistent hashing provides reduced latency.
• I think it is a very novel idea to use the TCP for the set and delete requests, and the UDP for the get requests to get reduced latency and overhead. Although UDP is a connectionless service it is fine to have a loss of a minor fraction of the packets since the get requests are mostly non-crucial and can afford to be lost.
• They use the mechanism of “leases” wherein a token is used to validate if a reply to the client corresponds to the requested key. This avoids the problem with stale sets; a modification to this mandates that token is sent only once every 10 seconds and this resolves the thundering herd problem.
• They use a fraction of the memcached servers in a cluster ( less than 1%) as gutters which are primarily used as a fall back in case some servers fail, thus providing a good way of handling failure.
• They take advantage of geographically spreading out the servers to benefit from low latency and other economic benefits as applicable. The main issue in the case, the consistency of the local replicas with the master source is alleviated by using remote markers in the region for a set or delete operation – the subsequent accesses to this region notify that the values in the region might be stale.
• The paper also proposes several optimizations such as making the server a multi-threaded implementation and using a separate port for each thread to reduce contention.
• Other improvements include implementation of an adaptive allocator that helps in matching the workloads.

Applicability:
This is a great paper that exposes the most recent implementation details of the largest social networking site. The ideas described here can find huge applicability in many of today’s other services. Although features such as UDP cannot be used in cases where there can be no compromise with the data consistency, it has great utility for services such as the Netflix, YouTube where small inconsistencies can be tolerated. Moreover, this is one of the largest systems in practice which proves the scalability and reliability of the system.

SUMMARY:
The paper presents a framework for serving Facebook requests using the open-source memcache combined with several custom changes to allow scaling at the cluster and geographic levels, allowing billions of world-wide requests per second. The design includes simple, isolated components, where optimizations can be made, and purposely prefers availability over consistency.

PROBLEM:
How do you best serve a system where consumption is rampant compared to original content, while still allowing updates and preserving the feel of interactiveness?

CONTRIBUTIONS:
1) Components (Front-end/Memcache/Database) are all isolated, allowing them to be individually swapped for a different/better approach.
2) Scale has been discussed one three levels: Individual machine, Local cluster, and World-wide geographically distinct replicated cluster.
3) UDP vs TCP for places where "best-effort" is acceptable vs where confirmation of the message delivery is necessary for consistency.
4) Simple solutions to complex problems that avoid race conditions and provide high availability by allowing slightly stale data, where the staleness is mitigated using techniques such as leases and remote markers
5) Reducing load on the back-end by division into pools of content for different amounts of "churn", as well as a Gutter for requests that cannot be filled.

DISCUSSION:
My favorite part of this design is that the memcache software knows nothing about the database schema, business logic, or anything at all about the structure of Facebook. It could be replaced by something better (and they did improve memcache). Likewise, the DB back-end could be anything, as exemplified by the idea of the Cold Cluster Startup, where the back-end is temporarily redirected to an already "warm" cluster.

Mcrouter (with TCP-style sliding windows) and mcsqueal were necessary to avoid incast congestion, which implies a hierarchy of these servers may eventually be necessary to continue scaling. The bundling and delegation of work to other processes is important to offload work from the front-end itself.

Summary
In this paper authors introduce how they use memcache to build a distributed cache systems that can support billions of requests per seconds. They build their distributed cache by forming servers into clusters, and gathering clusters into regions, and placing different clusters all over the world. They also made a lot of efforts to reduce latencies, reducing load, and maintain consistency.

Problem
Facebook's servers need to handle billions of requests per second. Moreover, each request would use data from different sources. This would never be possible if we access database for each request. But caching could greatly improve the performance. They started with memcache, a key-value hash table running on a single machine. But a single machine is not sufficient to support Facebook. Building a distributed cache system would face many challenges: how to organize various servers? How to minimize the latency? How to make the system scale? How to handle write requests? How to maintain data consistency?

Contributions
The authors made a lot of contributions building the memcached system.
* Organization of the cache system. At different scales the challenges and goals are also different. At cluster level the main focus is on reducing latency and load. At region level the main focus is failure tolerance. Across regions the challenge is data consistency.
* Reduce latency by building a dependency graph to maximize concurrency.
* Reduce latency by employing UDP on read requests.
* Reduce load by introducing leases and allowing a small portion of stale values.
* Minimize communication costs across clusters by replicating clusters.
* Using regional pools to improve memory efficiency.
* Improve hit rates of new clusters by warming up them first.
* Numerous improvements to single server memcache.

Applicability
This work is clearly applicable because Facebook is using it. Apparently it greatly reduced the load of the database considering the huge number of read requests Facebook receives every second.

Summary
In this paper authors introduce how they use memcache to build a distributed cache systems that can support billions of requests per seconds. They build their distributed cache by forming servers into clusters, and gathering clusters into regions, and placing different clusters all over the world. They also made a lot of efforts to reduce latencies, reducing load, and maintain consistency.

Problem
Facebook's servers need to handle billions of requests per second. Moreover, each request would use data from different sources. This would never be possible if we access database for each request. But caching could greatly improve the performance. They started with memcache, a key-value hash table running on a single machine. But a single machine is not sufficient to support Facebook. Building a distributed cache system would face many challenges: how to organize various servers? How to minimize the latency? How to make the system scale? How to handle write requests? How to maintain data consistency?

Contributions
The authors made a lot of contributions building the memcached system.
* Organization of the cache system. At different scales the challenges and goals are also different. At cluster level the main focus is on reducing latency and load. At region level the main focus is failure tolerance. Across regions the challenge is data consistency.
* Reduce latency by building a dependency graph to maximize concurrency.
* Reduce latency by employing UDP on read requests.
* Reducing load by introducing leases and allowing a small portion of stale values.
* Minimize communication costs across clusters by replicating clusters.
* Using regional pools to improve memory efficiency.
* Improving hit rates of new clusters by warming up them first.
* Numerous improvements to single server memcache.
* Other aspects that I don't understand.

Applicability
This work is clearly applicable because Facebook is using it. Apparently it greatly reduced the load of the database considering the huge number of read requests Facebook receives every second.

Summary
From Facebook engineers’ perspective, this paper introduce how they constructed a distributed key-value stores for the largest social network in the world based on memcached.

Problem Description
The Facebook engineers observed two important fact for current social network. The first fact is contents read by the users are much more than the contents created by themselves. That means fetching data is much more frequent than storing, and the system can be greatly benefited by caching. The second fact is the data may be stored in different sources for requested content. Therefore, they need an efficient caching strategy.

Contribution
1. To reduce the latency of fetching cached data, they construct a DAG, and web server can use this information to request maximum number of items at same time. Also, for the client and server communication, they rely on UDP for get requests to reduce the overhead. But for delete and set requests, they still use TCP to ensure high reliability.
2. Due to the TCP will consume high memory and other resources, they use mcrouter to improve the efficiency of the server through coalescing via server.
3. Similar to TCP congestion control, the memcache clients implement flow-control mechanism to avoid incast congestion. The design also reduces the latency.
4. For reducing the load, they introduce a mechanism called leases. This helps memcached to determine whether to store/write data or not. Also, by monitoring the rate of returned tokens, they can mitigate thundering herds problem.
5. By partitioning the cluster’s memcached servers into pools, they can differentiate items in cache using different strategies, like Low-churn and High-churn.
6. To preventing cascading failures, they proposed Gutter mechanism to insulate the backend services from failure by reducing the load of backing store.
7. When adding or removing a cluster, a Cold Cluster Warmup system will allow client with empty cache to retrieve data from “warm cluster”, which can bring it back to full capacity very soon.
8. To reduce the probability that the user fetched stale data, they employ a remote marker mechanism.

Applicability
The design tricks are already implemented in current Facebook system, and it is also applicable for large distributed systems, like twitter, YouTube and so on. All of these systems might face the same problems like fetching data dominating most of workload, and requested content stored in different sources. This paper provide a very practical solutions to this kind of distributed systems.

Summary:
Given Facebook's scale of operations and the user requests handled, this paper describes the caching solution built by scaling the open-source single node memcache to improve the response time of the user requests as well as the load on the backend database servers. Since the Facebook workload consists of more reads than writes and since the data has to be aggregated from different sources, this scaled caching solution is indispensable and is used as a DB query as well as a general cache. The design of the system considers reading stale data to be a tunable parameter and prefers stale date to loading the storage servers. Throughout the paper, the authors discuss the three scales of deployment for the scaled out memcache, one being a single cluster and the others being multiple frontend clusters forming a region as well as globally spread regions.

Contributions:
The system designers did a commendable job of identifying the workload characteristics and tayloring the architecture to efficiently handle the workload. Many solutions to challenges faced at operating at the scale of this system, e.g the Gutter to handle multiple system failures, depend on the workload characteristics one of which is acceptability of stale data.

Many optimizations like using UDP for get requests and flow control mechanisms to reduce incast congestion can be adopted by other similar systems.

The system designers have also developed unique mechanisms like leases to tackle problems like stale sets and thundering herds.

Problems:
The authors discuss how embedding invalidation requests in DB updates help prevent inconsistencies across regions, but they don't discuss how problems like thundering herds are solved in a multi-region system.

Relevance:
With large scale internet applications becoming the norm, the lessons learned from this memcache scaling work is highly relevant to designers of similar systems. From this work we can see that having a stateless components helps in changing and improving the system as the applications grow in scale as well as complexity. Further we can also see the benefits in scaling the system by breaking it into globally spread regions rather than one large cluster since the former helps in reducing the latency of the user requests.

Summary:
Authors discuss the use and modification of open-source memcached within Facebook's distributed key-value store.

Problems:
Facebook, as a specific service, has specific goals for their key-value store.

  • Serve billions of requests per second.
  • Handle high fan-out (one long-latency value can delay an entire page load).
  • Low latency to support near real-time communication.
  • Workload is dominated by reads.

Solutions:

    -Latency-
  • Use consistent hashing across memcached servers in single cluster.
  • Batch requests (based on DAG of data dependencies) to reduce round-trips.
  • UDP for 'get', TCP for state-changing operations 'set' and 'delete'. Common case, UDP succeeds. If packet is lost, treat as a cache miss (rare, so OK).
  • -Load-
  • Each memcached server provides a lease to a client to later set a key when a cache miss occurs. This lease can be invalidated if an intermittent delete occurs (preventing stale set) and lease frequency is limited to avoid thundering herd of updates resulting from identical misses.
  • Replicate when request rate for a (set of) key(s) is much higher than one machine can manage. Clients choose target replica based on their own IP.
  • -Fault Tolerance-
  • Gutter, small set of caches (1% of cluster) receives diverted requests in initial server is down. Never receives data invalidations, just short expirations. Trades stale data for less backend load during failure period.
  • -Scaling Further-
  • Multiple frontend clusters per region (single storage cluster). Provides replication (has pros and cons), but also smaller failure domains per region.
  • Cold warmup: New/recovering cluster with poor hit rates can be configured to request data from other frontends on miss until hit rate improves.
  • Remote markers are created when a write originates in a non-master region. If the marker still exists on a subsequent request, the request is directed to the master region.

Contributions:
Many techniques are well-applied, but not new. For example, consistent hashing, replication to handle particularly high load on keys, etc. From my perspective, the most interesting contributions were (1) Gutter (sort of hot-swapping technique) which not only provides additional machines in case of failure but also uses the machines differently to shield the storage cluster from load. (2) Cold Cluster Warmup, again a solution where behavior is changed temporarily to avoid an easily predictable high-latency and high storage load situation. (3) Remote markers. These may be useful in any other slave-master application where updates may originate from a slave replica. The higher latency of retrieving the freshest data from the master replica may be worth it if it avoids users feeling like their update did not succeed in the system due to seeing stale data.

Applicability:
Clearly, this is applicable today as Facebook uses it today and actually did achieve its goals of handling billions of requests per second. Practically, many other organizations will never need this much scale. However, the fault tolerance techniques for keeping storage requirements low and staying in the cache as much as possible (warmup, gutter, etc.) are useful at any scale.

Summary:
This paper presents various changes made by Facebook in open source memcache to make it scalable and robust to server 1 billion requests per seconds. Memcache is used as a caching layer for various kinds of backend storage (HDFS, mysql database, services etc) and provides consistent way to cache items. The main goals (problems which they are trying to solve) for memcache changes were (keeping in mind the read intensive, large fan-out workload in Facebook):

1. Reducing latency (increase response time).
2. Reduce load on backend servers when cache miss occurs or when some memcached server goes down.
3. Resolving inconsistency quickly while still maintaining performance.
4. Making software upgrades easier by moving most of the logic in mcrouter (memcache client), thus keeping memcached simple.

Contribution:
A lot of different techniques are used to achieve the above goals:
1. I really liked the idea of separating cache layer and storage servers. This way the same caching layers can be used for different types of services. However, as authors mentioned this caused interference between the different types which reduces cache hits. They resolved this by dividing the memcache in different pools according to access pattern.
2. There is no inter server communication. The main logic of routing, error handling, batching, etc are implemented in memcache client (mcrouter). This way the cache request can be optimized by mcrouter before sending it to the server.
3. Parallelize the memcache requests by using DAG which represents dependencies between data. This way there is no need to wait for a data before making request for others.
4. Using UDP for get request and TCP for invalidate. Also, sliding window mechanism is used in making get requests to reduce incast congestion.
5. Leases are used to mitigate staleness and thundering herd problem. Lease are given to a client for a key when it experience a cache miss. This way memcached server can control concurrent invalidates to a specific key.
6. Replication within pools to reduce latency. Replication is preferred over dividing the keys into more pools when entire data set fits into one memcached server and request rate is much high.
7. Using Gutter servers to handle small failures is a very nice technique compared to rehashing keys because this reduces chances of cascade failures.
8. Mcsqueal daemon is used by storage cluster to invalidate the different front clusters. This is better than directly invalidating them by web servers because mcsqueal can batch different invalidate requests and handle failures.
9. Using cold clusters to reduce load on backend servers in case of addition/removal of clusters.
10. Remote markers are used to reduce probability of reading stale data by making web servers make requests to master region in case a key is marked in replicated region by the memcached server indicating that the local replica might be stale.
11. Also, a lot of performance optimizations are done single memcached server such as: i) Adaptive slab allocator to better adapt to changing workload, ii) Hybrid caching scheme which evicts short lived keys proactively

Applicability:
These concepts are being used in the world's largest social networking site which servers more than 1 billion users. This itself shows how applicable this is. Even if these are specific tailored for facebook workload, these concepts are very much applicable in other websites. This also provides design guidelines for a distributed system: for eg. separating caching layer and backed layer for abstraction, scalability and fault tolerance.

Summary

In this paper the authors presented the challenges posed to support the world’s largest social networking site, Facebook and described how they leveraged memcached as a building block to construct and scale a distributed key-value store. We are presented with very specific implementation details within the very specific domain in with the distributed architecture of Facebook operates so that it can effectively handle billions of requests per second and hold trillions of items to deliver a rich experience for over a billion users around the world.

Problem

Since hundreds of millions of users use the Facebook network every day; the computational, network and I/O demands imposed cannot be effectively satisfied by traditional web architectures. To meet these demands Facebook has very specific Infrastructure and Design requirements.

·         Infrastructure requirements:

o   Near real-time communication.

o   Aggregate content on-the-fly from multiple sources.

o   Very popular shared contents need to be accessible and updatable.

o   Scale to process millions of user requests per second.

·         Design requirement:

o   Need to support an extremely heavy read load with over a billion reads/second.

o   Reducing impact of the high read rates on backend services.

o   A need for geographically distributed system to caters to the global population.

o   Ability to support a constantly evolving product: the system must be flexible enough to support a variety of use cases and also support rapid deployment of new features.

Contributions

The authors claim the following as main contributions:

·         Description of the evolution of the Facebook memcache architecture.

·         Enhancements to the existing memcache to improve performance and increase memory efficiency.

·         Enhancements and Mechanisms made to improve the ability to operate the system as scale.

·         Characterization of production workloads.

Some of the other noticeable contributions:

·         The idea of leasing to address problems of stale data and thundering herds.

·         Using memcache pools to separate low and high churn data.

·         The efficient use of “Gutter” server pools to deal with temporary failure and further insulate the backend servers from failures.

·         The cleaver use of Cold Cluster Warmup to insulate the backend from poor hit rates in the cache cluster by allowing clients in the cold cluster to retrieve data from a warm cluster rather than directly from the persistent storage by taking advantage of the data replication that happens across the front end clusters.

·         The use of remote markers to maintain consistency.

·         Most of the code was released as open source. In fact the mcrouter was open sourced to the public today (9/15/14)

·         Pushing complexity into the client whenever possible

·         Actual results of this implantation on the Facebook network to corroborate its effectiveness. 

Applicability

The challenges and solutions introduced in this paper is very relevant today where the number of users on social networks (or the internet in general) is growing. Even though the paper very specifically talks about the Facebook domain, the same challenges are faced in any large geographically distributed system dealing with high network traffic hence can benefit from the contributions of this paper. Also the fact that it was implemented with significant results reinforces its practical applicability.


Summary:
The paper talks about the evolution of Facebook's implementation of memcached into a distributed key-value store to process billions of requests per second. The paper talks abouts various methods and optimizations done to the system make it more reliable, fault tolerant and reduce latency and load on backend servers..

Problem:
The paper targets various challenges that a distributed system faces at different deployment scales such as latency, load and failures in a cluster, data replication across different regions and consistency issues.

Contributions:
1. Memcache as demand-filled look-aside cache: On cache miss, Web server retrieves from backend and populates the key-value pair in the cache.
2. Facebook chooses to delete cached data instead of updating it in cache as deletes are idempotent, memcache is not the authoritative source of data and is allowed to evict data.
3. Reducing Latency:
- Parallel Request and Batching: DAG visualization used to maximize the number of key-value pairs that can be fetched concurrently.
- Optimize client server communication:
-- Memcache servers don't communicate with eachother.
-- All the complexity is pushed down to stateless client made up 2 components: Library that can be embedded into applications and standalone proxy interface called mcrouter.
-- Uses UDP for get requests (UDP is connectionless and reduction in overhead by bypassing mcrouter) and TCP for set and delete requests (for reliability reasons).
- Sliding window mechanism used to handle incast congestion.
4. Reducing Load
- Lease, a 64-bit token is bound to keys to tackle state sets. thundering herds are regulated by limiting the rate of issuing lease tokens.
- Memcache servers are divided into pools based on key accesses. Also, replication is done within the pools to improve efficiency and latency.
5. Handling failures:
- Gutters, small set of machines (1%) reserved to use when few servers fail in a cluster. This approach is better than rehasing request among remaining memcached servers as that could lead to overloading.
- Split the webservers and memcached servers into multiple regions. This provides an advantage of smaller failure domains and tractable network configurations
6. System aims for Eventual consistency with an emphasis on performance and availability. Regions divided into master and replica regions. Remote markers have been used to minimize reading the stale data.
7. Performance optimizations such as automatic expansion of hash tables, specific UDP port for different threads and multithreading of server have been done.
8. Single server improvements through adaptive slab allocator and transient item cache have been added.

Applicability:
The system has scaled for workloads dominated by reads(twice more frequent than writes)and is currently part of facebook's infrastructure. The idea proposed in the paper are very much relevance to large distributed systems like twitter and google plus which have to serve millions of request per second.

Memcached is a simple in-memory caching system which has been used as a building block to build a large distributed system, called memcache at Facebook. This paper talks about all the aspects of distributed system such as scaling, recovery, availability and high throughput design parameters. Memcache is used as a demand-filled look-aside cache, in which a request from a client looks into the memcache to see if its a hit, or it will query the server and sets the entry again in cache. It is also used a generic cache, where developers can store complex computations like timeline aggregation, birthday events for a specific time in the cache. They introduces a new component in the system called mcrouter which will talk to a group of memcached instances. On whole this is the distributed system which supports scalability, availability and failover. Some of the issues handled in the paper are;

1. Growing working set: To handle the larger working set they propose sharding, where working set is partitioned in more than one memcached instance and mcrouter directs the request accordingly using consistent hashing.
2. High read rates: To support high read rates, sharding doesnt help much here, as all the servers will receive a packet, but with replicas mcrouter can forward all the requests to one server where some other request can be concurrently processed in another memcached instance.
3. Heterogeneous workloads: since LRU can evict expensive computed key from cache, hence the expensive computed keys can be stored in a separate pool than the simple ones. Thus with the architecture of pools, replicas and shards mcrouter directs the request packets to memcached instance appropriately.
4. Failures: if the network/server fails, mcrouter redirects the requests to replica in case of replica pools, or neighboring cluster in case of sharded pool case. In case of transient failures when the memached instance boots up again, mcrouter replays the state changes to this server. this is actually done by a daemon called mcsqueal in the storage server.
5. Inter cluster consistencies: To keep the cached data consistent over all the instances, all state changes are broadcasted to the replicas by the mcrouter.
6. Empty Cache servers: To avoid the new servers being empty, the requests are initially directed to cold pool, from where if its a miss get from warm pool and added to cold pool. This is called as cold cluster warmup.

They also talk about how to prevent stale set which occurs due to reordering of update requests using leases. With timed leases, this can help in thunderung herds also. To maintain consistency, mcrouter updates the master region from which the changes are updated to other replicas. If a non-master region wants to update the state in the master region, it can read stale data from its own cache, which is also alleviated by using remote markers.

Thus, overall the memcache in Facebook covers almost all the aspects of distributed system with simplicity.

Summary: This paper decribes how Facebook leverages memcached, a famous simple memery caching solution, to build their world's largest distributed social network infrastructure dealing with billions of requests per second. They introduce a series of systems evolving with the increase of demand: single cluster, multiple clusters in one region and finally geographically distributed clusters.

Problem: In Facebook, the demand of billions of users has the following characteristics:
(1) millions of read requests per second with real-time feedback. (2) all requests need the aggregation of information from many heterogeous sources.
(3) The number of update of content is magnitude smaller than the read requests.
These characteristics means the special requirements of the infrastructure behind the social network: (1) It should have low latency and very high throughput for read request. (2) It should be able to generate the answer of requests from heterogeous sources, e.g. MySQL, HDFS and backend servers. (3) It should be able to share the most popular and updated content, but the highest degree of consistency is not necessary. (4) as the number of users increasing, the clusters of servers are geographically distributed and the replicates should be consistent.

Contributions: They use memcached as a building block for the world's largest distributed system for social networks.
(1) They shows the evolution of Facebook's architecture with the increasing number of servers. The most valuable things is the two principles in this evolution: first, any change must impact a user facing or operational issue and second, the trade-off between performance and probability of read transient stale data (eventual consistency).
(2) They show how to use memcached to improve the performance and increase memory efficiency.
(3) They introduce the mechanism to oporate the system at a scale. First, each clusters web servers and memcached servers are all-to-all connected, which suffers the incast congestion problem and single server bottleneck problem. They solve this problem by designing a memcache client in each web server, using parallel requests and batching to decrease #requests, using flow-control to trade-off waiting time and incast congestion. Second, they use a 64-bit token called lease to deal with the stale set and thundering herds problem. Third, they use a specific small set of machines to deal with the failed servers. Fourth, for multiple clusters within a region (shared the same database), they replicates data among clusters for more independant failure domains (and tractable network flow and reduction of incast congestion). Fifth, when use geographically distributed regions of clusters, the high degree consistency among regions sacrifice performance, they choose eventual consistency.

Applications: This paper is like a simple guide book to build a huge distributed system for typical web service. A lot of experience: e.g. the UDP for get value and the TCP to set and invalidate; the window size control trade-off; the degree of consistency and performance trade-off. These experience will really help build big web service systems in the future.

Summmary:
This paper describes how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world's largest social network.

Problem:
Facebook's workload is dominated by fetching data and caching can have significant advantages. Read operations involves various data sources and backend services. Thus a flexible caching strategy able to store data from disparate sources is needed. Memcached provides a single machined in-memory hash table and Facebook needs to use it to build a large-scale distributed key-value store system.

Contributions:
(1) Scale memcached to clusters. Facebook distributes contents across the memcached servers through consistent hashing. They use DAG to represent the dependencies between data and maximize the number of items that can be concurrently fetched. Stateless clients are used to do communication job. UDP is used for get requests and TCP is used for set and delete operations. Clients also use sliding window to control the nuber of outstanding requests to avoid in cast congestion.
(2) Reduce load. They use leases to address stale sets and thundering herds. A client will have a lease to set data back into the cache when it experiences a cache miss and verification stage can fail if there is a delete request for the item. Memcached servers are separated into pools for different kinds of workloads. Replication can also improve the latency and efficiency.
(3) Fault tolerance. Gutter pools can take over failed servers.
(4) Servers are splitted into regions. It allows for smaller failure domains and a tractable network configuration. Data replica is used.
(5) They maintain consistency across regions. The consistency model constantly changes as a trade-off for performance.

Applicability:
The paper is mainly for industry application so there exist a lot tradeoffs. For example, the UDP is not reliable all the time and communication package can be lost, sometimes users will get out-dated data, the data doesn't keep strict consistency. However, it successfully solve the problems of Facebook and provides scalable distributed key-value store service. It also gives us a picture of how industry projects work. Every company will have specific demands. There will be no best system but the most suitable system.

Summary:
This paper summarizes the experience of deploying memcachd and build a distributed key-value store with it to support the world’s largest social network. The challenges and the corresponding solutions during scale the cache system from one server to distributed server clusters for Facebook scale are discussed. Several application specific optimizations/simplications are discussed in the paper, which are related to the application model of facebook. The eventual key-value store (cache) can hold trillions of items and handle billions of request per second.

Problem:
In Facebook’s application model, the web server will send hundreds of requests for data items per web page. Such large number of requests will generate too high workload to the database server considering that every user have different web page and the number of user is huge. So, it’s necessary to deploy a cache system to offload the pressure to the database server. In addition, the requests are not read/write uniformly distributed, most operations are reads. That’s the opportunity for the cache system; The system optimization can bias to performance, availability and efficiency, because Facebook’s application is not like the ones in financial companies, the inconsistency in short interval is acceptable.

Contributions:
As the authors point out, this paper describes the evolution of Facebook’s cache system, which is valuable because no other system with similar scale has been implemented and evaluated before. This paper describes the different challenge to deploy the system in different scales, and gives very detailed background information about the tradeoffs.

Discussion:
At my first glance, memcached is trying to implement a ”CDN” for facebook. However, unlike CDN, the individual items in Facebook’s database are rarely accessed very frequently. That means that it doesn’t make sense to over replicate the items within the same region, and the total (aggregate) size of the cache should be large enough. Another point is that CDN has the goal to move data close to the client proactively, while memcached does the job passively. The item will be moved to the cache server close to the client after first read access, and it will be deleted if any update about the item happened. I think build a prediction model about the data access pattern and prefetch some data to the cache server might be helpful for the overall system performance.
The way to handle failure in memcache system is different from normal consistent hashing caching solution. Instead of rehashing the keys in the failure cache server to other cache servers, a small set of dedicate machines will take over the responsibility of the failure cache servers.
In handling cross regions consistency, the memcache system provide best-effort eventual consistency but place an emphasis on performance and availability. That’s fine for Facebook’s application, but might don’t work for other applications.
This paper also introduce some detail information about the management and measurement facilities to memcache, which are pretty interesting.
The authors claim it’s a good strategy to move complexity to the client side, but I think that shouldn’t be a general rule. Sometime, it makes sense to implement some complexity in the server and shared by all the clients. We need to consider the trade-offs case by case.

Summary:
Facebook’s engineers discuss their caching architecture, which utilizes a custom version of memcached to support billions of requests per second. They first introduce their cache replacement and invalidation algorithm for a single memcache server before moving onto multiple caches in a cluster, multiple clusters in a region, and finally caching across multiple regions. They finish the paper with a description of their optimizations to memcache software and key metrics for measuring their workload in production.
Problems:
Facebook has an order of magnitude more reads than writes, so this design focuses on solving performance, fault tolerance, and consistency issues for this is geographically distributed cache leaving data persistence to another abstraction. Chiefly, this architecture needs to have near real-time communication; content that can be aggregated, accessed, and updated by multiple sources at their scale.
Contributions:
There were many contributions this paper made at each level; here are a few:
• A simple algorithm for a single demand-filled look-aside cache that had idempotent deletes.
• Using consistent hashing so that any web server could access the any cache and would appear to be a giant monolithic in-memory key-value store.
• Introducing mcrouter (which they just open-sourced yesterday) to proxy and batch memcache requests to avoid both ingress and egress congestion.
• Using UDP for memcache request reads to eliminate unnecessary TCP state memory and network congestion, while using TCP to do memcache sets and deletes.
• Utilize a sliding window similar to TCP congestion control to optimize memcache proxy requests.
• Introducing leases as means to address the consistency issue of stale caches and a thundering herd trying to update the cache. One gripe was the ten second token to set the cache, which seems like it would block many requests if it failed could cause a cascading failure.
• Introducing a secondary cache called Gutter which could pickup the load of failed machines and prevent further cascading failures by bottlenecking the databases.
• Tailing the database journal log with a program called mcsqueal to authoritatively invalidate caches through a tree of mcrouters back to the correct cluster.
• Cold Cluster Warmup, which could turn up a new cache cluster by copying an already live cluster’s memory drastically speeding up the time to peak performance. I question how this would work with consistent hashing, as a majority of the cached values would be rehashed to neighboring caches.
• A host of optimizations for memcache itself which included: concurrency with a single global lock, an adaptive class based memory allocator, and a short lived items list for invalidation beyond LRU.
Applicability:
Overall, this paper shows how to design and implement a caching architecture in a systematic way by introducing layers of abstraction at each workload size that offer optimizations and guarantees. It is highly applicable because it is describes a currently used system that could be modified depending on the scale and need.

Summary: Facebook workload is heavily read dominated with data fetched from variety of sources, therefore it requires a flexible caching solution. In this paper authors have described how by leveraging open-source "memcached", single node in-memory hash table implementation, to build "memcache", a large scale in-memory distributed key-value store, to improve read performance of services behind the facebook website and in-turn also reduce load on its backend persistent data storage systems.

Problem:
A single Facebook page load fetches data from tens of back-end stores and with millions of page request per second, leading to increased latency in fetching data, heavy load on the back-end stores, and increased network congestion due to all-to-all communication between client and servers. Facebook requires a solution which provide low latency in fetching data, with tolerance to expose slightly stale data to insulate back-end store from heavy read loads. And with requirement for such scale, they need to resolve the problems of consistency, failure handling, replication, and load balancing common in large scale distributed systems.

Contributions:
1) Separation of caching layer from persistence layer: With this approach both the layers can scale independently and also can be adjusted for performance independently.
2) Pushing complexity into client to keep cache servers simple: With adopting this approach they have been able to add features like batching of requests to reduce network congestion. And its easier to scale with lacking of complexity in servers.
3) Coalescing of connections: Large number of connection puts load on memory as well as increases network congestion. With adding "mcrouter" to coalesce connections, they have reduced the network, CPU and memory usage.
4) Flow control in memcache servers - To prevent web-servers getting overwhelmed by the responses, memcache server uses flow-control which adopts sliding window protocol to control how many responses are send back.
5) Idea for using leases to overcome issue of concurrent updates and also solving thundering-herd issue by regulating the rate at which tokens are issued.
6) Using different cache pools to prevent interference from sharing of the same infrastructure for different type of workloads.
7) Using small set of idle servers "gutter" for sharing loads from failed servers. In common approaches, there are no redundant servers, but the live servers share the extra load from failed server, here there are chances of extra load bringing down another server.
8) Use of remote markers to minimize reading of stale data.
9) Improvement in memcached: a) Auto expansion of hash table size, b) use of global locks to enable multi-threading, c) each thread using own UDP port for sending back replies.

Flaws:
I do not find any flaw in the system described, other than the burden of security being pushed onto the client. With memcache being such a large distributed system which is shared between multiple services, lack of security layer on the server may cause incorrectly configured service reading and updating data for other services.

Applicability:
Responsiveness is one of the critical factor for any website for being useful, either be social networks, shopping portals, news portal. Therefore the ideas in this paper are very applicable to any existing or new website, which has read dominated workload requirement with tolerance for stale data. They can follow the specifications in this paper to improve performance of their website.

Memcached is an in-memory cache solution. This paper deals with how facebook built a distributed key value store by enhancing the open source implementation of memcached. They talk about the how they reduce the database load by using the memcached infrastructure and achieve a highly scalable system. Besides, they also highlight how they achieved reliability, fault tolerance, reduction of latency for queries and load balancing using this architecture.

Major contributions :

1. Memcached is mainly a key value store composed of get, set and delete operations. The idea of choosing delete operations instead of updates since delete is idempotent is an useful idea. It is resilient to any communication issues.
2. Consistent hashing has been used to distribute the items across memcache servers. This would require an all-web servers-to-all memcache communication pattern. This is likely to cause incast congestion which is dealt with by using a sliding window mechanism to control the number of requests sent at a time.
3. Latency is reduced by employing the following techniques :
- Using a DAG to infer data dependencies and batch requests whose data could be fetched concurrently.
- Usage of stateless UDP for get and reliable TCP for set and delete along with connection coalescing to reduce number of open connections has a significant improvement in the response time.
4. Load reduction on the database is achieved by :
- Using lease tokens and limiting the rate at which these tokens are released which allows a particular client holding the token to write and stalls the remaining client requests for a short time so that they could read from the cache and not go to the database.
- Fetching stale data in cases where it is acceptable.
- Pooling the memcache servers according to the frequency of access and the overhead of cache misses (based on the workload).
- Replicating data within a pool when a set of keys are fetched very frequently.
5. There exist Gutter machines that take the responsibility of failed servers. Re-direction of requests to Gutter helps in reducing the backend load.
6. Splitting into regions (consisting of multiple front end clusters with storage cluster) rather than simply scaling web servers and memcache servers is an good attempt to attain smaller failure domains.
7. Handles invalidations in a region by using the mcsqueal pipeline that reads commit logs from database, batches invalidations and broadcasts them to memcache servers via the mcrouters. Avoids usage of web servers for invalidations as it is not reliable.
8. Achieves eventual consistency between the master and replica region databases. A remote marker mechanism is used to indicate staleness in a local replica database. This ensures that the query is redirected to the master region and updated data is read appropriately.
9. The methodologies used for memory management of a single memcache server such as the adaptive memory allocation and the hybrid mechanism of combining LRU and faster eviction of short lived keys for the transient item cache are interesting ideas.

Relevance :

The architecture is highly tuned for the workload of a social network like Facebook. As the workload does not always demand latest data to be available, usage of stale data is acceptable and the system makes good use of this. The solution is very much applicable to today’s distributed systems which need to process such a large scale of data like that of Facebook and Twitter and be able to serve billions of user requests concurrently.

Summary:
This paper shows how Facebook scale the memcached-based architecture to meet it demand of billions of requests per second and trillions of items to deliver.

Problems:
1. The facebook has a really read-heavy workload and wide fan-out. The performance for read is very important.
2. The memcached needs to be scaled to multiple servers.
3. Dealing data from various kinds of sources and each source has different access pattern.
4. Provide the consistent user experience around the world.
5. Under the all-to-all demand, how to reduce the latency and connections to and from servers.

Contributions:
The design principle is to move the complexity of system design into the stateless clients. In this case, the memcached is simplified. Also memcache is also configured as a general-purpose distributed key-value system. Many ideas are introduced here and I only high-light some of them:
To reduce the latency:
1. Construct a directed acyclic graph (DAG) representing the dependencies between data. Then the web server uses this find the max parallelism among all the data needed to be fetched.
2. Servers do not communicate with each other. Clients use consistent hashing to pick up the right server to be communicated.
3. UDP is chosen for get requests in order to reduce the latency and overhead. For set and delete operation, TCP is still used for the reliability purpose.
4. mcrouter helps to coalescing connections to servers to improve the efficiency of the server.
5. A sliding mechanism is used to ensure that a large number of responses from servers would not overwhelm the clients.
To reduce the load:
1.Leases are used to handle two problems: stale sets and thundering herds. The thundering herds is mitigated by letting the server return a token once every 10 seconds. Stale set is handled by using load-link/store-conditional operates that a new value will only be stored if no updates have occurred.
2. In some cases, a little out-of-date stale data are allowed to be forwarded to minimize the application’s wait time.
3. The servers in one cluster are partitioned into separated pools. Different pools server as the different access patterns. Some highly-parallel used keys are replicated within pools to improve the latency and efficiency further.
4. mcsqueal is adopted to detect the SQL statements that modifies authoritative state and batch deletes into fewter packets.
Failures-tolerant:
1. A small set of machines, named Gutter are used to take over the responsibilities of a few failed servers. It would limit the load on backend services at the cost of using slightly stale data.
Cold cluster warmup
This technique allows clients in the cold cluster to retrieve data from a warm one instead of from the back-end server, which takes the advantages from data replication across fronted clusters.
Single machine improvements
A lot of things are done to improve the single-machine performance, like using fine-grade lock and adaptive slab allocator.

Applicability:
Obviously, the various mechanisms that are introduced in this paper are practical not only to facebook but also to other web sites, especially the social websites. Some thoughts in this paper are also good experience for later distributed system design, like the design is driven by real data since most of data used to adjust the model are from the real usages.

Summary:
The paper discusses the factors considered and the decisions made in the design and implementation of a distributed key-value store, memcache, by extending a single node key-value store.

Problem:
The problems the authors were trying to solve were the following:
1. To handle billions of requests per second from the users.
2. To aggregate data from multiple sources to display a single page.
3. To provide near real-time communication between users.
4. To handle very popular shared content among users.

Contributions:
The paper has made many valuable contributions. Some of them are as follows:
1. The decision to optimize the distributed system based on the fact that users consume more data than they produce helped them to design a system tailored to their needs.
2. The usage of UDP over TCP helped to reduce the network latency and overhead as the connection between the memcache clients and the server needn't be reliable.
3. The lease mechanism that solves the problem of stale sets(due to reordering of concurrent updates to memcache) and thundering herd(invalidation of a cache entry due to heavy read write activity) was an important contribution.
4. Placing the low churn and high churn keys in different pools helped in avoiding the interferance between them.
5. The replication of keys on multiple servers helped the client requests to access any server which reduces the load per server by half when compared to partitioning the keys on different servers.
6. The use of Gutter servers to take over the responsibility of failed servers, instead of rehasing the keys among the remaining memcached servers, helped in reducing the load on the backend servers and the failure due to cascading effect.
7. The idea of "Cold Cluster Warmup" to retrieve data from a "warm cluster" instead of from the persistent storage when a cache starts afresh was great !
8. The decision to gurantee only eventual consistency with the help of remote markers which helps to identify if a non-master database holds stale or non-stale data.

Relevance:
The concepts discussed in this paper are very much helpful in building a distributed key-value system and a practical application can be seen at Facebook. These ideas can definitely be used to build any large-scale distributed system which handles billions of requests per second. I would be happy if these techniques are used to improve one of the largest online train ticket booking websites in India, namely IRCTC.

Summary :
This paper from Facebook presents how the engineers leveraged memcached to construct a highly scalable, distributed key-value store to serve billions of requests per second.

Problem :
Facebook, as we all know, is the largest social network platform with a billion users and it holds trillions of items. So, they need a system which is highly scalable, serves requests with less response time, does not return stale data (maintaining consistency across servers) and is fault tolerant.

Contributions :
1. The paper talks about many contributions which help solve issues at every level of the architecture.
2. The system tries to reduce the huge load on databases and other services by having a lot of memcached servers which all requests are ideally expected to hit.
3. I like their idea of using UDP for get requests thereby trading reliability with performance.
4. The idea of using Mcrouter to avoid network congestion is pretty neat. This router batches all TCP connections and unpacks them sending them to the server thereby reducing overhead. Thereby reducing latency of requests.
5. They use something called “lease” which handles stale sets and thundering herds.
6. The system trades performance in some cases where stale data is acceptable. I guess in large systems, this is always a decision to make and many people go in favor of performance.
7. They have their data partitioned into different “pools” which suit different purposes. For example, some data might be accessed frequently but their cache miss is inexpensive whereas some data might be computationally intensive and their cache misses are also expensive.
8. The paper points out a clear distinction between when data needs to be replicated and when it needs to be partitioned which is pretty intuitive.
9. They have a set of idle machines, which they call “Gutter” which might be used when a set of servers go down or there is heavy load on all the machines.
10. The paper defines a “region” as the storage cluster along with the memcached and web servers. The system has invaliadation daemon on every database, called mcsqueal, which extracts deletes from the SQL commit log and broadcasts to memcached deployment in every front end cluster of that region. This is to maintain consistency.
11. I like their idea of Cold Cluster Warmup where when a new cluster is brought online, it doesn’t have to incur high miss rates. Instead, it retrieves data from another cluster which has caches with normal hit rates, rather than going to the databases.
12. The system maintains inter-region consistency by introducing a flag called “remote marker” which indicates that data in the local replica database is potentially stale and redirects the query to the master database.
13. They have an adaptive slab allocator which is capable of re-balancing memory management assignments as the workload changes.
Each thread also has its own UDP port to reduce contention.

Applicability :
This is very applicable as Facebook is using it and is performing well! It is really nice that Facebook contributed to making Memcache distributed as this is currently also being used by Twitter and such. I think that if a system has a lot of data and a lot of requests coming through but not very strict consistency requirements, this is a very good mechanism to ensure low response time and high scalability.

Summary:
This paper describes the distributed key value store of the social networking service FaceBook. It talks about how Facebook leveraged the in memory hash table “memcached" to build a scalable, distributed key value store offering uniform view and performance to its clients.

Problem:
The problem they are trying to solve is come up with a system which looks to clients as if it is a single cache with limitless memory, offers higher throughput and lower latency and is fault tolerant.

Contribution:
1. Leveraging existing simple cache solution like memcached as a building block as opposed to developing something complex. Memcached has a simple interface and hence it makes information management simpler.
2. Built the key component mcrouter that sits between the web servers and mem cache servers which performs functionalities like :
2.1 Connection coalescing : The mcrouter coalesces multiple tcp connections from the clients and sends the data via one connection to the mem caches. This takes the load off the mem cache server, in terms of the number of connections it can handle.
2.2 Transparency : The interface exposed by the mcrouter is exactly the same as that of memcache thereby making the presence of mcrouter completely transparent to the client.
2.3 Packet decompression: During cache invalidations mcrouter is responsible for unpacking the message and forward relevant invalidate messages to the corresponding mem caches.
3.Achieving higher read rates through data replication.
4. Dealing with heterogenous workloads by categorizing the mem cache pools. This quality of service ensure that data which incurs lot of cost when fetched from database is kept in a bigger pool than the data which would incur lower cost when fetched from database. The frequency of access was also considered.
5.Cold servers were warmed up by fetching data from the warm servers instead of fetching it from the back end. This allowed the cold servers to populate the cache pretty quickly.
6.Preferred UDP for gets, since it was faster. Added congestion control to reduce incast congestion.
7.Used lease mechanism to prevent the problem of thundering herds and stale sets.
8. Mcsqueal ensured that the updates were consistent across all mem caches and database clusters.
9. Software upgrades did not affect the cache content as it was stored in system V shared memory region.

Limitations :
1. Usage of single marker for a key while performing updates. This would mean concurrent updates might have a problem as single marker represents two updates.
2. Stale value of the cache being returned sometimes in order to prevent the back end from getting overloaded.

Applicability:
This implementation is used by Facebook today. Other users apart from Facebook include Youtube, Wikipedia, Reddit, Twitter etc.

SUMMARY
The authors have adapted the memcached to server billion request per second! They used different hacks including making memcached distributed system, batching requests etc and improved its performance at the cost of 'little' inconsistencies (stale data).

PROBLEMS
They are trying to build a system to handle more than billion request per second. The request are read heavy and have very high fan-out. They have to consider bottleneck of CPU and network as well as the issues of consistency, speed and fault tolerance.

CONTRIBUTION
They build a distributed caching system out of memcached that serves the purpose! They whole paper is full of wonderful ideas (problem and solutinos) that help them realize the goal. And these ideas can be used in any distributed system that are of this scale, e.g.

  • "Warm up": Cold caches are warmed up by putting it in between client and a worm cache. This saves a lot of costly database querries as well as worms up very fast
  • "leasing": to deal with inconsistency in concurrent write requests (thundring heard or stale sets), they developed notion of "leasing" the writing permission against a key to one of the requests.
  • the requests are batched but 'mcrouter' to decrease network overhead.
  • Gutter Cache: Addendum to the existing fault tolerant distributed caching technique they prefer to have gutter caches to handle requests incase of failure of one or mroe servers. It stop cascading effect of misses in other caches ddue increased load otherwise.
  • Divide cache data based on high churn (can evicted) and low churn (do not evict).
  • Demand based caching: The logic of cache modification is put in client to make the design simple.
  • Releasing their distributed memcached as opensource.
  • Optize the system to server hot objects very fast - this will make expected service rating better.
  • Simple is better than complex!

APPLICABILITY
Facebook is using it. That is the biggest applicability and magic of this system. Though the one developed at Facebook is highly tuned towards Facebok's need, it can be adapted to other scenarios also. According to Facebook's need they need to server many small requests in short time instead of some large requests, so they prefer to replicate the data instead of divinding it for handling hotspots.

More they have not talked about what will be the bottlenecks to scale this system to handle 10 or 100 billion req/sec.

Summary

The paper is a study of the memcache implementation at Facebook to perform caching that supports processing of billions of requests per second, without mowing down the storage clusters in the backend with high read request volumes. The system builds on memcached, to create a distributed, highly performant key-value store.

Problem Domain
Workloads at Facebook are typically characterized by order of higher magnitudes of read than writes, with reads often involving high fanout and large number of data fetches to satisfy a single request. A lot of engineering decisions made in this paper are keeping this nature of workload in mind. Also factored in is the fact that a lot of applications that use memcache are tolerant to some staleness in the fetched data. Another thing to keep in mind is that this is that caching architecture that would deal with data that have different difficulties of generation/computation making caching more important in certain instances than others.

Contributions


  • An architecture that supports the existence of both partitioning and replication of data with uniform access to the underlying memcached servers through the use of a mcrouter.

  • Division of workloads into caching pools that prolong cache presence based on how computationally expensive it is to refetch/recompute the cached data, and dynamic routing to such caching pools based on the key space.

  • Building in tolerances to failure by use of spill over servers that can take up load from failed/overloaded servers. Also allowing for transient failures by the choice of using deletes (and the fact that they are idempotent), with replay of invalidation messages on server recovery.

  • The invalidation pipeline used to ensure consistency across regions is a pretty nice way of ensuring consistency without consuming a ton of bandwidth

  • The cold cluster warmup is a sweet engineering trick to get over cluster additions/resets from taking forever to warmup again. But I don’t particularly understand how a cluster is weaned away from cold cluster warmup mode once it has a good active working set of its own?

  • Enabling more agile updates to the software must be really important to a company like Facebook which is known for its pace of development, and the choice of System V memory retention across updates is a pretty neat way of solving a real engineering problem which would have otherwise made system upgrades much slower, and possibly much less frequent.


Applicability
As can be seen in the related work section, networks such as Twitter or Facebook that have large streams of data, that are predominantly read heavy, and which require a large number of aggregations from multiple sources would be able to make use of this memcache architecture to scale up to high throughput request processing ability with low latencies. By servicing most requests from the cache, not only do they gain performance, but they are also able to shield their backends from facing load which the traffic would otherwise possibly take down.

Summary: Paper describes how facebook has used, modified, and deployed memcached into a distributed system that is able to support over a billion requests per second. It presents interesting engineering insights with regards to how such a system is deployed and evolved to support the growing performance needs of a huge social network.

Problem

How does one ensure good performance for hundreds of millions of users? Obviously the answer is, again, caching. However, interesting problems start to reveal themselves when such a caching system is scaled up and that requires having a concrete understanding of the goals you are trying to optimize for. For instance, what decisions need to be made with regards to consistency? Depending on what approach you take towards consistency, it will have trade-offs with respect for your initial goals. In this case, one has the trade-off between better performance and strong consistency. With facebook, they have chosen performance over a perfect consistency model. This paper goes into their design decisions when they faced decisions on how to build their distributed memcache.

Contributions

The main contribution of this paper is the description of their memcache distributed system that is employed at facebook. It provided interesting insight into the unique process of deploying and evolving a system over time.

One of the great aspects of this paper is how the authors illustrated the problems that occur as you move up in scale. That is, while consistency, performance, fault-tolerance, etc. are general matters of concern, some of these aspects become more of a problem at varies levels of the hierarchy. For example, consistency becomes a major problem when you have a region of machines (multiple clusters) because you have to be able to push invalidates to all the memcaches. Doing this naively results in a large amount of traffic being sent over the network causing congestion. They addressed this by adding a layer of abstraction called the mcrouter that allowed them to reduce packet size and made configuration easier for them.

Applicability

Obviously this system is very applicable and relevant because it is something that is actually in use at a company that has to service billions of requests per second. That fact alone and the fact that the system actually works is enough to pay special attention to this. However, this is not a system that you can just pick up and use anywhere. As was said repeatedly in the paper, this system only works for specific workload profiles with specific goals for those workload profiles. If you have a system that is write heavy, then this is not the caching scheme you ought to adopt.

One of the reasons the authors gave for geographically disperse regions of servers was to provide fault-tolerance so that facebook will always be up no matter what calamities hit any particular server. However, it seems that their actual design does not allow for this in the case of actually writing data. The system is currently set up so that clusters that are operating on replica back-ends do not write to the replica back-ends; they send the writes to the master region. Therefore, if a calamity hit the master region, the system would no longer be able to write. The site would likely just be read-only and probably a trade-off that the engineers willingly made when designing the system.

Summary:

The authors describe how memcached can be used to implement a distributed key-value store that is robust and handles a very high workload.

Problem:

Facebook has a trillions of keys that must be served at billions of requests per second. The existing open source implementation of memcached cannot deliver such a high throughput.

Contributions:

- McRouter to make the system scalable.
- Choosing UDP over TCP for specific scenarios to optimize for a read dominant workload.
- Congestion control mechanism for Incast Congestion.
- Architecture for organization of cache servers, storage layers, and web servers across geographic boundaries.
- Lease based mechanism for dealing with stale sets and thundering writes.
- Creating specialized memcache pools for handling heterogeneous workloads.
- Connection coalescing mechanism for reducing network load and database load.
- Batching mechanism for cache requests to reduce packets.
- Cold cluster warmup mechanism for bringing a cluster online faster.
- Consistency mechanism across memcaches using McSqueal.
- Adaptive Slab Allocator.
- Retaining cache data across software upgrades.

Limitations:

- Optimized for reads at the cost of writes.
- Remote marker mechanism is not stable for concurrent writes.
- Backends are shielded from load and response times are improved at the cost of stale data.
- All writes operate on the master cluster. Single "point" of failure for writes.

Application:
- It is used as part of FB's infrastructure.
- Some of the changes made have gone to the open source community, and memcache is widely used by many companies.
- The design serves as an excellent example for building distributed systems optimized for certain characteristics.

Summary
The paper talks about how they leveraged and made changes to the open source version of memcached to improve the performance of Facebooks distributed key value store.

Problem trying to solve
• Reduction of latency on a request for data. The response time on a cache (memcache) hit or miss is a critical factor and needs to be alleviated.
• As web servers rely on a high degree of parallelism and over-subscription to achieve high throughput it becomes expensive in the case of open TCP connections between web threads and memcache. Resource utilization has to be reduced.
• Incast Congestion is another issue that has been dealt with.
• Load Reduction to take care of issues such as stale sets, thundering herds, improving hit rates and improve the latency and efficiency of memcached servers.
• Handling failures such as loss of hosts due to network or server failure and loss of servers within a cluster due to an outage.
• Impact of multiple front end clusters sharing the same storage cluster.
• Challenges when trying to broaden the geographic placements of data centers.

Contributions of the paper
• Improvements to memcached architecture to improve performance and increase memory efficiency.
• Latency reduction is achieved by focusing on the memcache client. This client holds a map to every available server and has a range of functions available to alleviate latency.
• To improve resource utilization coalescing of the multitude of open TCCP connections was proposed. Reducing network CPU and memory resource usage.
• In order to reduce incast congestion the sliding window technique is used. This is done in the client as it uses a window to control the number of outstanding requests.
• Improved Scalability
• Reduction in Load is done using three techniques
o Leases are used which emulate the working of load-link/store conditional to prevent stale sets.
o Thundering crowds are avoided by regulating rate return of a token.
o Memcache pools are used to improve hit rates. Pools were created from memcached servers differentiated based on the workloads popularity and miss latency.
o Replication was done within pools was proposed to improve latency and efficiency of the memcached servers
• Failures such as network and server failures were handled by the use of gutter server pool which also limits the load on backend servers.
• Division of web and memcache servers into multiple frontend clusters to avoid incast congestion due to increased traffic.

Applicability to real systems.
• The distributed key value store implemented scales well on workloads on presented by the paper. The paper being very recent makes this system highly applicable in the present.

This paper explains the various methods implemented by Facebook to scale memcached and create a distributed key-value store allowing them to process millions of user requests per second and minimize latency .

Various engineering decisions made by Facebook -
- Delete operations was preferred to sets since they were idempotent and filled up the in memory cache only on demand .
-A direct acyclic graph was constructed representing the data dependencies and this could be used to fetch data in parallel
-Memcached servers do not communicate with each other and all complexity in integrated into stateless clients reducing network traffic within a cluster
- UDP was used to communicate with memcached servers for Get requests to reduce latency for read intensive workloads . The UDP implementation detects dropped packets as cache miss errors . TCP is used as the reliable mechanism for Set and Delete operations .
-Connection coalescing is performed by the proxy mcrouter to save CPU and memory resources. Sliding window mechanism was used between servers and clients to prevent overloading of servers( flow control )
-Leases are given to clients by a server to make sure they never get access to stale sets , and the rate at which leases are given , can be regulated thereby mitigating the thundering herd problem. At times returning stale data is acceptable since Facebook believes this snapshot of a webpage is fairly accurate and drastically reduces latency .
-Memcache Servers are divided into pools to ensure minimal negative interference between workloads. Some pools are replicated to distribute load .
-Fault Tolerance :When a client receives no response from a server, it assumes failure and directs requests towards 'Gutter' servers .
-Clusters that are recently brought online , try to improve their hit rates and reduces load on the database server by retrieving data from memcache clusters that have good hit rates
-Remote markers are used within non master regions to ensure consistency . Inter-region communication channel is shared for the delete stream to reduce bandwidth requirements

Overall , the paper shows that using existing techniques such as consistent hashing , sliding windows, connection coalescing ,fine grained locking etc to a system , high performance and effiecieny can be achieved, and such systems can scale easily . This paper explains the details of the Facebook backend so its definitely applicable to present day distributed systems , but it also shows that system design is highly dependent on workloads .

This is a paper describing Facebook’s modifications to and usage of memcached to make it efficient for their use case. They include many different methods for decreasing latency at different scales - inside the server, in a cluster, and within a distributed, globalized network.

The problem is simple enough - figuring out how to scale a key-value store in a distributed nature. Of course, there are additional problems, including the massive I/O demands of users, scalability, and the domination of reading data over writing in their workloads.

The contributions are varied. The most basic are improvements that took place inside the server. These include automatically expanding the hash table, using multiple threads with a global lock, and giving each thread it’s own UDP port. They also worked on the eviction policy of the cache, by delaying some evictions for some newer data, and quickly evicting items with short bursts of activity.

Inside each cluster, they worked on decreasing the number of requests between nodes by batching requests, using UDP where possible, and limiting outgoing requests. One of the interesting ideas they implemented here was a sliding window for number of requests. This window increases in size until a request is not returned. In addition to network issues, they worked to fix two common problems in distributed systems - stale data and “thundering herd” behavior. To do this, they implemented a lease which helps keep track of data which has been deleted between cache miss and retrieval from data store, and forces requests to wait when multiple experience the same cache miss.

For any distributed system, failures become a concern. Facebook’s memcached is no different. They actually have a simple mechanism for dealing with failed devices - backup machines. These machines can be set with any keys on machines that are not responding.

The largest scale they have is regional. That is, for different regions, they have different clusters. This means that adding new machines does not scale for all data, just for data that exists in the regional cluster. Inside of a cluster they replicate data, since having multiple copies of the same data means hot keys will be available in multiple caches, which would not be possible if they just further distributed the key-value store.

Between regions they must maintain consistency. This is import for any distributed system. They performance a bit over strict consistency, allowing for some data to be slightly stale. They acknowledge that this is only possible for their use case. In any distributed system, the consistency-performance tradeoff must be made depending on the application. On Facebook, it’s not vital that the information be perfectly up-to-date - but in financial matters, it may be.

This is an extremely interesting read. The various work they did in all levels of the system made for an optimized and highly tuned cache. In addition, I want to highlight the focus on ease of use for their programmers. In addition to other tradeoffs they made (like using stateless servers), the use of a key-value store is simple for the application developers. It’s simplicity allows for extreme fine tuning and developing behind the scenes to make it as fast as possible.

In terms of flaws, the only ones I might see are the reliance on large numbers of machines (that is, it’s expensive). As an example, take the scheme for dealing with failure. They keep 1% of their machines idle just to take the load from these. I doubt that 1% of their machines are failed on any consistent basis, which makes this seem like a larger overhead. Of course, I acknowledge that they must deal with the peak number of failures, but there are other methods to mitigate the effects as well.

SUMMARY: In this paper, the authors motivate the need for high-performance distributed caches for large-scale, read-heavy, high fan-out workloads and discuss the design and implementation challenges in building out a distributed key-value store starting from a single-threaded, single-server key-value store, memcached.

CONTRIBUTIONS: The paper introduces many ideas and design decisions to solve challenges arising at various levels of the system scale-out. We highlight a few below:
* Using UDP instead of TCP to minimize service latency and memory footprint
* Introducing a routing layer, mcrouter
-- batches get requests and unpacks delete-invalidations, reducing network congestion
-- acts as the communication proxy between web servers, memcache and databases
* Client-side TCP-like adaptive, sliding window mechanism to control request rate
* Lease mechanism
-- avoids stale data in cache due to write-delete reordering during concurrent writes
-- avoids thundering herds problem, where high write-rate for a hot object causes high cache invalidation rate and load on backing storage servers
* Separating out cache pools for low- and high-churn objects, because of their different resource requirements and potential for negative interference. Similar to database buffer pools using different eviction policies for different types of pages.
* Argument for why replication is preferred over key-space partitioning while scaling out their servers within a pool, given that their service times are insensitive to request size. This is in contrast to previous work we read about
* Using Gutter servers for temporary backup during fault recovery, rather than remapping keys using consistent hashing, in order to avoid cascading failures due to non-uniform request frequency
* Propagating invalidations bottom-up (from the database servers to the caches) rather than from the web servers, in order to avoid stale data and race conditions, reduce network congestion and ease failure recovery
* Reducing replication of less popular, smaller objects for efficient memory and network utilization
* Warming up a cold cluster by placing it in front of a warm cluster, while largely avoiding cache consistency issues (using hold-off times)
* Using remote markers to provide reasonable consistency guarantees when all writes (even from non-master region web servers) go through the master region
* Slabbing memory allocation to improve memory reusability and reduce allocation costs, and adapting the relative allocations to simulate global LRU
* Separating out short-lived objects into a transient item cache for early eviction

APPLICABILITY: The ideas highlighted in this work are broadly applicable in the design of large-scale distributed systems. This includes both the technical contributions above, as well as general principles (such as the importance of modularity, statelessness, optimization of the common case, treating heterogenous workloads differently, etc).

Summary:

The paper presents the distributed(across different geographical regions) Key-Value store architecture currently deployed by Facebook. They have enhanced "memcache", an in-memory hash table implementation, to scale to multiple cluster of servers in order to meet the high request rate demand on their servers.

Problem:

Infrastructure of any high-demand social networking site has to always address the following problems:

  • Scale to very high demand(in concurrent requests).
  • Provide near to real-time update for the users.
  • Satisfy the demand for highly popular content; also deal with sudden spikes due to some significant event occurrence.
  • Aggregate content from various servers.
  • Satisfy heavier(than usual) demand during holidays.

In general it should also be consistent, fault tolerant and perform efficiently.

Contributions:

In order to meet the above demands, the architecture proposed in the paper separates the caching layer from the backend storage:

  • The infrastructure contains: regions of web server(which run the memcache client) clusters, memcache server clusters and database clusters.
  • There are multiple geographically separate regions across which stored data is replicated.
  • Mcrouter component in client is used to route requests to servers.
  • McSqueal component is used to invalidate data in the memcache servers.

Following are the some main design contributions and takeaways from the paper:

  • The fact that FB leveraged and built their architecture on an open-source software was cool.
  • Ensuring better performance of memcache server by pushing functionality to memcache client; increasing processing speed by making client stateless and by doing batch processing of requests.
  • Usage of UDP for get requests was an excellent idea as it reduces the connection establishment overhead; also ACKs are not required for them.
  • Dealing with incast congestion by leveraging sliding window protocol concepts was clever as the credibility of TCP is well established to handle congestion issues.
  • Concurrent writes and thundering herd problems were dealt by usage of 64-bit leases generated by the server, which the client can use to update data according to validity.
  • Pooling memcache servers aided in limiting the data replication and usage of large servers to store high-churn requests(not accessed frequently) enabled even those requests to be served from cache.
  • The concept of gutter usage provided high availability of clusters by backing up during failures.
  • In the master-replica region architecture, usage of remote marker to ensure user receives updated data(trading latency) was neat and well suited for FB status updates(the only exception where stale data was unacceptable).
  • Eviction of short-lived keys using Transient Item Cache would aid in effective memory management.

Applicability:

The components presented in the paper are highly applicable not only to FB but also to any high-demand social website which does not have strict consistency requirements with respect to staleness of data. Being an open-source software, memcache was very effective as it could be scaled to an architecture as big as FB's; also the paper is quite recent and is definitely highly applicable to present day social networking sites.

Summary: The paper explains how Facebook scaled their distributed memcache system through various phases from a few servers, to clusters, and eventually to geographically distributed clusters to help serve up 1 billion+ requests per second.

Problem: Facebook has a load that heavily favors reads vs writes, so their system had to be especially scalable for reads. Facebook cares most about low latency, and fast page load times, and they are willing to give up a little cache freshness in order to get better performance. With all-to-all communication between clients and servers, the system can't scale very well, too much bandwidth is consumed, and latency is high.

Contributions:
1-System design: although many aspects of the system were already well known, they put things together in a way no one had done before. There was no server-to-server communication and stateless client operations are favored. They split cached data into low and high turn pools to prevent high value, low churn data from getting evicted in favor of low value high churn data. They used key-space partitioning to limit the load that might go to a single server (by controlling replication within pools). They used UDP to speed up gets, and TCP for updates & deletes (to avoid repeating requests). Regional pools allowed a reduction in replication by allowing multiple front-end clusters to share the same set of memcached servers. Separate regional clusters also added geographic locality.

2A-Distributed Cache: Memcache: having a fast key-value store with get, update & delete operations was critical, and splitting the server from the cache allowed each to scale separately. They turned a simple key/value store into a distributed system with by using consistent hashing as well as other strategies. They also released an open source version of memcache. McRouter helped buffer actions and intra-cluster communication, while McSqueal helped broadcast cache invalidations when deletions were performed.

2B-Memcache (single system): the more recent versions of memcached were improved to handle 600k to 1.7m get requests per second (I believe old versions could only handle 50k requests per second).

3-Reduce load: They created a new mechanism “leases”, which help prevent thundering herd issues by only releasing 1 token every 10 seconds. Stale cache issues are prevented as well by a mechanism similar to load link and store conditional (LL/SC) which only stores a new value to a location only if no updates have occurred since the load link.

4-Fault Tolerance: The gutter servers are extra servers that sort of act as a spare tire for the system, since a few servers could go down & the gutter servers would then come online & serve up requests while backups could be brought into place. It prevented escalation of failures that might instead bring the entire system down.

5-Cold cluster warmup: a novel idea that allows you to warm up memcache servers by first attempting to retrieve a fresh value from another memcache server. So while inter-server communication is bad, there is in fact a context in which it is good. This would of course reduce load on backend database servers and quickly warm up a cluster cache (within a few hours instead of a few days).

6-Consistency: “remote markers” prevent race conditions most of the time by trading additional latency for decreased probability of reading stale cache data.

Applicability: The paper is straightforward & practical and their approach is too. Anyone who is willing to trade slightly stale data for extremely high system performance could follow their specifications and use the open-source version of memcache to create a high performance distributed cache. They didn't really tell us what problems they are currently facing but no doubt problems must exist.

Summary: Facebook takes memcached and modifies it to work in a
distributed system. Their goal is to find a tradeoff between performance and
consistency, which they accomplish by building a modular system with batched
messaging and heuristics for deciding how to send requests (or when not to).

Problem: they are generating dynamic pages that depend on values stored
in a key-value store. Retrieving the data from a backend DB would be
prohibitively expensive, and using a cache would raise consistency concerns. If
one frontend updates a value in the cache, how would this be replicated to
other caches? Another problem regarding fault tolerance is what to do when a
cache fails or does not respond to requests.

Contributions:
- Instead of building logic in the cache, build it in the client: if there is a
cache miss, the client is responsible for retrieving the data from the
backend and updating the cache.
- 'Gutter': when a cache fails, instead of using other caches, which could
cause a cascading failure, use a dedicated 'Gutter' cache, which may be a
little inconsistent, but at least will maintain availability.
- To swap in a new cache, instead of querying the backend storage, migrate from
a cache that is already hot.
- Batching! Network congestion can be greatly reduced by batching messages.
- 'Pool' together data that has similar requirements of consistency/latency.
- Cache eviction: use a hybrid scheme that: uses lazy eviction in most cases;
or proactively deletes based on LRU.
- Use independent failure domains.
- Simplicity is vital!

Applicability:
It *is* applicable, because Facebook uses it! They show how you can take an
existing cache and, without touching its codebase (too much), scale it to a
distributed system. In addition:
- Modularity allows components to scale at different rates.
- Treat errors as the normal case, and thus have procedures for dealing
with them that do not affect other parts of the system.
- The notion of pools can be used to have different consistency
guarantees for data with different requirements.

Summary:
The paper describes the architecture of a distributed key value store for supporting a large social networking website using memcached.
The main considerations were scale and performance and they have a read dominant workload which adds that much more incentive to caching. The end goal is to have most of the requests be handled by the in memory cache rather than the slower backend.
Through many clever choices such as pushing the complexity toward stateless clients, decoupling caching from persistent storage systems etc., the authors are able to achieve low latency, high throughput, fault tolerance and good consistency.
What problems are they trying to solve?
Many of the problems the authors are trying to solve arise due to the sheer scale of the system. The memcache is expected to hold trillions of items and handle billions of requests per second. Moreover, their workload has a wide fan-out. Each page request can create a dependency graph spanning tens of levels and everything needs to be handled in as little time as possible. Such a scale can create bottlenecks at various points in the system.
Contributions:
There is an amalgamation of various techniques in this paper and I will try to summarize a few of them. 1. To reduce complexity and operational efficiency, they push all complexity to stateless clients and so there is no inter-server communication between the caches. Considering the scale of the system this is a very important design decision that worked well for them.
2. To update the caches, they prefer ‘deletes’ for their idempotency rather than ‘sets’.
3. They make certain trade-offs like serving slightly stale data (by temporarily holding onto deleted items) when the preference is to reduce the load on the backend.
4. The all to all communication pattern leads to incast congestion which they solve through parallel requests, batching and using a proxy called mcrouter for routing requests efficiently.
5. From networking point of view, they use UDP for gets and TCP for sets and deletes. Since gets are more frequent using UDP makes sense as it is connectionless and has less overhead. TCP is used for the other two since it involves state change.
6. Clients also use a dynamically resizing sliding window (similar to one used by TCP) to control the number of outstanding requests. This is a great way of efficiently utilizing the available resources at any point in time.
7. They solve problems of concurrent updates and thundering herds by using a leasing mechanism.
8. Another cool idea was to pool the servers based on churn with different configurations for each pool specific to the kinds of workloads they are serving.
9. They do replication within clusters and also across geo-regions to handle fault-tolerance and for better performance.
10. They also use gutter servers to handle failures. However, if I understand correctly, all web-servers hit the failed cache and then the gutter and then go to the backend. This can be alleviated if there was a way to notify the web-servers of a failed cache so they can directly try the gutter.
11. For data invalidation they amend SQL statements that modify state to include corresponding memcache keys and use a daemon called mcsqueal (along with mcrouter) to co-ordinate the invalidation at the caches. Techniques such as batching are used here to avoid high traffic.
12. They are happy with eventual consistency across geo-regions. They classify the regions into master and replica regions and use ‘remote markers’(details omitted for brevity) to deal with consistency.
13. Finally, I thought the evaluations were really good and justified all their design decision adequately.

Applicability:
This paper is contemporary and is an aggregation of various techniques, each of which are valuable for building any large scale web service.

Post a comment