« Scale and Performance in a Distributed File System | Main | Why Do Computers Stop and What Can Be Done About It »

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. in 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.

Reviews due Tuesday, 11/18.


Google File System


The authors describe the design and performance of the Google
distributed file system which is designed specifically to handle large data
sets and to be run over commodity hardware which is cheaper and has higher
failure rates than server hardware.

Description of the problem being solved

GFS was the solution for the problem of a better filesystem that
addressed the following observations ( a traditional file system does not
handle this ) :

a) need to efficiently handle large data sets that dont fit in a
single commodity harddisk.

b) need to be a cheap solution using commodity hardware

c) optimize for the common write access pattern of append data to
existing files

Contributions of the paper

a) clusters with a single master that actively handles mutations makes
the whole solution simple. otherwise, there need to multi-way replication and
consistency solutions in place. Amazon's S3 follows a eventual consistency
model for example as a solution for the problem of multiple masters.

b) replication of chunks improves bandwidth for read access patterns
in a distributed environment.

c) instead of using a directory structure , a lookup table along with
prefix compression is used which leads to faster name to chunkserver

d) copy on write mechanism is used based on the observation of
frequent append access patterns. data transfers happen directly on the
chunkserver after metadata operations.Overall, all optimization that are
needed for the file access inside Google's distributed data processing needs
have been done.

e) Lazy garbage collection to improve latencies of individual chunk
accesses sacrificing overall throughput in a unit time.

f) block checksuming for consistency of chunks. operation log of
master for recovery.

Flaws in the paper

a) The single master with multiple hot replicas will face problems
when metadata update operations pressure builds up in cases like small file
update. ( Note however that metadata read pressure can be alleviated with
shadow read only masters. And also when a single master fails, then a replica
master can assume mastership almost instantly. )

Techniques used to achieve performance

a) Lazy cleanup : the files are cleaned up lazily to improve response

b) batching is applied in log writing, chunk server requests etc.

c) distributing the load : chunk servers handle the data access
requests and not the a single master.

d) copy on write snapshots.

Tradeoff made

a) Simplicity of design vs Performance : Single master that handles
all updates to metadata approach is simple but has the flaw of one master to
handled all metadata update loads.

Another part of OS where this technique could be applied

Most of the techniques in this paper are the right set of already used
techniques picked for the requirements of Google. Lazy cleanup, copy on write
has been used in other distributed file systems. Replication models and load
distribution techniques used in the paper have been used in databases and
other distributed systems in the past.

This paper describes the Google File System (GFS), a distributed file system that is scalable and provides high fault tolerance and throughput.

GFS was designed to meet Google’s internal data processing demands. Specifically, there are a large number of components and they often fail. Files are usually very large, and a multiple producer-single consumer model is the norm.

A cluster in GFS contains a master server and many chunkservers. The master periodically communicates with the chunkservers through HeartBeat messages. Files are broke into large chunks which are replicated on multiple chunkservers. The master stores in memory the file and chunk namespaces and mappings, as well as the location of chunks. The master makes global policy decisions as to where chunks are stored, preferably on different servers, different racks; however a chunkserver has the final say as to what chunks are stored on it. Upon startup, the master builds up its memory structures by asking the chunkservers which chunks they contain, and by building up the file system state with a checkpoint/log system similar to what we have previously seen. A new operation, the record append, is introduced to support a multiple producer model. Record appends are guaranteed to be atomic and write at least once; however, they may write more than once, and this may vary from replica to replica (replicas are not guaranteed to be exactly the same), and the application (consumer) is expected to be able to deal with duplicate records.

My sense is that they were happy to get something working that works fairly well for internal processing of large amounts of data. In this light, one might ask if some of the design decisions (and tradeoffs discussed below) are optimal for their workloads or simply sufficient to meet their needs.

One technique used is to make things as simple as possible. This entails a tradeoff between the implementation of features and the number of guarantees that are made versus simplicity and time. Specifically, record appends write atomically at least once whereas traditional file systems may not append atomically but we might expect them to append exactly once. The POSIX layer is not implemented, and a number of features such as file permissions and quotas have only been implemented over time, as deemed necessary or useful. As well, while there are potential benefits to not having a master and distributing its functions, the functions implemented by the master were simply much easier to implement on a master, rather than distributed.

As well, given the expected workloads, a tradeoff was made to favor high bandwidth over low latency. While the system is expected to handle small files and small (random) reads and writes, performance is optimized for large files and large (sequential) reads and writes.


The paper describes the design of the Google File System, which aims at providing high aggregate performance and provides fault tolerance inspite of running on cheap commodity hardware. It has a single master multiple chunk-server model with many design decisions to minimize the load on master.

Problem attempted

The file system has been designed after critical observation of the workload at Google. The objective in building this system is to address the requirements which were revealed from study, namely, importance of sustained bandwidth over latency, optimizing for workloads with large reads and large sequential writes (appends), providing atomicity for multiple clients appending to a same file.


1) Like the LFS paper, the design decisions have been made after a study of the workloads. The most important conclusions that are used to make design decisions are: modest number of huge files ( Multi-GB files), workloads that most often append to files and rarely overwrite files, applications with a high premium on high rate bulk data processing than low latency.

2) A GFS cluster has a single master and multiple chunk-servers. A file is divided into chunks of 64 MB and each chunk is stored in the chunkserver corresponding to the 64-bit unique chunk server handle. All metadata is maintained by master.

3) Many design decisions have been made to avoid a single master being a bottleneck to the system. The clients contact the master only for metadata; reading and writing file data go through the chink-server. Besides, prefetching metadata futher reduces client-master interactions. Similarly, the lease mechanism has been designed so as to reduce the overhead for master. For example, extension requests are piggybacked on Heartbeaet messages.

4) Operation logs are used to maintain a persistent record of metadata. As this info is important, it is replicated to many remote machines. Master can recover its file system state by just replaying its operation log.

5) GFS supports atomic record append operation for multiple clients to concurrently operate on a single file with atomicity. This is an important feature absent in AFS.


1) The paper mentions the problem of hotspots where-in small files which get accommodated in a single chunk are accessed heavily. This problem could occur with large files too where a specific portion of the file, extending to a few 10s of KBs could be very important and be heavily used. This too is an artifact of large chunk size where-in the entire portion of the important data would often be present in the same chunk.

2) Clients that cache chunk locations could potentially read from a stale replica. The paper ignores this problem to be a rare case and puts the onus on client to depend on appends rather than overwrites. Corner cases must be handled


1) Having a large chunk size reduces the amount of metadata to be stored. But at the same time it wastes space due to fragmentation. The paper chooses to have large chunk size because it provides crucial benefits like reducing client-master interaction and hence reducing network overhead.

2) Having metadata in master's main memory, all master operations are fast. But this could potentially limit the size of the whole cluster in terms of the size of master's main memory. But the paper choosed to store metadata in master's main memory with the reasoning that adding an extra piece of memory is a better solution.

Techniques used

1) Batching - The master batches several operation log records before flushing them so that the impact of flushing is reduced.

2) Separating data flow and control flow to improve the performance of pushing replicas

3) High throughput and relaxed latency requirements - Transfer of large files over a network is another place where the same principle is used.

The Google File System exemplifies the sort of thing that arises when one is willing to compromise both the system and its clients for a mutual benefit. Classic problems such as support of small files and effective buffer size prediction are neatly avoided because the environment tends not to engender them. The authors were further willing to sacrifice the POSIX file interface, and did so to their benefit. One constraint, however, was selected above all others to be unavoidable and significant: the authors expected hardware to fail, and it was expected to fail often. Even this constraint, however, known as a difficult problem to address, turned to their advantage as they were able to support clusters of inexpensive commodity hardware, both easily acquired and easy to lose.

Chunking, and more specifically chunking at a coarse granularity, were perhaps the most traditionally-rooted decision made in the system. Having larger chunks (and ergo larger block sizes in some estimation) works much in favor of large files and long sequential accesses, which it so happened were features of Google's software environment. Large chunk sizes also meant that file chunks were easily addressible across multiple machines and reduced the cost of check-summing to a trivially small figure. A record-append operation was the most unconventional change detailed in the paper, and per the results appeared to be an optimization not for raw speed improvements, but rather for correctness without an appreciable performance penalty.

Google File System's special-purpose design is both its greatest asset and yet also clearly limits its appropriateness to a general computing environment. Desktop file systems, for instance, are certain not to benefit (in the common case) from greater availability of enormous files distributed across multiple machines. Similarly, most servers interact with many small files in a random-ish pattern; a 64MB chunk size is far too large to be efficient for these systems. Furthermore, while having a single "master" node is indeed an acceptable design decision for large, mainly-static files, it would be quickly overwhelmed in a system with high churn over small files. Ghost files, the hidden remnants of deleted files, would pile up and occupy too much space; control messages dealing with the creation of new files might equal or overwhelm the traffic caused by writing itself; internal fragmentation of the disks would obviously be a problem because of the chunk size. The authors suggest that, now that GFS is in place, applications are designed to take advantage of its good properties and avoid the negative ones, which is fine so long as they control the entire stack. For most technology companies, this is an impossibility.

Every system author makes trade-offs; the authors of the Google File system made these trade-offs aggressively to their advantage. Unfortunately, the only other people in the world who could benefit from these decisions were other people at Google, or perhaps their direct competitors (and not for long, it appears). This research does indicate, however, that custom filesystems work well for the environment to which they are designed (and vice-versa). With this in mind, this research might have been more widely-impactful if the design were done in a more parameterized fashion. The chunk size, for instance, appears an arbitrary decision ("let's make it something big") from this paper, where I imagine one could study some effect of much larger chunks on the performance of the cluster. The implementation clearly benefits from a lack of generality, so I refuse to suggest that a general-purpose filesystem is in order from the same group. I only wish for a clearer separation of which measurable decisions were made arbitrarily rather than by experiment.

Something one could try with GFS-like systems were a re-implementation of a log-structured file system with GFS as the underlyer. The garbage-collection idea that GFS implements were a particularly direct benefit to the "cleaner" in an LFS, perhaps separating this concern from the logic of the filesystem itself. Also an idea might be to make a database designed to use GFS as its underlying storage mechanism; the name of the, "record-append," operation suggests that the two may be significantly consonant in their storage goals. In particular, I'd imagine there are certain database applications that would benefit a great deal from quick access to contiguous records, and similarly quick storage of many new contiguous records at a time (especially if they never change). The large chunks could be thought of as a cache for new records, flushed in a record-append operation when finally full. The approach to the design of GFS, in general, should be one that database implementors enjoy; each side would be designing to the came center point, rather than one being forced to interact with another's paradigm.

The paper describes the Google Filesystem, which is a distributed filesystem optimized for large data volumes and commodity hardware whose reliability is flaky.

Problems trying to solve
From what I gather, the authors realize that their particular workload has a common case of huge files with minimal random writes and more of an append to the file model. On the other side, the authors want to run their distributed filesystem over inexpensive off-the-shelf hardware which may not be so reliable. GFS is designed primarily to address these two requirements.

- minimal state is stored in the master, and state information is gathered in a more real time fashion allowing faster addressing of faults.
- large chunksizes are used which fits ideally for the design goal of huge files
- communication between clients and master is kept minimal. the master just directs the clients to the appropriate chunkservers. This improves scale.
- chunk replication and chunk checksumming improves reliability and fault tolerance.
- while operation log is maintained on the master, checkpointing of the same is done to improve performance and minimize replay overhead.

- Master is a single point of failure in their design. While replication of mater is available, the model doesn't look like a hot-standby model so that downtime seems to exist.
- The consistency model looks flaky to me. A lot seems to be left to the applications to figure out.
- The filesystem will perform rather bad for small files, and if there's a huge volume of small files, metadata pressure on the master can build up.
- The directory structuring of filenames seems rather artificial given the flat namespace.

- Lazy strategy is adopted in many places to improve performance
- batching is employed in operation log, chunk to chunkserver mapping, garbage collection etc to improve performance.
- Copy on write model is employed for snapshots.
- But the main technique is chunking and distribution of chunks across the cluster.

- A major tradeoff is in selecting the chunk size. Larger chunk sizes reduce metadata volume and improves master performance whereas smaller chunk sizes would have supported small filesizes better.

Alternate uses
The same technique of chunking and distributing of chunks with a master server keeping track of the chunk availability is what is used in BitTorrent. Here too checksums are used for verifying integrity although checksum is now stored in the torrent file.

I was disappointed reading the paper since there seemed to be so much hype around GFS but I didn't see much of anything new.


This paper describes the Google File system, a distributed file system built for the kind of workloads seen at Google. The authors describe the mechanisms and design decisions behind a scalable, fault tolerant file system which runs on inexpensive hardware.

What were they trying to solve:
Traditional file systems are not suitable for the scale at which Google generates and processes data: multi giga byte files are common. Google also utilizes inexpensive commodity storage, which makes component failures all the more common. Google's data update patterns are specific, and most of the updates append data to the end of the file. Traditional file systems donot guarantee consistency in the face of multiple concurrent updates, whereas using locks to achieve consistency hampers scalability by becoming a concurrency bottleneck.

Files are stored in Clusters. Each cluster has a master node, multiple chunkservers and multiple clients.
Files are divided into 64 MB chunks, each with a globally unique 64bit handle and a version number. Each chunk is stored as a file in the underlying file system and is replicated on multiple chunkservers.
All metadata operations happen at master, data transfer happens directly between client and chunkserver.
Directory structure doesn't exist. The master maps pathnames to chunks through a prefix compressed lookup table. All metadata is in memory.
POSIX compliance is not provided. An atomic append operation is provided, which ensures consistency even with concurrent updates.
Control and Data flow is decoupled for maximum network bandwidth utilization.
Copy-on-write mechanism is used to take a snapshot of files without any overhead.
Master maintains Operation log to recover from crashes.
Master state is also replicated on a remote site.
Lazy garbage collection: deleted files don't immediately get freed, they are renamed and hidden and cleaned up at a later time.
Block checksumming is used to ensure data consistency within a chunk.

As in any centralized system, the server becomes the bottleneck for the system. Since only the master knows all the metadata maps, any downtime of the master leads to downtime of the whole cluster.
The consistency guarantees given by the system is not bulletproof: it is up to the client application to interpret the file contents while reading and ensure logical consistency.
Lazy reclamation of deleted file space might stop new file creation because of quota restrictions.
One general flaw is that this fileystem is very specific to Google's data access patterns; it is unclear if any of these techniques can be used for general purpose file systems.

Techniques used to improve performance
Batching: most operations like writing the log, chuck server requests, garbage collection are not done synchronously, but batched to increase disk bandwidth.
Peer-to-Peer transfer model: Master is used only to establish a connection, actual data transfer doesn't involve master.
Copy-on-write snapshots and lazy garbage collection: examples of deferring operations to a later time or when there is a need.
Redundancy vs Fault Tolerance: Increased redundancy wastes disk space but provides higher fault tolerance.
Space utilization vs management overhead: Smaller chunk sizes waste lesser space, but increase the metadata that needs to be managed by the master.
another part of the OS where the technique could be applied:
Any log structured file system, with the same append-at-end characteristic can adopt some of the features of GFS.

Summary: GFS is a distributed FS highly customized for Google's computing needs and clusters composed of thousands of commodity disks. GFS uses a simple master-server architecture based on replication and auto-recovery for reliability, and designed for high aggregate throughput.

Problem: Google clusters are based on large amounts of commodity hardware where "failures are the norm rather than the exception". In their particular computing environment the files are huge, latency is less important than throughput, the workloads are skewed towards appends, and hundreds of clients are serviced concurrently.

Contributions: An inovattive FS highly customized for their needs, resilient to failures, optimized for large files. Many things stand out, but the ones that seem really new are:
- A new, relaxed consistency model for file for file changes, "mutations".
- Atomic appends
- No data caching
- Files divided in fixed size chunks (identifiable by unique 64bit chunk handles), replicated on chunkservers.
- The concept of a "lease" given to a primary chunkserver, and the expiration term that comes handy in case of chunkserver failure.
- One master server per cluster obtains part of its metadata by polling chunkservers (in HeartBeat messages), rather then storing it locally.

- Reduced interactions with the master: only for metadata. The data comes from chunkservers.
- Move the task of serialization from master to the primary lease holder.
- Caching of metadata on the client side, with an expiration time.
- Large chunks reduce metadata communication overhead. Small relative size of metadata.
- Efficient data structures: Checkpoints implemented as compact B-trees that can be directly mapped into memory. Use of checkpoints also allow a small operation log.
- Batching: in the communication of chunk location; in the flushing of the operation log
- Piggybacking of leases communication on top of regular HeartBeat messages.
- Pipelining: in data flow and snapshots done in parallel with new writes.
- Use higher replication factor for read hot spots.
- The master performs mainteinance tasks in the background, some only when available time.
- Lazy evaluation for garbage collection and lock allocation.
- Load (re)balancing.

Tradeoffs: The consistency is a shared task between the FS and clients. Keeping the master metadata in memory requires adding extra RAM for an increased FS size. The client has to be aware of chunks. Relaxed Consistency semantics versus concurrent updates. A deleted file has to be deleted twice to force it to be Garbage Collected.

Other parts where this can be applied: Not many. In massive map reduce operations for other web services.

Weaknesses: I think that the main drawback is its uniqueness. GFS is extremely good for Google's needs, but how many Googles are out there? Also, the relaxed semantics may be OK for a search engine, but unclear if it could work for a transaction system (workload issues aside), and clearly not in a trading system where latency is important.
There is large gap between the size of a real system and the one in the microbenchmark. It would be interesting if they repeated the same experiments in a real cluster, only to see how useful such microbenchmarks can be for performance prediction in such large scale systems.

In this paper the authors present GFS, a distributed file system for large distributed data-intensive applications. GFS provides fault tolerance while running on inexpensive commodity hardware and it delivers high aggregate performance to a large number of clients. The authors first describe its design and its basic features and then give experimental results based on micro-benchmarks and real workloads.

GFS shares many of the same goals as previous distributed file systems, like performance and scalability. However, it is designed to work well on Google’s workloads and environment. For example, its basic goal is to manage large multi-GB files in an environment where hardware failures are common. Other workload characteristics that affected its design are the most frequent operations (large sequential writes and large reads).

A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients. The master maintains all file system metadata. Files are divided into fixed-size chunks. Chunkservers store chunks on local disks. One contribution of the system is that clients never read and write data through the master. They only ask the master which chunkservers they should contact and then cache this information for some time. In this way, better performance is achieved as clients requests are spread across several chunkservers. Another contribution of the system is its recovery mechanism. Operation logs and checkpoints are used to restore consistency when a failure occurs. Metadata is replicated to other places (not only the master) so that the system can recover if the master fails. Garbage collection is done lazily and chunk version numbers are used to identify up-to date replicas. In this way, better performance is achieved without letting clients read out-of-date data. Another contribution of the system is that the flow of data is decoupled from the flow of control. This helps use the network more efficiently. Finally, GFS supports an atomic append operation is created, so that clients don’t need additional synchronization when appending a file.

One flaw of the system is that extra checks must be done to ensure that the data is consistent. Each chunkserver uses checksumming to detect corruption of stored data because the consistency model doesn’t guarantee identical replicas. These extra checks may reduce the performance of the system. Another flaw of the system is that a chunkserver can become overloaded if a chunk is accessed by many clients. Finally, it would be interesting to see a comparison between GFS and other distributed file systems for Google’s workload.

The main techniques used are chunk replication in order to achieve fault tolerance and reliability, operation logs and checkpoints in order to restore consistency, lazy deletion and optimization for the common cases (large writes and large reads) among others. There is a tradeoff between space and consistency. In order to ensure data integrity many replicas of the files and metadata are stored. Another tradeoff is that performance is decreased as each write must be “transmitted” to all replicas.

The paper designs and implements the Google File System, a scalable distributed file system for large distributed data-intensive applications. It is designed to provide efficient, reliable access to data using large clusters of commodity hardware. The design has been driven by observations of applications workloads and technological environment for both current and anticipated. It can provide fault tolerance while running on inexpensive hardware and deliver high aggregate performance on a large number of clients.
The goal the paper was trying to deal with:
The goal of this paper is to design and implement the Google File System which can reliably and economically store large amounts of data and provide high throughput access to applications under some specific assumptions.
1. Several important factors of the usage pattern are that bandwidth matters more than latency and files are large. This means all those designs for minimizing control messages and cache-coherence messages can go away, simply because data traffic dominates the network because of large file size and applications don't care about several round trips of messages in opening a file. This simplifies the design a lot and enables the designers to lean the design more to availability and manageability.
2. The "record append" semantics is good for a lot of applications. This is both useful and can be implemented efficiently. Therefore, it's better than enforcing UNIX file semantics over the network, which has been shown to be at least slow, if not possible.
3. Snapshots efficiently makes an exact copy of a directory structure or file, and record append allows multiple clients to append data to file while ensuring atomicity.
4. Single master node serves metadata and provides atomic file system operations. Replicas and write-ahead logging provide reliability.
(1) The paper is rather case study-like. Although useful, most techniques are not general enough. And few design principles useful to others are discussed.
(2) There is only one master server- the code does not allow multiple masters. This appears to be a flaw limiting the system's scalability and reliability, since its maximum size and up-time is limited by the master server's capability and up-time.
(3) Performance of the Google File System on small reads and writes is not good enough.
The techniques used to achieve performance:
Google files are broken up into chunks, each with a unique handle that is assigned by a master server at creation time. A chunk is represented by a set of Linux files on a chunkserver. These chunks are replicated three times for redundancy. The master, after creating a chunk, keeps track of its location, copies, etc, and handles garbage collection. This metadata is stored in memory, not on disk. What is more, other techniques include that Data to write is transferred linearly from replica to replica to maximize throughput, and that master performs continuous garbage collection, rebalancing and re-replication. Other techniques including atomic record appends, snapshot, garbage collection, namespace management and locking, master replication are also presented in this paper to improve the performance of the file system.
(1) The Google file system is designed and optimized to run on Google's computing clusters, the nodes of which consist of cheap, "commodity" computers, which means precautions must be taken against the high failure rate of individual nodes and the subsequent data loss. To prevent data lose they then must implement the replication algorithms and checkpoint systems.
(2) Other design decisions select for high data throughputs, even when it comes at the cost of latency.
(3) Writes have to flow through multiple racks.
Another part of the OS where the technique could be applied:
The technique of snapshots used in this paper can also be used in other parts of operating system. For example, Some Unix systems have snapshot-capable logical volume managers. These implement copy-on-write on entire block devices by copying changed blocks—just before they are to be overwritten—to other storage, thus preserving a self-consistent past image of the block device. And in virtualization, some system emulators host a guest operating system in a virtual machine; some (including VMware, Virtual PC) can perform whole-system snapshots by dumping the entire machine state to a backing file and redirecting future guest writes to a second file, which then acts as a copy-on-write table.

The Problems

Google's GFS is a distributed filesystem designed to support applications that process large volumes of data in a cluster of unreliable, commodity hardware. The design is throughput-centric, fault-tolerant, and has a simple, non-standard API: a direct result of GFS needing to satisfy different needs in relatively new datacenter environments.

In particular, the authors highlight increased likelihood of many different types of component failure, large average file size (often > 1 GB), and commonplace M to N relationships between data producers and data consumers.

The Solutions and Associated Trade-offs

To handle component failure, chunks of files are replicated on different data storage machines ("chunkservers") within the filesystem. Chunkservers inform a master server of what chunks of data they are storing, allowing the master server to track their location and replicate them as necessary when hardware dies. GFS can capture consistent snapshots of these chunks, using a copy-on-write approach.

Each chunk is checksummed locally on write: this sum is then compared with future checksums computed during later reads of the same chunk. Bad file chunks are discarded; the master server just replicates the chunk more elsewhere. The CPU cost of checksumming is considered neligable compared to the IO cost.

Updates to a chunk must be pushed to all the copies of that chunk. This is accomplished using a linear pipeline of data transfer traffic between chunkservers. Each chunkserver is responsible for picking the next server to send data to in a (presumably greedy) way that reduces network IO.

Chunks weigh in at 64 MB, considerably larger than most of today's filesystems. A large chunk size was picked to minimize the burden on the master server, which as a result uses less memory to store the state of all the chunks as well as sends less chunk state information to clients. In addition, increasing the chunk size amortizes the message-passing cost associated with transferring data in general.

Often, one or more data producers need to interleave their output into one or more data consumers. A file stored by GFS can be wielded in this manner, acting like a queue that producers push products into while consumers pop them out. The push operation is accomplished using a special append that guarantees one or more copies of the new data will be added to the file specified.


The paper doesn't quantify many negative effects of the design decisions enumerated:
* How much unnecessary replication of data objects written to a file chunk occurs on average? What conditions is this likely to occur under?
* How fragmented does the filesystem tend to get for a particular workload?
* How expensive are these chunk copy placement constraints (i.e. "spread copies across racks") in terms of network IO? What are some reasonable placement policies that perform load balancing while promoting physical storage diversity amid chunk copies?

I don't buy a centralized master sever increasing the system's reliability; instead, it seems like a major barrier to scaling it. Keeping the chunk lookup table and filepath-to-chunklist map resident in memory is a good idea, but requires tracking the contents of the entire filesystem within the confines of a single machine. It might be better to shard this master server filesystem data across several machines instead of buying more RAM. This would not prevent the use of shadows for each master shard while meanwhile increasing the system's flexibility (allowing for more room to store chunk state and metadata, promoting the use of smaller size file chunks, and helping further split up the handling of heartbeat signals).

In this paper, the Google File System is presented. GFS is a distributed filesystem developed specifically for Google's workloads. The design goals were ability to support hundreds of machines and high throughput to user applications.

One interesting observation in the paper is the failure model: they expect that at any given moment, some nodes will be down. This is a significant departure from traditional distributed filesystem designs which assume that failures are rare and there are no frequent changes to the computers who participate in the distributed system. In my opinion, the second biggest contribution of this paper is that it serves as a demonstration on what can be gained by designing a domain-specific filesystem that breaks POSIX API and widely accepted consistency guarantees.

One flaw of the design is the decision to have a single master, which limits the availability of the system. Although the writers argue that a takeover can happen within seconds, I believe that the most important implication is that a failed master might mean that some operations are lost, if they have not been recorded in the log. Relying on a quorum among multiple masters seems a straightforward extension and can provide better performance. Finally, the biggest flaw of the paper is its limited applicability: as a graduate student, I don't learn anything new on how to design a generic filesystem. Furthermore, the absence of a comparison with another filesystem on the same workload or a description of the workload to ensure repeatability gives little evidence on what is the win from breaking the POSIX API and the consistency guarantees.

The design of the filesystem makes a lot of problems associated with scalability much easier to solve. On the other hand, it requires all applications to be rewritten to use the new interface and to make their own isolation and consistency checks.

This paper presents the techniques used in the implementation of GFS, at a very high level - ofcourse it doesnot have to be public :). GFS provides fault tolerance and high aggregate performance (using the New Jersey approach of doing things). The authors completely give up the POSIX interface and keep the implementation simple such that it just satisfies their anticipated workload. Today, it is successfully serving world's largest data repository, which keeps multiple archives of the whole Internet.

The major problem targeted by the authors is not just scalability, but instead providing fault tolerance and high aggregate performance. The authors simply target GFS for workloads with an anticipated workload with huge files that are mostly appended or read sequentially.

GFS revisits the traditional FS assumptions in the light of its anticipated workloads and proposes some radical changes for a distributed FS. GFS makes two major contributions with these changes.

First, is the domain of fault tolerance. GFS proposes constant monitoring (Heartbeat messages) by a master server for each cluster and chunkserver. Further, it uses checksums for error recovery due to failed components (disks) and driver issues. It also uses automatic recovery - shadow masters and replication of master state and chunks across different chunk servers.

Second, is the domain of performance. Based on some observations such as file size distribution and their access patterns, GFS uses a large chunk size of 64MB, decouples data and control transfer between chunk servers and the master. For handling concurrent access, append-at-least-once mechanism is proposed along with chunk leases and explicit serialized mutation orders.

GFS uses the "closest" client for replication. However the term "closest" is never explained - is it based on network latency, or does it factor in network load on the closest chunkserver, does it factor in the system latency (including disk latency). How does one chunkserver figures out the closest server on the fly?
GFS uses fine-grained locking over filenames for handling concurrent access. However, such fine-grained locking comes with added management overhead for master server and requires locks to be acquired in a consistent total order. Chubby locks (OSDI '06) propose a coarse-grained locking alternative for GFS to elect primary servers.
Authors lament about performance deterioration (read/write bandwidths) due to network stack and their pipeling approach. I believe this is also attributed to some other factors like 64MB chunk size - it will need excessive packet fragmentation and re-assembly over Ethernet with a MTU of 1500 bytes (unless Jumbo Frames are used). Further, GFS clients use canonical name of master for each query, requiring a DNS lookup for each request. This certainly incurs a considerable added delay per request which will scale with the number of clients, increasing the number of DNS lookups manifold.

Techniques used:
Overall a great paper, with many techniques working in tandem with each other. Major techniques are lazy space allocation for chunks, lazy deletion which help in reducing internal fragmentation. Replication of states and chunks help in providing fault tolerance. Optimizing for the common case of huge files with minimal random writes.

Chunk replicas are spread over machines across racks. This shows a trade-off between increased fault-tolerance and added write traffic which has to flow through multiple racks. Lazy deletion trades-off and replication trades-off with 3x more disk usage.

Alternative uses:
Techniques used in GFS have been known since long and have been used at many other places. As for instance, replication in RAID, centralized master server for control traffic (and location-independence) as in AFS. Lazy deletion and garbage collection has been used in programming languages and many other FS such as LFS, etc.

In this paper, the authors present a scalable distributed file system that was used by Google for its storage needs. GFS was not started as a research project (like AFS) but was meant to satisfy Google's requirement for a distributed file system that can scale to TB's of data with thousands of unreliable disks.

Google has a huge number(in billions) of files for indexing all the web pages and these files tend to be very large in size(maybe GB's) too. Also their workload tend to have streaming reads and appends on such large files. They needed a distributed file system that could scale easily and is optimised for their workload. Also since Google uses commodity hardware, it needs to be able to handle failure gracefully without effecting the usability.

- For simplicity, they go with a master-slave model instead of a completely distributed system. Master maintains all the metadata and the mappings in memory and the chunkservers store chunks(64 MB chunks of file) as files in a normal Linux file system(with 64K blocks). As a result of this arrangement, the master has to store very little state(large chunks) which can then be kept in memory. Chunks provide similar advantages as provided by the RAID striping+mirroring across multiple disks.
- Since there is only one server(master) that has high level view of the entire system, it can intelligently allocate chunks to the chunkservers based on load or even rebalance them when required.
- The most common operation, streaming reads on chunks happen without any communication with the master which prevents the master from becoming a bottleneck.
- Splitting the data and control paths for writes allow the data to take the fastest route through the chunk servers instead of being bottlenecked by the outgoing bandwidth of the master server.
- Since their major concerns were fault tolerance/correctness on commodity hardwares, features such as chunk replication, operation logs, master state replication, block checksumming ensure that their is no impact on performance(on failure) or disk consistency.(consistency lost by at least once semantics is left to application)

- Master can become a single point of failure. Also restarting a master needs it to replay the last few events(after the snapshot) and also wait for heartbeats to check the state of each chunk server. During the time, all the requests to master are unanswered or inconsistent.
- Most of the consistency checks are pushed to the application(not even the client library). So application needs to maintain ids/checksums to ensure that the records are consistent.
- Small files can increase the pressure on the meta data that needs to be stored on the master. One rogue process(unintentional) can drastically increase the number of chunks, metadata information and fragmentation(on the Linux blocks-64k).

- Chunk replication(redundancy) is used to achieve reliability/fault tolerance.
- Batching is used in the operation log, chunk location requests by clients, garbage collection.
- Copy on write is utilised to prevent unnecessarily copying chunks that don't change while creating a snapshot.
- This file system can be used by a distributed DB that uses similar hardware and with slight modifications to make it use appends(log structured) instead of in-place writes.

Google file system is a propriety distributed file system developed by google to provide efficient, reliable access to data using large clusters of commodity hardware. GFS is optimized for Google's core data storage and usage needs, which can generate enormous amounts of data that need to be retained.

Problem trying to solve: The GFS shares many design goals as previous distributed file systems such as performance, scalability, reliability and availability. However, the workloads and technological environment of google, both current and anticipated, led to new design issues which could provide better performance. The system is built from inexpensive commodity hardware which tends to fail and requires constant monitoring. The system consists of large number of big files and requires optimization for them than smaller files. The reads/writes need to be batched together and must provide efficient semantic for multiple clients appending to same file.
• Files are divided into chunks and are identified by globally unique 64bit chunk handle assigned by the master at the chunk creation. There exist 3 chunk replicas by default for each chunk. Chunk size of 64 MB reduces client’s need to interact with master too often and reduces network load. The reduction is significant in google workload because their applications mostly read and write large files sequentially.
• Neither the client nor the chunk server caches the file data. Caching huge files don’t benefits and this policy eliminates cache coherence issues.
• Master holds all the metadata and client asks master for which chunk server to contact but further reads/writes do not require and involvement through master.
• The master logs file/chunk mutations to an operation log on master’s local disk and replicates on shadow master machines. This log allows us to update master state reliably. The master checkpoints its state in order to keep log small and have smaller startup time after failure.
• The master doesn’t keep a persistent record of chunk replica’s, this reduces the need to keep master and chunk servers in sync and it works best.
• GFS reliance on appending writes instead of overwrites; combined with frequent checkpoints, snapshots and replicas, the chance of data loss is very low, and results in data unavailability, not data corruption.
• GFS masters place new replicas on chunkservers with below average disk utilization. The master also rebalances replicas periodically, looking at disk space utilization and load balancing. Storage capacity is reclaimed slowly. Rather than eager deletion, the master lets old chunks hang around for a few days and reclaims storage in batches.

As a general purpose commercial product, it suffers some serious shortcomings. Performance on small reads and writes, which it wasn’t designed for, isn’t good enough for general data center workloads. Since appending is key to GFS write performance in a multi-writer environment, it might be that GFS would give up much of its performance advantage even in large serial writes.
Probably they should have included an experiment to study impact of killing master.

Performance: Not having client and chunk server caches simplifies the client and overall system by eliminating cache coherence issues. Since metadata is stored in memory, master operations are fast. Check pointing master state after some time period keeps the operational log and system startup after failure small.

Tradeoff: The data is made consistent across all chunk servers at the cost of storing versioning information at each replica and master. Large chunk sizes tradeoffs small file reads for larger file reads. Data integrity is provided with extra check summing stored for each chunk. Simplicity and duplication is traded off for appended writes. (If records append fails at any replica, the client retires the operation which might result in duplicates of same record.)

Another area: The same model can be applied for local file system. E.g. in RAID, we can have replicas of master along with mirrored disks. Snapshots/Checkpointing can be applied to any application server.


This paper introduces the Google File System (GFS), a distributed FS designed by observation of the specific workloads of Google applications. This file system is intended to be a good solution for large distributed data-intensive applications. This file system needs to be fault tolerant and be able to run on inexpensive computers.

This FS was design having in mind the particularities of Google applications. Some of the important points that need to be solved are: 1) Faults are common, the FS needs to be fault tolerant. 2)Files are very big, some modifications are needed for I/O operations. 3)Files are usually only read, and random writes are not common the usual are large writes to the end of the file. This operation needs to be optimized. 4) Atomic operations for write to a shared file.

A cluster is composed of a single master and multiple chunkservers. Files are divided into chunks (64MB) and identified by the chunk handle (64 bits). Clients communicate with the master for metadata about the files, but the data passing is through the chunkservers. This approach reduces the load of the master. The master communicates with the chunkservers for state update information through a batch process.

Metadata (file and chunk namespace, file to chunk map and replication) is stored in the main memory of the master reducing latency but system size is determined by the master´s memory size. But this is usually not a limitation.

The master is responsible for all namespace operations. Each node in the namespace has a read-write lock, for each operation locks need to be acquired. Serialization is achieved but the master is not a bottleneck by performing only one operation at a time. The modifications to the replicated chunks are performed using leases that maintain the order in all the replicas.

The garbage collector works in the background as a part of the state update information process between the master and the chunkservers. This allows the garbage collector to work in times of low activity. From the time a file deletion occurs until the chunks start being freed some time has to pass to avoid accidental deletion of files.

For the performance part only one and two clusters are used and 16 chunkservers and 16 clients, which are far away from being realistic numbers. The numbers do not represent forwarded writes or rebalancing of the data. There is not a study on how much space is wasted due to replication and due to having 64MB chunk size.

It is not very clear what the information that each master has about each of its chunkservers is. Do they have the number of free chunks or just the mean disk utilization. Do they know about the capacity of each server. Where are the permissions stored in the master or the server.

This system is tight to the existing Google applications, but how would it perform on applications with totally different characteristics. Could it be modified to have a good performance in different applications. How would it behave for future developed applications.

Performance is obtained by having a master that performs the file to chunk mapping and a chunkserver where all the read-write operations are performed to. Also by having all the metadata in the master´s main memory. The allocation of the new files in the least utilized chunkservers and the use of balancing also improves performance. Having the garbage collector run in the background improves performance because it works slowly and does not interfere with other operations.
A big chunk size is used (64MB), this is a tradeoff between internal fragmentation and performance. Having replication is a tradeoff between space, performance in writes and complexity to manage replicas for data integrity and availability in reads.
Other techniques that can be used in different parts of an OS are: Atomicity in accessing any shared structure. Checkpoints for any kind of debugging and consistency. Checksumming for integrity. And hierarchy as a division of tasks and protection.

Paper talks about Distributed file system Google File System(GFS). GFS is developed for specific application load at Google.

Existing distributed FS are developed for general purpose so they perform mostly equally well with small/large file, random/full read and random write/append. Workload at Google display specific property like large file, large streaming reads and file mutation has more record-append.Existing filesystem doesn’t tuning for such workload so GFS is developed from scratch keeping workload in mind.

Work Summary/Contribution:
1. GFS has one master and multiple Chunk server architecture. Master keeps the metadata of Files/directory and chunk server has file data. As there is only one master active at one time so interaction between master and client is highly optimized. Master keep all its metadata structure in memory so client request can be processed fast. Client contact masters only to get chunk location and thereafter master has no involvement unless some error.
2. Operation logs helps master node to recover fast from crash. Due to operation log (combined with checkpoints) startup time is very less.
3. Master serializes the mutation to same file from multiple client using lease. Lease is given to one of the chunk server (called Primary) having file chunk and primary server now order all mutation.
4. Garbage collection is done lazily which improve performance.
5. Data flow is is implemented using duplex link hardware. As client wait for its write to be written to all copies of particular file so if file is sent by client to all chunk server than performance will be bad. So in GFS client send data to closest chunk server and remaining server start sending ( at same time while receiving ) to other chunk server. Due to this chaining of data flow performance doesn’t get impacted drastically with the increase of duplicate copies.
6. GFS provide control on replication copies. Different replication level can be defined for different part of file namespace. For higher availability master is also replicated.
7. For data Integrity checksum is used at chunk server. This enable chunk server to verify data without comparing it with other copies.
8. Paper gives statistics of real work load on production cluster. From data it is clear that GFS is able to handle work load efficiently.

1. If there exist a Hotspot (same file/chunk is accessed by many clients) then chunk server can get overloaded.

GFS is developed for specific application so many tradeoffs were made for common cases –
1. Chunk size is 64MB so creation of large file is optimized. But this can create hotspot problem if small file (less than 64MB ) is accessed by many client. Even if they are accessing different part of file but end up accessing same chunk so causing chunk server to be overloaded.
2. Replica placement of chunk is done in multiple rack. This increase the availability incase one rack goes down but this impact performance during write as write data needs to be sent to multiple Racks.

Another part of OS where technique can be applied:
1. Technique of serializing multiple client update is being used at multiple place. For example google online doc multiple user can update document simultaneously and I believe internally update is getting serialized.
2. Technique like replica replacement, load balancing and lazy garbage collection can be used in other distributed system.

Google File System is a distributed file system that utilizes large numbers of commodity machines to design a storage system with fault tolerance targeting large file sizes. The file system targets a workload of large reads, and large sequential appending writes with high sustained bandwidth.

Description of Problem
The GFS was designed specifically for a workload of large files, large reads, and large sequential writes. In addition inexpensive components with high failure rates would be used, but high reliability, high bandwidth throughput, and large storage size in a production environment were important. Previous works target aspects of these design requirements, but do not fulfill all of them.

- Large scale processing file system on commodity hardware. Utilizing commodity hardware reuses older hardware and saves on the cost of expensive servers.
- File system simplification using a single master file server with shadow masters for redundancy. This server is responsible for mapping for the chuck locations and lazily rebalancing, monitoring, and garbage collection.
- Developing a file system that implements features that are designed for hardware failing.

With the data passing mechanisms and close proximity requirements of GFS, it does not seem possible to expand outside of a small local area. If a site failure were to occur, none of the data would still be available, unless there was a way to expand redundancy to multiple sites very distant from each other.

Techniques include batching non-critical operations until heartbeat messages, lazily performing replication, garbage collection and checksums for fault tolerance, designing the file system specifically for the target application, storing master metadata in memory for fast lookups. A trade off this system makes is fast sequential writes and reads for slow small random reads and writes. In addition, another trade off is fault recovery in turn for amounts of space lost to redundancy. In order to get the high throughput, latency or response time was hindered . A good example of where high throughput would be preferred over latency is the GPU. The GPU performs highly computational operations such as 3D rendering or shading, however the data only needs to be shown every 16 ms to achieve a 60 Hz rate.

Summary: The authors of this paper present the Google File System, which is a large-distributed file system build upon cheap commodity hardware that is still able to provide fault tolerance, high aggregate throughput, and support a large number of clients. In order to achieve this they present a new centralized architecture consisting of a single master and many chunkservers.
Problem to Solve: Unlike other file systems the Google system needed to be able to support large multi-GB files and prevalent hardware failures. Also by studying the primary applications the system would support they noticed that reads occurred mainly sequentially and random writes were rare but writes appending to existing files were not.
Contributions: First, they introduce the idea of 64MB chunk sizes for the storage block sizes. This tailors their file system to quickly locate and read the large files stored in the Goggle File System. Along with this they introduce Secondly, they introduce this new centralized architecture which maintains a single master with multiple chunkservers. This architecture handles multiple clients and allows for the replication of chunks to divide requests to reduce system latency. They also introduce HeartBeat messages as a way to keep track state throughout the system. Also they store checksum information about each chuck to prevent replication of corrupt data. Although this common, they proposed a way of incrementally updating the checksum as sequential writes occur to block. This reduces the overhead of keeping the checksums and the checksums allow a quick way to also see if blocks are up todate. They also use atomic appeals in the system to allow multiple clients to write to a given block at one time. This allows clients to write as they wish without waiting for locks to be released and automatically returns the location of where the information was written to the client.They also use checkpoints/snapshots to create copies of data, but without the overhead of making all process wait until completion of the operation. They implement a lazy garbage collection into the system, which reclaims blocks based on a hidden name with a marked timestamp. This allows system resources to be free for processes when necessary and do the collection during low utilization periods. Flaws: One flaw of the paper is that the frequency of master going down is not really explained or what impact it has on the overall performance is. They do explain well what their process is but I'm left with questions like how often does the monitor check to see if the master is up? or how long does the system wait for a master restart before declaring it dead and switching to the shadow?
What tradeoff is made: One tradeoff that is made is that since the block size is larger, it allows for quicker reading of sequential large files, but at the cost of some fragmentation with smaller files. They describe an algorithm they use to prevent this fragmentation, but it undoubtedly will occur in certain situations. Next, they use cheap hardware to build all of their infrastructure which lowers operating costs, but at the price of frequent failures. To prevent data lose they then have to implement the replication algorithms and checkpoint system.
Other places where a technique could be applied: One place where atomicity is becoming increasing useful is when documents are stored online, but the system allows for interactive editing by multiple users at once on these documents. Also the idea of lazy garbage collection could be applied to any part of the system that uses a lot of resources at one point but then has very low points of utilization as well.

Ghemawat et al. describe the design and performance of GFS, the Google file system. GFS is designed to be a highly reliable distributed file system which can handle the large data-intensive applications common to Google.

Attempting to “Organize the World’s Information,” Google has no shortage of data. Indeed, most of the functions Google performs operate on much larger amounts of data and in patterns different from most other applications. Additionally, Google makes use of vast amounts of commodity hardware, so failure of individual components is expected and must be planned for. Such a different workload and environment have characteristics that are best solved by a new approach to the file system.

• Realizing the different requirements of the workload and optimizing for performing reads and writes in large chunks, which are representative of the most common task.
• Focusing on the reliability of the system even in the face of component failures: data is check-summed and replicated on multiple, independent pieces of hardware. Since the replication is done in three areas, failure of any two components would not disable the system.
• Introducing self-healing tendencies to further improve overall system reliability. When it is found that data becomes corrupted or hardware with that data fails, the system detects the problem and replicates data from a good copy to another location, thereby restoring system reliability without the necessity of human intervention. Such a system could conceivably continue to function normally even with considerable failures remaining unrepaired.
• Employing a “master” to, among other things, help ensure consistency and to simplify the system. This master helps to ensure sufficient replication of data, maintains indexes of the data’s location, rebalances, and finds space for new data. This helps to simplify the design at the expense of having a single point of failure. Luckily, the master itself does have redundancy and steps were taken to ensure that a failure of the master is quickly recovered from.

• The chunkservers run the file system as a user-level server process. While the details of this are not completely clear, this is likely less efficient than implementing the file system directly in the kernel to improve performance. Justification of this design decision would have helped to set a stronger foundation for this paper.

GFS uses the conventional file system technique of using large chunks of data to improve read and write performance and sacrifices the performance of small files; essentially, GFS allows latency to suffer in order to obtain better throughput. However, this sacrifice is not too problematic because the workload which runs GFS is constrained to primarily large files.

Post a comment