CS 736 Reviews - Spring 2016: The Google File System

« FlashTier: A Lightweight, Consistent and Durable Storage Cache | Main | Optimistic Crash Consistency »

The Google File System

The Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.

Posted by Michael Swift on March 29, 2016 08:16 AM | Permalink

Comments

1.Summary
This paper describes the Google File system that is widely deployed within Google as a storage platform for the generation and processing of high volume of data used by Google services as well as research. It allows storage of hundreds of terabytes of data storage across thousands of disks on over a thousand commodity machines and can be concurrently accessed by hundreds of clients.

2.Problem
Typically distributed file systems aim for scalability, reliability and availability. However at the time of this publication, conventional systems posed limitations and challenges that led to GFS design:
* Component failures are norm rather than exception
* Files are huge by traditional standards. Multi GB files are norm.
* Most files are mutated by appending new data rather than overwriting existing data. Random writes are very rare.
* High sustained bandwidth is more important than low latency.

3.Contributions
The main contribution of the paper is the file system that can scale to a level unforeseen ever before, running on commodity hardware with built in monitoring, fault tolerance and automatic recovery. The paper illustrates the design and evaluation of the implementation of GFS with details on tackling the problems mentioned above.
Some of the highlights of the architecture are:
* single master: maintain metadata info, name space, ACL
* many chunks server: store chunks of file as normal file of local fs
* chunks are replicated across the system
* crash recovery: during mutation, if one of server crashes, client retries and if master crashes, replay the log, and wait for chunk server to checkin, to rebuild the mapping

4.Evaluation
The authors provide few micro benchmarks to illustrate the bottlenecks inherent in GFS architecture and implementation. The read rate gets 75-80% of the theoretical limits (set by the network bandwidth) and the write reaches about half of the limit. Further, several clients can simultaneously append file at a reasonably high rate (only to be bottlenecked by the network congestion). This shows the high sustained bandwidth the authors aim to achieve with the system. The authors also present data on the overhead at the master, distribution of workload at the chunkserver and and also on the recovery time after crash and show the scalability, fault tolerance and the automatic recovery aspect of the system.
Overall, the authors measure and report important aspects of the design. It would have been useful if they did a comparison with another system (system in Google before GFS) and report how it is better or worse in different aspects (read / write rate vs number of machine, disk size etc).

5.Confusion
What happens when Master fails? It seems like a single point of failure. Also How is consistency maintained across replicas? (i.e. what if replica has different length?)

Posted by: Udip | March 31, 2016 10:07 AM

Summary :This paper describes the design and implementation of the Google File System(GFS) which is a scalable distributed file system used for large distributed data-intensive applications. It provides fault tolerance, ability to run on inexpensive commodity hardware and high aggregate performance. The deviation of the nature of application workloads and technological environments observed in Google’s applications workloads from the earlier file system assumptions have driven many design decisions in GFS. GFS successfully fulfilled the storage needs of Google and was used for generating and processing data across multiple disks, machines concurrently accessed by many clients thereby providing hundreds of terabytes of storage. The paper also reports empirical data by evaluating GFS using micro-benchmarks and real world scenarios.

Problem :Google generates and processes large amounts data(files sizes in the range of gigabytes were common). a) Traditional file systems were not suitable for this large scale of storage as consistency wasn’t guaranteed in the case of multiple concurrent updates and use of locks for this would hinder scalability. b) The use of inexpensive commodity hardware for storage renders the system vulnerable to component failures frequently. GFS incorporated constant monitoring, error detection, fault tolerance and automatic recovery to address failures. c) In variety of data(large repositories, data streams, archival data) files are mutated mostly by appending new data than overwriting. Given this access pattern GFS aimed at optimizing appending performance and guaranteeing atomicity. d) GFS increased flexibility by co-designing the applications and file system API benefits.

Contributions :a) Simplified, new and relaxed design for mutations(file changes) involves a single master (makes global placement decisions, creates new chunks and replicas, coordinates system wide activities, balances load across chunkservers, reclaims unused storage, access control information and stores chunk metadata). However it is notable how GFS overcomes master being the bottleneck by keeping everything in memory, master involves only metadata operation and client caches(not data and with timeout) information reducing master-client interaction(only for metadata). b) Files divided into fixed size chunks and replicated on chunkservers. Each chunk has 3 replicas which improves read performance and reliability, but replication can be out of sync and incurs space overhead. Actual data read/write is served by chunkserver. c) Writes are pipelined and this is useful as it enables full utilization of machine’s bandwidth, avoids network bottlenecks and high latency links and it also minimizes latency by pipelining data over TCP connections and chunkserver immediately forwards data it receives. d) Consistency model - Files region can be consistent(all clients see same data regardless of the replica they are reading from) and defined(consistent and write updates are intact). Multiple concurrent writes can leave a file region in undefined state. GFS overcomes this by using record append operation. e) Large chunks are allocated using lazy allocation to reduce internal fragmentation. However problem of ‘hot spot’ can arise in case of small files consisting few chunks and many client accessing the file. f) In addition to storing master metadata in memory it is also stored on operation logs which helps in recovery by loading to a checkpoint and replaying log after that. Checkpoints are stored as compact B-trees and help keep log size in check. g) Lease is granted by master to primary replica - ensures operation ordering by adhering to lease grant number first and then serial numbers within each replica. h) Lazy deletion - when a file is deleted, it is logged immediately, name changed to hidden name with a timestamp, physical space lazily reclaimed by piggybacking deletion messages on HeartBeat messages. This is my opinion is good as it can be done in the background, is simple and useful in case of accidental failure. i)Data Integrity is ensured by maintaining and checking for checksum computed. j) Copy-on-Write is used for snapshots. However few shortcomings in the above design aspects is that master continues to be single point of failure and is limited by its memory capacity.

Evaluation : The authors evaluate the GFS using micro-benchmarks and real time workloads. They initially evaluated by using single master, 2 masters, 16 chunkservers and 16 clients. Aggregate read bandwidth was shown to scale well when client numbers were increased and the client performance showed negligible decrease. Similar trend was observed in case of aggregate writes also but client performance showed a more marked decrease owing to the greater contention for multiple updates on replicas. The GFS was then evaluated on research and development and data processing real time workloads. Memory consumption by metadata in master and chunkserver was minimal about 50MB to 100MB owing to the highly optimized in-memory data structures. GFS also helped to achieve desirable recovery time in the range of 24 minutes to restore 600GB of data replicated across chunkserver in case of a chunkserver failure. It is wise design choice to offload caching of data from client and chunkserver as the underlying Linux buffer cache anyways handles the data caching. Memory usage has also been optimized by using smart data structures such as smart B-trees for checkpoint etc. The authors also finally analyse the real time workloads and their characteristics to show how the GFS’s design aspects conform the the requirements. However it would have been more useful if the above empirical data would have been presented in comparison with measurements on other prevalent file systems instead of presenting absolute measurements as it would have helped analyze how GFS compares to other file systems.

Confusion :1) When is it necessary for master to access a file with its hidden name once it has been deleted? 2)Why and how is it sufficient to check only checksum of first and last blocks to ensure data integrity during overwrite operation? 3)How does lazy space allocation and padding in case of record append coexist(don’t the two mechanisms contradict with each other) ?

Posted by: Shruthi Racha | March 31, 2016 09:09 AM

Summary
The paper describes the Google File System (GFS), a distributed file system that uses inexpensive commodity hardware to reliably store data over the network in a fault-tolerant way and access it efficiently to achieve high execution throughput for processing large-scale distributed data-intensive applications, which generally interact with large-sized files through append operations.

Problem
A study of application workloads and the technological resources at Google revealed that component failures of inexpensive commodity hardware were common, which required the system to display constant monitoring, error detection, fault tolerance and automatic recovery. The study also revealed that file sizes were generally large (multi-GB files were common) and that the append operation was the most common way to append these files. These observations necessitated revision of design assumptions regarding I/O operations and block sizes. Specifically, the common case of append needed to be optimized. As the study also showed that a a majority of files were read sequentially, caching of data blocks in the client was not going to be useful. The system was also required to implement well-defined semantics for concurrent appends to the same file efficiently and to focus on achieving high sustained bandwidth over individual application latency. None of the existing distributed system designs achieved all of these goals.

Contribution
The authors at Google designed and developed their own distributed system that handled the above issues. The GFS design has influenced the design of the Hadoop Distributed File System, which also uses the MapReduce framework for running its applications. While not being POSIX compliant, GFS organizes files hierarchically and provides a standard interface for file access, along with special functions such as snapshot and record append. The key architecture of GFS consists of a single master node, multiple chunkservers and multiple clients. Files in GFS are stored in chunks of size of 64 MB. The chunkserver stores each chunk as Linux file, identified by a chunk handle and stores a checksum for every contiguous 64KB region in a chunk to protect against data corruption. The master node stores metadata that stores the file and chunk namespace, operation log and the locations of chunks . All master metadata except the mapping from chunk handle to chunk locations is stored persistently at the master and its replicas. The polling mechanism between the master and the chunkservers called the HeartBeat is used for updating chunk locations on the master and also for controlling chunk placement and monitoring chunkserver status. The snapshot operations creates a copy of the file and directory structure using a copy-on-write mechanism, while the record append operation ensures an append-at-least once semantics. To minimize intervention from the centralized master, GFS uses leases to maintain a consistent write order and forwards data from the client to chunkservers in a way that exploits the network topology. This design separates the control and data flow. GFS replicates chunks and uses shadow masters to ensure high availability of file data. It also uses a lazy garbage collection mecahnism for cleaning deleted files and restore space. Stale chunks are identified using version numbers. Extensive diagnostic logging is performed at all nodes to isolate problems, debug and for analysing performance.

Evaluation
The authors present two sets of evaluations – The first of this involved measuring the rate of read / write / record append operations on a small cluster developed specifically for the purpose of GFS evaluation. The read/write/append rate was measured with varying the number of clients issuing these requests. While the rate of read operations indicated a healthy usage of the network bandwidth, the write rate was only half of the theoretical limit. This was because the network stack did not work well with the GFS pipelining data while pushing data to chunk replicas. The record append rate was found to be limited by the bandwidth of the individual chunkservers. The next set of evaluations presented a comparison between usage statistics for smaller R&D clusters and larger production level clusters used at Google. In one such comparison, various aspects of GFS were compared, such as storage, metadata overhead, read and write rates, rate of operations sent to the master, and the recovery time. Another such comparison involved chunkserver workload profiling according to operation count and bytes transferred, comparison of number of appends vs writes, and the breakdown of requests received at the master. These comparisons validated the assumptions made by the GFS designers and also validated the GFS design objectives of high performance.

While the above evaluation covered the significant aspects of GFS, I believe that the following tests could have been added. First, there was no comparative analysis against other distributed systems such as NFS or AFS. Second, the key design decision of using chunk size as 64 MB was not evaluated – a performance analysis using different chunk sizes (both smaller and greater than 64MB) would have provided more insight on this decision.

Questions/Confusion
1. Semantics for handling chunk replica inconsistencies arising due to duplicates created by the record append operation.

Posted by: Shantanu Bhate | March 31, 2016 09:00 AM

1. Summary
Google File System (GFS) was introduced to specifically cater to Google's data-intensive workload providing fault-tolerance on inexpensive commodity hardware. The authors discuss the file system interface extensions designed to support large cluster of terabytes of storage across multiple disks in various machines. Optimizations like constant monitoring, replicating large chunks, and fast and automatic recovery have been discussed for concurrent appends and sequential reads of large files while ensuring that the centralized master is not a bottleneck. The design was well tested with real world problems as well, proving its feasibility.

2. Problem
Keeping in mind that components fail all the time, the authors tried to tackle the problem of not providing low latency in data access across networks but assurance of high bandwidth and crash recovery that never leads to loss of data. For most workloads, concurrent append writes to multi-GB files was most the common load and supporting multiple sequential readers was vital. Atomicity with minimal overhead was essential for multiple producers appending data and a consumer reading it simultaneously. The existing solutions of file system were not optimized for such workloads.

3. Contribution
Keeping multiple replicas across multiple machines is the key design element for ensuring minimum data loss. With a 64MB chunk size, the client would translate the filename and byte offset into a chunk index and request master for the location and chunk handle of replicas. Master keeps track of chunks by periodic HeartBeat messages to all the chunkservers(CS) and only polls at startup or when CS joins. Operation log was the only persistent data maintained by GFS, checkpoint is in a compact B-tree that can be directly mapped into memory and optimizes namespace lookup. There are at least 3 replicas maintained and with one of them being Primary with a time-out lease. Shadow masters support only read-only access and are accessed when master is unavailable. Metadata state changes in master via logging, flushing to disks and finally applying. New checkpoint can be created without lag in incoming mutations. Having relaxed consistency implies maintaining version number of chunks and the data may be unavailable but never corrupt. Garbage collection kicks in for stale chunks via version numbering and on deletion of file the orphaned chunks are identified on CS via HeartBeat messages. Reclamation of resource upon deletion of a file is lazily carried out on master.

4. Evaluation
The micro-benchmarks evaluate Read / Write / Append throughput via bandwidth utilization. Recovery time is also computed along with analysis and profiling of real Google workloads.
In my opinion, separating file system control to master and data transfer amongst chunkservers and clients is a good approach to harness the most out of a distributed file system. Keeping the 3 major metadata types and no file data is a good usage of memory and also ensures no single point of failure. Chunk replica information is not persisted in master since the servers can modify them locally. File data is never cached, and snapshot is done on the same chunkserver that reduces the latency and network traffic a lot. It would have been good to evaluate these optimizations in a step-wise process to understand the performance gain better.

5. Question
How are the inconsistent regions during record appends handled.

Posted by: Sejal chauhan | March 31, 2016 08:59 AM

1. Summary
This paper describes Google File System which is a scalable distributed file system tailored for large data intensive workloads running on commodity machines. With goals like fault tolerance, scalability, etc. similar to previous distributed file systems, Its design is mainly driven by the workload characteristics and environment with focus on high aggregate throughput than low latency. It provides an API that is different than POSIX with operations like atomic record append and snapshot.

2. Problem
There was a need to design a new distributed file system to meet Google’s growing data processing needs typically involving huge distributed workloads operating on a large number of multi-GB files. Some characteristics of their workload and operating environment include inexpensive commodity machines that fail often, files of huge size typically in GB, workloads with large streaming reads and small and rare random reads, workloads with large sequential appends to file, concurrent clients, etc.

3. Contributions
The main contribution of the paper is in the overall design of a distributed file system that support huge data intensive workloads operating of huge amounts of data. i) While many people thought that single master design was not scalable and fault tolerant GFS design proves them wrong. Single master keeps the design simple. Though it might seem like a bottleneck, GFS master serve only metadata requests and the actual data requests go to the chunk servers. ii) Splitting a file into fixed chunks of 64MB and making the chunk as the unit of replication is another novel design choice. Having chunks of large size reduces the size of metadata at master and also reads and writes on the same chunk require only one initial request to the master. iii) As opposed to POSIX semantics which doesn’t provide any guarantee for concurrent operations within a file, GFS provides a clear definition of file state in presence of concurrent accesses for write and record append operations. Since concurrent appends are common among the workloads at Google, GFS provides a new atomic record append operation that guarantees that the data is written to the same offset at all the replicas. iv) Some other novel design choices include pipelined writes between replicas that fully utilizes each machine's bandwidth, checksums to protect against filesystem corruption, lazy deletion that runs in the background that can also deal with cases like accidental deletion, flat namespace and concurrent mutations to a directory, no caching at client.

4. Evaluation
The evaluation presents some microbenchmarks and real world workloads from Google clusters. They evaluate aggregate throughputs of various operations like read, write and record append and compare it against the theoretical limit. They also show the master load which proves that a single master is not a bottleneck in their design. They also evaluated recovery times during chunkserver failures. They also present the real world workload characteristics in two GFS clusters. The evaluation didn’t show the downtime that can be caused due to master failure.

Posted by: Aishwarya Ganesan | March 31, 2016 08:59 AM

1. Summary
The paper presents the design and implementation of the Google File system which is a scalable distributed system for large distributed data-intensive applications, providing fault tolerance running on inexpensive commodity hardware and delivers high aggregate performance(throughput) to a large number of clients. The design was driven analyzing the application workloads at Google and thus traditional choices in design were reexamined to explore suitable design points.

2. Problem
The application workloads and technological environment at Google is different from existing compute and storage file systems. Component failures are the norm rather than exception and thus needs constant monitoring,error detection, fault tolerance and automatic recovery. Also, the files are huge by traditional standards, with Multi-GB files being the common case. Another aspect of the environment is that most files are mutated by appending new data rather than overwriting existing data, which makes random-writes practically non-existent. Reads performed will be sequential. They have also addressed the problem of atomic append operation for multiple clients to append concurrently without using synchronization constructs.

3. Contributions
The design of GFS cluster contains a single master, multiple chunk servers and is accessed by multiple clients. Files are divided into fixed sized chunks which are replicated in GFS for reliability. Chunkservers store chunks,which are identified by global unique chunk handle, on local disks as Linux files. The master maintains all the file system metadata - namespace, access control information, mapping from files to chunks and current locations of chunks and controls system wide activities like chunk lease management, garbage collection and chunk migration. The client interaction with master happens only to get metadata, and all data-bearing communication goes directly to chunkserver. Having a single master simplifies the design and makes chunk placement and replication decisions using global knowledge. Chunk size is chosen to be quite large leading to reduction in client-master interaction , reduce network overhead by keeping persistent TCP connection, and reduction of metadata size stored at master node(it can be stored in memory even for very large clusters). Mutations are logged persistently into operational log at master to ensure reliability and recoverability after master crash. Another useful design point is the separation of data flow and control flow in GFS to use the network efficiently. Snapshot operation is supported to make a copy of a file/directory instantly. The master implements chunk level replica placement policy to maximize data reliability and availability. Garbage collection is performed lazily at chunk and file levels to amortize its cost. The design of master and chunkserver ensures fast recovery to restore their state by master replication and chunk replication.
4. Evaluation
To illustrate the performance the authors rn micro-benchmarks to analyze the read, write and append performance. They have made compared GFS performance with the theoretical maximum performance for these operations calculated using cluster configuration. For real world clusters , the authors point out that metadata stored in master is very small and can easily fit in the master’s memory. They also present experimental results illustrating that master is not the bottle-neck in the system with the load being around 300-500 ops at master. Thus they show that scaling will not be a problem with these designs. Results for quick recovery are also presented based on the priorities (2-20min for different clusters 600GB). The read rates are are very high and about 70% of the theoretical value while the writes are slightly slower than expected by authors. IT would have been interesting if these experiments would have been compared to existing file/storage systems for these applications.

5. Confusion
Could you explain about the state of the files after the atomic append operation? How is inconsistent file state repaired? Can replicas have slightly different contents at some stage?

Posted by: Anshul Purohit | March 31, 2016 08:54 AM

Sumary:
The authors in this paper discuss the design and the implementation techniques used in GFS, a scalable distributed file system for large data intensive applications. GFS provides fault tolerance and high aggregate performance. In the design authors have completely given up the POSIX interface and have kept it simple enough to work optimally on their anticipated workload. This file system is successfully deployed within Google as a storage platform for processing large datasets.

Problem:
The major problem being solved by the authors in the paper is to provide fault tolerance and high aggregate performance along with scalability in the distributed file system. They target GFS for workload with huge files which are mostly appended or read sequentially.

Contribution:
The paper revisits the traditional file system assumptions in the scenario of anticipated workload of google and propose some radical changes for distributed file systems.
First is with respect to fault tolerance. GFS with its heart beat messages proposes constant monitoring of its cluster and chunkserver by a master server. It uses checksums for error recovery of failed components and driver issues. It also uses automatic recovery by shadowing master, replication of state and chunk across different chunkservers.
Second is with performance, based on an observation of file size distribution and access pattern, GFS uses a large chunk size, decouples data and control transfer between chunk servers and master. For concurrent access it supports append at least once mechanism along with chunk leases and explicit serialized mutation orders.

Evaluation:
The authors evaluated the performance of GFS in the paper with few micro benchmark experiments for reads, writes and record appends. They find that efficiency of read drops as the number of readers increase as there could be multiple readers reading from the same chunk server. Writes were slower than expected but it doesn't increase any significant aggregate write bandwidth. They also did an examination on a real world cluster used within google, read rates were much higher than the write rates.
Overall it is a great paper with many techniques working together and I think since the google file system is suited to google workloads the evaluation is appropriate.

Confusion:
Would like to have a discussion on tradeoffs incurred in the design and how the file security as provided by POSIX interface is implemented here ?

Posted by: Ankur Srivastava | March 31, 2016 08:43 AM

1.Summary:
This paper is about the design and implementation of Google File System(GFS), a scalable distributed file system for large distributed data-intensive applications. The main features of the file system are fault tolerance and high aggregate performance.

2.Problem:
Traditional distributed file systems lack the following for data-intensive applications:
1) Inexpensive commodity components accessed by client machines are prone to failures and do not recover easily. As a result, fault tolerance must be integral to the system
2) Multi GB files are mostly common and must be optimized for, than small files.
3) There are large appends when compared to random writes and must be made efficient.
4) Co-designing file system API and the applications makes the system flexible.
GFS is built on the above assumptions.

3.Contributions:
GFS employs variety of simple techniques as part of the system.
Following are some of the key contributions:
1) GFS cluster with single master and multiple chunkservers of fixed size chunks with globally unique 64 bit chunk handle assigned by master during creation. Master maintains all the metadata including file namesapce, mappings from files to chunk, chunk location, access permissions etc. Client gets chunk information from masters and contacts the chunk servers for reads and writes.
2) Error detection using 'heart beat' messages from chunk servers to collect state, give instructions. Fault tolerance is provided by replicating the chunks across the chunk servers and having a 'shadow' master that shares the master's metadata and logging information.
3) Having large chunk size of 64MB optimizes for workloads that read and write large files sequentially, reduce network overhead by persistent TCP connection.
4) Concepts such as snapshot to create copy of files/directory tree instantaneously and atomic record appends, where appending 'atomically atleast once' semantics is used. This suits the need where multiple clients append to the same file concurrently. Duplicates/inconsistencies are handled through checksumming.
5) Lazy Garbage collection at file and chunk levels, making it simpler. During file deletion, instead of reclaiming physical resources, it is changed to a hidden name and chunk entries are removed from metadata. Orphaned chunks are then found across the chunk servers and later reclaimed.

4.Evaluations:
The authors have based the evaluation on micro-benchmarks and real world clusters. In Microbenchmarks, the performance of reads, writes and record appends for N number of clients are measured and compared against the theoretical limit. Observed read rate is 80% of the ideal limit and drops as number of clients increases, aggregate write rates about half of the limit and record append rates are less due to network congestions. Further, two real world clusters are taken, of which one contains many smaller reads and other with large sequential reads. Here, they evaluate the load on master and find that it is not a bottleneck ,with the master handling thousands of file accesses/sec due to efficient searches. Chunk server failure handling has been evaluated by killing one of the chunkservers with 15000 chunks and 600 GB of data, in which all of the chunks were recovered in 23.2 minutes.
The authors have measured the system well and provided performance metrics in the form of tables and graphs, but have not really drawn any conclusion from them or provided their thoughts as to why the results are so. The efficiency of GFS could have been shown better by comparing against the same workloads on a different distributed file system.

5.Confusion:
How does lazy space allocation solve internal fragmentation problem?

Posted by: Sharanya Devaraj | March 31, 2016 08:40 AM

Summary
This paper describes the design and implementation of the Google File System (GFS), which is a scalable distributed file system designed by Google to especially support their large distributed data-intensive applications.

Problem
Large-scale data processing needs at Google led the authors to revisit some of the design goals of the file system. Specifically, the requirement was to create a fault-tolerant & scalable file system that could run on commodity hardware and provide concurrent access to hundreds of clients to terabytes of data stored across a thousand of machines. And thus, driven by the unique requirement of their application workloads, the authors proposed GFS.

Contributions
According to me, the following are the novel contributions of this paper:
(1) A novel file system architecture that relies on separation of control between master and a set of chunkservers for file system management. The master is only responsible for managing the file system metadata, while the actual data is hosted by the chunkservers. Master involvement in common operations is minimized by a large chunk size and by chunk leases, which delegates authority to primary replicas in data mutations. This design allows for greater parallelism and ability to support hundreds of concurrent clients.
(2) The choice of a single master for centralized management, rather than relying on distributed algorithms for consistency and management. This is contrary to approach to taken by many other distributed file systems like xFS, Frangipani, etc. and leads to simpler design and increased reliability & flexibility.
(3) Implementation of an atomic append operation (record append) to serialize concurrent appends by multiple clients. This obviates the need for clients to implement complicated and expensive synchronization for typical producer-consumer scenarios.
(4) Replication of chunks at different chunk servers based on rack awareness and network distance allows for better performance and better network bandwidth utilization.
(5) Speedy recovery made possible by minimizing the state that the master and the chunkservers have to maintain. Operation log and checkpointing are used to provide these guarantees.
(6) A novel online repair mechanism that can regularly and transparently repair damage and compensate for lost replicas as soon as possible.
(7) Use of checksums to detect data corruption of chunks in chunkservers, which is often too common in a large distributed file system.

Evaluation
The authors have presented their evaluation of GFS using micro-benchmarks, as well as, using real-world workloads. A small GFS cluster consisting of one master, two master replicas, 16 chunkservers and 16 clients was used to micro-benchmark the performance of GFS. For their micro-benchmarks, the authors have compared the performance of read, write and record append operations of GFS against the theoretical possible limit as established by network bandwidth. The read operation rate was able to achieve 75% to 80% of the theoretical limit, while the write operation was only about half of the theoretical rate. They have attributed slow write rates to their network stack, which does not interact very well with the pipelining scheme used to push data to chunk replicas.
The authors also examine the performance of GFS on two kinds of real-world clusters- one that is used for research and development and the other that is used for production at Google. For both the types of clusters, they have demonstrated that GFS is able to sustain high read, write and record append operation throughput even in the presence of chunkserver and disk failures.

Given that GFS is highly tailored for use by large-scale data-intensive applications run within Google itself, I think it justifies for the authors to have done their evaluation primarily on the workloads seen by their own Google clusters. However, I feel that the authors could have elaborated more and justified some of their design choices through their evaluations. For example, what is an appropriate chunk size? Is 64 MB good enough? How does the performance vary with different chunk sizes? What is a good replication factor? How does increasing replication factor affect the network congestion and performance of the overall file system? I think it would be interesting to see answers to these questions.

Confusion
How is access control and permissions implemented in GFS?

Posted by: Saket Saurabh | March 31, 2016 08:36 AM

1. Summary
In this paper, the authors describe the design and implementation of the Google File System - a scalable distributed file system for large distributed data-intensive applications. This system meets the rapidly growing demands of Google’s data processing and storage needs.
2. Problem
The authors reexamine the traditional choices in designing distributed file systems in order to meet their needs. These include having component failures as norms rather than exceptions, ubiquity of large files and the fact that most files are mutated by appending new data rather than overwriting the existing ones. They have also co-designed the applications and the file system API to benefit the overall system flexibility.
3. Contribution
The authors have laid down several design and implementation decisions taken in order to provide a fault tolerant distributed system. A GFS cluster consists of a single master and multiple chunkservers accessed by multiple clients. It provides create, delete, open, close, read, write, snapshot and record append operations for a file interface. The files are divided into fixed-size (64MB) chunks identified by a 64 bit chunk handle assigned by the master during chunk creation. The chunks are replicated across atleast three servers. The master maintains all file system metadata which includes the file and chunk namespace, access control information, mapping from files to chunks and chunk locations. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks and chunk migration between chunkservers. Having a single master allows it to make sophisticated chunk placement and replication decisions using global knowledge. Chunks reduce clients’ need to interact with the master as operations on a chunk require only one initial request to the master for location information. It also reduces the size of the metadata stored and enables to have a prolonged TCP connection established between the client and the server. The master obtains all the chunk locations at startup by polling the chunkservers and it keeps itself up-to-date thereafter with regular Heartbeat messages. Operation log stores the critical metadata changes preventing metadata losses. GFS has a relaxed consistency model that guarantees that the file namespace mutations are atomic, concurrent successful mutations leave the region undefined but consistent and the mutated file region is defined and contains the data written by the last mutation. System interactions with the master is minimised in order to avoid it becoming the bottleneck. It leases replicas as primary and secondary to establish a mutation order. Data flow and control flow are decoupled to use network efficiently. Atomic record append allows the client to specify only the data and the GFS guarantees that the data will be appended to the file atleast once. Snapshot operation allows the system to have a consistent state to rollback to in case of failures. The master manages namespaces through locking allowing for concurrent mutations in the same directory. The system allows for simple garbage collection due to the mappings maintained. Identification of orphaned chunks is done through heartbeat messages. The chunk version number identifies stale replicas. The authors also specify the fault tolerance and diagnosis. I feel that the authors have done a great job in identifying the key design challenges and developing a system that addresses these shortcomings.
4. Evaluation
The system has been carefully and convincingly evaluated. The GFS performance was analysed through micro-benchmarks and real world clusters. In the micro-benchmarks, the read and write rates increases with client numbers while the record append operation shows variations due to congestion, which is not an issue in practise. The authors justify the read/write performance through the real world clusters whose workload has been broken down and analysed. The master is no longer the bottleneck. They also provide an analysis of the recovery time.
5. Confusion

Posted by: Nivetha Singara Vadivelu | March 31, 2016 08:35 AM

Summary:
This paper describes the design of a new file system called the Google File System that intends to optimize common case workloads and is distributed, fault tolerant on commodity hardware, and hence does not strictly adhere to POSIX standards. The paper also evaluates the design on google workloads to prove its effectiveness.

Problem:
Existing POSIX file systems were not suitable for the common case workloads at Google. They decided to go ahead with commodity hardware where failure was the norm and conventional file systems are not fit for high scalable, fault tolerance, error detection and fast recovery in case of failures in large clusters. The size of files used were large and performing reads and writes at small block granularity was becoming too much overhead, so they needed larger chunk sizes. Concurrent writes and appends was a common workload, and again existing file systems are not tuned for such specific type of workloads. Google also intended to co-design applications with the file system, hence incorporated all the above mentioned design specifications into GFS for high performance and reliability.

Contributions:
So the contributions of the paper are in terms of the policies and mechanisms to incorporate all the design principles mentioned in the problem section of this review. Fault tolerance is provided by constant monitoring and recovery in times of component failures. GFS consists of a single active master and multiple chunk servers. Clients interact with both the master and the chunk servers. Chunk servers contain the actual data in fixed size chunks (64 MB) and the master is responsible for coordination and metadata management. Operation logging checkpoints are used for faster recovery of master. The concept of leases is used for clients to interact with chunk servers and to also ensure correct ordering of writes in all the replicas. Data and control transfer is decoupled to enhance utilization of network bandwidth and load balancing. Data transfer is pipelined for better network utilization and synchronous operations. It also provides additional operations such as snapshot to make additional local copies using copy on write. Replica placement places chunks on different racks for better fault tolerance. Replication factor of chunks can also be dynamic based on request patterns. Lazy garbage collection to clean up deleted files later by just logging the deletes and performing actual deletes during regular scans improves performance. Clients interact with the master for metadata, but all the data traffic is directed directly to the chunk server to prevent bottlenecking of the master. No caching of data is performed to avoid the cache coherence overhead and more so because cache is not useful in case of large sequential reads, which is a common workload. Metadata is cached and is stored in memory for fast lookups. The file states after mutation are consistent, defined, undefined and inconsistent. Chunk servers use checksums for error detection to ensure data integrity.

Evaluation:
It uses micro-benchmarks to evaluate the bottlenecks in the GFS design using real world clusters and compare the read, write and append against the theoretical marks. It also evaluates and verifies all the design choices made in the design and proves that master is not a bottleneck and workload is append intensive. Each machines network bandwidth is well utilized. As readers increase, reduction in aggregate performance was observed while reading simultaneous from same chunkserver. Aggregate write performance also reduces as number of writers increase. 2 different GFS clusters were used for evaluation. Chunkservers (one or two failures) were on purpose killed to evaluate performance and recovery times were noted in both cases. Master load was minimal ( Missing Evaluations: Evaluation of a range of other non-standard workloads (too many small files), comparison against other distributed file systems, advantage of file system buffer caching in these workloads.

Issues:
Why not use multiple chunk sizes depending on the workload? Wouldn’t the relaxed consistency model create high overhead in clients?

Posted by: Siddharth Suresh | March 31, 2016 08:34 AM

Summary:
GFS demonstrates how to support large scale processing workloads on commodity hardware. It is designed to tolerate frequent component failures and optimized for huge files that are mostly appended. They also extend and relax the standard file system interface to improve the overall system.

Problem:
This paper describes a filesystem designed to be specifically efficient for Google's applications and workloads. This filesystem was designed keeping in mind the following observations:
i)High commodity failure rates because of use of inexpensive commodity components
ii)Workloads use modest number of huge files
iii) Files are write one append only mostly concurrently
iv)Large streaming reads and high sustained throughput favored over low latency

Contribution:
The new filesystem designed to optimize for the features listed above was the contribution. Googles file system does not use the traditional per directory data-structure that lists all the files in that directory.
i) Here, files are stored in chunks and the chunk size is also decided after careful analysis to reduce interaction with master at the same time not leading to hot spots if many clients access small files in same chunk. Chunks are also replicated across different chunk-servers to improve reliability.
ii) Simple centralize management, but the master is not consulted on every I/O. Clients cache the information of which chunk-servers it should contact and uses it for subsequent operations. Not only does it minimize master involvement but also has shadow masters which are used during failures.
iii) Google also identified that for its workloads data caching is not important since they have large streaming reads and large data sets.
iv) The whole file system has been designed with the same interfaces (read, write). Two new interfaces have been added: snapshot and record append. These interfaces have been designed specifically to expedite appends to a file and creating snapshots of directories, both of which are prevalent in their workload.
v) Simpler and more reliable garbage collection: it does it asynchronously , it logs the deletion and renames it to hidden name. This feature also helps in keeping deleted data for some time helping in cases of accidental delete.
vi) Decoupled the data flow and the control flow to avoid network bottlenecks and provide high aggregate throughput to concurrent readers and writers.

Evaluation:
The performance of this filesystem is evaluated using a GFS cluster consisting of one master, 2 master replicas, 16 chunk-servers, and 16 clients. All machines are configured with dual 1.4GHz P3 processors, 2 GB of memory, two 80GB 5400 rpm disks, and a 100Mbps full-duplex Ethernet connection to an HP 2524 switch. All 19GFS servers are connected to one switch and 16 client machines to the other. The 2 switches are connected through a 1Gbps link.
The paper evaluates the performance in terms of read rates, write rates and append rates and compares it to theoretical limits imposed by the network. This is done for varying number of clients to test the system with varying levels of pressure. One common observation was as number of client increases efficiency drops which is obvious in a networked server. An explanation of how 16 clients translate to a realistic sharing scenario for a cluster with 16 chunk-servers is not given. I mean, say if a google cluster has 1000 chunk-servers in real situation are their just 1000 clients?
The paper then takes 2 workloads, A and B , one used for research in google and other a production data processing. It tests the read rate , write rate and master operations on the two types of workloads and proves that master is not a bottleneck for these operations. None of the results have been compared against traditional filesystem. What happens if google workloads are run on a traditional filesystem, how do the numbers relate?
Master could become a bottleneck with increasing number of chunk-servers I think. Evaluation for future scalability options (esp. for master). Even though results prove that presently it is not a bottleneck due to measures taken in design.

Confusion:
No data caching in DRAM or the SRAM caching? Is there no caching at all?

Posted by: Vishakha Dhelia | March 31, 2016 08:28 AM

1. Summary
This paper describes a distributed file system with a master node that maintains metadata and client nodes that hold replicas of the data stored. It is designed to be fault-tolerant enough to run well on commodity hardware.

2. Problem
Google found that it used distributed file systems in ways for which no file system had been designed. First, Google used clusters of inexpensive commodity hardware to store their data. This virtually guaranteed that, before the end of a file system operation, some hard drive, computer, or network would crash or permanently fail. Second, Google stored many large files. These were typically written in long, sequential appends to the end of the file and read in long, sequential reads, although small random reads and writes did sometimes occur.

3. Contributions
The paper contributes a file system design and implementation that uses a master node that manages metadata and multiple chunkservers that hold the file data itself. Files are split into large blocks called chunks, which in the paper are set at 64 MB. Each chunk is replicated across multiple chunkservers, and the master stores which chunkserver each chunk is located on. The master also holds metadata, such as namespaces and protection information, as well as an information log, which records the mutations made to the files. All of these are held in memory. While the GFS file system described in this paper does not give a POSIX interface, it provides operations to create, delete, open, close, read, and write to files as usual, as well as two new operations: record append and snapshot.

For reading, a client asks for the location of an offset of a file in a chunk on a chunkserver. Using the IP address, the master is able to give the client the location of a close chunkserver with that data. The client then reads the data, verifying each 64 KB block against its pre-computed checksums in order to detect corruption. For a mutation, either writing or a record append, to a pre-existing chunk, the master grants a lease to one of the replicas, which is called the primary replica. The master then notifies the client of all the replicas for the file requested, to which the client then pushes data, in any order. The primary replica then decides the order in which the mutations should be applied and alerts the other replicas to this.

The two new operations, record append and snapshot, are implemented as follows. Record append takes data as operand, but not a location, since the data must be appended to the end of the given file. It will be appended at least once, until all replicas have an identical region of data at an identical chunk offset. This allows faster concurrent appends. Snapshot is implemented in a copy-on-write fashion. When the master receives the operation, it marks all chunks of the data for the snapshot and revokes leases to them. When new mutation requests are initiated to those chunks, the master creates new chunks.

The replication of the data across chunkservers contributes greatly to the fault tolerance of the system. In addition, the master node is replicated to many shadow masters, each of which stores a copy of the operation log and the checkpoints of it. The checksumming described above also contributes to the fault tolerance.

4. Evaluation
The paper evaluates the GFS on both microbenchmarks and real-world clusters. The microbenchmarks were run on a cluster with three master nodes and 16 client/chunkserver nodes. Reads are found to be 75%-80% of the theoretical limit. Writes, contrastingly, run at about half the theoretical limit.

For the real-world clusters, the read rates were again much higher than the write rates. The master node was never found to be a bottleneck and restoration of a single chunkserver with 600 GB of data took 23 minutes.

In general, this evaluation seems appropriate. However, as the paper itself points out, they are evaluating the Google File System on Google's workloads, which is a large threat to the validity of the experiments. Nevertheless, this shows how well the file system behaves on (exactly) what it was designed to do.

5. Confusion
How is data protection (permissions) handled with this file system? Is it in this file system, or in a layer above it?

Posted by: Stephen N. Lee | March 31, 2016 08:04 AM

1. Summary
This paper describes Google File System (GFS), which is a scalable distributed file system for large distributed data-intensive applications. Though GFS shares many of the goals of previous distributed file systems, the interesting aspect of GFS is that it was designed keeping in mind the application workload as well as the technological environment. The authors give details regarding the various aspects of GFS and justify their choice. Lastly, carrying out measurements using micro-benchmarks as well as from real world use puts across the feasibility of GFS.
2. Problem
As stated earlier, the authors wanted to come up with a file system that was designed keeping in mind Google’s workload. Firstly, they wanted to build a FS that would scale using commodity hardware, wherein failure was no longer an exception. Thus, there was a need for constant monitoring and other mechanisms to ensure fault tolerance. Secondly, they wanted to optimize the FS for Google’ s workload that had the characteristics of large files, append-only workloads, streaming reads and less random reads, emphasis on throughput as opposed to latency. The existing solutions were too generic and were not tailored to such workloads.
3. Contribution
According to me this paper was a very interesting read and the authors have put across the various features of the GFS systematically. The GFS consists of a single master (contains the metadata in memory) and multiple chunk servers (store the actual data in terms of chunks). I liked that fact that authors took into consideration that a single master could be a bottleneck and mitigate the issue to an extent by minimizing the master’s involvement in the read and write operations. In additional to normal FS operations, GFS supports snapshots and record appends. Another aspect that I found interesting was that use of heartbeats to ensure that the master is aware of the chunk servers and monitors their status. This allowed the master not to persist the chunk locations, which is a vital decision as chunk locations may keep on changing. Apart from chunk locations, the other metadata is kept persistent using an operational log. GFS follows a relaxed consistency model wherein the basic idea is to make the applications responsible for enforcing consistency. I feel this made the design of the server simpler (as opposed to if it would have to handle the consistency semantics). The key feature of decoupling data flow and control flow is an important one as it ensures efficient network usage. Also, the mutation order is maintained using the concept of leases, which is simple yet efficient (as the master solely isn’t responsible for the order). Though distributed garbage collection can be challenging, GFS simplifies the problem at hand by adopting a lazy reclamation policy. There are various other interesting aspects of the GFS such as replication of the chunks to increase availability, rebalancing of chunks to ensure proper disk utilization, use of “shadow” masters to reduce the impact of a master failure to an extent.
4. Evaluation
The authors evaluate the FS by measuring the performance of a few micro-benchmarks as well as present some numbers from real clusters at Google. Through the micro-benchmarks, the authors show the aggregate read/write throughput generally increases with number of clients. Though the results show that the throughput achieved by record appends is considerably lower, the authors attribute the cause of this to the network bandwidth limitation. Through the real world clusters and the workloads running on them (research oriented and product-oriented), the authors justify the initial assumptions of the workload characteristics. They also give details regarding the recovery time, master load and metadata storage overhead which helps one gauge at the feasibility of GFS. Though convincing, I feel the evaluation of the system could have been done in a better manner. Firstly, comparing GFS with existing distributed FS would have been ideal. Also, evaluating GFS on other workloads apart from local Google workloads would have been more convincing.
5. Confusion
When would one want to read files via the hidden name? Is the chunk size a unit of allocation on the disk? How exactly does lazy space allocation work (doesn’t padding incase of record append violate the same)?

Posted by: Arjun Singhvi | March 31, 2016 07:09 AM

1. Summary
This paper describes Google File System, a distributed file system that is designed and used by many data-intensive applications in Google. Some of the design decisions are made specifically to fit the workload and hardware environment of Google.
2. Problem
In contrast to the many existed discussions and papers on distributed file system, Google File System made some design choices based on their observations on the commodity hardware and typical workload pattern in Google: at the workload scale of Google, hardware failure of some machine at any given time is expected rather than an exception; files are usually huge, scaling to the size of multi gigabytes; data access pattern involves mostly large sequential and small random reads, along with sequential writes.
3. Contributions
The design of the Google File System to meet their needs include a single master server and multiple chunk servers: the master server holds all the file metadata, while files are split into fixed-size chunks and scattered across chunk servers. The interaction between master and clients is minimized when master only responds to the request of chunk server location that contains the given chunk of a file. The chunk size is designed to be 64 MB, much larger than physical disk block size, to further minimize the need to querying master for the metadata and enables caching more on the client’s side. Each chunk is replicated on multiple chunk servers, so that fault tolerance of data is improved. Concurrent write is supported by providing the atomic record append, in which appending specified data to a file is atomic and the offset would be returned.
4. Evaluation
The performance of GFS is measured on a cluster with one master, two master replicas, 16 chunk servers, and 16 clients. The test includes concurrent reads and write to new files, as well as record appending. In both read and write cases the performance is compared to the bandwidth limit. Later the performance is also measured on real workloads, where there is a comprehensive list of aspects that are measured, including operations per second and recovery time. The evaluation shows how this system scales with data size and number of clients, and like some other work from Google, due to the uniqueness of the workload they are facing, there are not many other systems they would compare with, but the performance is mostly measured considering the hardware limitations.
5. Confusion

Posted by: Fujie Zhan | March 31, 2016 06:35 AM

1. Summary
The paper introduces Google File System (GFS) a new distributed file system tailored to google’s own workloads which are characterized by data intensive application with mostly sequential read and append operations. These characteristics are used to selectively optimize the file system interface in a manner different from the standard POSIX file system interface.
2. Problem
The problem here is Google’s own workload needing a large distributed file system to handle particular workloads. Due to the large scale and usage of commodity machines, failures are the norm and hence the solution needs to be fault tolerant. The large file size also motivates a different design pattern including things like block sizes. A concurrent append workload necessitates atomicity be built into the file system rather than being managed by clients. Complete control over the applications as well as the file system offers various flexibility to tune the system as required.
3. Contribution
The paper designs a distributed system where the master and chink servers are separate. This allows a clear separation of concern between critical and non-critical data. For example, the master metadata is valued more than the user data being stored at the chunk servers as loss of the former would lead to loss of the entire file system state. The master maintains all metadata including file to chunk mappings and is responsible for all decisions such as distributing the chunk between servers, handing out leases to the primary chunk server as well rebalancing, cloning and garbage collecting chunks. The chunk servers maintain little metadata other than the actual data and cooperate among themselves for fast distribution of any write request data by piping the write data to all other servers. The client here however needs to be a little smart and must understand the chunk servers and work around the relaxed consistency model imposed by GFS.
4. Evaluation
The paper presents a thorough evaluation of the performance of this distributed system by first using microbenchmarks to classify read and write performance. The paper then utilizes the fact that GFS is actually deployed to use data from deployed clusters to show the read and write rates, amount of metadata created, the load on the master and time taken to recover after a chunk server. The paper also analyses the workload to demonstrate the reason for certain design choices. The evaluation however does not show comparisons with a similar distributed file system, I realize that this may not be easily performed as in this case the clients are deeply coupled with GFS. However, something would have been used in this scenario prior to the development to GFS and a summary of the performance improvements gained after moving to GFS would have provided more confidence to other users to tune/deploy something similar for their workloads.
5. Confusion
I am confused by whether inconsistent/duplicate data may vary between chunk servers and how this is handled by different clients talking to different chunk servers

Posted by: Abhinav Mehra | March 31, 2016 04:24 AM

Summary
This paper presents a new scalable distributed file system introduced by Google called Google File System. The file system was designed for a certain class of data-intensive applications and aims at providing high aggregate performance to a large number of clients with fault tolerance over commodity hardwares. The paper further lists out the measurements for both micro-benchmarks and real world use cases using GFS.

Problem
Google explores it's current storage requirements and data processing needs and identifies key aspects that do not go well with the use of traditional distributed file systems. With the use of commodity hardwares, component failures occur more frequently than are believed to occur. Need to handle fast growing huge data sets. And, looking at the write access patterns for such data, it seemed there were more requests for appending data to a file as compared to over-writing existing data. The newer file system is designed to fit in these above cases.

Contributions
a. A GFS cluster comprises of a single master and multiple chunkservers containing the actual data. With a single master, it provides more simplicity to the design. Further, a client only contacts the master to get the locations of the file and then caches the information and then talks to the chunkserver directly, hence reducing the chances of master becoming a bottleneck.
b. The master is responsible for chunk allocation/replication policies providing a load balance among the chunkservers. File metadata is stored at the master with prefix compression of file names, thus reducing the meta data size.
c. Replicas of data are maintained among chunkservers ensuring recovery if one of the chunkservers fail.
d. Consistency is maintained by ordering of writes followed by each chunkserver and maintaining a chunk version number with every update to it. Such approach helps with garbage collection at the chunkservers to remove stale data. Also, checksumming is used to detect data corruptions.
e. Each update to the metadata at the master is logged and checkpointing is used to reduce the log sizes. Such logging ensures recovery speed ups after crashes.
f. Operations like record append and snapshot are provided to allow multiple writes concurrently and copy a given file or a directory tree respectively.

Evaluation
The evaluation seemed a bit confusing to me. The paper comes out with the read/write/append rates for the applications with GFS but does not really give an idea on how GFS has improved the performances for these applications over the traditional file systems. GFS is tested with microbenchmarks and real data-intensive applications at Google. The microbenchmarks were tested on a platform with two master replicas, 16 chunk servers and 16 clients. In case of reads, a linear rise is seen in efficiency with the client number but then gets reduced after a certain number of clients. The write rates and record append rates were found to be less as compared to reads and the authors claim that the network stack(involved with data and control flow) claims for the extra time. For the real applications, bigger clusters with 342 and 227 chunkservers were used. Again, read rates were found more than write rates with read rates using almost utilizing the network bandwidth. Master was seen to give good throughput and did not pose a bottleneck. The restoration times for chunkservers were found reasonable, proportional to the priority with which they were cloned. Plus, they do not mention on how the applications were affected with the introduction of some file system parameters like finding the right chunk and keeping track of each chunk's locations.

Confusion
Is there any overhead for applications in tracking the chunks for each of it's files and keeping a track of it's locations ?

Posted by: Akshay Kanfade | March 31, 2016 03:35 AM

1. Summary
This paper presents the Google File System, designed for supporting large-scale data-processing workloads on commodity hardware. It was motivated by the needs of Google's in-house data processing, and hence was co-designed with the applications. Scalability, fault tolerance, fast recovery, automated management and high aggregate throughput are key design motivators, optimized for large files, where read and append operations dominate.

2. Problem
Optimize for common case of Google's data processing – large files, reads and concurrent appends, component failures.
Large file handling – difficult in traditional FS.
Scalability – ease of adding capacity, monitoring and maintaining was another need.
Commodity hardware - cheap, but unreliable - failure is common and expected.
Network - Minimize bandwidth usage, needs load balancing

3. Contributions
1. Co-design FS and application.
2. Not just a file system – Autonomic Computing - Automatic monitoring and administration.
3. Additional file commands -
  a) record append (atomic, append at-least once sematics)
  b) snapshot (copy-on-write on same chunkserver - local copy faster, after leases expire/are revoked)
4. Big file handling – 64 MB chunks, with unique 64 bit handles
5. Master Metadata -
file and chunk namespaces – persistent log on disk, checkpoints.
file->chunk mapping – persistent log on disk, checkpoints.
chunk replica locations – master polls at startup, or when chunkserver joins
6. Fault Tolerance - Fast Recovery, Replication
  Master
  Metadata State changes - logged, flushed to disk on all replicas, finally applied.
  On master process failure - restarted, almost immediately up (operation log).
  Machine / disk failed - replica starts process.
  Shadow masters - support read-only access, whether master is down or up.
  Writes pause for less than a min while new master talks to chunkservers.
  (comparable to most RAID recovery)
  Clients only know the master via a DNS alias, can change master transparently.
  Chunk Server
  Replicas - default 3, configurable. Primary (time-out lease) and secondaries.
  Replicas on Different machines on different racks
  (side benefit - reads - bandwidth of multiple racks)
7. Messages
  Heartbeats
    Chunkserver -> Master - metadata updates - periodic
    Master -> chunkserver instructions
    Lease extensions piggy-backed
  Handshakes between master and chunk servers
    Failed chunkservers – identified, replication done as necessary.
    Data corruption - 32 bit checksums per 64KB block.
      checked by chunk server on read
    Stale chunks – identified using version number
8. Relaxed consistency
  Version numbers for chunks
  Mutations applied to all replicas in same order, managed by primary replica.
  Worst case, data unavailable, not corrupt
9. Garbage Collection
  On deletion of a file, only logged with timestamp, lazy resource reclamation after 3 days on master
  On removal from master, orphaned chunks are identified to chunkservers on heartbeat messages.
  Stale chunks - version number identifies, regular garbage collection removes.
10. Data flow
Pushed along a linear chain, instead of trees etc, to use every machine's full outbound bandwidth.
As large reads and writes are common, GFS is optimized for sustained bandwidth rather than low latency or IOPS.
11. Master -
  Global Lookup table mapping file path to metadata, not per directory
  Each node has a R/W lock - lazy allocation, acquisition is ordered by level and lexicographically.
12. Decouple data flow from FS control flow on mutations (append / write)

4. Evaluation
Measurement
Micro-benchmarks evaluated Read / Write / Append throughput via bandwidth utilization.
Real Google clusters - workloads profiled, analysed. In addition to I/O ops, recovery time was also computed. Not detailing, since comparison isnt the point, appropriateness for their workload is.
My Opinion
1. Comparison with existing distributed file systems isn't done, but perhaps forgivable as the design motivations and environment are completely different.
2. Debatable Design decisions - Driven by workload characteristics, after analysis - good.
3. Smart memory usage - avoid caching file data, store checksums / metadata .
4. Neither the client no the chunk server cache file data (CS - buffer caches since linux file)
5. Single Server - Bottleneck / Single Point of Failure ? - not really
  In-memory – 3 major metadata types, no file data - fast, so not bottleneck
  Replication, persistence – fast recovery, so not SPOF.
6. Chunk size - lazy allocation to avoid internal frag
7. Chunk replica info not persisted - master just holds, doesn’t own/change this data
8. Different replication, reclamation policies for different parts of the namespace
9. Copy-on-write on same chunkserver - nice touch

5. Confusion

Posted by: Adithya Bhat | March 31, 2016 03:18 AM

Summary
The paper presents a The Google File System or GFS is a scalable, fault-tolerance distributed file system custom-designed to handle Google’s data-intensive workloads. GFS provides high aggregate throughput for a large number of readers and append-only writers (i.e., no overwrites) in a fault-tolerant manner, while running on inexpensive commodity hardware. It is specially optimized for large files, sequential reads/writes, and high sustained throughput instead of low latency. GFS uses a single master with minimal involvement in regular operations and a relaxed consistency model to simplify concurrent operations by many clients on same files.
The problem
Based on observations made on Google's application workloads and technological environment, the authors identify three assumptions that are radically different from the traditional distributed file systems. Firstly, component failures are norms rather than exceptions. Secondly,files here are huge (multi GB) in size. Thirdly, most files are mutated by appending new data rather than overwriting existing data. Finally co-designing the applications and file system API increases flexibility. The above departure in characteristics from earlier file systems motivated the authors to come up with the Google File System.
Contributions
1.A GFS cluster consists of multiple nodes. These nodes are divided into two types: one Master node and a large number of Chunkservers. Each file is divided into fixed-size chunks. Chunk servers store these chunks. Each chunk is assigned a unique 64-bit label by the master node at the time of creation, and logical mappings of files to constituent chunks are maintained. Each chunk is replicated several times throughout the network, with the minimum being three, but even more for files that have high end-in demand or need more redundancy. The master state is also replicated for reliability and this way high availability is ensured.
2.GFS promotes a relaxed consistency model with simple guarantees of defined and consistent regions and the GFS applications can easily accommodate this a with a few already existing techniques.
3.GFS uses a lazy garbage collection instead of eager deletion. This form of lazy storage reclamation comes with a lot of advantages like providing reliability over component failure , allows merging of garbage collection with background activities resulting in cost amortization and acts as a safety net against accidental deletions
4.The master stores the metadata instead of the chunks itself in memory itself. The metadata per file is pretty small in size hence this memory-only approach is not a major limitation.
5.GFS introduces a concept of leasing for maintaining consistent mutation order across replicas.
6.Unlike most other file systems, GFS is not implemented in the kernel of an operating system, but is instead provided as a userspace library.
7.GFS accommodates easy snapshots without hampering ongoing mutations.
8.The atomic append operation called record append allows concurrent writes to the same region without complicated distributed lock manager.
9.To optimize network traffic, data flow and control are decoupled.
GFS was a very important paper and paved the way for HDFS.
Evaluation
The evaluation section is divided into two parts-first it presents the bottlenecks inherent in the GFS architecture at the very outset via micro-benchmarks and secondly it reports performance statistics of real world clusters. The authors provide adequate explanation for all the observations like why aggregate write rates are less than the theoretical limitations. All the design choices made in the paper are verified by the results – the workload is append intensive , master is not the bottleneck etc. Also the workload breakdown provided is very detailed and helps in better understanding. However no comparison is done with existing distributed file systems to show where exactly GFS gains in performance for Google's own application workloads. Also it would have been good to present the performance statistics for workloads which benefit from client caching and are not of the form read-once,write-once.
Confusion
What is meant by explicitly deleting a file which is already deleted?
Also how is it feasible for a chunkserver to have a chunk with a version number higher than that recorded in the master?

Posted by: Amrita Roy Chowdhury | March 31, 2016 02:30 AM

Summary
This paper describes GFS, a distributed file system built over commodity linux systems and disk drives. It is optimized for workloads at Google, provides fault tolerance and extends the file system API for synchronous append operations.
Problem:
Workloads at Google use large files (1GB) and tend to stream reads/writes to the disk. Secondly, application tend to have sequential appends to large files, and concurrent updates are also a common case. To maintain atomicity such updates require complex synchronization. The engineers at Google recognized an opportunity to build a distributed file system to cater to these needs and build a reliable storage sub system out of commodity disk drives rather than using expensive proprietary network storage.
Contribution:
GFS leverages common case of large files and divides every file into fixed size chunks. The system consists of a master server and a collection of chunkservers. Each chunk in a file is replicated over three or more chunkservers. The chunks themselves are stored as files on the native linux system in each chunkserver. Some of the main ideas of the design include:
1- A single master keeps the system simple. To avoid making the master a bottleneck, all read/write interactions for the chunk are handled directly with the chunkserver. A client caches the chunk-id and chunkserver list obtained from the master for subsequent requests for the chunk.
2- The master stores metadata for the file namspace, a mapping of file+offsets to chunks and a list of chunk servers presently storing a chunk. Metadata is maintained in-memory and is kept persistent using a combination of compact fast checkpoints and logs. Multiple replicas of the meta-data are kept in sync allowing redundancy in case the master server crashes.
3- The master assigns one of the chunk owners as the primary chunkserver using leases. The primary chunkserver acts as an ordering point for updating multiple replicas of the chunk consistently. The request path is de-coupled form the data path which uses a bandwidth optimized chain for carrying the data to all replicas before an update.
4- GFS add a record operation for concurrent appends to a file. This guarantees the update
is atomically applied at-least once. GFS relaxes consistency requirements from a file system allowing it to add padding in chunk crossing cases. In cases where updates to one of the replicas fails the primary server re-tries at the following offset. Applications must deal with duplicates and invalid records.
5- The master uses a combination of replica placement policies to achiever uniform disk utilization and balance load across chunkservers as well as across the server racks. GFS uses lazy garbage collection for deleted files and stale chunks (using a version number)
6- GFS uses several fault tolerance measures. The master uses a heartbeat signal to poll the health of chunkservers and re-replicate chunks if necessary. The master state is replicated in shadow server which also provide a lagged version view of the file system meta data. To add resilience against data corruption the chunkservers maintaining checksums at a finer granularity which are verified during reads to a block.
Evaluation
The authors use micro benchmarks to evaluate the potential read/write/append bandwidth in an experimental system. Particularly record appends give lower write bandwidth as it is limited by the network bandwidth of the chunkserver storing the last chunk. They use a calculated theoretical limit as a know best case for comparison. They also profile real world systems running GFS to give an idea of the the size of metadata in the system and sustained read write performance. They analyze the recovery time in case of one and two chunk server crashes showing that full replication counts were restored in under 30m and 2x replicas in just 2 mins. It would be interesting to see the overheads due to checksums in the chunk servers, also an analysis of how atomic record appends improve the performance vs distributed locking.
Confusion
How does GFS provide protection if an application fraudulently tries to access a chunk that does not belong to an opened file?

Posted by: Brian Coutinho | March 31, 2016 02:03 AM

1. Summary
This paper proposes a new file system design called Google File System (GFS) which provides high aggregate performance and is fault-tolerant. They first talk about the workloads in google and the assumptions they made about it. They then explain each feature/ aspects of this file system in detail. In the evaluation part they prove that their assumptions were right and they show how it performs in google's workload.

2. Problem
The main motivation for this new design was google's workload. Component failure has become a norm and the File System(FS) must be fault tolerant, monitored and needs fast recovery. They also deal with large amount of data and most of it is just writing data to the end-of-file or append new data instead of overwriting existing data. Multiple clients are concurrently appending to the same file and atomicity is required.

3. Contributions
The architecture of the FS consists of a GFS master and many GFS chunkservers. The master stores 3 types of meta-data (namespace, mapping and locations of chunks) & is responsible for providing the client with chunk server information using the available metadata. One of the interesting things which I liked about the system was that the master polls the chunkserver (using heartbeats) for chunk information instead of having a persistent record. Each chunk is in one primary replica and 2 secondary ones where the ordering is decided by the primary replica. They have separated the data and control flow so that data flow is done in a network aware manner (the idea of separating data from control plane is interesting and is also being done at different levels in other systems e.g. SDN separates data plane and control plane). Like it was mentioned in the motivation part most of the writes are append so they have a separate function called record append and the primary is responsible for ensuring it is consistent and defined. Some other new features which I found in GFS were allowing concurrent mutations in same directory, re-replicating & rebalancing chunks and using chunk versions to avoid stale data. Another interesting note is that a lot of features of GFS are also seen in the hadoop (HDFS) file system. Along with chunk replica it also uses shadow master replicas which gets update from master and chunkservers and is used when master goes down (read-only until elected as new master). They provide integrity using checksums and does garbage collection in the background. They have decided to temporary rename deleted files (before garbage collecting them) instead of removing them immediately and use heartbeat messages to tell chunkservers about the garbage collection decision. All the chunk and master replicas as well as the rebalancing and heartbeat features are used to provide high availability and fault tolerance. They have used snapshot and a combination of log and checkpoint to aid in recovery.

4. Evaluation
The evaluation section was mainly justifying the assumptions made at the start of the paper. They evaluated different operations for different clusters & real-time workloads. They mainly focused on showing what operations were performed and what operation dominated the workload (most of the operations were writes and here GFS performed well for concurrent operations). They also spoke about the amount of time it took to perform various operations. I felt they could have compared GFS to the previously existing file system used in google or the other popular file systems used at that time. They were able to justify all their assumptions. But they have also claimed that such kind of workloads can also exist in other places but have not spoken about where such kind of workloads can exist or compared how GFS performs on such similar but non-google workloads.

5. Confusion
What do they mean by lazy evaluation? How are they doing that when space is assigned in chunks? Doesn't padding waste memory? The paper claims that accessing hidden files (deleted) is useful, when is it actually getting accessed without restoring the file (without renaming file back to its original name)? Under data integrity they say that to make sure checksum is not hiding corruption they check the first and last record of the region, why are they checking only the first and last record?

Posted by: Anubhavnidhi "Archie" Abhashkumar | March 31, 2016 01:59 AM

1. Summary
This paper describes the distributed file system designed by Google. The file system was optimized for workloads used by Google keeping in mind the constraints they face everyday. These workloads are majorly characterized by large sequential reads and concurrent appends to a file, while running on commodity hardware where failure is a norm. It is designed to provide high aggregate performance and to be fault tolerant. The file system is codesigned with applications to exploit the maximum performance out of this system.

2. Problem
The state of the art file systems were not designed for the types of challenges faced at Google. They needed a system that could run on many inexpensive commodity components that often fail. Their system stores a modest number of large files rather than large number of small files. The workloads consists of large stream reads or small random reads. The workloads also have large sequential writes that append to a file. These appends are concurrent in nature and the system needs to guarantee some consistent semantics in such scenarios. Atomicity with minimal overhead is desirable in such a scenario. Finally, they care for a system which supports high sustained bandwidth rather than low latency. Given these non traditional requirements, they designed a new file system optimized for the above scenario.

3. Contribution
The major contribution of this paper is in providing an insight into the workloads at Google and the kind of optimizations required for running such workloads, given a high failure rate of hardware. The google file systems consists of a master and multiple chunkservers. Each file is divided into multiple chunks which are stored across different chunkservers. Each chunk has a 64 bit ID associated with it. Each chunk is replicated across multiple chunkservers. The site of replication is chosen based on utilization of a chunkserver, location of the chunkserver and the number of recent creations on the chunkserver. The master stores minimum meta data about the file. The major structures are: the file and the chunk namespaces, the mappings from files to chunks and the location of the chunk’s replicas. The metadata is stored in masters memory. The first two types are also kept persistent by logging mutations to an operation log. The master also has replicas and changes to metadata structure are first pushed to all the replicas before returning to the caller. The master maintains a “heartbeat” connection with the chunkserver which helps keep the master up to date with information about the status of the chunks/ location of the chunks. The GFS interface has the usual read, write etc calls along with the addition of calls to support snapshot and recording appends. GFS has a relaxed consistency model which is simple and efficient to implement. File namespace mutations are atomic. The masters operation log defines a global total order of these operations. A file region could be either consistent or non-consistent and defined or undefined. Successful record append operations are generally defined and may be interspersed with inconsistent data. In order to maintain a consistent mutation (write/append changes) order across replicas, they have a notion of lease. The serverchunk with the lease determines the order of all the changes to a particular chunk and communicates that information to the replicas. In order to optimize for network bandwidth utilization, data flow is pipelined. The snapshot provides a copy on write semantics to quickly make a copy of a file or directory tree. GFS uses read and write locks for providing concurrency while accessing files. To avoid deadlocks, all locks are acquired in a consistent global order. Garbage collection is done lazily. This helps the master to not get blocked on storage reclamation. This is an important optimization assuming a high rate of chunkserver failure. This allows the master to be responsive to other requests immediately and also helps provide a safety net against accidental deletion. The chunkserver also store checksums to ensure correctness.

4. Evaluation
The authors run two sets of tests to measure the performance of their FS - microbenchmarks and real world workloads. It would have been interesting to see a comparison of the micro benchmarks when run on an existing state of the art distributed file system. However, the authors do not present that data. For microbenchmarks, the aggregate read/write speed increases as the number of clients increases. There is a small drop in performance after a certain point which is mainly because of the reads/writes going to the same chunk. The append process is generally bottlenecked by the network bandwidth of the chunkservers that store the last chunk of the file. It would have been interesting to see some sort of study in these microbenchmarks which provide an indication as to when the master becomes a performance bottleneck. This would have given a better idea about the scalability of the system. The 2 real world workloads - one focussed on R&D and the other on production data processing, validate their assumption on the characterization of the workloads. They also provide an insight on the recovery time of the system, the cost of in memory data structures for meta data, the rate of operations sent to the master and the total storage area used.

5. Questions
1. What does a client do with the offset returned after an append call?
2. Why does the primary have to wait for all the chunks to receive data before starting with writing the mutations?
3. How are the duplicated data regions due to append handled during sequential file read?

Posted by: Urmish Thakker | March 31, 2016 01:58 AM

1. Summary
This paper presents the design and implementation of Google File System (GFS), a scalable distributed file system for large distributed data-intensive applications. Based on observations from Google’s application workloads and technological environment, GFS revisits the traditional design choices and is a departure from other file system approaches.
2. Problem
The existing distributed file systems were not tuned to the characteristics of Google’s workloads and compute environment. First, Google aimed to use off-the-shelf cheap systems in large numbers which basically meant that failure of a node is not an exception, but a fairly common occurrence. As such the file system needed to focus on monitoring, error detection, fault tolerance and fast recovery. Second, huge (multi-GB) files represent the common case, instead of small multi-KB files. As such the system parameters need to be revisited. Third, Google’s applications had the common characteristic of appending new data at the end of file, while overwriting data at specific locations was not as crucial. As such the file system needed to be optimized for appends. The existing file systems did not have these design goals/considerations and thus, GFS was designed.
3. Contributions
The major contribution of the work starts with analyzing the workloads Google is interested in and then coming up with a new set of considerations that affect the file system design. Some of them were - most of the files were huge in size, so system should not be optimized for small files; large streaming and small random reads are important; instead of random writes, file appends should be optimized for; concurrent atomic file appends with minimal/no synchronization overhead should be supported. GFS did not implement POSIX interface, rather focussed on the requirements of the company - in addition to normal file operations, they supported snapshot and record append operations. To keep the control flow simple, GFS implemented a single master for a cluster that controlled a number of chunkservers. It represented a file as a collection of 64MB chunks to efficiently support large file operations. It offered a different set of guarantees about the semantics of file regions being consistent and defined. Google applications were aware of the consistency model and worked with GFS to ensure correctness of their implementation. By decoupling the data transfer between clients and chunkservers, and the control flow between clients and master, GFS enjoyed the simplicity of control enforced by the single master without the risk of making master the bottleneck. The concept of leasing a chunk further offloaded the master. The master also used checkpointing and operation log to maintain persistence without hurting instantaneous system performance. Replication was a key feature of GFS to boost system reliability. A single node of control using master made it possible to optimize replica placement and replication levels. GFS further provided sophisticated mechanisms to ensure data integrity and stale replica detection. It used lazy garbage collection to keep system lightweight. It also ensured that master has minimal system state, in-memory and persistent, to keep it lightly loaded and be able to scale to large number of servers/clients. File and chunk namespaces are structured to support multiple concurrent updates and efficient snapshot operations.
4. Evaluation
In first part of the evaluation, the authors used microbenchmarks to study a GFS cluster with 1 master, 2 master replicas, 16 chunkservers and 16 clients. They show how the aggregate read bandwidth scales well with number of clients with minimal decrease in per client performance. The write performance shows a similar trend except for relatively more degradation in performance with number of clients due to greater contention for updating multiple replicas per write. In the second part, the authors used two workloads - one focussed on research and development, and the other on data processing. They show that master and chunkservers maintain minimal in-memory metadata (50MB to 100MB). The master is lightly loaded with 200 to 500 ops/sec that it can easily support due to optimized in-memory data structures, thus not becoming the performance bottleneck. Further, the recovery times observed are in accordance with expectations and Google’s requirements - for example, they could restore 600GB of replicated data (lost due to chunkserver failure) in less than 24 minutes. Lastly, they analyze the characteristics of each of the two real world workloads to explain how they conform to the assumptions initially stated and thus, benefit from various optimizations of GFS. These experiments provide some good insights about the efficacy of GFS mechanisms. However, it would be more convincing if the scalability studies shown in part 1 are done on bigger clusters. Further, some kind of comparison against the performance of other file systems with Google’s workloads would put GFS performance number in the right perspective.
5. Confusion
When record append operation returns the offset info, what is it used for? How are duplicated data regions due to append retries handled during sequential file read?
What are semantics of a directory in GFS?

Posted by: Lokesh Jindal | March 31, 2016 01:52 AM

Summary
The paper explains the design and implementation of Google File system, a distributed file system for data intensive applications. The file system is designed to handle the data consisting of large files by splitting them into chunks and efficient utilization of the network bandwidth by maintaining a global master with reduced metadata. Improved fault tolerance with shadow masters and replication of data on separate chunk servers.
Problem
As the data processing needs of Google increased they analyzed their application loads and the current environment and found the following problems which are not addressed in general distributed file system
1] The Failures of systems in distributed environment are normal, hence constant monitoring , fault tolerance, automatic recovery should be part of the system. The file system is now stored on inexpensive commodity systems.
3] File are huge and hence the design assumptions and parameters such as I/O operation and block sizes have be revised
3] most of the operations on the files are append instead of random reads ,hence focus should be on performance of append
4] Co-designing the applications and the file system API benefits the overall system by increasing the flexibility.
Contributions
1] Efficient utilization of network bandwidth by minimizing the interaction with the master. Chunk locations are determined on boot. Master handles replication placement, load balancing and hence interacts with Heartbeat messages periodically, but at a lesser frequency just to check if chunk server is alive
2] Fault tolerance and Faster recovery by using logs and shadow master.
3]Chuck replicas are placed in different chuck server and hence faster recovery of chunks and increased fault tolerance and better utilization of network bandwidth.
4] Bottleneck of master is reduced by minimize the metadata stored so that it can be placed in memory. Client directly interacts with chunks server determining the one which is nearest in topology
5] Master stores the filename to chuck location mapping and hence search of directory structure is removed.
6] Lazy space allocation to avoid wastage of internal fragmentation. Lazy Garbage collection but just logging the deletion and reclaiming later during regular scan and hence simplifying the garbage collection in distributed system which is a big problem.
Evaluation
All the evaluations are specific to workload identified within google systems and very detailed with clear explanation.
Good points:
The graphs shown clearly mention what is the network limit and what was achieved and also explaining where they are lagging behind(Figure 3). Good evaluation of Real world workloads which exist in Google. Detailed explanation of workload in table 2. Reduction in metadata which was achieved is clearly shown. Read and writes achieved are shown in Table 3. Reduction in interaction with master is also shown in table 3. A clear breakdown of workload in given in section 6.3 and all the tables 4,5 explain what was the distribution of data and operations .

Missing evaluation:
1] The evaluation of recovery doesn’t seem enough for me. Instead of failing 1 or 2 chunk server(out of just 16) and checking the recovery time the workload should have been if there are 1000’s of chunk server and multiple of them are failing how would the recovery perform and how will the load balancing mechanism perform.
2] In the evaluation there are only 16 clients and 16 chunk servers, since mapping is filename to chunk and if thousands of files are present, not sure how the master will perform as the metadata will be more and response time might change as it may not fit in memory.
3] The lazy garbage collection mechanism performance when the systems have reached it maximum capacity and there is heavy network workload such are writes and reads. Will the system be able to recover efficiently?
3] No comparison with other existing Distributed file system such as AFS.
Confusion
How does the system recover from inconsistent state i am still not clear.

Posted by: Mushahid Alam | March 31, 2016 01:45 AM

1. Summary
The authors propose a mechanism to build a scalable distributed file system above a cluster of cheap machines for data-intensive applications, and specially tune the design to the actual applications running in Google. It consists of single master maintaining file system metadata and multiple chunkservers accessed by multiple clients.
2. Problem
Distributed file system should deal with application and OS bugs, human errors and failures of disk, memory and network. Realistic design assumptions should be made to sacrifice management of small files for prominent large workload. Performance optimization and atomicity needs to be guaranteed by focusing on appends rather than random writes. Co-designing file system like relaxing consistency model could increase application flexibility.
3. Contributions
It provides fault tolerance by constant monitoring and detection and recovery of component failures. It delivers high aggregate performance to large number of clients by batching and sorting small reads and maintaining high sustained bandwidth. It has snapshot and record append operations to handle multiway merge results and concurrent appends to producer-consumer queue. Cache coherence issues are eliminated by chunkserver and client not caching file data locally. Having single master made it possible to make sophisticated chunk placement and replication decisions using global knowledge. Space wastage is avoided by lazy space allocation. Log allows to update master state simply and reliably without inconsistencies in crash. File metadata size is reduced by employing prefix compression of file names, storing checkpoint in B-tree like form directly mapped to memory which further speeds up recovery and improves availability. Not storing chunk location persistently eliminated need to sync master and chunkservers in the event of failures. Consistency is guaranteed by ensuring write order on all replicas and using version number to detect staleness. Checksumming and regular handshake protocols are employed to detect chunkserver failures and data corruption.
4. Evaluation
It exhibits poor low latency as it don’t impose stringent response time requirements. It doesn’t guarantee that all replicas are bytewise identical in record append ops. Data flow and control flow decoupling lead to performance increase by leveraging network topology. Each machine’s network bandwidth is fully utilized. Applications are received errors instead of corrupt data. Clients and chunkservers experience minor hiccup during recovery while timing out on their outstanding requests. When master is down, file metadata like directory contents or access control information could become stale. Efficiency drop was observed as the number of readers increase due to simultaneous reading from same chunkserver. Pipeline scheme is not compatible with network stack resulting in reduced write rate due to delays in data propagation between replicas. Due to reduction in metadata size, recovery is fast. Master support effective file accesses with efficient namespace binary searches by its data structures. Data loss during chunkserver failure is prevented by fast cloning during chunk restoration. Load balancing and fault tolerance was exhibited by maintaining location independent namespace. Garbage collection uniformly cleans up any unused replicas which impose no overhead as is done in background in batches.
5. Confusion
Can we go thru how is deadlock between GFS data structures prevented?

Posted by: Unmesh Phalak | March 31, 2016 01:26 AM

1. summary
This paper is about the Google File System(GFS) which is a distributed file system designed for a network of inexpensive commodity hardware systems with high fault tolerance. It is a distributed system with a single master and multiple chunk servers.
2. Problem
The GFS aims to address the following main design issues faced by current distributed file systems:the inexpensive storage components in the distributed network have a high failure rate,the file system consists mostly of large multi-GB files as opposed to smaller files ,file operation are mostly append and not overwrite.
3. Contributions
The GFS system consists of a single master and multiple chunk servers. A file is divided into multiple chunks of fixed size.Each chunk is replicated across multiple chunk servers. All metadata is maintained by the single master.Clients interact with master to get chunk server for a file in order to perform data operations, thus master does not become a bottleneck.The chunk to chunk server mapping is not maintained persistently .The master queries the servers to find out this mapping at start time, this prevents any consistency issues.An operation log maintains a persistent log of metadata changes.Changes are made visible after metadata changes are replicated and persisted in the log.

A lease is used to ensure correct ordering while applying mutations to all chunk replicas. The lease is granted to one server which is considered primary, this server picks the correct order in which metadata changes will be replicated. Extension of lease is piggybacked on heartbeat messages.Data flow is decoupled from the above control flow, i.e data is pushed serially to chunk servers in a pipeline. Record appends ensure idempotent append operations to a file.Copy on write techniques are used to implement snapshots.The physical space of a deleted file is not immediately reclaimed , the file deletion is logged,the filename is changed to a hidden name and a timestamp is maintained.Hidden files are removed if the file is older than 3 days.This interaction is piggybacked on heartbeat messages.

4. Evaluation
Using microbenchmarks that consist of one master , two master replicas,16 chunk servers and 16 clients it is shown that the read efficiency drops as the number of clients increase(because multiple clients will read from same chunk server).Similarly,the write efficiency rate is significantly less than the theoretical limit(because of the testing network stack).This does not hinder the aggregate write bandwidth of the system.Append performance is limited by the network bandwidth of the chunk servers.This is alright as a client can move on and write to a new file while chunkserver is busy with append.Two clusters(one used for R&D other used for data processing) in the Google environment are used to analyze performance of the system in real world envrironment.It is shown that the master is not a bottleneck and can handle 200-500 operations per second.Further,a single/double failed chunkserver can be restored within reasonable time.The evaluation is well justified and nuanced because it explains how the observed performance is sufficient for a production environment even when the performance is less than the theoretical maximum.The evalautation section could have touched upon what the performance impact would be if the master fails and needs to recover the metadata state from the replicas.
5. Confusion
What exactly happens when the master fails or gets corrupted and its state needs to recovered ? has that happened at Google so far and how would that impact overall performance ?

Posted by: shreya kamath | March 31, 2016 01:25 AM

1. Summary
Google File System caters to storage for large distributed data-intensive applications, taking care of fault-tolerance, scalability, reliability, and availability. GFS builds upon real-world technical and application requirements to provide a flexible and scalable storage platform in the form of a client-server model to manage files and provide fault-tolerance.
2. Problem
Analyzing the real-world environment and application workload, it is clear that component failures are a norm, files are large, appends are most common, there would be concurrent writes, and that higher bandwidth is most desired rather than latency. So, the file system needs to address these by making the internal architecture transparent. Fault tolerance, replication, large file placement and management should be handled by the storage layer, and not the users.
3. Contributions
GFS was designed to handle application workloads at Google data center, hence the architecture principle is heavily based on the specific behavior as stated in the assumptions. They derive a client-server model with server consisting of a single master containing metadata and multiple chunk-servers containing data, and clients issuing requests on files that are divided into fixed-size chunks because managing replicas of arbitrary size is complex, and that internal fragmentation due to this large fixed chunk size is not a concern.
Thus, the major takeaway here that a chunk is the simplest unit of replication, and thus replicating chunk across multiple chunk-servers achieves high availability and fault-tolerance. And with such a cluster architecture[master - chunk-servers] there is decoupling of control and data transfers that makes request operations faster and the network load-balanced as opposed to having a single server serving both metadata and data. Data transfer is pipelined to attain maximum bandwidth- both machine and network. And then GFS is famous for the record append operation: owing to the workload behavior they optimize for appends rather than random writes. GFS client contacts the master, and master chooses and returns the offset it writes to and appends the data to each replica at least once (consistency semantics), without the need of any distributed locking even with multiple concurrent appends. Fault tolerance is achieved through chunk replication with a minimum replication factor chosen as 3 and reliability through replicating master state across shadow masters. Techniques employed for replica placement, load balancing and lazy garbage collection are sound in such a distributed system. Wonderfully written paper, the design and principles are convincing, and the assumptions unchallenged considering it was at Google.
4. Evaluation
GFS is first evaluated to find the bottlenecks and then they present its performance on real workloads at Google clusters. Reads get a high throughput till the point when chunkservers become the bottleneck due to increasing clients. Writes are affected more when the number of clients increase due to network congestion due to added replication. While appends are bottlenecked by the network capacity of the intended chunkserver. On real world clusters, the measurements prove all their assumptions and design principles. They test the fault tolerance and recovery. This section falls a little short because they did not compare GFS with a distributed file system/architecture. If comparing against other systems was improbable, they could have compared the different optimizations in achieving each design goal.
5. Comments/Confusion
Where do they employ lazy space allocation, because they eagerly allocate a 64MB chunk anyway. Why would they check data integrity before overwrites, and if that was justified, why 1st block along with the last block?

Posted by: Tithy Sahu | March 31, 2016 01:22 AM

1. Summary
Based on the understanding of majority of workloads and patterns, Google came up with its own distributed file system, the Google File System, which heavily focuses on the common case at Google.
2. Problem
Previous distributed file systems came with many design assumptions that were not acceptable at Google’s environment. Unlike the assumptions, components failures are much more common that they are regularly expected, files are much bigger, and most of the files are updated by appending rather than overwriting existing data.
3. Contributions
As a result, GFS was introduced to not only continue achieving previous distributed FS’ goals, such as scalability, availability, and reliability, but it was also heavily optimized for the workloads at Google. These workloads consist of using large files that are written mostly by append and read either in large streams or small random accesses.
The GFS architecture consists of a single master, which holds all the file metadata, and multiple chunkservers, which persist chunks of files. When a client needs to access a file, it first asks master information about on which chunkservers the specific file chunk resides, and it then communicates with listed chunkserver(s) to access the file. To keep things fast, the master keeps metadata in memory, and to keep things reliable and available, the master is shadowed by more machines.
GFS provides a relaxed consistency model, which guarantees that the metadata changes are atomic using operation log and includes an atomic append to make sure multiple clients can append to the same file without extra synchronization.
Another contribution of GFS is that it heavily uses heartbeat messages and piggybacks other messages together between master and chunkservers. It also tries to take full advantage of network bandwidth as well as load balancing by reading different parts of a files from separate chunkservers to read in file pieces in parallel.
It also provides snapshotting and data integrity using checksums.
4. Evaluation
The authors do two sets of tests: micro-benchmarks and real-world workloads. Though micro-benchmark results are useful to know general performance of the file system, it is not being compared against previous distributed systems such as AFS to show how much faster GFS is. This would have been useful for the micro-benchmarks, though it might not have been really possible the real-world workloads results, which were taken over long period of real use of the file system. In addition, the applications are tuned for GFS rather than a general FS that can be swapped in to replace GFS. Overall, the results look good, and the authors show that GFS tries to utilize the network bandwidth as much as possible to prevent any possible network bottlenecks.
5. Confusion
Can we go over the consistency model and how it is okay for data to be in different offsets in each replica?

Posted by: Arman Shanjani | March 31, 2016 01:14 AM

1. Summary
This paper describes GFS, a scalable distributed file system (DFS) designed from the ground-up to provide fault tolerance and high aggregate performance. This system's design was based on Google's application workloads and technological environment. The paper clearly delineates the design choices, presents the interface extensions and reports measurements from micro-benchmarks and real world utilization.

2. Problem
Prevalent distributed file-systems at the time were designed to handle small files and were optimised for low latency and performant random read/write operations. Google's append-once-read-many workload, technological environment and scalability requirements led them to re-examine file system design choices and build a system using commodity hardware based on the following assumptions:
- Component failures are the norm
- Large Multi-GB are common-case
- Workloads are typically large streaming reads/writes
- High sustained bandwidth over low latency

I personally think that this problem was interesting as the file-system was designed to optimize large file management, large streaming reads and append operations.

3. Contributions
The design for a scalable distributed parallel fault-tolerant file-system using commodity hardware for special workloads was highly innovative. The system was architected to use a single master with multiple chunk-servers to serve a large number of clients. Though not POSIX compliant, the file-system interface supported the usual operations along with new operations snapshot and record append.The use of a large chunk-size (64MB) proves to advantageous and helps reduce client-master interactions, network overhead and master's metadata. This system maintains all its metadata(file,chunk namespaces, map(file -> chunks), location(chunk-replicas)) in memory which is drastically different from existing systems and uses prefix compression to reduce memory footprint of filenames. The use of an operation log further optimises recovery and availability. GFS has a relaxed consistency model and explicitly defines the notion of consistent and defined states. The authors also show that this model works when the applications and file-system are designed cooperatively. The system uses chunk version numbers to detect stale replicas. GFS separates control and data flow which I believe is highly innovative at that time and this helps significantly with chunk leasing and serializing mutation order.GFS uses locks efficiently manage namespace regions, has effective replica placement policies to ensure better availability and takes significant steps towards chunk creation, re-replication and rebalancing. Though distributed garbage collection is a challenging problem, GFS uses non-eager(lazy) deletion (file renamed to hidden file) which is simple and reliable. It also merges storage reclamation to the background which amortizes cost and provides a safety-net for accidental, irreversible deletes.
As the master is a single point of failure, the system maintains "shadow" masters to provide read-only access when primary master is down. External monitoring systems can detect master failures and quickly spin up a new up-to-date master using the operation log and replicas.
Overall, the authors seem have to targeted "simplicity" to build elegant solutions to complex problems, from avoiding hot spots on chunk-servers to metadata and operation log management.

4. Evaluation
The authors use micro-benchmarks and real world clusters to justify the design choices and illustrate the bottlenecks within the system. The micro-benchmarks for read, write and record-append are compared against the theoretical limit. The aggregate read performance drops as the number of readers increase and this could be attributed to multiple readers reading from the same chunkserver. Aggregate write performance drops as the number of writers increase and this could due to increased collision rates. Also, the performance of record appends drop with increasing number of clients and this is attributed to congestion and network transfer rates.
The authors showcase results from two GFS clusters used at Google. They first evaluate the read/write performance and provide statistics that commensurate the workload characteristics (more reads than writes, network throughput: 750 MB/s). They next provide statistics to show that load on the master is within manageable limits (500 operations/second). The authors designed experiments wherein they killed chunk-servers to evaluate recovery time. In the case of both single and double failures, the recovery times were 23.2 minutes for the single failure case (15000 chunks with 600GB data) and under 2 minutes for the second case which is more drastic and requires higher priority(2X) cloning and replication. The authors also provide a workload breakdown and metrics for operation breakdown by size and Bytes transferred Breakdown by Operation Size (%) to support their claims.
Overall, I feel the performance was unsatisfactory as the results were just showcased as a large set of numbers(metrics) without providing deep valuable explanations or justifications. They have also not compared their system against any existing distributed file system to show how much better or worse their system could be.

5. Confusion
How does lazy space allocation work ??
Why would anyone use the hidden name from the master to access a deleted file ?
When would the hidden file-name (GC'ed) be used to read from the master ?
Why should we only read and verify the first and last blocks of range for a write operation over an existing range ??

Posted by: Vinothkumar Siddharth | March 31, 2016 12:58 AM

Summary
The paper describes Google File System(GFS), a scalable distributed file system designed specifically for distributed data-intensive applications at Google. It runs on commodity hardware and provides fault tolerance and delivers high aggregate performance to a large number of clients.

Problem
Google's applications, compute and storage infrastructure varies significantly compared to other companies designs and needs. Google's applications mostly append new data rather than overwriting the existing data. Moreover, the files being operated upon are in the order of GB. The compute and storage infrastructure comprises of thousands of commodity hardware where failure is a norm rather than an exception. Traditional filesystems are unable to cater to the needs of Google's high data processing needs and provide fault tolerance without sacrificing performance. Thus, Google wanted to re-examine traditional file system assumptions in the light of the current and anticipated workloads and infrastructure where applications and filesystem are co-designed together to suit each others needs.

Contribution
GFS consists of a single master node, multiple chunkservers and is accessed by multiple clients. The master maintains all filesystem metadata - namespace, ACL, mapping files to chunks and current location of chunks. All the metadata is stored in memory for faster access and file mapping + namespace is stored on-disk in the form of operation log. Replaying of Operation log is used for recovery whenever a failure happens. Master also controls chunk lease management, garbage collections of orphaned chunks and chunk migration between chunkservers. Chunkservers stores the files in the form of fixed sized chunks and replicates them efficiently(rack aware) to ensure reliability and availability. The chunks are stored on local disks as Linux files and read/write data is specified bu a chunk handle and byte range. Primary chunkserver who is granted the chunk lease defines the mutation order between mutations and all secondary chunkserver follow that same order. Clients interact with the master for metadata operations while all data-bearing communication goes directly to chunkservers. Clients never cache file data but caches metadata for a limited time.
File namespace mutations are handled by master guaranteeing atomicity and correctness. The state of a file region after a mutation depends on the type of the mutation, whether it succeeded or failed and whether there are concurrent mutations. The states can be either consistent, defined, undefined and inconsistent. Data mutations consists of writes - data written at application specified offset and record appends -data appended atomically at least once even in presence of concurrent mutations, but at the offset chosen by GFS. To ensure fault tolerance, master and chunkservers can restore their states faster and master state is replicated to ensure reliability. For data integrity purpose, chunkservers use checksums to detect corruptions.

Evaluation
To illustrate the performance, the authors create a small cluster configuration and run micro-benchmarks to measure read, write and record appends performance. The read, write performance decreases with increasing number of clients as multiple clients start reading/writing simultaneously to the same chunkserver. Record appends performance is impacted due to congestion and variances in network transfer with increasing number of clients. For real world clusters, the authors point out that metadata size is very small compared to the data size and hence master's memory does not limit the system's capacity. Moreover, the load at master is about 200-500 ops and the 64.3 % request is for findlocation, 26.1% for open and 7.8% for findleaseholder. Thus, it confirms that scaling is possible by adding more chunkserver without affecting the master's in-memory footprint and load. The authors also shows how quickly the cluster recovers when one chunkserver is killed and how the cluster clones the chunks at highest priority when two chunkservers are killed thereby enabling the system to tolerate another chunkserver failure without data loss. Overall, the results show that the read rates is almost 75-80% of the theoretical read rates proving one of their goal to make read faster as the workloads are majorly read-heavy. Their workload also contains large, sequential writes that appends files that is handled by aggregate write bandwidth of the system even though writes are slower than expected. I do not expect any security based benchmarking as GFS is used internally to Google and everything is trusted with an enterprise.

Confusion
I am not sure how GFS recovers in case of inconsistent state of the file due to failed mutation. Also, it seems that the application needs to do a lot of heavy-lifting of validating and identifying their own records. Wouldn't this complicate the application and divert the attention from the application logic to handling filesystem semantics?

Posted by: Yuvraj | March 31, 2016 12:56 AM

1. Summary
The Google File System is a distributed file system specially tuned for the application use pattern in Google, where device failures are common, a relatively small number of huge files are stored, and most I/Os are concurrent append and sequential read.

2. Problem
Traditional distributed file systems treat device failure as an exception but this is no longer true when a huge number of devices are involved. Fault tolerance and recovery need special care. The fact that most I/Os are sequential and most files are huge is not fully exploited to optimize the performance. Traditional locking scheme is not optimal either.

3. Contributions
The master controls all the metadata while chunkservers store fixed-size chunks. The client can talk to a chunkserver directly and chunkservers can talk to each other. A temporary primary replica is assigned by the master to coordinate all replicas when a write job comes. This centralized architecture simplifies the management while retaining the performance and reliability gain of a distributed system.
A special atomic append operation is introduced, and the consistency guarantee is relaxed to achieve simpler design and better performance. Each replica is not bytewise identical: they provide same valid data but may contain different garbage. The detection of padding, corrupted data and duplicates can be done in application libraries where checksums and unique IDs are already widely used.
Deleted, corrupted and all other kinds of useless data are handled uniformly by the garbage collector. The collection is also part of the regular master background activities, just like the HeartBeat messages. This can keep the cost at a lower level.
To achieve fast recovery, the architecture does not distinguish graceful termination with a crash at all.

4. Evaluation
Results from both micro-benchmarks and practical measurements are given. In micro-benchmarks, the authors compared the differences between the measured I/O rate with the network limit, and gave some simple explanation. Although the write performance is not good as expected, they claimed the causes would not apply in practical use.
For real world statistics, clusters used for research and for production are both included. Tasks on research clusters are relatively smaller and shorter than those on production clusters. Metadata cost, I/O rates, master load, recovery time as well as workload breakdown is shared. They proved the assumptions they made in designing the system and the resulting system met their goal.
Some designs are not covered in the experiments, including garbage collection, rebalancing and master replication.

5. Confusion
Is there any unnecessary overhead as GFS is built on top of a Linux file system?

Posted by: Xiangjin Wu | March 31, 2016 12:41 AM

1. Summary: The paper presents Google File System, which is designed to efficiently handle the common case of large reads and appends in large files, a pattern observed in Google’s workloads. The authors back their claims by providing relevant workload statistics and the performance of their implementation for these workloads.
2. Problem: Contemporary Distributed File Systems (DFS) were agnostic of the workloads which will use them. In a way they were too generic. Since the applications and FSs were not co-designed, FS had to handle all corner cases and suffer from unnecessary complexity and delays. Failures and recovery were still not considered the “norm”. Infact, at the scale of thousands of disks and Petabytes of data, the need is to handle and optimize for such failures. The requirement to be compatible to previous FSs also limited radical assumptions and developments!
3. Contribution: With all this, the authors analyzed their workloads, and realized that the common case has changed. Thus, they decided to build a new DFS optimized for their needs: 1. handle large files. 2. Optimize for common case of large reads and data appends. 3. Handle faults and recovery so that system could run on commodity hardware. They further realized throughput is important for their applications compared to latency, and made some important design decisions which are their major contributions:
a. In a way, they made the FS “application aware”, also evident by large chunk sizes they chose. Since this was a new approach, they also didn’t unnecessarily bother about backwards compatibility with existing systems.
b. They realized simple design is important and use one master to make bookkeeping decisions, but didn’t want the master to become bottleneck, and hence distributed its data across multiple machines. To further reduce its traffic, it only stores meta-data.
c. They decoupled data flow from control flow to ensure consistency in concurrent appends, without being limited by network bandwidth. They also use caching, timeouts, and B-tree like structures for efficient lookups and other optimizations.
d. They solved the problem of reliability by replication. They identify failures by implementing “heartbeat” messages to find out stale data, or which chunkservers are down. They added the mechanism for garbage collection in these messages, and performed it lazily to ensure actual work is not affected.
e. They use log records and checkpointing for faster recovery.
This paper also inspired HDFS, Hadoop Distributed File System, an open source file system, widely in use today to implement Map Reduce operations!
4. Evaluation: The authors base their design on their workload, and properly present its breakdown to justify their claims. They also present the throughput of their system for reads, writes and appends for 1 GB data, and show that their system scales with the number of clients. The throughput for writes and appends is less than reads because of delays in transferring same data over all replicas. Although, they show that the number of operations at master is around 500 op/s which can be handles, I would have liked to see the throughput results on the master as well as chunkservers, when the master starts/restarts after crash. Since, all chunkservers have to communicate with master initially, the traffic on network would be really high. Also, they did not compare their results to any existing DFS. This would have told more about the effectiveness of their specific optimizations. Also, even though they say thier FS is tailored for their applications, performance results on other workloads would have thrown more light on the drawbacks of current FSs.
5. Confusion: Exactly what kind of workloads perform only appends a lot? How do applications handle inconsistent regions left as such during record appends?

Posted by: Mohit | March 31, 2016 12:30 AM

1. Summary
The authors have come up with a new scalable File System (FS) interface which basis its design on present workloads which does more appends than writes, uses commodity hardware and expects large file size. It also is fault tolerant, provides faster automatic recovery and high performance to the clients. Also the FS was codesigned to work with applications which could benefit out of such a system.

2. Problem
Existing FS were designed based on assumptions (when workload patterns were different) which were not longer relevant in the current time. Also, most of the FS tried to find a solution to general use case, where application requirements were not given any consideration.

3. Contribution
The authors came up with FS which utilized commodity components, designed their system to work for large files - by breaking them into chunks of 64MB, replicated thrice. Reads were classified in two forms - sequential and random (which were batched and sorted in advance). Their design consisted of a master (with another read only shadow master) and chunkservers. Masters store the metadata in memory backed up in disk. Chunk replication are not backed up in the disk and the master polls the chunkservers to learn about the information. Chunkservers store the actual data in 64MB chunks. Clients interact with the master only for the metadata and rest of the interaction happens with chunkservers. This prevents the master from becoming the bottleneck of the design. Operation Logs were used with checkpointing to persist the record and faster failover recovery. To be able to provide consistency, masters grant chunkserver lease (60 secs.) which then acts as a primary chunkserver to initiate the writes. Data is copied in the most network optimized manner, with data replicating on the nearest chunkserver. Only when a primary chunkserver signals the secondary chunkservers, the writes are committed. Other operation such as snapshot is done by duplicating the metadata on the master followed by the chunkserver making a local disk copy of the data. Optimizations in the form of replica placement prefer availability by placing them on different racks (even though it may lead to write traffic flowing across racks). While replication, they also considered factors such as below average utilized disks, prefer to replicate the chunks with just 1 copy than 2 copies. Garbage collection techniques (running the job in background, lazy cleaning) were used for cleaning deleted files, which makes things a lot simpler. Stale files were handled using a version number which ensures client getting the latest copy of the file. Other forms of integrity such as checksum also provide extra layer of correctness without which reads might be susceptible to silent corruptions.

4. Evaluation
The evaluation presented in the paper was provided for comparing the read and writes rates compared to the theoretical rate. This served as a good estimate of performance compared to theoretical performance. However the test system was not close to the scale of the real world clusters. The authors also provided a sense of the memory consumed and the break up in the form of metadata at the master and chunkserver, which supports their claim of keeping metadata in the memory. Recovery time experiments provided baselines for expecting the time needed to recover chunkserver failover and replicating chunkserver containing single copy. The last set of experiments provided type and number of calls made to the server. Overall, the evaluation was comprehensive but since they did not compare with any existing solutions, it is difficult to relate their work. Also, they could have provided some inspiration for choosing the chunk size as 64MB.

5. Confusion
I am still not clear how will the file system interact with the intermediate files - pipes, mmap area? Does it only make a copy of the files or the temporary areas as well?

Posted by: Vikas Goel | March 31, 2016 12:17 AM

1. Summary GFS provides a failure-tolerant distributed filesystem optimized for read and sequential write-heave concurrent workloads over multi-gigabyte files.

2. Problem Google's big data workloads require data storage at the petabyte scale, requiring hundreds of inexpensive commodity-hardware backed machines to serve data. At this scale, storage clusters are guaranteed to be experiencing one or more hardware failures at any given time, meaning replication is mandatory. Additionally, these workloads entail concurrent access to multi-gigabyte files by large numbers of clients, and data must be persisted correctly and atomically to replicated files. Because of the high degree of replication and the number of concurrent clients, it needs to be possible correctly propagate metadata changes, and read and write large extents of data without inducing network congestion, or having any single network node becoming a performance bottleneck.

3. Solution
By identifying key characteristics of their workload, the authors tune GFS to its specific properties, namely access to multi-gigabyte files, where the majority of writes are sequential appends, and readers are resilient against certain forms of inconsistency, thus the authors tune the file system to support idempotent appends and arbitrary reads with a high throughput. Each file is divided into fixed size chunks, which are replicated to some predefined degree across arbitrary nodes. A 'Master' server acts as the global maintainer of metadata, mapping filenames and their offsets to chunks, and maintaining a directory of chunk locations via in-memory data structures. The communication protocol is designed to minimize the need for communication with the master. On a write, for example, the Master grants a lease to a primary chunk server, and informs the client of the primary's identity, along with the locations of the replicas. To optimize network traffic, data flow and control are decoupled; the client is free to push the write data to the closest replica (which is pipelined to the others), while passing the request to the primary. The primary selects a canonical serialization of all pending writes, which it pushes to the replicas. Stale chunks are detected via versioning, and garbage is lazilly collected. If the Master does not detect heartbeats from one or more chunk servers, new blocks are added to bring all blocks with missing replicas back up to a pre-defined count.

4. Evaluation
The authors test GFS on a smaller scale using a synthetic benchmark and a real-world benchmark. The synthetic benchmarks show read, write, and append performace approaching or exceeding 50% of the theoretical limit imposed by the network characteristics. In the live workloads, read rates on a stressed cluster approach the maximum possible in the network, and GFS is capable of supporting write rates in the 100MBPS range. Results also show that the master is not a bottleneck in live workloads. I appreciated the authors' detailed characterization of the production workloads, particularly the breakdown of read/write sizes, as well as specific application characteristics. being able to operate at Google's scale is a rarity in research, so being able to see an open and thorough discussion of live data is refreshing.

5. Confusion I'd like to hear a bit more about the consistency model - I'm still not quite sure what 'abnormal' file states are permitted by GFS and tolerated by applications.

Posted by: Michael Vaughn | March 31, 2016 12:16 AM

1. Summary
The paper presents the design of GFS(Google File Systems) which provides a scalable and reliable distributed system for handling the datasets that Google processes.

2. Problems
The traditional file system design are not suited for the datasets handled by google. For example, typical file sizes are of 1GB and having a file system to support block size in KB does not yield any performance benefit. Since these distributed systems are built on top of inexpensive components that are highly likely to fail, the file system design must assume their failure and have built-in recovery mechanisms. Also, having the synchronization mechanisms in the application software level reduces flexibility. Thus a traditional file system may not scale to the needs of Google since the design points are radically different.

3. Contribution
GFS improves the performance by modifying the design points to suit the data handled by google. Following are the main contributions of GFS:
(i)GFS Architecture: The architecture consists of a master which records all the metadata information and the data is stored in the chunkserver. Any client which wants to do a file operation asks for chunkserver server which has stored the file and then proceeds to communicate with the chunkserver for data. The files are replicated across many chunkserver ( 3 by default) to provide reliability. Chunkserver and master exchange metadata information like file deletion, data corruption, etc. through periodic heartbeat message.
(ii) Data specific design decisions : Many system design decisions are based on the datasets handled by google. For example, the chunk size is fixed at 64MB, which is representative of the file sizes handled by google’s servers. GFS supports operations like record append because the random writes are rare and most of the writes are just write once and read it many times later. Since the clusters are built with inexpensive hardwares, error checking and consistency models are built within GFS.
(iii) Master: Master is a central point of contact for all metadata information. It also performs various necessary operations like deciding on replica placements, creation and replication of chunks, garbage collection, stale data replication and fault tolerance and diagnosis. For faster address lookups, master stores absolute file paths with prefix compression for lesser memory footprint. Master takes care of atomically updates of metadata and provides consistency through logging mechanisms. It supports copy-on-write type of mechanism for faster snapshot operations.
(iv) Chunkserver: Data is maintained in these servers and it is replicated across many such servers for reliability. It provides basic data integrity using checksums and restricts bad data proliferation. The primary chunkserver is responsible for ordering of the chunk writes among the replicas.

4. Evaluation
The evaluation presented in the paper is two fold. First it characterizes the file operations like read, write and append through microbenchmarks and shows how they scale with clients and how close they reach the theoretical design limit. Second phase takes the actual clusters running GFS and presents various insights on how memory overheads(for metadata), traffic handled by master and chunkservers and the recovery time for scenarios like single and 2 server failures. Also it discusses the workload characteristics to back their design philosophy. However the evaluation skips to do present a scaling analysis on overheads involved in master for real workload. Overall the paper presents a complete evaluation characterizing micro and macro behaviours.

5. Confusions
What is prefix compression?
What are the exact overheads due to network connection establishment. How has this evolved since publishing this paper?

Posted by: Bharadwaj Krishnamurthy | March 31, 2016 12:05 AM

1. Summary
The Google File System is a distributed, scalable file system built specifically for large distributed data-intensive applications. The paper introduces the unique design decisions made to accommodate its common case, and then it presents a series of tests done on a mini-version of a GFS implementation. An analysis of a real-life cluster is also conducted.

2. Problem
GFS is built for a specific need: namely, large files. Workloads are usually large sequential reads or writes, or alternatively small random writes. Files are also expected to be very large (several GB is not unusual), and component failures within the system should be commonly accepted. GFS was developed as master-chunkserver system to address these design requirements.

3. Contributions
In general, GFS is built in a “cluster” system focusing on a single master coordinator that is responsible for delegating as much work as possible to its underlings in order to minimize bottleneck. A general write workflow looks as follows: the master relays chunkserver information, and there is usually a primary/secondary order amongst the chunkservers (which is cached by the client). The primary then handles the specific read/write request, making a few attempts to retry if necessary.
The single master design allows for a simpler overall structure, but caching must be done on the client side to compensate. Each cluster has a single master, multiple chunk-servers, and are accessed by multiple clients at a time. Files are divided into chunks, and the master is responsible for maintaining metadata about the individual chunks for each file. It is also responsible for placement decisions, reations of new chunks, and coordination of system-wide activity. The master's speed is optimized by storing metadata in memory, causing master operations to be very fast. There are three types of metadata: file/chunk namespaces, file-chunk mapping, chunk replica locations. Operation logs are used to keep the first two persistent, while the third is collected from each chunkserver upon certain key events. This policy allows the master to access metadata quickly, implement chunk garbage collection, and replay events whenever crashes happen.
GFS has a relaxed consistency model that provides several guarantees:
1) File namespace mutations are atomic.
2) Concurrent successful mutations will leave a data region undefined but consistent (ie clients will see the name data, but mutation versions may differ).
3) After a series of successful mutations, the mutated file region is guaranteed to be consistent and contain the data written by the last mutation.
For applications, this means that file mutation can be done via appending rather than overwriting, which makes applications more resilient in the face of failure. Constant checkpoints are also used to allow incremental restart, which in turn increases file resiliency.

4. Evaluation
Evaluation was done with micro-benchmarks on a small GFS cluster. Read, write, and record append performances were evaluated against a “theoretical limit”, although I'm not too sure how these limits were obtained or how other systems perform against them. Two real world clusters were also analyzed, this time with hundreds of chunkservers. It's hard to give a per-client read/write rate in a real-life network situation where clients may drop in and out, but appears that overall network speed is on par or faster than the test simulation. As the real-world GFS test was produced via actual client traffic, it presents a realistic view of a typical day-to-day use within the system, although the authors do warn that the workload is tailored specifically for Google applications and thus might not be expandable to other areas.

5. Confusion
I'm still a little confused on how appending works. If replicas aren't bytewise identical, how does the master know which chunk is the real one?

Posted by: En-Ui Lin | March 30, 2016 11:53 PM

Summary:
The paper describes the design and implementation of Google File System, a distributed file system that while satisfying the scalability,reliability and availability requirements of traditional distributed file systems, is also optimized mainly for large sequential reads and write-append workloads that are specific to Google.

Problem:
Traditional distributed file systems are not designed for a particular kind of workloads and are thus generic. The performance of such file systems are modest. They also don’t assume significant amount of component failures that could potentially lead to loss of data. Google’s workloads were characteristic of large sequential reads to multi-GB files and appending to such files from multiple clients concurrently. Google also had a number of commodity components in its cluster and the probability of failure was quite high. GFS was developed under these assumptions.

Contributions:
1) GFS has a single master server (that is replicated) that provides metadata information about the location of data to its clients. Metadata maintained in the master although persistent(except for chunk locations), is small enough to be stored in the memory, thereby providing fast lookups. In order to avoid metadata losses, multiple copies of metadata is stored as operation logs from time to time, GFS has chunkservers which provide the data requested directly from the client. Data is almost always replicated across chunkservers This separation of control and data flow removes the bottleneck at the highly contended master servers. File regions are divided into Chunks that are the basic units of storage in the chunkservers.
2) GFS has a relaxed consistency model when it comes to concurrent writers. However, GFS does guarantee atomicity between writes from different clients to the same chunk. GFS also guarantees that the chunks in different replicas are always consistent after a successful mutation. This is enforced using leases where a specified chunkserver containing the chunk determines the order in which mutations to the chunk from different clients is written and instructs the other replicas to do perform the same operation in the same order. Record append operations are semantically treated as ‘append at-least once’ in the sense that although any particular replica may contain duplicate records, any successful operation makes sure the record is at the same offset in all the replicas. The applications read/write such files through library code which takes care of undefined regions. GFS also provides the snapshot operation that has copy-on-write semantics on cloned namespaces.
3) The GFS master server is responsible for creating, replicating and rebalancing chunks across chunkservers. It also does lazy garbage collection of deleted files. It also detects stale replicas in chunkservers using version numbers for the chunks. Most of these mechanisms have user-level policies/knobs for greater control from the userspace. GFS also maintains persistent checksums for each chunk to identify corrupt data.

Evaluation:
The authors have demonstrated the scalability of GFS, albeit at a small scale by measuring the throughput for large sequential reads,large random writes to different files and concurrent record appends. However, a comparison of GFS with other popular distributed file system would have thrown more light on the optimizations made in GFS. The authors also evaluate the recovery time after losing a number of chunks. The authors have also reported the nature of workloads that actually run on production and R&D clusters. Although this more or less proves their initial assumptions correct, there is not sufficient throughput data to actually interpret the scalability of GFS. Also, on a large scale one might expect the network to be flooded with TCP flow control messages. The authors could have evaluated the effective throughput for production clusters.

Confusion:
What are the overheads of TCP connection setup/breakdown.

Posted by: Prashanth Balasubramanian | March 30, 2016 11:49 PM

1. summary
Google implemented Google file system, which can reduce the cost of large density disks and prevent failure from server crash by using the policies: replication of data over three machines, new file system architecture, relaxed consistency model. To support these policies, GFS utilizes explicit communication sequences among master, client and chuck servers to access data, and to write or append data, and to replicate data.

2. Problem
Because distributed system is build with many machines, the component failure is often, comes from application bugs, operating system bugs, human errors, and the failures of hardwares, and power shortage. The sizes of most file are huge, which is opposite from the previous file system assumption that most file sizes are small. Most files are mutated by appending new data rather than writing random access overwrite. Combining the applications and file system are beneficial because it can reduce restrictions applied for correct operation over designed file system at the sacrifice of performance.

3. Contributions
Main contributions in GFS is that it provides fault-tolerant distributed system, even when there are several servers are crashed at the cost of replication of data over default 3 chunkservers and backup of master. Replication is reliable while it is costly. To cope with the increase of cost, GFS uses commodity cheap disk instead of expensive enterprise disk.
The detail operations are followed. A client gets information about file handle and chunkserver location by asking a file name and chunk index to master, and then the client start transaction on one of closest chunkserver. To reduce the traffic between clients and a master server, GFS has 3 components: master storing namespace and location of chunk, clients, chunckservers maintaining data chunk. To avoid heavy traffic on a machine GFS uses replication of data requiring data structure such as namespace and locations. Those data is kept in master’s space and logged in disk and backup master to prevent losing such data. Write operations takes place with permission named lease from master. It reduces the traffic from client to replicate same data over replicas. To maintain chunks evenly on chunkservers, the new chunk is stored in the chunkserver which utilizes below average disk space when there is a request for new replica.
GFS uses large chunk size because the metadata in master needs a lot of memory space when the size of chunk is small. In addition, it can provide good sequential operation for read or write. The update of metadata needs synchronization to prevent corruption from concurrent requests by clients.

4. Evaluation
GFS is evaluated with simple cluster: one master, two master replica, 16 chunkservers, and 16 client. The read and write access speed shows that the utilization of network for data transaction increases as the number of requests from clients increases while record append operation utilize network bandwidth statically regardless of the number of requests from clients.
The metadata in chunkservers require large space because of checksum according to the real world workload evaluation while metadata in master needs small space. The metadata for recovery are small, so the recovery time also is small. The recovery time for 600GB of data needs 23.2min, which is very small time for its operation.
The table.5 shows that the percentage of write operation which has over 1MB are 3.3 and 28.0% each other. Is It big potion?
5. Confusion
What is the actual meaning of appending here? What is the difference between appending and overwriting if the requested operation is overwriting?

Posted by: Choungki Song | March 30, 2016 11:01 PM

1. Summary
This paper describes the design and implementation of Google's distributed file system. The design decisions for implementing the file system were driven by Google's own application and workload requirements which resulted in certain radical differences as compared to the already existing systems. But this filesystem with custom applications designed to run on it have successfully met Google's storage needs.

2. Problem
Contemporary distributed file systems were not a perfect fit for Google's applications and workloads. The file systems assumed component failures as exception events, but Google's clusters were in the order of thousand machines containing thousands of disks. A system failure was a highly probable event in such an environment and with Google planning to expand their clusters further, failure would be a norm. The files on which Google's application operated were in the order of Giga Bytes and were mostly mutated by data appends, so they had to revisit the design decisions involving i/o operations and block sizes and optimize for the large file size.

3. Contributions
The design of GFS shows how a filesystem can work with a relaxed consistency model if the applications are co-designed taking into account such a model. Applications running on GFS modify files by appending and the consistency model of GFS allows duplicate records in chunks, so the onus of detecting the duplicates is on the applications. GFS architecture has a single master with data stored on the chunkservers. The clients only exchange metadata with the master and the data is exchanged directly with the chunkservers, this design prevents master from being a performance bottleneck. A single master which can be a single point of failure, the authors overcome this issue having shadow masters which only store the metadata. GFS stores data in 64MB chunks, each file is represented by chunks which are distributed on different machines and replicated by default, it ensures data integrity of the huge chunks by maintaining a checksum for each 64KB block, this speeds of calculating checksum for appended data if chunks are partially full. The system also has well thought policies for chunk placement, replication and rebalancing which spreads the chunks across several machines and racks to ensure equal disk utilization, prevent a chunkserver to become a bottleneck and prevent the failure of a rack from losing data. The snapshot technique enables master to create quick copy of large amount of data and the garbage collection mechanism which lazily removes files makes the system more reliable.

4. Evaluation
The authors present evaluation in two stages. First they show the performance of reads, writes and appends in micro-benchmarks and then present evaluation for real world clusters using GFS. The micro benchmarks show that the per-client read rates highest as compared to the theoretical maximum when the number of clients are less. This efficiency drop as clients increase is attributed to the fact that as the number of clients increase then there is a higher chance that they simultaneously read from the same chunkserver. The rate of record append drops as the number of clients simultaneously append data to chunks of same file this is attributed to network congestion and the network topology. The paper also presents evaluation of 2 real world clusters, both the clusters have hundreds of chunkservers serving around 700k files with data in order of tens of TBs. The metadata for masters of such clusters is less than 100MB so it is clear to see how the data can be easily maintained in memory, the authors also claim that the relatively small RSS of master makes recovery faster in case the master crashes. The authors had claimed that their systems read much more data than they write and they show the numbers in evaluation, and since they optimize the GFS for reads the evaluation shows that in large clusters the read rates are about 77% of theoretical maximum which is great. GFS was designed to prevent master from being the bottleneck, the evaluation shows that even in huge clusters the master has to deal with around 500 ops/sec, which mostly do not involve any i/o operations and so the master can easily handle them when the namespace is efficiently designed to store data in a binary tree. The paper also presents analysis of recovery time and workload breakdown which re-enforces the assumptions and the design decisions made by the authors.

5. Confusion
a. discuss consistent and defined regions in class and how client can identify defined regions.
b. I did not quite understand how checksumming is optimized for appends.

Posted by: Mihir Shete | March 30, 2016 10:27 PM

CS 736 Reviews - Spring 2016

The Google File System

Comments

Post a comment