The Google file system

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google

One-line Summary

GFS is a master-chunkservers architecture with fault-tolerate in minds, and its design is affected by its workload.

Overview/Main Points

Design Assumptions/Rationale
- Any component could fail
- Used by Google internal
- Some large files instead of many small files
  - divide a file into chunks whose size if fixed 64MB
  - cheap disks whose data replication speed is 2x faster than 100 Mb Ethernet links
  - no file data cache due to the large working set
- Append-heavy writes; sequential accesses
GFS architecture
- a single master
  - maintain all file system metadata in memory
    - namespace
      - a lookup table (prefix comppression) mapping full pathnames to 64-bit metadata, including chunk id along with read-write locks
      - keep persistent by logging in the local disk and replicated on remote machines
    - the mapping from files to chunks, also kept persistent and replicated
    - the current locations of chunks: asked during the startup and whenever a chunkserver joins the cluster.
    - access control info
  - control system-wide activities
    - chunk lease management: determine a server as the primary server for update propagations
    - locking for serialization, acquired in a consistent total order (first namespace tree level, and then lexicographically)
    - chunk placement and migration between chunkservers
    - chunk creation, re-replication, and rebalancing
    - garbage collection of orphaned or stale chunks by renaming
    - maintain the membership of chunkservers by HeartBeat messages
- Chunkservers: data replication across chunkservers
- Client: cache metadata
- shadow master
GFS operations
- read a file by given byte offset
  - Client asks master with filename and chunkIndex which is translated from the byte offset
  - Master replies with chunk handle (chunk id), a set of chunk locations, and the chunk version number.
  - Client contacts the closest chunkserver with byte range
  - The chunkserver returns data if matching the version number
- write
  - Client contacts the master for the chunk handle and chunkservers.
  - Master replies with a primary server which has a lease, and other secondary server.
  - Client caches metadata info, and pushes the data (including checksums) to all replicas. In fact, for a better performance, GFS decouple the data flow from the control flow in that each server forwards the data to its “ closest ” machine in the network topology that has not got it, by pipelining over TCP connections at the rate of 80 ms per 1 MB.
  - After all replicas ack, client sends the write request to the primary server which decides a serial order among concurrent writes and applies to its own local state.
  - The primary server forwards the write request to all secondary servers, and replies to the client after all other servers ack. If any errors occur, client retries the failed mutation a few attempts before a retry from the beginning of the write.
- snapshot
  - like AFS, use standard copy-on-write techniques to make a copy of a file or a directory tree (the “ source ”)
  - after receving the snapshot request, the master revokes any outstanding leases on the objects so that any subsequent updates will defer.
  - the master acquire read-write locks on the source, logs the operation to disk, and duplicates the metadata for the source; the created snapshot files points to the same chunks as the source.
  - When a client requests to write a snapshoted file in the chunk C, the master replies with a new chunk handle C', and asks the chunkserver to create a new chunk C' from the current replica of chunk C.
  - The rest are similar to the write: grant a lease for the primary server that holds C' and reply to the client.

Data integration

data view
- consistent: all clients always see the same data
- defined: consistent and client reads its writes entirely

	Write	Record Append
Serial success	defined	defined, interspersed with inconsistent
Concurrent successes	consistent but undefined	defined, interspersed with inconsistent
Failure	inconsistent

write (normal write or append) across chunk-boundary may be defined but inconsistent
padding

Normal file system metadata
- i-node, bitmap, superblock (inode is 2; 0 for no inode; 1 for bad blocks), ...
- store on disk
- data block size usually 4k
  - avoid internal fragmentation (smaller the better)
  - fewer meta-data information (larger the better)
  - sequential r/w (larger the better)

The Google file system

One-line Summary

Overview/Main Points

Relevance

Flaws