Back to index
The Google file system
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google
One-line Summary
GFS is a master-chunkservers architecture with fault-tolerate in minds, and its design is affected by its workload.
Overview/Main Points
- Design Assumptions/Rationale
- Any component could fail
- Used by Google internal
- Some large files instead of many small files
- divide a file into chunks whose size if fixed 64MB
- cheap disks whose data replication speed is 2x faster than 100 Mb Ethernet links
- no file data cache due to the large working set
- Append-heavy writes; sequential accesses
- GFS architecture
- a single master
- maintain all file system metadata in memory
- namespace
- a lookup table (prefix comppression) mapping full pathnames to 64-bit metadata, including chunk id along with read-write locks
- keep persistent by logging in the local disk and replicated on remote machines
- the mapping from files to chunks, also kept persistent and replicated
- the current locations of chunks: asked during the startup and whenever a chunkserver joins the cluster.
- access control info
- control system-wide activities
- chunk lease management: determine a server as the primary server for update propagations
- locking for serialization, acquired in a consistent total order (first namespace tree level, and then lexicographically)
- chunk placement and migration between chunkservers
- chunk creation, re-replication, and rebalancing
- garbage collection of orphaned or stale chunks by renaming
- maintain the membership of chunkservers by HeartBeat messages
- Chunkservers: data replication across chunkservers
- Client: cache metadata
- shadow master
- GFS operations
- read a file by given byte offset
- Client asks master with filename and chunkIndex which is translated from the byte offset
- Master replies with chunk handle (chunk id), a set of chunk locations, and the chunk version number.
- Client contacts the closest chunkserver with byte range
- The chunkserver returns data if matching the version number
- write
- Client contacts the master for the chunk handle and chunkservers.
- Master replies with a primary server which has a lease, and other secondary server.
- Client caches metadata info, and pushes the data (including checksums) to all replicas. In fact, for a better performance, GFS decouple the data flow from the control flow in that each server forwards the data to its “ closest ” machine in the network topology that has not got it, by pipelining over TCP connections at the rate of 80 ms per 1 MB.
- After all replicas ack, client sends the write request to the primary server which decides a serial order among concurrent writes and applies to its own local state.
- The primary server forwards the write request to all secondary servers, and replies to the client after all other servers ack. If any errors occur, client retries the failed mutation a few attempts before a retry from the beginning of the write.
- snapshot
- like AFS, use standard copy-on-write techniques to make a copy of a file or a directory tree (the “ source ”)
- after receving the snapshot request, the master revokes any outstanding leases on the objects so that any subsequent updates will defer.
- the master acquire read-write locks on the source, logs the operation to disk, and duplicates the metadata for the source; the created snapshot files points to the same chunks as the source.
- When a client requests to write a snapshoted file in the chunk C, the master replies with a new chunk handle C', and asks the chunkserver to create a new chunk C' from the current replica of chunk C.
- The rest are similar to the write: grant a lease for the primary server that holds C' and reply to the client.
- Data integration
- data view
- consistent: all clients always see the same data
- defined: consistent and client reads its writes entirely
|
Write |
Record Append |
| Serial success |
defined |
defined, interspersed with inconsistent |
| Concurrent successes |
consistent but undefined |
| Failure |
inconsistent |
- write (normal write or append) across chunk-boundary may be defined but inconsistent
- padding
- Normal file system metadata
- i-node, bitmap, superblock (inode is 2; 0 for no inode; 1 for bad blocks), ...
- store on disk
- data block size usually 4k
- avoid internal fragmentation (smaller the better)
- fewer meta-data information (larger the better)
- sequential r/w (larger the better)
Relevance
Flaws