**REMZI's note** ================ http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2008/Notes/gfs.pdf * Chunk size is 64MB, why fragmentation is not a problem? - Chunk is stored as Linux file, hence local file system deal with fragmentation, and it often does well. * The theme in GFS paper is simplification. Can you give some examples? - Single Master? Why + simplify design + enable master to make global decision like ~ chunk placement ~ load balancing - Server stores chunk and chunk ID * Single Master seems to be a bottleneck. Why it is not a problem in GFS? - Master keeps everything in memory, hence fast - Master involve only in metadata operation, most of the time, during read/write, client interacts directly with chunk servers. - Client also cache some information, hence reduce master-client interaction * What is good or bad about replication in GFS? - by default, each chunk has 3 replica - Good: + improve read performance, client can read any where + reliability, one chunk server goes down, client still read from others - Bad: + Replication can be out of sync ~ sometimes it is ok for app + space overhead * Why pipeline write? - Fully utilize each machine's bandwidth + each machine's full bandwidth is used to transfer the data as fast as possible, rather than divided among multiple recipient - Avoid network bottlenecks and high-latency links + each machine forwards the data to the "closest" machine in the network topology that has not received it - Minimize latency: + pipelining the data transfer over TPC connections + once a chunkserver receives some data, it starts forwarding right away * What is the consistency model in GFS? - A file region can be: + consistent: if clients see same data regardless of which replica they read from + defined: consistent, and all write updates intact * Give the example where file region can be consistent but undefined? - When concurrent writers make "big write" (i.e. write greater than chunk size), GFS client code break it down to multiple write operations. But they may be interleaved and overwritten. Therefore, shared file region may end up containing fragments from different clients, all though the replicas will be identical because the individual operations are completed successfully in the same order on all replicas. - Example: Say two clients concurrently write to a shared file with large write that exceed chunks boundary + Client 1: write 1 - 2 - 3 + Client 2: write A - B - C + The final result will be: 1 - B - C + this is consistent: since all replica has same content + but undefined: since some write lost ... * What’s the property of this record_append? - Idempotent operation! if the record_append fails, client can simply retry the operation again and again. * even no caching at client, why? - Because most program only cares about the output - The output (e.g. search result is produced by the chunkservers, which also act as worker machines) - In other words, client typically only says “please run this job (a search query)” **THE GOOGLE FILE SYSTEM** ========================== # NOTE: there is a paper in CACM about GFS # 0. Takeaway - focus on scale, fault tolerance, and high aggregate performance - assumption: about workload and environment + failure is the norm ==> hence fault tolerance and automatic recovery + large file (GB file size) + workload: mainly > large streaming read, small random read > large sequential append (hence, no random write that overwrites data) + high sustained bandwidth is more important than low latency - Architecture + 1 master: maintain metadata info, name space, ACL + many chunks server: store chunks of file as normal file of local fs (note: file is divided into large chunks, each has unique global ID) + chunks are replicated across the system - Single master: + make chunk placement and replication decision based on global knowledge + just maintain metadata operation + actual read/write is served by chunk server (hence client contact master for metadata, and chunk server for data) - Large chunk: + 64 MB + lazy allocation to avoid fragmentation + reduce the need to contact to master, because reads and writes on the same chunks require only initial request to master to find chunk location + reduce network overhead (client can consitently maintain TPC connection with chunkserver) + reduce size of metadata at master (this allow to keep metadata in memory) Problem: "hot spot" with small file small file consists of few chunks, if many client access that file ==> chunk server storing the chunk become hot spot - Metadata at master + keep in memory + also on an operation logs > log to this before return success to client > ops can be batch to avoid overhead > easy recover: - load a checkpoint - and replay the log + not keep info about which chunk server has which chunk persistently (rather, when a chunk server check in with master, it sends a chunk report) ==> do not have to sync between master and chunk servers - Consistency Model: (more here) + defined, undefined, and consistent Now: because chunk is replicated, how to update the chunk, and make all replica consistent? ==> LEASE: to maintain mutation order across replica - a lease is granted to a server containing a replica - this chunk server define the operation order of for all mutation - lease can be renewed if expired - help reduce the overhead at master - How about crash recovery: + during mutation, if one of server crash, client retries + if master crash, replay the log, and wait for chunk server to checkin, to rebuild the mapping (file, chunk ID --> chunk server) - Master operation: + no real directory, everything is just string hence, there is a mapping between full pathname and its metadata + lock is used to serialize concurrent operation + replica placement: rack-aware, to anticipate rack failure (i.e put replica each in different rack, hence avoid the risk of loosing all replica if once rack fails) ==> improve reliability, availability, and network bandwidth utilization - when replica is created: + at creation time + when # of replica of a chunk fall below a threshold + rebalancing: > for disk space > load balancing - When a file is deleted: > logs the delation immediately > but simply rename the file with delete time stamp ==> this file will be discard later (3 days) > but the physical storage is lazily claimed The deletion commands are piggy back in heartbeat message Question: why doing this? + simple + can be done in back ground (hence amortize the cost) + can deal with accidental storage: the chunk can still be recovered - How to provide HA? + fast recovery, what happen when > master crashes > chunk server crashes + simple failure handling: > client simply retry + replication: > chunk replication: multiple reader can contact different chunk servers for the same chunks > master replication - What about data integrity? + checksum - What problem left? 1) single master failure 2) Master is limited by its memory capacity 3) consistency among replica (what if replica has different length, etc. this is not discussed in the paper)