FLAT DATACENTER STORAGE
CS838, November 19, 2012

DESIGN OF FDS
- Data layout
    - Blob: logical unit of data 
    - GUID: unique 128-bit identifier for a blob
    - Tract: fixed size piece of a blob (8MB)
    - Tractserver: process associated with each drive
        - Services read and write requests from clients
        - Use raw disk interface
        - E.g., 1,2,3,4,5
    - Tract Locator Table (TLT): list of all tractservers (repeated m times
      in random order)
        - Cached at clients
        - Version number assigned to each row
        - Recreated when tractserver is added or removed
        - Size of TLT is proportional to the # of tractservers
        - E.g., 1,2,3,4,5,3,2,4,1,5
    - Tractserver where tract is stored is determined by:
        (Hash(GUID) + Tract_#) mod TLT_length
        - Radomizes each blob's start to maximize parallelism
        - Ensures large blobs are uniformly distributed across tractservers
        - E.g., (3 + 0) mod 10 = 3, (3 + 1) mod 10 = 4, etc.
    - Metadata is stored like every other tract (Tract # = -1)
        - Blob is initialized to size 0
        - Need to extend a blob before writing past its end; space is
          allocated lazily
    - Client library: hides the details of tractservers and the FDS network
      protocol
        - Calls are non-blocking and invoke a callback function when an 
          operation completes
- Frameworks using FDS should assign work at fine granularity during task
  execution
- Replication
    - n-way replication lists n tractservers in each row in the TLT
    - Send write to every tractserver in the selected TLT row
    - Send read to a single random tractserver in the selected TLT row
    - e.g., 1,2,3,4,5,3,2,4,1,5
            5,3,2,1,4,4,1,5,2,3
            2,5,4,3,1,2,5,1,3,4
- Failure
    - Detected by heartbeat messages
    - All rows in the TLT containing a failed tractserver are invalidated and
      random tractservers are chosen to fill the empty spaces 
    - Version # of each row is incremented to invalidate copies currently
      cached at clients
    - Hand out new TLT to tractservers and waits for all to ACK before giving
      new TLT to clients
    - Tractserver starts copying from other replicas when new TLT is received
    - Group tractservers into failure domains (e.g., racks); no row of the
      TLT should have two or more tractservers from the same failure domain
- Network
    - Full-bisection bandwidth between nodes
    - Network bandwidth on each node equals its disk bandwidth
    - Fan-in during reads can causing incast problems
        - Collisions mostly occur at the receiver because of full bisection
          bandwidth
        - Uses RTS/CTS flow-scheduling
        - Extra RTT required is not problematic due to deep read-ahead

STUDENT QUESTIONS/FEEDBACK
Yanpei: There is one thing I don't understand. The authors said "in an n-disk
cluster where one disk fails, roughly 1/nth of the replicated data will be
found on all n of the other disks. All remaining disks send the
under-replicated data to each other in parallel, restoring the cluster to
full replication very quickly". I don't understand what this means. Do they
mean something like "any n-1 nodes can be used to reconstruct the entire n
nodes?" This sounds more like reconstructing using erasure code to me -- not
sure if this is what they mean.

Jim: FDS seems to be essentially just another key-value store. 

Jim: I just don't think FDS is very impressive. The gains exhibited in the
Applications section of the paper seem to come largely from the network, not
the design of FDS.

Leo: FDS basically throws away advances in distributed storage technology
like consistent hashing , paxos etc and chooses simple options where needed.
consistent hashing allows new node additions or node failures to affect only
the neighbors. While FDS strives to involve all nodes for fast recovery /
fast balancing. also, FDS might not be good at handling partitioned
networks - basically FDS needs only one metadata server to exist at any point
in time and needs operator involvement to avoid cluster corruption.

Leo: I liked the paper overall but wonder if the paper might have positioned
itself as not throwing away data locality but hybrid model where it is not
aggressively optimized for promoting data locality.

Leo: Also, wondering if data points on SSD disks and faster disks could have
been added. Such faster disks might further the throughput difference between
local and remote disks.

Josh: I found FDS to be an interesting application of full bisection
bandwidth. Treating all disks the same simplifies the system and layout of
data. It also allows for quick failure recovery.

Ram: FDS fits between GFS/Hadoop and DHTs i.e., it supports one-hop access to
data, fast reaction to failure like GFS/hadoop and high scalability, no
central bottlenecks like DHTs.

Srinivas: Can the same model be applied to satisfy latency-sensitive demands?
Since data is striped and hash-partitioned already, I think it should be
possible.

Xishuo: The question I have is that the FDS requires full bisection
bandwidth. Can we relax this constraint and yet still use some of the ideas
presented in the paper?

Robert: I think this is an interesting paper built on top of heterogeneous
storage nodes. Still it seems it is quite expensive to build such a CLOS
network.
 
SOFTWARE DEFINED STORAGE
* Is this a good base?
* What else do we need?
    - Considering journaling filesystems as an example