« Practical Byzantine Fault Tolerance | Main | Frangipani: A Scalable Distributed File System »

Petal: Distributed virtual disks

E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proc. 7th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 84--92, October 1996.

Review due Thursday 10/23

Comments

Summary:
The authors here present the design of a network attached, distributed, block-level storage system meant to replace local disk attached storage systems. Petal seems to be the earliest conceptualization of a SAN. It provides users with the abstraction of highly available virtual disks which are realized on physical disks managed by a collection of servers. This architecture also provides easy scaling, management, replication and a load balancing failure tolerance scheme (chained-declustering).
Problem:
A good storage system should be highly available, easily manageable, provide good performance and failure tolerance and also scale well. These are not realized with local disks because a single component failure can take down the entire system and they also require constant monitoring and do not scale due to the limited resources of the single server. Given the limits of local disks, the challenge is to build a storage system architecture that satisfies all of the above mentioned requirements.
Solution from the paper/Contribution:
Petal consists of a pool of network connected servers that manage a pool of physical disks. User is provided with a virtual disks abstraction which allows different applications to use it as a block level storage. The virtual location is mapped to the physical location using a 3-level translation mechanism involving Virtual Disk Directory (VD-> GMap ID), GMap(GMap ID > physical server) and PMap (GMap + offset -> physical disk + offset). All servers have a globally consistent copy of the vDir and GMap and typically a single server can perform all 3 translations to service a request provided the client had initially contacted the right server which is usually the case. The clients store a hint about which server to contact for a request. If the server cannot service that request, it will return an error and the client will retry with other server and update its hint. Petal supports snapshotting and for that, each translation is stored with an epoch number which indicates the version. Reconfiguration of the cluster is done incrementally. Load balancing across servers is done by ‘fencing’ off some blocks of data and relocating them incrementally. The data is placed according to a chained-declustering scheme where each block is replicated in a neighbor, looks kind of a diagonal arrangement. This placement provides better load balancing in case a single server failure because the data from that server is in multiple servers unlike normal mirroring. However, it’s not so good if more than one server fails. One copy is primary (all-writes here) and the other secondary.
Key contributions I find are:
Treating the n/w as the primary system-level interconnect to physically decouple the storage system.
V-disk abstraction that is transparent to user & allows different apps to run on top without modification
Chained-declustering data placement scheme which is good for load-balancing in a failure scenario.
The translation mechanism which is quite efficient.
Key takeaway:
It’s a great idea to separate out the storage system so we can provide great properties such as mentioned above. True that in the paper, the performace of Petal is slightly worse than local disks but this is no longer a problem since we have high throughput, low latency interconnects like infiniband and flash arrays replacing disks that are capable of supporting hundreds of thousands of iops (Pure Storage, Coho Data etc.).

summary:
- they described their distributed systems that provides virtual disks to clients.

problem:
- they want to design a distributed system that provides disk space to clients.
- they want it to be fault tolerant.
- they want to be able to uniformly balance the load.
- they want fast and efficient backup.
- they want to minimize the need for manual management.

solution:
- they have modules to check the state of the servers (who is dead and who is alive)
- each server has the information about the state of the system (they used Paxos)
- they have a virtual disk directory and a global map that can map the virtual disk to a server and they have a physical map that map a virtual offset to a physical offset.
- they keep the VDirs and GMaps in all servers and the PMaps in the corresponding machines. this way each server can decide if it is the appropriate machines to answer the request of the client and if it is not, who is.
- they provides snapshots for backup
- they have an algorithm for reconfiguring the virtual disks and moving them to new machines.
- because they update their maps before moving the virtual disks, at some point the records would be inconsistent and they should redirect the request to the old location. this can increase the overhead and traffic.
- to decrease the overhead, they do this gradually and fence small parts of the maps that they are moving. this way other parts of the map are either old or new and they don't need redirection.
- they used chained declustering. instead of having replicates of the data in one other machine, they put parts of replicates in two machines.
- this way if a machine fails, the overhead wouldn't be on one machine but would be split between the other two. and those two also can move some of their load to other machines.
- using this, they can move half their machines to different geographical site so that each site would have the whole data and they can handle partitioning this way.
- they handle some of load balancing in clients (but they mentioned that it works when they have few clients, when the number of clients increase it wouldn't work)
- they evaluated their implementation and showed that the overhead of their server software is negligible compared to network's overhead.

learned:
- the fencing mechanisms was interesting for me. they lock small parts of the data that they are moving, and the state of the rest of the data is not ambiguous, they are either old or new.

confusing:
- I didn't understand how the snapshot works.

Summary:
This paper introduces the Petal distributed storage system, which is a globally accessible, always available, has high performance and large capacity. Petal provides large abstract containers called virtual disks over storage system to clients.

Problems:
Managing large storage systems is expensive and complicated. The system should be fault tolerant, able to reduce fragmentation and remove hot spots. Besides, it should have balanced load and capacity, and scalability. Client should be able to reconfigure the storage system to expand the performance and capacity when new servers/disks are added.

Contributions:
(1) Provide virtual disks as containers. This method cleanly separate a client's view of storage from the physical resources. This also adds the flexibility of sharing physical resources among clients.
(2) Use copy-on-write to quickly create snapshots. The snapshots can be seen as backup, which can be used in recovery and archive.
(3) Allow incremental reconfiguration so that new disks and servers can be added. The redistribution of data is finished by using new global map.
(4) Using chained-declustering to achieve fault-tolerance. It can by pass failure nodes. It use two replica: primary and secondary and read on write all policy. The consistency is ensured between the two replica.

Confusing:
How is copy-on-write exactly implemented. For example, one block in 4MB, I made a small modification for about 1KB, will the whole block be copied? If so that is really an efficient way.

Learned:
The use of one primary and several replicas is a very common strategy to get fault tolerance. But as to keep replicas consistent, the write cost is relatively high. I think this strategy is more suitable to read-dominant workloads?

Summary:
This paper introduces the Petal distributed storage system, which is a globally accessible, always available, has high performance and large capacity. Petal provides large abstract containers called virtual disks over storage system to clients.

Problems:
Managing large storage systems is expensive and complicated. The system should be fault tolerant, able to reduce fragmentation and remove hot spots. Besides, it should have balanced load and capacity, and scalability. Client should be able to reconfigure the storage system to expand the performance and capacity when new servers/disks are added.

Contributions:
(1) Provide virtual disks as containers. This method cleanly separate a client's view of storage from the physical resources. This also adds the flexibility of sharing physical resources among clients.
(2) Use copy-on-write to quickly create snapshots. The snapshots can be seen as backup, which can be used in recovery and archive.
(3) Allow incremental reconfiguration so that new disks and servers can be added. The redistribution of data is finished by using new global map.
(4) Using chained-declustering to achieve fault-tolerance. It can by pass failure nodes. It use two replica: primary and secondary and read on write all policy. The consistency is ensured between the two replica.

Confusing:
How is copy-on-write exactly implemented. For example, one block in 4MB, I made a small modification for about 1KB, will the whole block be copied? If so that is really an efficient way.

Learned:
The use of one primary and several replicas is a very common strategy to get fault tolerance. But as to keep replicas consistent, the write cost is relatively high. I think this strategy is more suitable to read-dominant workloads?

Summary:
The authors described the architecture and performance of a distributed storage system -- Petal, which is a set of servers which collectively use their physical disk to give a abstraction of arbitrary size highly available, one-server fault tolerant virtual disks.

Problem:
Ideal goal for a global storage system is to be highly available, fault tolerant (and recover from faults) as well as unlimited performance and resource utilization. This is clearly a myth to achieve. The main problem is to find sweet points of compromise in the design.

Some of this challenges are, deciding a mount of redundancy, load balancing, resource utilization as well as management overhead, deciding on how much the system will be able to handle network disk or server failure etc.



Contributions:


  • Creating Petal and showed its performance "close" to the real.
  • Petal handles heterogeneous file system as it gives block level interface
  • Fast efficient snapshot of virtual disk by exploiting copy on write is a smart idea.

Learning:
The design details and concerns to build a distributed virtual disk is a great take away from this paper.
Question:
In chained-declustering what exactly meant by adjacent servers, and why they need adjacency constraint.
How are they confident that the system scales linearly with number of servers. Also, they get almost equal amount of failed request. Not sure why the failed request is so high.

Review 11 Patel Distributed Virtual Disks
Problem:
- The paper wants to build a fast, scalable, fault tolerant, load balanced, shared storage system with little management.

Summary:
- Patel creates a shared storage system by mapping virtual disk identifiers to global map identifier, which contains the physical disk and offset.
- Patel uses data replication for fault tolerance.
- Patel performs just as fast as locally attached disks.

Contributions:
- The paper tells us how to map from {virtual-disk-identifier, offset} to {server-identifier, disk-identifier, disk-offset}, so that we can use pointers to access data in a shared storage system. Therefore, we only need to change the mapping when redistributing the data across more servers, which makes this scheme scalable.
- The paper presents multiple fault tolerance schemes. It first presents using a crash consistent snapshot in case the disk goes down. It later presents some data distribution schemes such as even-odd chain declustering to tolerant site failures. Combining both of these ideas makes Patel quite fault-tolerant.
- Load balancing the data replication is accomplished by dynamic load balancing. If a server goes down, then 1) other servers will pick up the load of failed servers, and 2) other servers will pick up the load of the servers that are picking up the load of failed servers. Although this scheme works for load balancing, the latency is increased by having to resort to the secondary copy. I think load balancing can be accomplished better with random distribution, so no server will be overloaded when their neighbors fails.

Confusing:
- Although I understand what each module does [Figure 3], I don’t understand how the modules are used. What do the arrows mean? Since there are no arrows pointing to the recovery module, does Patel always run the recovery module until another module is required?

Learned:
- A shared storage system was managed manually and expensively prior to the publication of this paper. People used to manually move, partition, and replicate files and directories across multiple disks.

Summary
The authors describe Petal a scalable,fault tolerant distributed system which gives clients an illusion of disks.

Problem
The fault tolerance, scalability of distributed systems are very attractive. Storage systems can benefit from it. Distributed storage solutions which are competitive with locally attached disks and which provide benefits of distributed systems are needed. The authors try to achieve this.

Contributions
They provide scalability and performance by decoupling data which can needs to global and thus consistent from data that can be local.

Fault tolerance is achived by replicating critical information (virtual disk directory and global map) which is needed to map data in virtual view to server. Replication is achived by Paxos algorithm.

Local server stores data to keep information which maps data to physical offset. This enables physical disks to maintian locally.

Addition and leaving of servers is done automatically and dynamically changing global map, redistributing virtual disks and providing consensus of information using Paxos.

To provide further configurationable fault tolerance, they use chained de-clustering scheme where data on one server is replicated on a neighboring server.

Learning
Decoupling data that needs to be global and data that needs to be local along with minimizing data that needs to be global gives better control of distributed systems.

Confusion
It is not clear how the experiments which involved few nodes made them to claim that it is highly scalable solution.

Replicationg to a neighbour means that the replication will not work during power failure or network partition. Not sure how replicating to a neighbour and to only one more would give much needed resilience.

Flaws
It doesnt take into consideration physical characteristics of storage disks.

Summary
In this paper the authors describe the architecture of Petal, a distributed storage system where the servers in the system abstract a pool of physical disks to provide block storage to clients, who can aggregate the blocks into arbitrarily sized containers called virtual disks and access them over the network transparently as they would access a directly attached physical disk by using the Petal device driver.

Problem
In this paper the authors aim to solve the problem of building a globally accessible storage system that has high availability, and very good performance with little management overhead. Also Petals aims to provide the ability to increase the storage capacity of the system by adding more servers and disks without affecting the operation of the clients that are already accessing the storage system.

Contributions
1. The authors build the system to expose the storage in terms of blocks instead of directly building a File system interface and this is useful in running any file system or Database applications that require block storage directly on top of Petal.
2. The ability to create snapshots of data very quickly and efficiently by using Copy-on-write technique which will be useful in creating cheap backups and in data archival.
3. Ability to dynamically add/remove servers or disks without affecting the operation of the system.
4. The system is also designed to recover from failures and load balance the client requests among surviving nodes efficiently.

Learned
The technique for moving the data during reconfiguration of the virtual disk by moving only small portions of the virtual disk without quiescing the client I/O requests.

Confused
There are two copies of all data blocks, a primary and a secondary. How are the read/write semantics for these two types of blocks handled during the virtual disk reconfiguration. Are these blocks treated differently or the same?

Summary:

Petal is a distributed storage system designed to provide a virtual disk layer abstraction that is separated from the actual physical resource and is built to provide any single failure recovery, scalability, load-balancing, ease, flexibility and transparency in configurations and to provide low-level services in a distributed environment.

Problem:

There is a trade-off among storage system properties such as fault tolerance, high availability, load-balancing, global access, incremental scalability and easy management especially because of problems such as network partitioning and component failures. Petal tried to provide a highly available, efficient solution that has global access to multiple clients and provides substantial component failure tolerance, load-balancing among servers managing physical disk and at time tries to provide a highly configurable administrative options.

Contributions:

  • Uses a layered abstraction approach to separate file objects and naming from block placements and allocation.
  • Decoupling of clients' view of storage and real physical disks makes the sharing of physical resources more flexible, efficient and scalable.
  • Easy reconfiguration mechanism provided. Dynamic membership is easy.
  • Logical objects were represented using large-sparse address space.

Learned:

I found the abstraction of the actual physical layer using the reliable block layer very interesting since it simplifies the lower layer complexities for the upper layer.

Confusion:

Failure recovery model for permanently failed node its a little confusing to me.

Summary:
The paper introduces Petal, a distributed storage system that provides a virtual disk abstraction separate from the physical resource. This block-level storage system is easy to manage, globally accessible, highly available and scalable. Heterogeneous clients and applications are supported.

Problem:
Large storage systems are difficult to manage. Single component failure can halt the system and requires significant time and effort to resume the operation. Periodic monitoring to reduce fragmentation and eliminate hot spots. Manual effort is required to partition and replicate files/directories.

Contributions:
1. Block-level interface: Simple design, easy to scale, provide reliable storage and supports heterogeneity.
2. Transparency achieved through set of virtual disks. Virtual to Physical translator module takes care of mapping virtual addresses provided by client to physical addresses.
3. Chain de-clustering to replicate blocks over multiple servers. Servers are organized into a loop and blocks stored at one server are replicated at the next server in the loop.
4. Incremental addition of resources to increase capacity.
5. Backup and recovery of client's data through snapshots of virtual disks based on copy-on-write mechanism.

Learning:
Main learning is how a lower level service can help in tackling various distribution problems pretty well. Petal uses various indirections to achieve various higher tolerance from single point of failure, transparent reconfiguration, fast and efficient support for backup and recovery.

Confusion/Questions:
How is Load Balancing actually done? Tests used only four clients and petal performed nearly same as local disks. It would be interesting to know how the system would perform under heavy concurrent loads.
Also, If two adjacent servers fail, data will be lost. Why was further replication not considered?

Summary:
This paper introduces the implementation of distributed storage system composed by a collection of servers which cooperatively manage a pool of physical disks. The interface provided by the storage system is virtual disks, which is an abstraction layer similar to the interface provided by physical disks.

Problem:
In the storage system
1. The components (disk, server, or network) might be failed
2. Nodes need to be geographically distributed to tolerate site failures
3. Need be scaled up with more servers and disks
4. Need to balance the load and capacity to the servers in the system
5. Need to support fast and efficient backup and recovery

Contributions:
This paper proposes to implement a distributed storage system to manage a pool of physical disks, and to provide the interface similar to physical disks to the upper layer module. In the internal implementation, Petal maintains global state information with the algorithm based on Paxos to implement distributed, replicated state machines. The algorithm ensures correctness in the face of arbitrary combinations of server and communication failures and recoveries, and guarantees progress as long as a majority of servers can communicate with each other. The addition of a disk to a server is handled locally by the given server. But addition of a server will update the global state information, and will also redistribute the data to the servers. Petal uses chained-declustered data access and recovery modules to replicate data for one virtual disk to multiple disks on multiple servers. Compared to simple mirrored redundancy scheme, it can achieve better load balance. By placing all the even-numbered servers at one site and all the odd-numbered servers at another site, it can tolerate site failures. By using copy-on-write techniques, Petal can quickly create an exact copy of a virtual disk at a specified point in time, which can be used by upper layer to implement backup function. A nice feature of Peta is that it provide interface similar to traditional disks, which make it is very easy to be integrated with normal file system and database system to implement distributed versions of them.

Discussion:
My concern about this paper is that the abstraction layer distribute and manage the data in so fine granularity. That make the optimization for the lower layer more difficult, and we can’t use any specialized optimization above the block access interface for future physical disk implementations. That also limit the room to deploy intelligence in every server, because Petal restrict what they can do in the system.
Since the data is distributed and managed in so fine granularity, the scalability of the system is also limited, because the overhead to maintain the information of these blocks will become higher as the scale of the system increasing.
What I have learn from this paper is that sometimes making the interface of the system compatible with available system is very useful in practice.

Problem and Summary:

This paper describes the design and implementation of Petal, a distributed block storage system atop which file systems and databases can be deployed. The clients see petal as a highly-available block-level storage system that provides an abstraction called virtual disks. There are different modules like Global state module, Liveness modules, Translator module etc. These modules work together.

Contributions:

1. Virtual to physical translation using three-level indirection: Petal server maintains three data structures namely VDir, GMap and PMap. VDir and GMap are globally replicated but PMap is specific to a single storage server. Using these data structures, the client supplied virtual disk locations can be transparently mapped into physical server, disk and offset inside it. If a disk is added to a server, only the local PMap has to be changed.

2. Snapshots - Using COW techniques and slighly different way of mapping from VDir to GMap, petal can provide clients the ability to take snapshots.

3. Incremental Reconfig - when a new server/disk is added to the cluster, a client might want to reconfigure the virtual disk. In that scenario, Petal creates a new GMap, point the VDir entries to point to this new GMap and redistribute data to the servers according to the new mappings. Petal can relocate the data in small portions called Fenced portions (the portion which is currently being moved from old to new mapping). During this relocation, requests to new and old portions can still continue. Once a fenced portion is relocated it is moved as new portion.

Learning:

Indirections in the map data structures almost always have some neat properties.

Confusion:

For writes, Petal does synchronously replication to the secondary server. This can increase write latencies since disks are already slow. Also, it is not clear to me if ATM networks play an important role in this system. How would this system work in IP networks?

Summary:
This paper propose Petal, a pioneering distributed storage system which is block-level granularity, fault tolerant, easily scalable, transparent to clients. The performance in experiment is similar with locally attached disks.

Problem:
Implement a distributed storage system that has the following features:
1. availability, it should run as normal to clients when there are failures/adding/removing of machines in the system.
2. globally accessbility, the physical disks may distribute geographically different places, all should be accessible.
3. high performance, it should has not be much slower than read/write on local disk system.
4. transparency, the system to the client is just the same as a big disk.

Contributions:
1. This paper proposes the block-level granularity distributed storage system, instead of the previous file/directory-level systems. It makes the system easily scalable and maintainable and support heterogeneous clients.

2. By using chained-declustering introduces redundancy (fault tolerant on single server failures) and balances the load in a cascading manner to neighboring servers (better performance).

3. The snapshot is done periodically for the potential resuming (or other use) later. The copy-on-write technique is used to speed up the snapshot.

4. Incremental reconfiguration is utilized to ensure the transparency and availability and the unchanged performance when new machine is added.

5. The classification of global data structure (virtual disk directory and global mapping) and local data structure (physical mapping), only the global part is needed to be replicated. Decrease the amount of replication.

One thing I learnt:
The chained-declustering strategy is beautiful and achieves both fault-tolerency and load-balancing.

One thing I found confusing:
The chained-declustering strategy also has flaws: it sufferes from failure of geographically nearby machines. Failure of close neighbored machines may be positive correlated. If both the machine and the replica often fail, how can Patel deal with it?

Summary:

Petal is an distributed, replicated and fault tolerant storage system. It tries to combine following features such as high performance, resilient to any single component failure, should distribute geographically, should be transparently able to scale on demand, and efficiently provide backup and recovery. Petal offers a "virtual disk" interface which consists of number of servers and each server consists of number of disks. Virtual disk interface offers flexibility for the implementors to implement a variety of features such as incremental stability and snapshots transparent to user. Petal provides a low-level block granularity interface than the file system interface. Clients view the petal system as a set of virtual discs and access it through RPC interface. The state of the system is stored in the server and only hints are stored in client. Liveness module ensures the liveness of server nodes. It maintains the information about the storage system and replicates the same across all servers. Global state manager is used to manage it consistently using Paxos algorithm. Since every administrative decision takes place through Paxos, they claim it to be fault tolerant.

The translation of client requested virtual identifiers to physical identifiers using 2 global data structures and one local to server pMap structure. Backup is done using snapshot, which stores the GMap translation using the epoch number. This is done very fast in less than one second. Incremental reconfiguration can be possible in the system. They have also proposed new data placement for this distributed system.

Contributions:

1. A distributed low level storage system which gives a virtual disk interface to client at block level granularity.
2. Fault tolerant is achieved using Paxos algorithms for global manager state and global data structures.
3. "Lightning speed" snapshot for backup.
4. Provides disk addition as well as server addition incrementally. Naive algorithm which copies the data to new configuration as well as servicing the read and write requests
5. Second algorithm which deals with in region wise.
6. New type of data placement called chained declustering which increases fault-tolerance but is less reliable.
7. primary and secondary server namings help to propagate the writes consistently.

Confusing:

Where does the liveness module reside ?

Learning:

How a simple RAID idea can be extended to a distributed storage system.

Summary:
The authors present Petal, a distributed storage system which provides virtual disks usable by a wide range of clients due to its level of abstraction (block-layer).

Problem/Goal:
- Want an easy-to-manage distributed storage system. Simple addition/removal of nodes/physical storage devices.
- Tolerate at least one failure, and tolerate certain patterns of multiple failures (wherein no data is lost completely) with graceful performance degradation.
- Provide simple opaque interface of block-layer device allowing many clients to take advantage of the system’s features with little/no modification.
- Provide performance comparable to other large storage systems.

Contributions:
- Virtual disks which span nodes. This allows for (automatic, once configured) tolerance to site failures by keeping redundant copies of blocks within the same virtual disk at more than one site.
- Incremental Reconfiguration of virtual disk, changing its redundancy or placement schemes. Particularly the idea of fencing regions of the disk and making the change in smaller pieces to avoid impacting performance while reconfiguring.
- Chained-declustering, an interesting redundancy scheme which provides redundancy similar to RAID1 (though slightly less reliable, as discussed in paper), and allows some dynamic load balancing to “neighboring nodes”.

Learned:
I thought the use of fencing was very smart and seems like it could be useful in other areas where online modification of the system is safe to do incrementally and would be harmful to do all at once.

Confused:
I was somewhat confused by the mapping discussion. I wish a better example had been given as is seen in the likes of virtual memory papers, which also may use multi-tiered maps.

Summary:
Petal is a system that provides a unified block level storage access, and also highly available with good performance and low overhead of maintaining the system. It uses the indirection and creates an abstraction to achieve these goals. With these, it also incorporates the paxos algorithm and simple replication strategies to provide an easily reconfigurable system that is both recoverable and scalable.

Problem:
Realization of a large memory (even virtual) is difficult especially if it is distributed across a collection of servers. This would require the client to maintain information about the underlying physical resources and is not easy to manage. A client would rather prefer to have a single global view of the storage, just like in a disk.

Contributions:
• The key goal that was achieved in Petal was a unified view of the collection of virtual disks. This allows the client to access the memory over a wide range (2^64B) without worrying about underlying physical resources.
• The translation of virtual to physical address is analogous to that on a regular computer. Virtual disk ID maps to a server and that server performs the translation to the offset. The optimization was to use the same server for translation and disk operations.
• The use of Paxos algorithm for global operations makes the system fault tolerant. The reconfiguration of the system is especially convenient with the global state management module that maintains the membership status of the system and allows easy addition/ removal of disks by changing the global mappings in an immutable form.
• An interesting approach was the method of adding a new node such that it incrementally boosts the performance of system rather than stalling the system for a long time till it copies the old data. Also, the technique of using the “old, new, fence” divisions further reduce the overhead.
• The chained-declustering technique provides dynamic load balancing and good recovery (comparable to mirroring). The only catch is to use the locks (during writes) in order to avoid deadlocks.

Confusing:
• In chained declustering, they propose to place the 2 neighbors at separate sites to be resilient to site crash, in this case won’t there be an excessive delay especially because they use locks to achieve consistency.
• In case of a read-after-write can the read bypass the write lock (in case the secondary hasn’t completed yet) and read from the primary server (if ready)?

Learning:
• The most appealing feature of Petal is the reconfigurability of the system. The way it handles addition (or removal) of disks is really nice since it doesn’t cause any peak imbalance.

Summary:
Petal is a chain-declustered block level distributed storage system. The system exposes the notion of a virtual disk to clients that can build file systems on top of petal. Petal aims to be globally accessible, highly available, provide matching performance to a local disk, and transparently handle scalability and failure.

Problems:
The goal of Petal is to provide a highly available distributed block level storage system. In this regard the authors solve problems relating to (i) transparently handling membership changes (i.e. nodes and disks entering/leaving the the physical structure underlying the virtual disk Petal exposes) (ii) gracefully handle failures and degrade performance accordingly (iii) efficiently handle incoming requests through load balancing and (iv) have the ability to snapshot the global state without interrupting system progress.

Contributions:
- The notion of a virtual disk which transparently hides the underlying physical disks that the virtual disk a client interfaces with is built on top of
- Implementation of Paxos to handle the maintenance of global state
- The algorithm to perform a virtual to physical disk addresses mapping
- Incremental reconfiguration which transparently handles the addition of nodes and disks to the Petal system (allows the system to scale in a manner hidden from clients)
- The data access algorithm which also doubles as part of Petal’s load balancing scheme among its cluster of servers.

Learned:
I was impressed to learn that a distributed block level storage system could be built in an efficient manner that allowed it to provide similar performance to a local disk on a computer. This creates a large set of possibilities with which to exercise such a system (such as building file systems as they suggest).

Confused:
It was not clear in the paper (at least to me) how the system actually achieved the writing of the replicated data on the different servers. The explanation is very brief and felt rather like a pseudo code explanation. I have to imagine the details are much more nuanced as the issue of consistency is discussed in detail in more or less every paper we have read thus far.

Summary
The paper describes the design details, Implementation and the performance of Petal, a distribution storage system focusing on availability, capacity and performance.

Problems to solve
A single large storage system can become a bottleneck in terms of cost and complexity. Maintaining the system can become a problem and in case it goes down the all work will have to halt until brought back up. Implementing a storage system that would be easy to manage, distributed and highly available.

Contributions
Separation of the storage system into a block level storage and a file system. This allows for greater flexibility and easier design and implementation. Also this feature allows support for heterogeneous data.

Ability to back up and recover data with the help of snapshots. Done by maintaining version number namely epoch numbers.

Ability to change redundancy policies or mappings due to change in number of disks and servers. This is achieved incrementally while allowing normal processing of client requests.

Use of Chained-Declustering to improve toleration of failures in servers placed at different sites. It also allows better load distribution among the available servers.

Confusing about the paper
Didn’t understand the Global Map data structure. It seems that it holds the mappings for every server would be really expensive on an update.
Also it is mentioned that the server that performs the translation from GMap to determine server in focus also performs the task of finding physical disk and offset. So does the client hold some kind of mapping to the proper server to process on?

Learnings
Learnt the advantage of using a block level interface improving heterogeneity of clients.

Summary: This paper presents the design of a scalable, fault-tolerant block-level storage system that provides an interface and performance similar to that of local disks. This abstraction allows the construction of distributed file systems, databases etc on top of the Petal storage layer.

Problem: As we hit the limit of scaling-up the storage available at a single node, we are forced to use distributed designs. But we would like to maintain the abstraction of a local disk array, providing the same interface (device driver). Such an abstraction can be a powerful component in the design of a DFS, database or other applications, as long as performance and fault-tolerance guarantees can be made.

Contributions:
* The abstraction of a virtual container allows hiding of the distributed aspects of the system as well as heterogeneity in the physical components.
* Petal's design incorporates fault-tolerance through chained declustering, which ensures load-balancing even under node failures. Further, if "neighboring" nodes in the declustering are at geographically separated sites, the system is capable of handling site failures and network partitions.
* Snapshots are implemented using copy-on-write, so that a snapshot can be taken in a short period of time. However, this requires quiescing the system and is not a backup.
* Petal allows runtime incremental reconfiguration as well as addition and deletion of disks on-the-fly. The algorithm for relocating blocks (for load balancing) relies on multiple layers of abstraction hiding the mapping from virtual disks to the "global map" and physical disks.
* The extent of globally shared data needed to be maintained is largely limited to immutable or rarely-changing data structures, replicated using Paxos.
* The primary/secondar (master/slave) replication with read-one-write-all allows for strong consistency, a necessity when hiding the distributed aspects of the system.

Confusing:
* Fault-handling in the system depends on the secondary being able to identify a failed primary and taking over the write responsibilities. But what happens in case of a network partition - can there be multiple primaries for the same data element now? Does this break the strong consistency guarantee?
* Abstraction usually trades off with performance. The design of many systems such as databases relies on physical disk optimizations - how well do such systems perform when built on top of Petal?

Learned:
"All problems in computer science can be solved by another level of indirection" - Lampson. This quote attributed to Lampson is typified by Petal's approach to distributed system design. Petal abstracts away the distributed aspects and physical heterogeneity behind a simple, well-known block-based interface. Snapshots, online reconfiguration, and backups all depend on the use of an additional layer of indirection.

Summary :
The petal system aims to generate a distributed block storage system that clients can use transparently to build databases or file systems on top. Other goals that the authors try to achieve are live scalability, high availability and good performance.

Problem :
The problem that the paper targets is to create a globally accessible, always available store that provides good performance and requires relatively no management. The solution to these goals is provided through effectively distributing the block interface across many different storage nodes instead of a single node.

Contributions :
1. Transparency to the clients that use petal by providing a data access module that can be customized based on the storage model. This involves transparently converting the virtual address provided, to the location of the physical node that holds the data.
2. Support for snapshotting and versioning of snapshots that the clients can later use for different purposes.
3. Support for incremental reconfiguration which provides the ability to scale effectively on a live system without having to take it down when new storage nodes are added.
4. A very neat idea was to achieve all the above in a manner that does not incur significant overhead over doing local operations.

What I found confusing :
Since the applications are paused during taking of a snapshot, they say that snapshot takes a second only. I don't understand how this is possible. Even if copy-on-write was used, it would take some amount of time to mark the blocks as being made copy-on-write in persistent store.

What I learned :
I learned that it is possible to implement a lower level interface such as a block interface which as a distributed system. And not just this, it can be done with high performance and consistency!

Petal is a distributed storage system which makes use of multiple techniques in order to improve the ease of use, availability, fault tolerance and the performance of a storage system.

Contributions :

1. Transparency is highly emphasized by employing the concept of virtual disks. The client views the entire storage system as a set of virtual disks and the virtual to physical mapping translation component takes care of resolving the physical address from the virtual address provided by the client.
2. The system uses two global data structures namely the virtual disk directory and the global map for resolving the target server id. The physical map is locally resident on that server and is used to derive the physical address on the disk and its offset. A good decision was to classify these structures as global and local such that only the global data structures are replicated.
3. The petal client interface is quite simple and is implemented as a kernel device driver on the client end.
4. It has a mechanism in place to support backup and recovery by using a copy-on-write technique to create a read only snapshot of the virtual disks.
5. The system supports incremental reconfiguration using a refined algorithm that relocates only minimal portions of a disk at a time. Thereby, they ensure good availability and also a transparency of reconfiguration to the petal client.
6. The concept of chained declustering has been used for data placement. This ensures that the system is tolerant of single component failures and is useful for dynamic load balancing. They also geographically distribute the disks so as to tolerate site failures.

Learning

I have learnt about the importance of virtualization in the system which can be used to independently make changes to the physical storage system transparent to the client. The concept of transparent reconfiguration is simple yet highly improves the availability of the system.

Confusion

1. How is the data distributed among the disks? Those blocks that are accessed together (locality) could be placed together such that the disk fetch latency is also reduced.
2. Every write request requires synchronous writes to the primary and secondary disks. If the primary or secondary replica is very slow to respond, then the write latency is quite high. The system has been tested with a minimal configuration but is this latency applicable in practical scenarios?
3. In the case of multiple component failures, in which both the primary and the secondary disks go down, the data becomes unavailable. Will it be applicable in a practical system?

Summary:
This paper presented a highly available, globally accessible and scalable storage system which provides a block level interface to the clients. The clients view the system as a collection of virtual disks (containers). These virtual disks are an abstraction over physical servers (each have some number of disks) which collectively provide storage. The overall system is divided into modules which handle different responsibilities and use each other.

Contribution:
1. The major contribution of this paper is using three level of indirection (Vdir, GMap, PMap) to translate virtual address provided by clients to the physical address in a physical disk. Together, these three data structure provides a host of desirable features:
a) Adding a physical disk in some server only require local changes in a server (changes in PMap).
b) Configuration changes for a virtual disk can be made by making a new version of GMap and then atomically updating the VDir to point to new GMap for this virtual disk. And because we have old version of GMap, the reconfiguration can be done incrementally (Use old GMap when required). To reduce the load, the address space is divided into new, old and fenced regions.
c) Immutable GMap are also used to create snapshot of a virtual disk by using copy-on-write technique. Every snapshot have a different GMap (identified by epoch number). The latest epoch number is the current virtual disk.
d) This separation of translation data structure allow to keep the bulk of mapping information local. Only a small portion (GMap, Vdir) is maintained on all the servers using Paxos based consensus algorithm.

2. Most distributed storage system provide interface at the level of individual objects (files, directories), but Petal provides a block level interface. It makes it simple to model, implement and tune.

Learned: I learned that a distributed system can be design at different granularity levels. In petal, the granularity is at block level. There are different challenges at different granularity. For example, at block level, if some server fails, we can only access some portion of a larger entity(file).

Confused: Using paxos for maintaining consistent global state can introduce delays in the system when servers are frequently joining and leaving the server. Also the paper mentioned that it can tolerate power outages and natural disaster by distributing replicas of block in different geographic sites. In this scenario, network partition can cause the translation data structures to differ. How are they reconciled after recovery?

Summary: Paper presents Petal, a distributed disk architecture. It allows clients to create and write to virtual disks on the block level. The authors created a system that is easily scalable, fault tolerant, and recovery in a way that is transparent to clients. Performance is similar to that of locally attached disks.

Problem: It is hard to manage and get good performance out of a large scale storage system. Current systems are not very amenable to failure and ensuring good performance requires the manual tuning of different components (manual balancing of data and replication). This is obviously not a good solution for either users or managers of such a system. Therefore, the author’s goal was to make much of this manual work transparent while offering a better service to clients in terms of fault tolerance and load balancing.

Contribution: The key contribution is the design and implementation of a distributed virtual disk system. It allows clients to interact with it on the block level and therefore it doesn't take much client modification in order to transfer to using this system.

One of the first good design points of Petal is that all important state is stored on the servers. This is important when ensuring that the system is fault tolerant as if there is any state on the client that is important, that failure avenue might cause the client to lose data.

All servers know about all other servers and Paxos is used to increase membership and agree on global state. Petal also provides the ability for clients to take snapshots and it uses a copy-on-write system for doing so.

Servers are added to the system easily where the global map is changed and pushed to all servers and this mapping will redistribute data across servers. The tricky part comes with allowing for this redistribution to be concurrent with giving clients access to their data. They solve this by allowing the redistribution to happen incrementally in “fenced” regions where offsets within them may be in old or new global maps.

Recovery is achieved by replication of data across servers. However, it seems that the two copies are “always stored on neighboring servers”. This seems to go against a design principle with regards to being fault tolerant in the case of power outages and natural disasters in that neighboring servers will probably be next to each other. Writing to the system follows a simple, write to primary model. If the primary is down, they can write to a secondary and they have mechanisms for making the stale data of the primary be updated when it comes back.

Confusion: How exactly is data redistributed and replicated across servers? The paper gives the high level overview in that it flatly says that data is redistributed and replicated for fault tolerance, but it doesn't go into the actual mechanism for how this is done. For instance, are they using consistent hashing?

Learned: Learned about the petel distributed file system. Most interestingly is that it was something that is an alternative to Raid in terms of a block level interface for clients to interact with the unique point of the ability to create multiple virtual disks. I still need to be convinced that this is as fault tolerant as Raid is. Petal can handle one component failure, but can it handle multiple?

Summary
The Petal system is an early form of a Storage Area Network, that provides a flexible, fault tolerant and distributed block device interface that can be used by filesystems built on top of it. The authors show that it’s possible to develop such a system without too much of a performance penalty and still achieve some nice features such as being able to capture snapshots that can be versioned, and providing flexibility in expansion of the storage servers in use in a live environment.

Problem
The authors recognize a need to create a storage system that resembles a mythical unicorn in a sense - one that is globally accessible, always available, unlimited performance with a capacity for a large number of clients, and one that needs no management. The system proposed tried to provide reasonable solutions to most of these problems.

Contributions

  • The concept of a virtual disk, and the provision for a hierarchical address translation mechanism from virtual to physical address in a manner transparent to clients accessing the store. (Though this is slightly reminiscent of the HP Auto Raid system).

  • Software based RAID management, and the ability to select between levels at the disk creation time.

  • The ability to take snapshots of the virtual disks and version them.

  • Live reconfiguration of data by partitioning the content of the virtual disk as old, fenced or new, and using this ability to also scale out as more storage nodes are added.

What’s not clear

  • When new writes are directed to the newly added node, and a background copy from the existing storage node is happening due to reconfiguration, how does the system guard against the background copy overwriting the newly written value?

  • Also, how are clients trusted to do the load balancing? Couldn’t malicious clients target a copy and take it down by attempting a DOS attack?

A learning
The solution to almost half the problems in systems is to add a level of indirection to it :).

Summary:
In this paper authors have presented the design and implementation of Petal, the block level storage system consisting of a set of servers and physical disks connected over the network that presents any petal client a view of a single highly accessible virtual disk that can tolerate disk, server and network failures. This abstraction also make possible to provide nice properties such as dynamic reconfiguration, load balancing and replication.

Problem:
The problem this paper tries to solve is to design a highly available block-level distributed system that is fault tolerant and incrementally expandable, while its
performance should be at least not worse than a single local disk system. Prior file and disk systems only achieve a subset of the following goals: a globally accessible storage system which could tolerate any single component failure, provide load balance, could be incrementally deployed, geographically distributed and is easy to administer. Petal tries to achieve all these goals in practice.

Contributions:
-By separating the storage system into block-level storage system and file system, the scalability and maintenance can be improved.
-The complete state information is maintained in the server which makes the client thin and easy to handle failures and recovery.
-By using chained - declustering, it provides high availability of data and better performance.
-The incremental virtual disk reconfiguration scheme provides better performance eventually without affecting the current performance.
-Petal uses separate modules each performing its own function separately which is better in terms of maintenance.

Flaw:
The replication scheme used by Petal may not be robust in face of component failures due to its deterministic characteristic, that is, if two consecutive servers fail at the same time, then certain blocks will be unavailable. To make it more reliable, more sophisticated mechanisms can be used.

Learned:
-Having a simple block level storage with virtual disk abstraction is lot simpler and more flexible, and can be utilized to build a complex file system.
-Having a good modular design provides easy of adding new features such as in petal new access schemes can be very readily added.

Confusions/Questions:
-With heavy read write load where there is many create-delete operations, how does Petal handles the disk fragmentation issues?
-One of the design goal of Petal is to save as much states on server as possible, while making clients maintaining only a few of the mapping states. But other papers (like Dynamo) mentioned that in order to build a scalable system, clients need to do as much work as possible instead of the server. So in this aspect, is Petal really scalable?

Summary:
This paper describes the design, implementation, and performance of Petal, which is a distributed storage system. As the result, Petal provides a view of a collection of virtual disks to its clients.

Problem:
Manage large storage systems is an expensive and complicated process.
1. One node fails would cause the systems to take a long time to resume.
2. The capacity and performance of individual components in the system must be periodically monitored and balanced to reduce fragmentation and eliminate hot spots.
3. Specific clients or libraries are needed in order to use a general implemented distributed storage system.

Contributions:
1. Use chained-declustering to gain the goal of balancing load and tolerating a single site failure. Similar to RAID1, Petal works well if one server of the two replicas goes down.
2. Transparently reconfigure to expand in performance and capacity as new servers and disks are added.
3. Different clients can view the same disks. it is easy to use. No need to change implementation in the upper-level file system.
4. Automatically and periodically taking snapshot of the whole system for the purpose of resuming later. Copy-on-write technology is adopted to speed up the progress of creating snapshots.
5. A refined algorithm break up the whole virtual disk’s address space into 3 regions - old, new and fenced. This algorithm reduces some traffics by limiting the updating only occurring at fenced region.

Confusings:
1. Like the new security issues raised from the newly cloud computing services, would sharing disks might also introducing some kinds of security issues?
2. Is the translation from virtual disk to physical disk really good enough to support different upper-level file systems? Some file system like FFS and LFS has specific optimizations based on some physical characters on magnetic disk, would these kinds of optimizations affect the performance of Petal.

Learned:
Virtualization is really a popular way to increase the utilization of resources and provide exact the same service to different users.

Summary: Petal provides a block-level abstraction that is backed by a distributed system. It has mechanisms for data replication and maintaining availability in spite of node failures.

Problem: at the time of writing of this paper, there were many discrete systems for managing data, and it was not easy to scale them in storage size or clients accessing them. It was difficult to introduce some sort of redundancy to maintain consistency, yet have an available system that is not difficult to administer.
Contributions:


  • Instead of storing data at the file level, provide a more heterogeneous abstraction at the block-level

  • Redundancy: data is striped not across just any servers, but across contiguous servers so that in case of failure, not all nodes will be affected by an increase in traffic

  • Distributed naming scheme: nodes that store a block have an internal datastructure to lookup the physical location, thus reducing the information that has to be sent around

Confusing: Snapshots: how exactly are they taken? Is it by just assigning an epoch number to a block, and instead of writing to it, a new block with the same identifier will be created with the new data? Thus, the only overhead in storage of a snapshot is storing the old versions of blocks that have since been written to?
Blocks: is it possible for two contiguous blocks to be stored on separate servers? This would hurt performance!

Learned: in the context of transparency, I found it interesting that they wanted to provide a block-level storage service, instead of a file one. This would make it easy to implement almost any filesystem on top of, which means that it could have real world applications. However, the filesystems would assume that they are accessing the resource from a disk, as they are not optimized for a networking environment. I am surprised that the filesystems did so well in the benchmarks!

In the evaluation, they mentioned reading different amounts of data. What if they played around with block sizes?

Summary:
This paper describes the design and implementation of Petal, a storage system that is distributed across physical servers that cooperatively manage a disk pool. In the storage stack, Petal could be thought of as a layer that sits between disks and the file systems. To the clients of Petal, it appears as a set of virtual disks that supports various operations as a typical disk would support. [Read, Write, Snapshots etc]. Petal contains multiple modules that perform virtualization, fault tolerant, replication functions.

Problem:
Growing size of storage systems impose various concerns like capacity management, performance, handling failures and load balancing.
So building a easily manageable storage system that caters to large number of clients, and ensures that load of requests is equally distributed across a dynamically changing set of servers and disks is necessary.

Contribution:
1. Block level interface aided in supporting heterogeneous clients.

2. Virtualizing the disks aids in sharing the available storage flexibly among multiple clients. Though virtualization is not a new concept, the authors implemented this idea at the level of disks. Clients were also given the ability to configure the type of replication they would require in the virtual disks and the number of servers it should span on via the virtual disk reconfiguration.

3. Various operations like snapshotting, adding/deleting new servers are made fault tolerant by applying Paxos or Part-time parliament algorithm.This ensures correctness even in the presence of server failures.

4. Provides snapshotting feature, that aids clients in backing up data. The snapshots were copy on write, and were readable.

5. Incremental reconfiguration handled the load distribution among changing set of nodes and disks.

6. Chained de clustering ensured redundancy and propagated the load in a cascading manner across neighboring servers. By placing even-numbered servers in one site and odd-numbered servers in one site, site failures could be tolerated.

One thing I found confusing:
I am confused if the GMap is a collection of entries with one entry per virtual disk or if each server has all collection of GMaps where one GMap maps to one virtual disk.

One thing I learnt :
Minimizing the information kept in global state reduces the cost of replication.

Summary:

The paper presents design and implementation of Petal, a distributed storage system which provides client transparency through the use of virtual disks, high load balancing, automatic reconfiguration during failure and efficient backup mechanism for the data. It can be used to host a distributed file system efficiently on the its underlying architecture. The work also presents comparison of performance Petal based file system implementation to local file system implementations.

Problem:

Managing a large distributed storage system is complicated and involves providing high availability, reliability and consistency.
Monitoring and recovering from failures automatically such that load gets evenly re-distributed.
Providing transparency to clients from the complicated distributed operations.
Contributions:

Providing block level access to data instead of file level access is appealing.
Client transparency by abstracting it from virtual to physical mapping.
Usage of Paxos algorithm in Global State Manager component to provide fault tolerance during storage server or network failure.
Usage of copy-on-write technique along with epoch numbers to track the same was very effective in taking snapshot for backup and recovery.
Choosing non-contiguous blocks of data for fencing operation is a clever way of ensuring better data availability.
Unclear concept:

In virtual disk reconfiguration algorithms, they state that they alter the Vdir entries to point to the new global map before they copy the data. Isn't Vdir mapping the only way to recover old Gmap mapping? If they replace that before the copy, how are they able to access the old GMap correctly? Although the means to solve this is quite simple by using some backup data structure, I wasn't clear how they do it in their system.

Learning:

Dynamic load balancing or load redistribution during failures using the concept of chained-declustering was something which I learned from this system design.

Summary:

The paper presents design and implementation of Petal, a distributed storage system which provides client transparency through the use of virtual disks, high load balancing, automatic reconfiguration during failure and efficient backup mechanism for the data. It can be used to host a distributed file system efficiently on the its underlying architecture. The work also presents comparison of performance Petal based file system implementation to local file system implementations.

Problem:

  • Managing a large distributed storage system is complicated and involves providing high availability, reliability and consistency.
  • Monitoring and recovering from failures automatically such that load gets evenly re-distributed.
  • Providing transparency to clients from the complicated distributed operations.

Contributions:

  • Providing block level access to data instead of file level access is appealing.
  • Client transparency by abstracting it from virtual to physical mapping.
  • Usage of Paxos algorithm in Global State Manager component to provide fault tolerance during storage server or network failure.
  • Usage of copy-on-write technique along with epoch numbers to track the same was very effective in taking snapshot for backup and recovery.
  • Choosing non-contiguous blocks of data for fencing operation is a clever way of ensuring better data availability.

Unclear concept:

In virtual disk reconfiguration algorithms, they state that they alter the Vdir entries to point to the new global map before they copy the data. Isn't Vdir mapping the only way to recover old Gmap mapping? If they replace that before the copy, how are they able to access the old GMap correctly? Although the means to solve this is quite simple by using some backup data structure, I wasn't clear how they do it in their system.

Learning:

Dynamic load balancing or load redistribution during failures using the concept of chained-declustering was something which I learned from this system design.

Summary: Petal is an incrementally expandable distributed storage system that stores data at the block level (so it’s suited to be at a level below a distributed file system). Petal uses a series of software modules, virtualization, and replication to provide a highly available, fault tolerant, redundant virtualized file system that transparently provides a consistent view to the client (while hiding the underlying network hardware implementation).

Problem: As storage systems become larger and more distributed they grow increasingly difficult to manage and share. The solution is to separate the client’s view from the underlying physical resources which allows resources to be more freely shared between many clients.

Contributions:

1-The design of Petal, a highly available block level storage system
Software Modules:
-Liveness module: looks for a majority consensus (that the system is alive)
-Global state manager: consistently maintain state information using the paxos algorithm
-Recovery, Data Access, and Virtual to Physical Translator Modules: these work together to service client read / write requests.
Virtual to Physical Translation: VDir, GMap, and PMap are the 3x most important data structures. This is somewhat similar to page tables in a standard OS. Snapshots
Virtual Disk Reconfiguration: there’s two algorithms they provide; a simple one, and a more refined one. Requests to old & new regions go to old & new maps respectively, while requests to the fenced region use the original algorithm. By progressively shifting data from smaller non-contiguous fenced regions we avoid a hard performance hit while transferring data.
Chained declustered data access and recovery modules: this provides highly available access to data by automatically bypassing failed components. One data block is the primary, and the other is the secondary. A read request succeeds unless both copies of the data have been destroyed. A write request involves using an algorithm that attempts to keep primary and secondary copies consistent.
Simple RPC interface - 24 different calls are available that allow for creating and deleting virtual disks, snapshot creation, reconfiguring a virtual disk, etc.

2-Performance Tests:
-The tests show Petal performance is less than, but surprisingly close to the performance of a local physical disk

What I Found Confusing: I wasn’t completely clear on why chained declustering is considered reliable (e.g. “if a server fails, and one or two neighbors fail, this can result in data becoming unavailable”).

What I Learned: Virtualization provides an excellent abstraction that can make a distributed storage system more scalable, more fault tolerant, and easier to manage all at the same time. The performance was surprisingly close to that of a local disk. Also, every system should always use Paxos (just kidding, but this does seem to be the only game in town…) . More seriously, Paxos is clearly usable in many different contexts.

Summary:
Petal is a distributed storage system, which transparently abstracts managing physical disks and servers to the applications using it by implementing virtual disks. It attempts to be highly available, consistent, performant and fault tolerant such that recovery is transparent to the user.

Problems:
In addition to virtualizing the disks by mapping virtual disk reads and writes to physical disks transparently, Petal solves several other problems.
• Fault tolerance for single components at several levels including disks, servers, and network
• Fault tolerance for data center outages by being geographically distributed
• Adding and removing membership of servers seamlessly
• When adding new capacity performance increases linearly and does not require additional overhead administration
• Load balancing when accessing physical disks, so access times to virtual disk containers don’t spike breaking transparency
• Easy and quick backup and recovery of snapshots

Contributions:
Petal has several contributions differentiating it from other distributed file systems of its time. Instead of file level granularity such as with NFS and AFS, Petal offers block level granularity for its virtual disk container. This allows for further work (like Frangipani) to build upon this abstraction and create file level storage systems. The major contribution of this paper was the idea of disk virtualization through their implementation of containers. This helped by adding the ability to scalably allocate physical resources and add and remove nodes easily and transparently to existing containers.

Unclear:
It was a little unclear why the system had three separate maps for the overall mapping of a virtual block to a physical disk block. It seems Petal could have foregone one of the mappings and just utilized a global map and physical map to increase performance. Also, it seems odd that they were using striping and then resorted to chained-declustering replication. I feel erasure-coding instead would have been a significantly better choice and would have allowed more flexibility when adding and removing members as well as load balancing.

Learned:
Virtualization is all the rage these days, having worked on Amazon’s AWS S3 I can see how the research on this system contributed to their system. I learned why virtualizing at the disk layer is a wise choice as it provides the features of transparency regardless of physical failures or physical capacity increases.

Petal: Distributed Virtual Disks

Summary:
Petal is a distributed system which provides a block-level interface to it's clients, which are file systems that can access petal as virtual disks. It is a system which is globally accessible(virtual disks can be in different geographical locations), always available(addition or deletion of disks or servers doesn't affect the performance for the clients), highly performant(the read and write are comparable to local disks), provides huge storage capacities for clients by using multiple disks and needs no manual intervention. The clients views the system as something which has large virtual disks. Petal provides these features using multiple techniques like virtual to physical translation, incremental reconfiguration and the way they access and recover data.

Problem:
The problem the authors are trying to solve is to ease the process of managing large storage systems where components can be added or removed frequently, failure of some components shouldn't halt the system and issues like disk fragmentation, overloading and hot spots should be managed gracefully.

Contributions:
1. The main contribution of the paper is to combine the ideas of distribution and virtual containers to come up with the idea of virtual disks which provides a block-level distributed storage system.
2. The idea for splitting the data structures used for virtual disk id to physical disk id mapping, into three layers like Virtual Directory, Global Map and Physical Map. This makes it easier to synchronize the global data structures using paxos as most of the details is push to the physical map.
3. The use of epoch numbers to take the snapshot of the virtual disk was an important contribution since the contents of a particular virtual disk is spread across multiple physical disks.
4. The chained-declustering technique provides many advantages over the mirroring as it balances load more evenly on other servers.
5. The idea of breaking up the virtual disk's address space into 3 regions - old, new and fenced makes the reconfiguration process while adding a new server or disk to take place in an incremental manner.

One thing I learnt:
I learnt that reconfiguration process in such a distributed system can be done in such a graceful manner such that the client has no idea that such a thing is happening. This was very interesting to me as previously I thought that the client could observe such reconfigurations since the system would halt for atleast sometime.

One thing I found confusing:
I'm still not clear about the datastructure Global Map. I'm not sure if it's a single global map shared among all the servers or if there would be multiple Global Maps for different virtual disks.

Overall, I think that the authors have developed an excellent distributed system of disks which provides a block-level interface using which distributed file systems can be built.

Summary:

Petal describes how to virtualize disks over a distributed set of physical disks that operates in heterogenous environments, easy to administer, does reasonable load balancing, scales with servers and disks, and has a certain degree of fault-tolerance in the system.

Problem:

Large storage system management is an expensive process; single component failures can frequently bring the system to halt. The capacity and performance of every node needs to be monitored, and fragmentation occurs. Hot spots exist in the system. Manual administration requires moving, partitioning, or replicating files and directories.

Contributions:

- Chained-declustering mechanism as a fail-over to spread load in a cascading manner to neighboring servers.
- Virtualizing disks over a network as opposed to traditional approach of implementing a distributed file system.
- Provides correctness guarantees over arbitrary combinations of server failures and works correctly despite network reordering of messages.
- Provides copy-on-write snapshot mechanism for fast snapshots. We see this mechanism in modern file systems like btrfs and HDFS
- Allows clients to specify redundancy schemes for each virtual disk.

One thing I found confusing:

I'm confused about the structure of the physical map. Is it a function that maps virtual offset to physical offset, or is it a table? If it's a table, isn't this enormous?

One thing I learned from paper:

I learned the chained-declustering mechanism is an elegant way to distribute load.

Post a comment