I felt that the paper in its current form should not be accepted into SOSP. The paper did a good job of summarizing the functionality of Dryad, and highlighting its good attributes. It also described in length the kind of scenarios where they felt Dryad would be applicable and how it could be modified via an abstraction layer etc to ease the learning curve. However, the place where it was most seriously lacking was in the section where they evaluated the performance of Dryad. Though they ran two tests: a SQL query over a modest 10 nodes and a data-mining job over a cluster of 1800 nodes, they only gave the scaling review for the SQL task (which considering it was only run on 10 nodes did not present a very reliable result). There was no real way for assess the performance of Dryad for the data-mining task.
That all being said, however, the paper was an interesting read and presented a way to better handle running data-parallel tasks in parallel. Most of the techniques which they employed though not highly original in their own right, when used in conjunction as they did, did seem to give somewhat promising results.
This paper is recommended for inclusion in SOSP. While others have attempted to simplify parallel processing through development of frameworks (MapReduce, for example), Dryad develops a much more general system while preserving simplicity by creating various generic components with reasonable performance to ease parallel programming. Additionally, Dryad employs various ideas regarding reliability which other systems pioneered.
The authors have tested Dryad on both a small local environment and have tested it on a large cluster of 1800 workstations, yielding credence to the fact that it does indeed scale. The need for having more control over the computational layout was argued for and is easy to agree with based off the fact that different applications run in different way. The authors gave credence to this argument when they demonstrated improvement in performance for their test application simply by modifying the run-time graph.
One aspect in which this paper can be improved is by comparing performance against other frameworks. Dryad can be set up to run exactly the same programs as frameworks such as MapReduce. Its performance should be comparable in such cases for the system to be feasible.
As the design goal, Dryad desires to provide general purpose execution environment for distributed, data parallel application with focus on throughout. Similar as MapReduce, Dryad hides many tricky concurrent programming problems, e.g., threads and fine-grain control, from the program developer. More importantly, it handles abstract the issues such as resources allocation, scheduling and the transient or permanent failure of the computation in the system. In short, it provides much convenience to current programming in fully exploiting the power of clusters. Not like existing work MapReduce, Dryad provides more flexibility without much constraint on the application model. For example, MapReduce needs the application to fit into the Map+Reduce style. Dryad executes the jobs as an arbitrary DAG. In Dryad, developer can have control over the communication graphs besides the subroutines in the vertices by specifying the arbitrary directed acyclic graph. With this flexibility, Dryad expected the developers can produce more efficient programs.
The thing that I am concerned about the Dryad is that it is closely currently bound with C++. The developer needs to be experienced in C++ to use Dryad. The Dryad executes the programs as an arbitrary graph. It may needs some efforts to tune the performance. How easy is the tuning? Some experiments on that would be great.
I would strongly accept this paper.
This paper presents much flexible model compared to Map Reduce using a job manager and a graph building language for composing vertices of computation and edges of communication channels between the vertices.The paper addresses the concerns that may arise from such a system i.e. performance, scalability, fault tolerance and programmability in detail.
The weakness of this paper is that it follows the approach of giving the developer the power to manipulate the data dependencies of computation. This can not only be error prone but can also cause performance bottlenecks if the developer does not handle these concerns correctly. It is similar to MPI where the programmer's ability severely restricts optimal performance from the system. The job manager also appears to be a single point of failure in the system. There is no mention of this in the paper.
Contributions:
- The key contribution as comes across from the paper is the representation/specification of the computation as a DAG.
- Dryad builds a general purpose distributed computation engine, which supports multiple data transfer mechanisms.
- The paper claims scalability from multi-core machines to multiple computer clusters scaling thousands of computers.
- The project also designs a simple graph description language for enabling ease of programming.
- Legacy executable support is an interesting feature, although it doesn't figures in the core idea of the paper. Weaknesses:
- The idea of using a DAG to parallelize computations and spread them across a cluster has been in existence for quite some time. Condor has an implementation of the same, but while they mention condor in the paper, they completely gloss over this.
- While the paper assumes DAGs to be the perfect flexible approach for parallelizing computations and builds up a lot on top of that, there is little analysis done to show that DAGs are in fact the way to go. I would like to see a discussion on how well existing sequential applications can be translated into a parallel model using DAGs.
- Although they talk about scalability across thousands of computers, and start out with massive figures for data sets in section 2.1, what they ultimately processed seems to be rather small. (I glossed over section 6, so I could be wrong here).
- The fault tolerance approach taken seems to be less capable than the approach used in Google's MapReduce. From reading the paper, I get the impression that it's eminently possible for entire computations to fail due to bottleneck blocks, because they attempt to reexecute not only the failed node, but also it's dependancies (potentially causing a flood of reexecutions).
- To me, the paper is rather poorly written and poorly motivated for making it to the top conference in systems.
Overall, I would only accept this paper if I am really badly starved for innovative papers at OSDI. Reason, the weaknesses seem to weigh out the contributions.
I think this paper should not be accepted. As the system presented in the paper does not really make it easier for developers to write efficient parallel and distributed applications, at least not as easy as the MapReduce system.
Contributions: Implements a general-purpose data-parallel execution engine at a lower level. More flexible as the programmer can control the dataflow graph. Can be used for more fields as one can add layer upon it.
Weaknesses: It is not that easy to use this system. One need at least data flow graph and vertex program. The Dryad application may be difficult to debug as the underlying system is a little complex. What's more, it's not open source.
I would probably not accept this paper to SOSP. First, although they tackled the problem of more complex layered distributed tasks, in most cases this problem can also be solved by using multiple levels of MapReduce. Second, they didn't provide a good way for system load balancing. Last, the system adds more complexity and is more error prone, especially in large scalability. I would doubt the scalability of their system and its performance with multiple jobs.
Contributions: They introduced a way for developers of distributed systems to have more flexibility in designing their tasks and more control over the data communication of their tasks. They addressed the problem of more complex distributed tasks than simple one step map-reduce, which is quite useful in applications like distributed databases. They didn't fix their channel methods, which allow more flexibility for different systems and tasks (as apposed to MapReduce's fixed way of writing to disks). On the other hand, for channels using shared-memory or pipes, failures will propagate to all the previous related nodes. Their experiments on real large-scale clusters are also valuable. Weaknesses: They didn't address the situation when multiple jobs are asking for system resources. Their simple local greedy algorithm to assign nodes won't work well and have load balance in such situations. They added complexity to the programming environment for developers, which can cause pain for developers. More important, this could bring a lot of errors especially with big scale systems, which are very difficult to debug. The problem of propagating failures in error handling as mentioned above.
Dryad architecture is completely generic - it doesn't assume anything about the application or impose restrictions on the same. A key difference between Dryad and Map-Reduce is the fact that Dryad supports several stages of operations and thus represents a superset of MapReduce. It also supports multiple sources of inputs and outputs. The other major strength of the system was the run time refinement of the graphs - the runtime engine can detect certain network configurations and may introduce extra layers of indirection that take advantage of data locality in an effort to minimize the network bandwidth consumed during aggregation operations.
The evaluation of the system is very weak - the system is compared against a single node SQL Server database. SQL Server isn't the best performing database in the market and a comparison against it doesn't really tell us anything. The other benchmark that simulates a map-reduce type of workload is also insufficient. The overheads of the systems are not brought out in these experiments - would it not have been possible to compare the execution times and network bandwidth consumption with Hadoop implementation of the same problem in the same environment? The handling of stragglers and heterogeneous environments is also very vague and incomplete. The paper claims to dryad could scale to a large setup - experiments dont show this either. The requirement that users learn a new language to construct graphs could be a potential barrier to adoption.
SOSP? The work represents an incremental improvement over what Map Reduce offers and doesn't introduce anything conceptually new or novel. The effectiveness of the new architecture isn't evaluated in a systematic manner. This paper isnt the quality that one comes to expect to appear in SOSP.
One of the major contributions of this paper is its scalability. It scales from a multi-core machine to a cluster of thousands of computers, in a generic manner. Unlike MapReduce, which focuses on simplicity, Dyrad is much inclined towards generality of the framework. Hence, this is both a contribution and weakness. From systems perspective, I think its rather a weakness. Programmability of Dyrad is definitely a powerful contribution. Abstracting the underlying gory details in the form of a graph description language empowers the develper with explicit graph construction and refinement to fully take advantage of the rich features of Dyrad execution engine.
The paper describes a new framework to develop and execute data parallel computions that can fit broad use cases. The authors claim that the solution can scale from multiple processors in a single machine to multiple machines in a distributed system with good bandwidth for communication. The Dryad framework allows developers to define a distributed application as a directed acyclic graph where vertices from compute machines where the application code will be run and the directed edges are communication channels. The direction of the edge is based on the direction of flow of data between vertices. The developers therefore can highly customize the flow of data and the number of compute instances using the graph description language that Dryad provides. Once the application is developed, the Dryad framework can take care of scheduling , executing, tolerating communication / machine failures etc. Also, the description graph can be changed dynamically based on the execution environment.
Vote: Not accepted
Weaknesses:
a) The paper lacks clear reasoning about the reasons why Dryad was designed to be what it is. Rather the paper focusses on describing what Dryad is and also does not describe it completely.
b) Paper organization : Lots of material can be removed overall to make space for sound design reasonings and the example given could have been a better generic example instead of a complex SQL query.
c) Weak evaluation : Better evaluations that provide glimpses into Dryads performance in absolute and relative ( to other distributed programming frameworks ) terms are needed.
d) More example applications in sufficient detail that stress the wide applicability of the framework are needed.
Strengths:
a) They have attempted to come up with a distributed computing framework that is more generic in terms of applicability to problems.
b) They have concentrated on makign the framework allow customizable data flow across nodes which is very suitable for data intensive applications.
Personally, it could be accepted if their work and objectives were more focused. The whole system appear to be an anthology of various prior systems, merging them in a integrated solution, whereas their most significant contribution is the implementation and design of the communication graph, the language defined to use them, the run time refinement; increasing the distributed and parallel elements, although there is a trade-off in simplicity at programming code, in relation to other works, such as Google's MapReduce. Another bad point of the paper is the evaluation section. The evaluation was a comparison between Dryad and a non distributed system, and a test on data mining. It could have been more worthy to show a comparative evaluation between some similar systems and Dryad , to quantify their contributions.
From my point of view, this paper could be accepted. Its contributions lie in the following:
1. Proposed a sound model for data parallel programming. Though a number of research have already been conducted in this area, it is from a different perspective.
2. The Dryad infrastructure provides concurrency execution, reliability, scheduling, and resource management. The system has decent performance, and has already built a number of applications on top of it.
This paper presents a new distributed execution engine, Dryad. Developers have to provide a communication graph explicitly to Dryad, where vertices are sequential programs that run on physical machines and edges represent communications between vertices. Then, Dryad automatically schedules vertices to run simultaneously on available machines in the cluster. It hides all sorts of issues from the developer, e.g. load balancing, task scheduling and failure handling. Also, it applies runtime refinement to improve the performance.
In my opinion, the contribution of this paper is just that it shows a more general distributed execution framework that could handle more problems than others (e.g. MapReduce). Developers could control the flow of execution freely in the form of communication graph. The system could further optimize the graph at runtime to gain more performance.
However, there are some problems in the design and experiment. First, to a large extent, the success (I mean achieving high performance) of a job running on Dryad depends upon a carefully designed communication graph. It's implied by the evaluation in the paper: the graph for the SQL query already had several optimization and the graph used in data mining needed to be rearranged to achieve good scalability and performance. This may be a barrier to widely apply Dryad because developers have to spend more time learning and thinking about communication graphs. Second, the real-time graph refinement seems not truly real-time. The second experiment is a good example. By reading the paragraph below Figure 9 and the first point at the end of that page, I feel it's the authors not Dryad that tried different refinements to get Figure 10. Lastly, the experiment results are not convincing, because all of the graphs adopted were highly optimized by human. Also in the second experiment, they only shows numbers but no comparison. It would be better if they could show a comparison between Dryad and MapReduce.