Suhail Shergill

Dryad is designed to be a general-purpose distributed execution engine for coarse-grain data-parallel applications. In simpler terms Dryad attempts to solve the problem of how best to come up with a framework to carry out data-parallel tasks on a large clusters of nodes in such a way so as to have a good scaling performance. The Dryad execution engine tackles the problems of creating a large distributed, concurrent application: scheduling across resources, optimizing the level of concurrency within a computer, recovering from communication or computer failures, and delivering data to where it is needed. Dryad accomplishes this by running the data-parallel computational pieces concurrently on a cluster of nodes, and optimizing the scheduling.

I felt that the paper in its current form should not be accepted into SOSP. The paper did a good job of summarizing the functionality of Dryad, and highlighting its good attributes. It also described in length the kind of scenarios where they felt Dryad would be applicable and how it could be modified via an abstraction layer etc to ease the learning curve. However, the place where it was most seriously lacking was in the section where they evaluated the performance of Dryad. Though they ran two tests: a SQL query over a modest 10 nodes and a data-mining job over a cluster of 1800 nodes, they only gave the scaling review for the SQL task (which considering it was only run on 10 nodes did not present a very reliable result). There was no real way for assess the performance of Dryad for the data-mining task.

That all being said, however, the paper was an interesting read and presented a way to better handle running data-parallel tasks in parallel. Most of the techniques which they employed though not highly original in their own right, when used in conjunction as they did, did seem to give somewhat promising results.

Mark Sieklucki

Dryad presents a framework aimed at simplifying the ease and effectiveness of writing parallel applications. It models computation as an acyclic graph of computational vertices connected by communication edges. This graphic representation is programmed (and can typically be reused for a variety of applications) in conjunction with the application and dictates the way in which the program is distributed and executed upon the system.

This paper is recommended for inclusion in SOSP. While others have attempted to simplify parallel processing through development of frameworks (MapReduce, for example), Dryad develops a much more general system while preserving simplicity by creating various generic components with reasonable performance to ease parallel programming. Additionally, Dryad employs various ideas regarding reliability which other systems pioneered.

The authors have tested Dryad on both a small local environment and have tested it on a large cluster of 1800 workstations, yielding credence to the fact that it does indeed scale. The need for having more control over the computational layout was argued for and is easy to agree with based off the fact that different applications run in different way. The authors gave credence to this argument when they demonstrated improvement in performance for their test application simply by modifying the run-time graph.

One aspect in which this paper can be improved is by comparing performance against other frameworks. Dryad can be set up to run exactly the same programs as frameworks such as MapReduce. Its performance should be comparable in such cases for the system to be feasible.

Chong Sun

do not have any concrete feeling on what is the standard to get a paper in SOSP, while I think Dryad is good paper in the sense it does a lot of work following many existing and incoming needs on cluster( or multi-core) computation.

As the design goal, Dryad desires to provide general purpose execution environment for distributed, data parallel application with focus on throughout. Similar as MapReduce, Dryad hides many tricky concurrent programming problems, e.g., threads and fine-grain control, from the program developer. More importantly, it handles abstract the issues such as resources allocation, scheduling and the transient or permanent failure of the computation in the system. In short, it provides much convenience to current programming in fully exploiting the power of clusters. Not like existing work MapReduce, Dryad provides more flexibility without much constraint on the application model. For example, MapReduce needs the application to fit into the Map+Reduce style. Dryad executes the jobs as an arbitrary DAG. In Dryad, developer can have control over the communication graphs besides the subroutines in the vertices by specifying the arbitrary directed acyclic graph. With this flexibility, Dryad expected the developers can produce more efficient programs.

The thing that I am concerned about the Dryad is that it is closely currently bound with C++. The developer needs to be experienced in C++ to use Dryad. The Dryad executes the programs as an arbitrary graph. It may needs some efforts to tune the performance. How easy is the tuning? Some experiments on that would be great.

Asim Kadav

Dryad, describes distributed execution engine for running data parallel programs built from sequential programs. The execution is in the form of a DAG where the applications form computational vertices while the edges are the communication channels. A similar system, map reduce requires input/output in the form of map/sort/reduce ops, however, here the DAG design gives additional flexibility as the vertices can have multiple inputs/outputs of different types. The paper describes with examples how the edges and vertices are formed and how a vertex program is written. The paper also emphasizes on programmability and describes a graph description language for utilizing the Dryad execution engine.

I would strongly accept this paper.

This paper presents much flexible model compared to Map Reduce using a job manager and a graph building language for composing vertices of computation and edges of communication channels between the vertices.The paper addresses the concerns that may arise from such a system i.e. performance, scalability, fault tolerance and programmability in detail.

The weakness of this paper is that it follows the approach of giving the developer the power to manipulate the data dependencies of computation. This can not only be error prone but can also cause performance bottlenecks if the developer does not handle these concerns correctly. It is similar to MPI where the programmer's ability severely restricts optimal performance from the system. The job manager also appears to be a single point of failure in the system. There is no mention of this in the paper.

Varghese Mathew

Summary: This paper describes Dryad, which is a parallel programming approach developed at MSR. The key idea is to structure computations as Directed Acyclic Graphs (DAG), where each edge represents a data/output dependancy, and each vertex represents a computation. This DAG is specified by the application developer. Now, Dryad runs as many vertices whose dependencies are met parallely on computers available in the cluster, and provides mechanisms to stream/buffer the output to the next dependent vertices. The system also supports partitioning of the input data set into manageable sizes and running multiple iterations of the vertex to cover the entire input set etc.

Contributions:

- The key contribution as comes across from the paper is the representation/specification of the computation as a DAG.

- Dryad builds a general purpose distributed computation engine, which supports multiple data transfer mechanisms.

- The paper claims scalability from multi-core machines to multiple computer clusters scaling thousands of computers.

- The project also designs a simple graph description language for enabling ease of programming.

- Legacy executable support is an interesting feature, although it doesn't figures in the core idea of the paper. Weaknesses:

- The idea of using a DAG to parallelize computations and spread them across a cluster has been in existence for quite some time. Condor has an implementation of the same, but while they mention condor in the paper, they completely gloss over this.

- While the paper assumes DAGs to be the perfect flexible approach for parallelizing computations and builds up a lot on top of that, there is little analysis done to show that DAGs are in fact the way to go. I would like to see a discussion on how well existing sequential applications can be translated into a parallel model using DAGs.

- Although they talk about scalability across thousands of computers, and start out with massive figures for data sets in section 2.1, what they ultimately processed seems to be rather small. (I glossed over section 6, so I could be wrong here).

- The fault tolerance approach taken seems to be less capable than the approach used in Google's MapReduce. From reading the paper, I get the impression that it's eminently possible for entire computations to fail due to bottleneck blocks, because they attempt to reexecute not only the failed node, but also it's dependancies (potentially causing a flood of reexecutions).

- To me, the paper is rather poorly written and poorly motivated for making it to the top conference in systems.

Overall, I would only accept this paper if I am really badly starved for innovative papers at OSDI. Reason, the weaknesses seem to weigh out the contributions.

Guoliang Jin

The Dryad system implements a general-purpose data-parallel execution engine. The purpose of this system is to make it easier for developers to write efficient parallel and distributed applications. A Dryad application is basically a dataflow graph, composed by computational vertices and communication channels. Actually computation is taken place in those vertices, and communication channels are used as appropriate through file, TCP pipes, and shared-memory FIFOs. After describing the basic structure, the paper presents how to create the dataflow graph, how to (not really how to) write a vertex program, and some optimization used in the system. The system is tested using SQL query and data mining to show it works well and the optimization used in the data mining test. Finally, as claimed by the paper, the Dryad can be used as a based system to build higher-level system, some examples are given.

I think this paper should not be accepted. As the system presented in the paper does not really make it easier for developers to write efficient parallel and distributed applications, at least not as easy as the MapReduce system.

Contributions: Implements a general-purpose data-parallel execution engine at a lower level. More flexible as the programmer can control the dataflow graph. Can be used for more fields as one can add layer upon it.

Weaknesses: It is not that easy to use this system. One need at least data flow graph and vertex program. The Dryad application may be difficult to debug as the underlying system is a little complex. What's more, it's not open source.

Yiying Zhang

Summary: Dryad provides an environment to support developers' control of the communication and data flow in their distributed tasks and allow more complex tasks than simple map-reduce. They use communication grpahs to represent data communication relationship, whose vertices represent partitions of subtasks and whose edges represent data communications. They allow three types of channels: using files, TCP pipes or shared memories. After users generated their communication graph (by creating new vertices, adding edges and merging graphs) and the program of each sequential steps, the graph is mapped to physical nodes using a local greedy method with possible refinement (which adds intermediate layers to make use of data locality). The system also supports fault tolerance using callbacks and reexecutions and possible speculative scheduling for slow tasks. They evaluated their system using a SQL query workload and a data mining workload and show the results of near-linear speed-up with number of nodes and the effectiveness of their graph refinement method.

I would probably not accept this paper to SOSP. First, although they tackled the problem of more complex layered distributed tasks, in most cases this problem can also be solved by using multiple levels of MapReduce. Second, they didn't provide a good way for system load balancing. Last, the system adds more complexity and is more error prone, especially in large scalability. I would doubt the scalability of their system and its performance with multiple jobs.

Contributions: They introduced a way for developers of distributed systems to have more flexibility in designing their tasks and more control over the data communication of their tasks. They addressed the problem of more complex distributed tasks than simple one step map-reduce, which is quite useful in applications like distributed databases. They didn't fix their channel methods, which allow more flexibility for different systems and tasks (as apposed to MapReduce's fixed way of writing to disks). On the other hand, for channels using shared-memory or pipes, failures will propagate to all the previous related nodes. Their experiments on real large-scale clusters are also valuable. Weaknesses: They didn't address the situation when multiple jobs are asking for system resources. Their simple local greedy algorithm to assign nodes won't work well and have load balance in such situations. They added complexity to the programming environment for developers, which can cause pain for developers. More important, this could bring a lot of errors especially with big scale systems, which are very difficult to debug. The problem of propagating failures in error handling as mentioned above.

Sriram Subramanian

Dryad is a distributed execution engine that benefits applications that have data parallelism. The application developer constructs a logical graph that describes the flow of data through the system and this provides the runtime the information necessary to exploit the data parallelism. The developer also provides the code that has to be run in each of the nodes (vertex) of the graph. Each vertex can communicate with other vertices by means of files, IPC or network (and this can be specified by the developer or chosen by the runtime depending on the mapping of logical nodes to physical nodes). As with other systems like Map-Reduce, the user does not have to worry about scheduling, concurrency, component failure.

Dryad architecture is completely generic - it doesn't assume anything about the application or impose restrictions on the same. A key difference between Dryad and Map-Reduce is the fact that Dryad supports several stages of operations and thus represents a superset of MapReduce. It also supports multiple sources of inputs and outputs. The other major strength of the system was the run time refinement of the graphs - the runtime engine can detect certain network configurations and may introduce extra layers of indirection that take advantage of data locality in an effort to minimize the network bandwidth consumed during aggregation operations.

The evaluation of the system is very weak - the system is compared against a single node SQL Server database. SQL Server isn't the best performing database in the market and a comparison against it doesn't really tell us anything. The other benchmark that simulates a map-reduce type of workload is also insufficient. The overheads of the systems are not brought out in these experiments - would it not have been possible to compare the execution times and network bandwidth consumption with Hadoop implementation of the same problem in the same environment? The handling of stragglers and heterogeneous environments is also very vague and incomplete. The paper claims to dryad could scale to a large setup - experiments dont show this either. The requirement that users learn a new language to construct graphs could be a potential barrier to adoption.

SOSP? The work represents an incremental improvement over what Map Reduce offers and doesn't introduce anything conceptually new or novel. The effectiveness of the new architecture isn't evaluated in a systematic manner. This paper isnt the quality that one comes to expect to appear in SOSP.

Mohit Saxena

Dyrad: Microsoft's answer to Google's MapReduce. Dyrad is a framework for executing coarse-grained data parallel applications. The framework assumes the shape of a directed acyclic graph where each vertex executes the sequential programs and the edges provide the communication channel between these tasks. Dyrad borrows ideas from many earlier works, including Condor, MapReduce and parallel databases; and presents a solution to solve a problem already solved earlier in a much more elegant way (MapReduce). Hence in my opinion, this work qualifies for EuroSys; but not for SOSP.

One of the major contributions of this paper is its scalability. It scales from a multi-core machine to a cluster of thousands of computers, in a generic manner. Unlike MapReduce, which focuses on simplicity, Dyrad is much inclined towards generality of the framework. Hence, this is both a contribution and weakness. From systems perspective, I think its rather a weakness. Programmability of Dyrad is definitely a powerful contribution. Abstracting the underlying gory details in the form of a graph description language empowers the develper with explicit graph construction and refinement to fully take advantage of the rich features of Dyrad execution engine.

Leo

Summary :

The paper describes a new framework to develop and execute data parallel computions that can fit broad use cases. The authors claim that the solution can scale from multiple processors in a single machine to multiple machines in a distributed system with good bandwidth for communication. The Dryad framework allows developers to define a distributed application as a directed acyclic graph where vertices from compute machines where the application code will be run and the directed edges are communication channels. The direction of the edge is based on the direction of flow of data between vertices. The developers therefore can highly customize the flow of data and the number of compute instances using the graph description language that Dryad provides. Once the application is developed, the Dryad framework can take care of scheduling , executing, tolerating communication / machine failures etc. Also, the description graph can be changed dynamically based on the execution environment.

Vote: Not accepted

Weaknesses:

a) The paper lacks clear reasoning about the reasons why Dryad was designed to be what it is. Rather the paper focusses on describing what Dryad is and also does not describe it completely.

b) Paper organization : Lots of material can be removed overall to make space for sound design reasonings and the example given could have been a better generic example instead of a complex SQL query.

c) Weak evaluation : Better evaluations that provide glimpses into Dryads performance in absolute and relative ( to other distributed programming frameworks ) terms are needed.

d) More example applications in sufficient detail that stress the wide applicability of the framework are needed.

Strengths:

a) They have attempted to come up with a distributed computing framework that is more generic in terms of applicability to problems.

b) They have concentrated on makign the framework allow customizable data flow across nodes which is very suitable for data intensive applications.

Jose Angel Perez Rico

On this paper, the system Dryad is introduced. It is a general-purpose distributed execution engine for coarse-grain data-parallel applications, as presented by the authors. The system is intended to run from a multi-processor machine to a cluster of thousands, over a LAN. It chose one machine as the job manger, where all work is coordinated. It has the application-specific code to construct the job's communication graph and scheduling responsibility. It depends principally on two elements, the name server, which exposes the location of each machine in the topology; and a daemon, running on each machine to empower the job master and monitor the state. Most of the work on the paper is attributed to the communication graph where the structure of the jobs is defined. It defines how the data should be partitioned, and where and how should be distributed.

Personally, it could be accepted if their work and objectives were more focused. The whole system appear to be an anthology of various prior systems, merging them in a integrated solution, whereas their most significant contribution is the implementation and design of the communication graph, the language defined to use them, the run time refinement; increasing the distributed and parallel elements, although there is a trade-off in simplicity at programming code, in relation to other works, such as Google's MapReduce. Another bad point of the paper is the evaluation section. The evaluation was a comparison between Dryad and a non distributed system, and a test on data mining. It could have been more worthy to show a comparative evaluation between some similar systems and Dryad , to quantify their contributions.

Zhenxiao Luo

This paper talks about the Dryad project, which is a programming model for writing parallel and distributed programs to scale from a small cluster to a large data-center. A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

From my point of view, this paper could be accepted. Its contributions lie in the following:

1. Proposed a sound model for data parallel programming. Though a number of research have already been conducted in this area, it is from a different perspective.

2. The Dryad infrastructure provides concurrency execution, reliability, scheduling, and resource management. The system has decent performance, and has already built a number of applications on top of it.

Yupu Zhang

Weak Reject

This paper presents a new distributed execution engine, Dryad. Developers have to provide a communication graph explicitly to Dryad, where vertices are sequential programs that run on physical machines and edges represent communications between vertices. Then, Dryad automatically schedules vertices to run simultaneously on available machines in the cluster. It hides all sorts of issues from the developer, e.g. load balancing, task scheduling and failure handling. Also, it applies runtime refinement to improve the performance.

In my opinion, the contribution of this paper is just that it shows a more general distributed execution framework that could handle more problems than others (e.g. MapReduce). Developers could control the flow of execution freely in the form of communication graph. The system could further optimize the graph at runtime to gain more performance.

However, there are some problems in the design and experiment. First, to a large extent, the success (I mean achieving high performance) of a job running on Dryad depends upon a carefully designed communication graph. It's implied by the evaluation in the paper: the graph for the SQL query already had several optimization and the graph used in data mining needed to be rearranged to achieve good scalability and performance. This may be a barrier to widely apply Dryad because developers have to spend more time learning and thinking about communication graphs. Second, the real-time graph refinement seems not truly real-time. The second experiment is a good example. By reading the paragraph below Figure 9 and the first point at the end of that page, I feel it's the authors not Dryad that tried different refinements to get Figure 10. Lastly, the experiment results are not convincing, because all of the graphs adopted were highly optimized by human. Also in the second experiment, they only shows numbers but no comparison. It would be better if they could show a comparison between Dryad and MapReduce.