Back to index
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ăšlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey
MSR, Silicon Valley, and Reykjavik University, Iceland
One-line Summary
DryadLINQ is a programming model that runs sequential LINQ programs on Dryad run-time system to achive parallelism in a transparent manner.
Overview/Main Points
- Overview
- LINQ (Language INtegrated Query) constructs
- a high-level, strong typed programming language
- Language constructs: select, join, groupby, orderby, where ...
- like writing a sequential program
- imperative + declarative
- deferred execution
- Execution Plan Graph: bridging language and run-time system together
- Dryad run-time system
- job manager
- instantiate a job's EPG
- process schedule on cluster machines
- failure handling by re-execution of the failed or slow processes
- job monitor and statistics collection
- EPG transformation based on user's policy
- Task scheduler manages a cluster for job schedule and execution
- Schedule
- data partition
- fault-tolerance
- failed workers: re-execute
- slow workers: re-execute
- failed job manager: re-start?
- static optimization
- pipelining: multiple operators executed in a single process
- removing redundancy: removal of unnecessary hash- or range-partitions
- eager aggregation: down-stream aggregations moved in front of partitioning operators where possible
- I/O reduction: use TCP-pipe and in-memory FIFO channels instead of flushing temporary data to disks; data compression before partition
- dynamic optimization
- partial aggregation
- change the number of instances / partitions
- DryadLINQ
- data model: distributed, partitioned implementation of LINQ collections
- stored by distributed file systems/NTFS/SQL tables
- three ways to partition
- corresponding metadata part of the object
- Annotations: associative, homomorphic, ...
- Data re-partitioning
- HashPartition‘T, K’
- RangePartition‘T, K’
- Execution Plan Graph (EPG)
- DAG, not a tree
- Vertex: operators
- Edge: denote data dependency
-
Relevance
Flaws