DryadLINQ, Distributed Computing Made Easy
Dryad is an infrastructure that runs sequential programs in parallel on a cluster of computers or on a data center. The parallel computation is organized as a directed acyclic graph where the programs are the graph’s vertices while the edges are the communication channels between the programs as shown in the figure below:
The graph simply shows the relationship between programs, where the input comes from and where output is directed to. The graph must be acyclic to avoid program scheduling deadlocks. The Job Manager (JM) holds the graph and schedules the execution of a program when its input channels are ready and when there is a computer available. JM gets an available computer from the Name Server (NS) and schedules the program for running through a daemon (D). The communication channels between programs (vertices) are represented by files, shared memory, or TCP pipes. The graph can be changed dynamically during runtime and is fault tolerant. The entire graph can be run on a single system for debugging purposes. Dryad is already used in production by Microsoft for their AdCenter.
DryadLINQ is “a compiler which translates LINQ programs to distributed computations which can be run on a PC cluster”. This process is achieved through the following:
- C# and LINQ data objects become distributed partitioned files.
- LINQ queries become distributed Dryad jobs.
- C# methods become code running on the vertices of a Dryad job.
DryadLINQ has the following features:
- Declarative programming: computations are expressed in a high-level language similar to SQL
- Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
- Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
- Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
- Type safety: distributed computations are statically type-checked.
- Automatic serialization: data transport mechanisms automatically handle all .Net object types.
- Job graph optimizations
- static: a rich set of term-rewriting query optimization rules is applied to the query plan, optimizing locality and improving performance.
- dynamic: run-time query plan optimizations automatically adapt the plan taking into account the statistics of the data set processed.
Useful links: Paper: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, Presentation: Dryad, Paper: DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, DryadLINQ program examples (PDF).
Yoni Goldberg Oct 30, 2014
Dmytro Svarytsevych Oct 30, 2014