BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

DryadLINQ, Distributed Computing Made Easy

by Abel Avram on May 07, 2009 |

Dryad and DryadLINQ are two Microsoft Research projects which facilitate large data processing on computer clusters or data centers for the C# developer.

Dryad is an infrastructure that runs sequential programs in parallel on a cluster of computers or on a data center. The parallel computation is organized as a directed acyclic graph where the programs are the graph’s vertices while the edges are the communication channels between the programs as shown in the figure below:

dryad

The graph simply shows the relationship between programs, where the input comes from and where output is directed to. The graph must be acyclic to avoid program scheduling deadlocks. The Job Manager (JM) holds the graph and schedules the execution of a program when its input channels are ready and when there is a computer available. JM gets an available computer from the Name Server (NS) and schedules the program for running through a daemon (D). The communication channels between programs (vertices) are represented by files, shared memory, or TCP pipes. The graph can be changed dynamically during runtime and is fault tolerant. The entire graph can be run on a single system for debugging purposes. Dryad is already used in production by Microsoft for their AdCenter.

DryadLINQ is “a compiler which translates LINQ programs to distributed computations which can be run on a PC cluster”. This process is achieved through the following:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • Type safety: distributed computations are statically type-checked.
  • Automatic serialization: data transport mechanisms automatically handle all .Net object types.
  • Job graph optimizations
    • static: a rich set of term-rewriting query optimization rules is applied to the query plan, optimizing locality and improving performance.
    • dynamic: run-time query plan optimizations automatically adapt the plan taking into account the statistics of the data set processed.

Useful links: Paper: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, Presentation: Dryad, Paper: DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, DryadLINQ program examples (PDF).

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT