Monday, September 17, 2012

DryadLINQ


! Motivation
: Microsoft have developed Dryad platform (execution engine) for general purpose runtime for execution of data parallel applications. Although Dryad has several powerful advantages for processing distributed data parallel computing, yet developers should understand and know how to instruct Dryad and this makes hard to write Dryad program. So developers want simplicity of the programming on higher level of abstraction. This is why DryadLINQ is emerged. DryadLINQ combines two technology Dryad stated above and LINQ (Language INtergrated Query) which enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library for programming with datasets.

! Main Idea
: DryadLINQ is a system and a set of language extensions that enable a new programming for writing large-scale data parallel applications running on large PC clusters. To make developer’s life happy, DryadLINQ translates the data process in LINQ automatically to optimized execution plan in Dryad system. So developers do not need to care about detail of dryad platform (how to parallelize data flow, how to make plan done by job Manager and etc…). DryadLINQ optimizes the job graph by supporting both static and dynamic optimizations which focus on minimizing disk and network I/O.

Since DryadLINQ exploits LINQ, a set of .NET constructs, it can gain benefits from LINQ and .Net.
- DryadLINQ programs may be written any .Net language types such as C#, VB, F#, and etc…
- Objects in DryadLINQ datasets can be of any .NET types; this makes it easy to compute with data such as image patches, vectors, and matrices.
Programs can be written as imperative or declarative operations on datasets within a traditional high-level programming language, using an expressive data model of strongly typed .NET objects.
- By leveraging other LINQ providers such as PLINQ, it gives parallelizing the sequential code to exploit the multi-core advantage.
- LINQ’s strong static typing is extremely valuable when programming large scale computations. It is much easier to debug compilation errors in Visual Studio than run-time errors in the cluster.
- Shared objects can be referenced and read freely and will be automatically serialized and distributed where necessary.

! Weakness
- Since DryadLINQ does not check or enforce the absence of side-effects, all the functions called in DryadLINQ expressions must be side-effect free.
- Though checking correctness of program is not difficult but performance debugging wait for us. Moreover the paper gives us not enough information about the debugging but only users comments and opinions.
- Though operation Apply can be helpful for beginners, yet it can reduce the system’s ability to make high-level program transformations.
- DryadLINQ is very inefficient for algorithms which are naturally expressed using random-accesses.
Due to the restriction of Dryad and LINQ, DryadLINQ is not suitable for some kind of workload, such as task requiring low latency.
- Dynamic generated assembly code is needed to ship to cluster computers; it may increase the overhead of network.
- C++ usually faster than C# (managed code)
- Experimental evaluation is conducted on only medium-sized private cluster which includes 240 computers. I think that this is not enough to evaluate DryadLINQ.

1 comment:

  1. Please convert this page to English if you can.
    Good descriptions of pros and cons -- however, you can pick a smaller number and go deeper. For example, C++ vs. C# is off-topic I think.

    ReplyDelete