! Motivation
:
Microsoft have developed Dryad platform (execution engine) for general purpose runtime for execution of data parallel applications.
Although Dryad has several powerful advantages for processing distributed data
parallel computing, yet developers should understand and know how to instruct Dryad
and this makes hard to write Dryad program. So developers want simplicity of
the programming on higher level of abstraction. This is why DryadLINQ is
emerged. DryadLINQ combines two technology Dryad stated above and LINQ (Language INtergrated Query) which enables developers to write and debug their applications
in a SQL-like query language, relying on the entire .NET library for
programming with datasets.
! Main
Idea
:
DryadLINQ is a system and a set of language extensions that enable a new
programming for writing
large-scale data parallel applications running on large PC clusters.
To make developer’s life happy, DryadLINQ translates the data process in LINQ
automatically to optimized execution plan in Dryad system. So developers
do not need to care about detail of dryad platform (how to parallelize data
flow, how to make plan done by job Manager and etc…). DryadLINQ optimizes the job graph by supporting both static and dynamic optimizations
which focus on minimizing disk and network
I/O.
Since
DryadLINQ exploits LINQ, a set of .NET constructs, it can gain benefits from LINQ
and .Net.
- DryadLINQ programs may be written any .Net language types such as C#, VB, F#, and etc…
- Objects in
DryadLINQ datasets can be of any .NET types; this makes it easy to compute with
data such as image patches, vectors, and matrices.
- Programs
can be written as imperative or declarative operations on datasets within a
traditional high-level programming language, using an expressive data model of
strongly typed .NET objects.
- By leveraging other
LINQ providers such as PLINQ, it gives parallelizing
the sequential code to exploit the multi-core advantage.
- LINQ’s strong
static typing is extremely valuable when programming large scale computations. It
is much easier to debug compilation errors in Visual Studio than run-time
errors in the cluster.
- Shared objects can
be referenced and read freely and will be automatically serialized and
distributed where necessary.
! Weakness
- Since DryadLINQ does not check or enforce the absence of side-effects, all
the functions called in DryadLINQ expressions must be side-effect free.
- Though checking
correctness of program is not difficult but performance debugging wait for us.
Moreover the paper gives us not enough information about the debugging but only
users comments and opinions.
- Though operation Apply can be helpful for
beginners, yet it can reduce the system’s ability to make high-level program
transformations.
- DryadLINQ is very inefficient for
algorithms which are naturally expressed using random-accesses.
- Due to the
restriction of Dryad and LINQ, DryadLINQ is not suitable for some kind of
workload, such as task requiring low latency.
- Dynamic generated assembly code is needed
to ship to cluster computers; it may increase the overhead of network.
- C++ usually faster
than C# (managed code)
- Experimental
evaluation is conducted on only medium-sized private cluster which includes 240 computers. I think that this is not enough to evaluate DryadLINQ.
Please convert this page to English if you can.
ReplyDeleteGood descriptions of pros and cons -- however, you can pick a smaller number and go deeper. For example, C++ vs. C# is off-topic I think.