Wednesday, September 26, 2012

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics


Motivation
MapReduce execution engine is a popular choice for big data analytic. However it is difficult to tune the system to get good performance for non-expert user because of nature of Hadoop and absence of middle man (administrator). So Starfish, a self-tuning system for big data analytics and built on Hadoop, is introduced. Starfish try to give a solution of several tuning problems such as Job-level MapReduce configuration, Cluster sizing, Workflow optimization, Workload management, Data layout tuning.

Elastisizer has developed as a subsystem of Starfish to solve one of the tuning problems, Cluster sizing problem. Elastisizer goal is to automatically determine the best cluster and Hadoop configurations to process a given workload subject to user-specified goals time, and monetary costs.

1. Components in the Startfish architecutre 

System Workflow
Once a user expresses the workload (jobs), space for cluster resources, space for job configuration, performance requirement as a query, she can get the answer of cluster sizing problem by Elastisizer.

First of all, a job can be represented by j = <p, d1, r1, c1> in which p means program, d means data, r means Cluster resource, c means Configuration parameter setting. Generating a new profile from a job is the first stage for solving the cluster sizing problem.

A profile of a job is a vector of fields (cost, cost statistic, data flow, and data flow statistic) that together form a concise summary of the job’s execution at the level of tasks and phases. It can be generated by collecting information using BTrace while the job runs or by estimation using some careful mix of white-box and black-box models without running a job. The latter is called ‘virtual profile’ and made by what-if engine.

The Elastisizer use the What-if Engine in Starfish and two Enumeration and Optimization Engines (EOEs) responsible for cluster resources, and job configuration parameters. The What-if Engine makes virtual profile with possibly hypothetical input such as data, cluster resources, configuration setting. If more complex queries specify searching spaces configuration and cluster resource, the Elastisizer will invoke the EOEs to efficiently enumerate and search through the high dimensional space of configuration parameter settings and find best suitable virtual profile. Then What-if Engine uses a Task Scheduler Simulator, along with virtual profile, to simulate the scheduling and execution of the job. Virtual job can be represented by j` = <p, d2, r2, c2>. The output is a description of the complete (hypothetical) job execution in the cluster.
  
How to estimate fields in virtual profile
-      Cost Statistic
To estimate cost Static field, the system use Relative Black-box model. If cluster resources are not equal for the job, the profiles for those two jobs can be generated by direct measurement. Training sample also can be made by generating a separate task profile for each task run in each of these two jobs. Once the training samples are generated there are many supervised learning techniques available for generating the black-box model. Since cost statistics are real-valued, M5 Tree Model is selected for estimating cost statistic field. 
-      Dataflow Statistics
The What-if Engine does not have detailed statistical information about the input data in hypothetical job. But it makes a dataflow proportionality assumption that the logical dataflow sizes through the job’s phases are proportional to the input data size. It means that the dataflow statistics fields in the virtual profile will be the same as those in the profile of job given as input. This assumption can be overridden by other providing data flow statistics field. 
-      Dataflow and Cost
The Dataflow fields can be calculated by using detailed set of mathematical (white box model) with given Dataflow Statistics and configuration parameter setting in hypothetical job. The Cost field can be calculated with a second set of mathematical models combine the estimated cost statistics and dataflow field.

Pros
-      No need to modify Hadoop.
-      No overhead if profiling turned off.
-      Support unmodified MapReduce programs Java, Python, Ruby, C++
-      Easy to use. (the system give us intuitive GUI)
   User Interface (from Starfish Homepage)

Cons and Question
-      Not support network topology. (But can be removed)
-      There is no information how many time will take to get the result within query. Is it fast enough?
-      The training samples must have good coverage of the prediction space and the time to generate the complete set of training samples must be small. How does the user choose the samples easily?
-    User need to understand in detailed the system such as fields, configualtion parameter, and etc...
-      User may create new models for their big data. Is it easy?

Check this out
-     We may understand more easily if we understand Starfish first. http://www.cs.duke.edu/starfish/

Monday, September 17, 2012

DryadLINQ


! Motivation
: Microsoft have developed Dryad platform (execution engine) for general purpose runtime for execution of data parallel applications. Although Dryad has several powerful advantages for processing distributed data parallel computing, yet developers should understand and know how to instruct Dryad and this makes hard to write Dryad program. So developers want simplicity of the programming on higher level of abstraction. This is why DryadLINQ is emerged. DryadLINQ combines two technology Dryad stated above and LINQ (Language INtergrated Query) which enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library for programming with datasets.

! Main Idea
: DryadLINQ is a system and a set of language extensions that enable a new programming for writing large-scale data parallel applications running on large PC clusters. To make developer’s life happy, DryadLINQ translates the data process in LINQ automatically to optimized execution plan in Dryad system. So developers do not need to care about detail of dryad platform (how to parallelize data flow, how to make plan done by job Manager and etc…). DryadLINQ optimizes the job graph by supporting both static and dynamic optimizations which focus on minimizing disk and network I/O.

Since DryadLINQ exploits LINQ, a set of .NET constructs, it can gain benefits from LINQ and .Net.
- DryadLINQ programs may be written any .Net language types such as C#, VB, F#, and etc…
- Objects in DryadLINQ datasets can be of any .NET types; this makes it easy to compute with data such as image patches, vectors, and matrices.
Programs can be written as imperative or declarative operations on datasets within a traditional high-level programming language, using an expressive data model of strongly typed .NET objects.
- By leveraging other LINQ providers such as PLINQ, it gives parallelizing the sequential code to exploit the multi-core advantage.
- LINQ’s strong static typing is extremely valuable when programming large scale computations. It is much easier to debug compilation errors in Visual Studio than run-time errors in the cluster.
- Shared objects can be referenced and read freely and will be automatically serialized and distributed where necessary.

! Weakness
- Since DryadLINQ does not check or enforce the absence of side-effects, all the functions called in DryadLINQ expressions must be side-effect free.
- Though checking correctness of program is not difficult but performance debugging wait for us. Moreover the paper gives us not enough information about the debugging but only users comments and opinions.
- Though operation Apply can be helpful for beginners, yet it can reduce the system’s ability to make high-level program transformations.
- DryadLINQ is very inefficient for algorithms which are naturally expressed using random-accesses.
Due to the restriction of Dryad and LINQ, DryadLINQ is not suitable for some kind of workload, such as task requiring low latency.
- Dynamic generated assembly code is needed to ship to cluster computers; it may increase the overhead of network.
- C++ usually faster than C# (managed code)
- Experimental evaluation is conducted on only medium-sized private cluster which includes 240 computers. I think that this is not enough to evaluate DryadLINQ.