Motivation
MapReduce execution engine is a popular
choice for big data analytic. However it is difficult to tune the system to get
good performance for non-expert user because of nature of Hadoop and absence of
middle man (administrator). So Starfish, a self-tuning system for big data
analytics and built on Hadoop, is introduced. Starfish try to give a solution
of several tuning problems such as Job-level MapReduce configuration, Cluster
sizing, Workflow optimization, Workload management, Data layout tuning.
Elastisizer has developed as a subsystem
of Starfish to solve one of the tuning problems, Cluster sizing problem. Elastisizer goal is to automatically determine the
best cluster and Hadoop configurations to process a given workload subject to
user-specified goals time, and monetary costs.
1. Components in the Startfish architecutre
System Workflow
Once a user expresses the workload
(jobs), space for cluster resources, space for job configuration, performance
requirement as a query, she can get the answer of cluster sizing problem by Elastisizer.
First of all, a job can be represented
by j = <p, d1, r1, c1> in which p means program, d means data, r means
Cluster resource, c means Configuration parameter setting. Generating a new
profile from a job is the first stage for solving the cluster sizing problem.
A profile of a job
is a vector of fields (cost, cost
statistic, data flow, and data flow statistic) that together form a concise
summary of the job’s execution at the level of tasks and phases. It can be
generated by collecting information using BTrace while the job runs or by
estimation using some careful mix of white-box and black-box models without
running a job. The latter is called ‘virtual profile’ and made by what-if
engine.
The Elastisizer use
the What-if Engine in Starfish and two Enumeration and Optimization Engines (EOEs)
responsible for cluster resources, and job configuration parameters. The What-if
Engine makes virtual profile with possibly hypothetical input such as data,
cluster resources, configuration setting. If more complex queries specify searching
spaces configuration and cluster resource, the Elastisizer will invoke the EOEs
to efficiently enumerate and search through the high dimensional space of
configuration parameter settings and find best suitable virtual profile. Then What-if
Engine uses a Task Scheduler Simulator, along with virtual profile, to simulate
the scheduling and execution of the job. Virtual job can be represented by j` =
<p, d2, r2, c2>. The output is a description of the complete
(hypothetical) job execution in the cluster.
How to estimate fields
in virtual profile
- Cost Statistic
To estimate cost
Static field, the system use Relative Black-box model. If cluster resources are
not equal for the job, the profiles for those two jobs can be generated by
direct measurement. Training sample also can be made by generating a separate
task profile for each task run in each of these two jobs. Once the training samples
are generated there are many supervised learning techniques available for generating
the black-box model. Since cost statistics are real-valued, M5 Tree Model is selected
for estimating cost statistic field.
- Dataflow Statistics
The What-if Engine does not have detailed statistical information
about the input data in hypothetical job. But it makes a dataflow
proportionality assumption that the logical dataflow sizes through the job’s
phases are proportional to the input data size. It means that the dataflow
statistics fields in the virtual profile will be the same as those in the
profile of job given as input. This assumption can be overridden by other
providing data flow statistics field.
- Dataflow and Cost
The Dataflow fields
can be calculated by using detailed set of mathematical (white box model) with
given Dataflow Statistics and configuration parameter setting in hypothetical job.
The Cost field can be calculated with a second set of mathematical models
combine the estimated cost statistics and dataflow field.
Pros
- No need to modify Hadoop.
- No overhead if profiling turned off.
- Support unmodified MapReduce programs Java,
Python, Ruby, C++
- Easy to use. (the system give us intuitive
GUI)
User Interface (from Starfish Homepage)
Cons and Question
- Not support network topology. (But can
be removed)
- There is no
information how many time will take to get the result within query. Is it fast
enough?
- The
training samples must have good coverage of the prediction space and the time
to generate the complete set of training samples must be small. How does the
user choose the samples easily?
- User need to understand in detailed the system such as fields, configualtion parameter, and etc...
- User may create new
models for their big data. Is it easy?
Check this out