Papers related with BigData And Cloud I have read.

Thursday, July 25, 2013

A First Look at Inter-Data Center Traffic Characteristics vis Yahoo! Datasets

Authors

Yingying Chen, Sourabh Jain, Vijay Kumar Adhikari, Zhi-Li Zhang (UMN)

Kuai XU (ASU)

! Keep changing.

- Motivation

Nowadays many IT companies have built their own data centers on wide distributed area to provide services (main, news, messenger, game...) and give better experience to their customers(clients) by making the data close to them. For this, the data should keep being moved and replicated through multiple data centers depending on the clients' needs (low latency, availability and so on). This paper tries to find the characteristics of traffics between Data Centers (inter-data center-D2D) so that data center can be designed and managed with lower optical cost. There are few researches for the inner-data (traffic within a single data center), but a little for the traffic characteristic of inter-data center.

- How it works
They collected NetFolw dataset which includes 1)time stamp, 2)source and destination IP and transport layer port number, 3)source and destination interface on the router, 4)IP protocol, number of bytes, and 5)packets exchanged from the Yahoo! data centers located in Dallas (DAX), Washington DC (DCP), Palo Alto (PAO), Hong Kong (HK), and United Kingdom (UK). They build their own novel heuristics to infer the Yahoo! IP and localize their location form the anonymized NetFolw dataset.

- Traffic Characteristics
D2C traffic : The traffic exchanged between Yahoo! servers and client.
D2D traffic : The traffic exchanged between different Yahoo! servers at different locations.
transit D2C and D2D traffic : border router at a given location may also carry D2C and D2D traffic for other locations.

DAX -> 50% D2C traffic : because of replication and efficiency of placement.
20% D2D traffic : need to be reduced.
25% transit D2C traffic : need to be reduced
few transit D2D traffic : not much.

D2C -> Some services are strongly correlated. (email, messenger and so on). So strongly correlated services should be provided in a single data center to reduce the traffic between D2D.

D2D.
1)D2D triggered by background batch job for maintenance (replicate, placement so on) is dominant in the aggregate D2D traffic. There is no specific trend for this (small variance)
2)D2D triggered by D2C has increasing or decreasing trends depending upon the time of the day i.e pattern of usage (big variance).

- Pro
.Use real collected data from the Yahoo! data center so that their research seems very trustful.
.The first look for the traffic between inter-data center which give us some knowledge of it.

- Con
.As they says this result is too specified to Yahoo! data center. So other companies which provide different types of service e.g Amazon may have different pattern of traffic.

- Note for me

Research things for Wide-distributed-area file (storage) system, specially inter-data center (D2D). I read this paper to understand what kind of characteristic that I should consider to build a file system for Inter-data center.

Monday, November 19, 2012

c-Through: Part-time Optics in Data Centers

- Motivation

Nowadays, the applications normally handles data-intensive workloads such as those generated by Hadoop and Dryad. The problems may be happened when the applications uses the mass data distributed in different server racks in data-center i.e. traditional hierarchical topologies which use tree-structure Ethernet can be a bottleneck. There are some solutions to resolve this problems such as Fat trees, DCell, and BCube, but these solutions require a large number of links and switches, and complex structured wiring, and expanding the networks after construction is challenging. So this paper proposes a hybrid architecture prototype design “c-Through”, which mixes the advantages of optical circuit-switched network and traditional hierarchy of packet switch network. The goal of this paper is trying to demonstrate the fundamental feasibility and applicability question of hybrid data center network.

- System Architecture

Optical circuit switching provides much higher bandwidth but it is limited to have only paired communication and reconfiguration for pairing needs some time. The paper gives us some experimental results e.g. “only a few ToRs are hot and most of their traffic goes to a few other ToRs.” to explain why it is worth to keep considering the optical circuit to resolve the problems. The paper suggests that new hybrid packet and circuit switched data-center network by combining high speed switching from traditional packet switching and high bandwidth from optical circuit-switched network. So in this system, there are traditional packet networks and optical networks which connect the two racks instead of connecting all of servers.

- How it works

To operate both optical and electrical networks, each end-host (server) runs a monitor program in their kernel to estimate traffic demands by simply enlarge output buffer limits of sockets. The Optical Manager receives all these packet info from each server. Given the cross-rack traffic matrix, the optical manager determines how to connect the server racks by optical paths in order to maximize the amount of traffic offloaded to the optical network.
Each server makes multiplexing decision using two virtual interfaces with VLAN-s and VLAN-c that mapping to electrical and optical network. Every time the optical network reconfigured, the server will be informed this and the De-MUX in server will tag packets with the appropriate VLAN ID.

- Related work

There are two similar works, Flyways to de-congest data center networks and Helios. Helios which explores similar hybrid electrical/optical data center architecture is very similar with c-Through. The main difference of two systems is where the implementation for estimation and traffic demultiplexing features exist. While c-Through implements the function in end-host (server), Helios does this in switches. The supercomputing community also try to use circuit switched optics. However the goal of use optical switching is different each other substantially. The main difference is within supercomputer, they try to use of node-to-node circuit switched while c-Though tries to use rack-to-rack circuit. The paper said their approach is more appropriate for a commodity data-center because c-Through deliberately amortizes the potentially high cost of optical ports.

- Advantages

1. Don’t need to change previous networking system. Moreover, also no applications are needed to be modified because this system modifies the kernel yet this can be also disadvantage as I stated below in criticism.
2. Using both electronically and optical switching at the same time depending on the analysis of data flow between racks.
3. Several experiments show us good performance.
4. As the author mentioned, even though current system design is not best way to gain the goal, but the system demonstrates the fundamental feasibility and gives us kinds of valuable research topic.

- Criticism

1. Using kernel memory for buffering, is it safe for server system? Can c-Through be applied to any kinds of server system i.e. can the kernel in server system be modified all the time? Because this paper says that this system is appropriate commodity data-center, we may consider these questions.
2. If there are so many servers in data center, is there any problem with numerous optical managers and then is it scalable?
3. This paper gives us results from simulating experiments instead of real system. This can make us hesitate to use this system even though paper said that the system works well several data-intensive applications such as Hadoop and MPI.
4. I’m not sure that it is valuable to use hybrid Packet and Circuit all the time since I think that many current data-centers have seemed to handle current data-intensive application pretty well without Hybrid packet and circuit. We need to consider which one is easier to handle between new complexities from hybrid structure and traditional complexities from number of links and switches i.e. which design can handle better the problems efficiently?

- Conclusions

I think that such hybrid design is a point of compromise of transition period for the technology improvement and there will be a leading network technology from such experimental design in the near future.
The paper tries to solve the problems that can be happened under specific conditions such as current network architecture and data intensive applications with new approach. Both optics and electrical networks have each own pros and cons: optics is good for bandwidth and electrical is good for low latency. By mixing both pros, the paper introduces us new type of hybrid architecture. It seems that it is not easy to conclude that such new type of architecture (hybrid architecture) is the best way or better than previous architecture right now. However, one thing clear is that the new architecture, approach, gives us very valuable research topic and different way of thinking to resolve the problems.

Wednesday, November 14, 2012

Hive - A Petabyte Scale Data Warehouse Using Hadoop

- Motivation

Facebook has to take care of huge size of data in everyday for their applications which may need large computing power such as analysis peta-bytes size of data. So they decided to use the Hadoop, software library, which allows for the distributed processing of large data sets across clusters of computers using simple programming models for handling their huge data within scalable and reliable way. However, Hadoop system does not provide an explicit structured data processing framework. So the basic reason why they build Hive system using SQL-like, HiveQL is to make it easy to use Hadoop for ones who are not familiar with it by making map/reduce jobs with HiveQL.

- System Architecture

Hive runs on the top of Hadoop. By giving some interfaces to communicate with Hadoop, Hive can hide the complicated pipelining of multiple map-reduce jobs from the programmers to make their life happy. With the SQL-like language, programmers may be able to write simple and complicated queries without huge efforts for the analysis or optimization of the map-reduce jobs.

l Metastore – stores system catalog and metadata about tables, columns, partitions, etc.

l Driver – manages the lifecycle of a HiveQL statement. Also handle a session handle.

l Query Compiler – compiles HiveQL into a directed acyclic graph of map/reduce tasks.

l Execution engine – interact with Hadoop and executes the task produced by compiler.

l Hiveserver – provides a thrift interface and JDBC/ODBC server.

l CLI – Command Line Interface, the web UI.

l Extensible Interfaces – includes the SerDe and ObjectInspector interfaces.

- How it works

The basic idea of Hive is to provide users a SQL-like language, HiveQL. Programs written in HiveQL will be input from CLI or WebUI and then the system will send it to the Query compiler. Then the program will be compiled into map-reduce jobs that are executed using Hadoop through execution engine.

- Rleated work

They mentioned about Scope which is an SQL-like language on top of Microsoft’s proprietary Cosmos map/reduce and PIG which allows users to write declarative scripts to process data. The main difference with them is providing a system catalog as a Metastore which used for data exploration, query optimization, query compilation. Even though it seems that Hive is very much influenced by these related systems, the author does not describe about the relation with other system.

- Advantages

1. First SQL-like system using Hadoop.

2. Supports not only primitive data types but also multiple customizable data type by using SerDe.

3. Working actively on by Facebook so the system may continuously be improved.

4. Open Source system can be improved.

5. By Supporting SQL syntax, system can integrate with existing commercial BI tools.

6. I think this system can be used for other system by replacing Driver component.

- Criticism

1. They don’t give exact comparison to competitive systems with analysis or benchmark. Yet they give us only their own result that the system works 20% better than other system. How can we estimate and trust its performances and how many times does it take for whole single map/task? We need such information to decide whether use this system or not.

2. Do any of components can cause a performance issue? How many times does it take to operate each component such as compiler, Thrift server and etc.? Does Facebook was faced any problem with this system?

3. Optimizer is only rule-based not support cost-based. Moreover programmers have to provide query hint on doing “MAPJOIN” on small tables and on 2-stage map-reduce for “GROUP BY” aggregates where the group-by columns have highly skewed data.

4. Some operations do not be supported such as INSERT INTO, UPDATE and DELETE.

5. This paper may not be written for academic purpose. It seems to focus on more giving the examples of queries of HiveQL and to be likely for more general purpose to introduce about HIVE to the general.

- Conclusions

Actually, I think that there are not new things on this system such as parser, graph generator and optimizer ideas. I think that though the Hive gives us easy way to use Hadoop without huge efforts but we may be needed to learn making and running our own map/reduce jobs for the best performance. However, by giving SQL-Like query express to users, the system provides many benefits to programmers and non-programmers who want to use Hadoop but are not familiar with it. Although this paper is not written well and we cannot get detailed information from this paper about HIVE, yet Hive system seems that it will be very improved. The author interests and works in optimizing Hive and subsuming SQL syntax. Moreover it becomes now Open-source system. So if some problems such as limited optimizer, SQL-expressions and etc are solved, then the system seems to be more popular and accepted.

Wednesday, September 26, 2012

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

Motivation

MapReduce execution engine is a popular choice for big data analytic. However it is difficult to tune the system to get good performance for non-expert user because of nature of Hadoop and absence of middle man (administrator). So Starfish, a self-tuning system for big data analytics and built on Hadoop, is introduced. Starfish try to give a solution of several tuning problems such as Job-level MapReduce configuration, Cluster sizing, Workflow optimization, Workload management, Data layout tuning.

Elastisizer has developed as a subsystem of Starfish to solve one of the tuning problems, Cluster sizing problem. Elastisizer goal is to automatically determine the best cluster and Hadoop configurations to process a given workload subject to user-specified goals time, and monetary costs.

1. Components in the Startfish architecutre

(from Starfish: A Self tuning System for Big Data Analytics)

System Workflow

Once a user expresses the workload (jobs), space for cluster resources, space for job configuration, performance requirement as a query, she can get the answer of cluster sizing problem by Elastisizer.

First of all, a job can be represented by j = <p, d1, r1, c1> in which p means program, d means data, r means Cluster resource, c means Configuration parameter setting. Generating a new profile from a job is the first stage for solving the cluster sizing problem.

A profile of a job is a vector of fields (cost, cost statistic, data flow, and data flow statistic) that together form a concise summary of the job’s execution at the level of tasks and phases. It can be generated by collecting information using BTrace while the job runs or by estimation using some careful mix of white-box and black-box models without running a job. The latter is called ‘virtual profile’ and made by what-if engine.

The Elastisizer use the What-if Engine in Starfish and two Enumeration and Optimization Engines (EOEs) responsible for cluster resources, and job configuration parameters. The What-if Engine makes virtual profile with possibly hypothetical input such as data, cluster resources, configuration setting. If more complex queries specify searching spaces configuration and cluster resource, the Elastisizer will invoke the EOEs to efficiently enumerate and search through the high dimensional space of configuration parameter settings and find best suitable virtual profile. Then What-if Engine uses a Task Scheduler Simulator, along with virtual profile, to simulate the scheduling and execution of the job. Virtual job can be represented by j` = <p, d2, r2, c2>. The output is a description of the complete (hypothetical) job execution in the cluster.

How to estimate fields in virtual profile

- Cost Statistic

To estimate cost Static field, the system use Relative Black-box model. If cluster resources are not equal for the job, the profiles for those two jobs can be generated by direct measurement. Training sample also can be made by generating a separate task profile for each task run in each of these two jobs. Once the training samples are generated there are many supervised learning techniques available for generating the black-box model. Since cost statistics are real-valued, M5 Tree Model is selected for estimating cost statistic field.

- Dataflow Statistics

The What-if Engine does not have detailed statistical information about the input data in hypothetical job. But it makes a dataflow proportionality assumption that the logical dataflow sizes through the job’s phases are proportional to the input data size. It means that the dataflow statistics fields in the virtual profile will be the same as those in the profile of job given as input. This assumption can be overridden by other providing data flow statistics field.

- Dataflow and Cost

The Dataflow fields can be calculated by using detailed set of mathematical (white box model) with given Dataflow Statistics and configuration parameter setting in hypothetical job. The Cost field can be calculated with a second set of mathematical models combine the estimated cost statistics and dataflow field.

Pros

- No need to modify Hadoop.

- No overhead if profiling turned off.

- Support unmodified MapReduce programs Java, Python, Ruby, C++

- Easy to use. (the system give us intuitive GUI)

User Interface (from Starfish Homepage)

Cons and Question

- Not support network topology. (But can be removed)

- There is no information how many time will take to get the result within query. Is it fast enough?

- The training samples must have good coverage of the prediction space and the time to generate the complete set of training samples must be small. How does the user choose the samples easily?

- User need to understand in detailed the system such as fields, configualtion parameter, and etc...

- User may create new models for their big data. Is it easy?

Check this out

- We may understand more easily if we understand Starfish first. http://www.cs.duke.edu/starfish/

Monday, September 17, 2012

DryadLINQ

! Motivation

: Microsoft have developed Dryad platform (execution engine) for general purpose runtime for execution of data parallel applications. Although Dryad has several powerful advantages for processing distributed data parallel computing, yet developers should understand and know how to instruct Dryad and this makes hard to write Dryad program. So developers want simplicity of the programming on higher level of abstraction. This is why DryadLINQ is emerged. DryadLINQ combines two technology Dryad stated above and LINQ (Language INtergrated Query) which enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library for programming with datasets.

! Main Idea

: DryadLINQ is a system and a set of language extensions that enable a new programming for writing large-scale data parallel applications running on large PC clusters. To make developer’s life happy, DryadLINQ translates the data process in LINQ automatically to optimized execution plan in Dryad system. So developers do not need to care about detail of dryad platform (how to parallelize data flow, how to make plan done by job Manager and etc…). DryadLINQ optimizes the job graph by supporting both static and dynamic optimizations which focus on minimizing disk and network I/O.

Since DryadLINQ exploits LINQ, a set of .NET constructs, it can gain benefits from LINQ and .Net.

- DryadLINQ programs may be written any .Net language types such as C#, VB, F#, and etc…

- Objects in DryadLINQ datasets can be of any .NET types; this makes it easy to compute with data such as image patches, vectors, and matrices.

- Programs can be written as imperative or declarative operations on datasets within a traditional high-level programming language, using an expressive data model of strongly typed .NET objects.

- By leveraging other LINQ providers such as PLINQ, it gives parallelizing the sequential code to exploit the multi-core advantage.

- LINQ’s strong static typing is extremely valuable when programming large scale computations. It is much easier to debug compilation errors in Visual Studio than run-time errors in the cluster.

- Shared objects can be referenced and read freely and will be automatically serialized and distributed where necessary.

! Weakness

- Since DryadLINQ does not check or enforce the absence of side-effects, all the functions called in DryadLINQ expressions must be side-effect free.

- Though checking correctness of program is not difficult but performance debugging wait for us. Moreover the paper gives us not enough information about the debugging but only users comments and opinions.

- Though operation Apply can be helpful for beginners, yet it can reduce the system’s ability to make high-level program transformations.

- DryadLINQ is very inefficient for algorithms which are naturally expressed using random-accesses.

- Due to the restriction of Dryad and LINQ, DryadLINQ is not suitable for some kind of workload, such as task requiring low latency.

- Dynamic generated assembly code is needed to ship to cluster computers; it may increase the overhead of network.

- C++ usually faster than C# (managed code)

- Experimental evaluation is conducted on only medium-sized private cluster which includes 240 computers. I think that this is not enough to evaluate DryadLINQ.