- Motivation
Facebook
has to take care of huge size of data in everyday for their applications which
may need large computing power such
as analysis peta-bytes size of data. So they decided to use the Hadoop, software library, which allows for the distributed
processing of large data sets across clusters of computers using simple
programming models for handling their
huge data within scalable and reliable way. However, Hadoop system does not provide an
explicit structured data processing framework. So the basic reason why they build Hive system using SQL-like, HiveQL is to
make it easy to use Hadoop for ones who are not familiar with it by making map/reduce jobs with HiveQL.
- System
Architecture
Hive runs
on the top of Hadoop. By giving some interfaces to communicate with Hadoop, Hive
can hide the complicated pipelining of multiple map-reduce jobs from the
programmers to make their life happy. With the SQL-like language, programmers may
be able to write simple and complicated queries without huge efforts for the analysis
or optimization of the map-reduce jobs.
l Metastore – stores system catalog
and metadata about tables, columns, partitions, etc.
l Driver – manages the lifecycle of
a HiveQL statement. Also handle a session handle.
l Query Compiler – compiles HiveQL
into a directed acyclic graph of map/reduce tasks.
l Execution engine – interact with
Hadoop and executes the task produced by compiler.
l Hiveserver – provides a thrift
interface and JDBC/ODBC server.
l CLI – Command Line Interface, the
web UI.
l Extensible Interfaces – includes
the SerDe and ObjectInspector interfaces.
- How it
works
The
basic idea of Hive is to provide users
a SQL-like language, HiveQL. Programs written in HiveQL will be input from CLI or WebUI
and then the system will send it to the Query compiler. Then the program will be compiled into
map-reduce jobs that are executed using Hadoop through execution engine.
- Rleated
work
They
mentioned about Scope which is
an SQL-like language on top of Microsoft’s proprietary Cosmos map/reduce and PIG which allows users to write
declarative scripts to process data. The main difference with them is providing a system
catalog as a Metastore which used for data exploration, query optimization,
query compilation. Even
though it seems that Hive is very much influenced by these related systems, the
author does not describe about the relation with other system.
- Advantages
1.
First SQL-like system using Hadoop.
2.
Supports not only primitive data types but also multiple customizable data type by using SerDe.
3. Working actively on by Facebook so the system may continuously be
improved.
4. Open Source system can be improved.
5. By Supporting SQL syntax, system can integrate with existing commercial BI tools.
6. I
think this system can be used for other system by replacing Driver component.
- Criticism
1. They don’t give exact
comparison to competitive systems with analysis or benchmark. Yet they give us only their own result that the system works 20% better than other system. How can we estimate and trust its performances and how many
times does it take for whole single map/task? We need such information to
decide whether use this system or not.
2. Do any of components can cause
a performance issue? How many times does it take to operate each component such
as compiler, Thrift server and etc.? Does Facebook was faced any problem with
this system?
3. Optimizer is only rule-based
not support cost-based. Moreover
programmers have to provide query hint on doing “MAPJOIN” on small tables and on
2-stage map-reduce for “GROUP BY” aggregates where the group-by columns have
highly skewed data.
4. Some operations do not be supported such as INSERT
INTO, UPDATE and DELETE.
5. This paper may not be written for academic purpose. It
seems to focus on more giving the examples of queries of HiveQL and to be likely
for more general purpose
to introduce
about HIVE to the general.
- Conclusions
Actually,
I think that there are not new things on this system such as parser, graph
generator and optimizer ideas. I think that though the Hive gives us easy way to
use Hadoop without huge efforts but we may be needed to learn making and running
our own map/reduce jobs for the best performance. However, by giving SQL-Like
query express to users, the system provides many benefits to programmers and
non-programmers who want to use Hadoop but are not familiar with it. Although this paper is
not written well and we
cannot get detailed information from this paper about HIVE, yet Hive system seems that it will be very
improved. The author interests and works in optimizing Hive and subsuming SQL
syntax. Moreover it becomes now Open-source system. So if some problems such as
limited optimizer, SQL-expressions and etc are solved, then the system seems to
be more popular and accepted.
Excellent evaluation. I agree with you - it is not a great or insightful paper.
ReplyDeleteHive - A Petabyte Scale Data Warehouse Using Hadoop. - Motivation. Facebook has to take care of huge size of data in everyday. survival warehouse food supplies
ReplyDelete