research-article

Major technical advancements in apache hive

Authors:
Yin Huai

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Ashutosh Chauhan

Hortonworks Inc., Palo Alto, CA, USA

Hortonworks Inc., Palo Alto, CA, USA
View Profile

,
Alan Gates

Hortonworks Inc., Palo Alto, CA, USA

Hortonworks Inc., Palo Alto, CA, USA
View Profile

,
Gunther Hagleitner

Hortonworks Inc., Palo Alto, CA, USA

Hortonworks Inc., Palo Alto, CA, USA
View Profile

,
Eric N. Hanson

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Owen O'Malley

Hortonworks Inc., Palo Alto, CA, USA

Hortonworks Inc., Palo Alto, CA, USA
View Profile

,
Jitendra Pandey

Hortonworks Inc., Palo Alto, CA, USA

Hortonworks Inc., Palo Alto, CA, USA
View Profile

,
Yuan Yuan

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Rubao Lee

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Xiaodong Zhang

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 1235–1246https://doi.org/10.1145/2588555.2595630

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1235–1246

ABSTRACT

Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

References

https://hadoop.apache.org/Google Scholar
https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2Google Scholar
https://tez.incubator.apache.org/.Google Scholar
https://hbase.apache.org/Google Scholar
https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java?view=logGoogle Scholar
https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java?view=logGoogle Scholar
https://cwiki.apache.org/confluence/display/Hive/Correlation+OptimizerGoogle Scholar
https://issues.apache.org/jira/browse/HIVE-4160Google Scholar
https://issues.apache.org/jira/secure/attachment/12603710/Hive-Vectorized-Query-Execution-Design-rev11.pdfGoogle Scholar
http://www.tpc.org/tpch/Google Scholar
http://www.tpc.org/tpcds/Google Scholar
http://avro.apache.org/docs/current/trevni/spec.htmlGoogle Scholar
https://github.com/Parquet/parquet-formatGoogle Scholar
https://incubator.apache.org/drill/Google Scholar
https://github.com/cloudera/impalaGoogle Scholar
http://www.slideshare.net/ApacheDrill/oscon-2013-apache-drill-workshop-part-2Google Scholar
https://hive.apache.org/Google Scholar
A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving Relations for Cache Performance. In VLDB, 2001. Google ScholarDigital Library
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google Scholar
Y. Cao, G. C. Das, C.-Y. Chan, and K.-L. Tan. Optimizing Complex Queries with Multiple Relation Instances. In SIGMOD, 2008. Google ScholarDigital Library
G. P. Copeland and S. N. Khoshafian. A Decomposition Storage Model. In SIGMOD, 1985. Google ScholarDigital Library
P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. B. Zdonik, and P. G. Brown. SS-DB: A Standard Science DBMS Benchmark. http://www-conf.slac.stanford.edu/xldb10/docs/ssdb_benchmark.pdfGoogle Scholar
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. In VLDB, 2011. Google ScholarDigital Library
S. Guo, J. Xiong, W. Wang, and R. Lee. Mastiff: A Mapreduce-based System for Time-Based Big Data Analytics. In CLUSTER, 2012. Google ScholarDigital Library
S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A Simultaneously Pipelined Relational Query Engine. In SIGMOD, 2005. Google ScholarDigital Library
W. Hasan and R. Motwani. Coloring Away Communication in Parallel Query Optimization. In VLDB, 1995. Google ScholarDigital Library
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011. Google ScholarDigital Library
Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang. Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters. In VLDB, 2013. Google ScholarDigital Library
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011. Google ScholarDigital Library
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, and C. Bear. The Vertica Analytic Database: C-Store 7 Years Later. In VLDB, 2012. Google ScholarDigital Library
R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-MapReduce Translator. In ICDCS, 2011. Google ScholarDigital Library
H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for Mapreduce Workflows. In VLDB, 2012. Google ScholarDigital Library
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: Leveraging Columnar Storage for Scalable Join Processing in the Mapreduce Framework. In SIGMOD, 2011. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. In VLDB, 2010. Google ScholarDigital Library
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. In VLDB, 2011. Google ScholarDigital Library
T. Neumann and G. Moerkotte. A Combined Framework for Grouping and Order Optimization. In VLDB, 2004. Google ScholarDigital Library
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in Mapreduce. In VLDB, 2010. Google ScholarDigital Library
S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. In ICDE, 2001. Google ScholarDigital Library
J. Parikh. Data Infrastructure at Web Scale. http://www.vldb.org/2013/video/keynote1.flvGoogle Scholar
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In SIGMOD, 1979. Google ScholarDigital Library
T. K. Sellis. Multiple-Query Optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988. Google ScholarDigital Library
D. Simmen, E. Shekita, and T. Malkemus. Fundamental Techniques for Order Optimization. In SIGMOD, 1996. Google ScholarDigital Library
J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. Compilation in Query Execution. In DaMoN, 2011. Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.Google ScholarCross Ref
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In SoCC, 2013. Google ScholarDigital Library
X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the Cloud. In SoCC, 2011. Google ScholarDigital Library
J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer. In ICDE, 2010.Google ScholarCross Ref
J. Zhou and K. A. Ross. A Multi-resolution Block Storage Model for Database Design. In IDEAS, 2003.Google Scholar
J. Zhou and K. A. Ross. Buffering Database Operations for Enhanced Instruction Cache Performance. In SIGMOD, 2004. Google ScholarDigital Library
M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB, 2007. Google ScholarDigital Library

Index Terms

Major technical advancements in apache hive
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture ...
Read More
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
Apache Hive Essentials: Essential techniques to help you process, and get unique insights from, big data, 2nd Edition
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data warehouse
databases
hadoop
hive
mapreduce
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 96
  Total Citations
  View Citations
- 1,831
  Total Downloads
- Downloads (Last 12 months)74
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Major technical advancements in apache hive

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Apache Hive Essentials: Essential techniques to help you process, and get unique insights from, big data, 2nd Edition