skip to main content
10.1145/2588555.2595630acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Major technical advancements in apache hive

Published:18 June 2014Publication History

ABSTRACT

Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

References

  1. https://hadoop.apache.org/Google ScholarGoogle Scholar
  2. https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2Google ScholarGoogle Scholar
  3. https://tez.incubator.apache.org/.Google ScholarGoogle Scholar
  4. https://hbase.apache.org/Google ScholarGoogle Scholar
  5. https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java?view=logGoogle ScholarGoogle Scholar
  6. https://svn.apache.org/viewvc/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java?view=logGoogle ScholarGoogle Scholar
  7. https://cwiki.apache.org/confluence/display/Hive/Correlation+OptimizerGoogle ScholarGoogle Scholar
  8. https://issues.apache.org/jira/browse/HIVE-4160Google ScholarGoogle Scholar
  9. https://issues.apache.org/jira/secure/attachment/12603710/Hive-Vectorized-Query-Execution-Design-rev11.pdfGoogle ScholarGoogle Scholar
  10. http://www.tpc.org/tpch/Google ScholarGoogle Scholar
  11. http://www.tpc.org/tpcds/Google ScholarGoogle Scholar
  12. http://avro.apache.org/docs/current/trevni/spec.htmlGoogle ScholarGoogle Scholar
  13. https://github.com/Parquet/parquet-formatGoogle ScholarGoogle Scholar
  14. https://incubator.apache.org/drill/Google ScholarGoogle Scholar
  15. https://github.com/cloudera/impalaGoogle ScholarGoogle Scholar
  16. http://www.slideshare.net/ApacheDrill/oscon-2013-apache-drill-workshop-part-2Google ScholarGoogle Scholar
  17. https://hive.apache.org/Google ScholarGoogle Scholar
  18. A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving Relations for Cache Performance. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google ScholarGoogle Scholar
  20. Y. Cao, G. C. Das, C.-Y. Chan, and K.-L. Tan. Optimizing Complex Queries with Multiple Relation Instances. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. P. Copeland and S. N. Khoshafian. A Decomposition Storage Model. In SIGMOD, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. B. Zdonik, and P. G. Brown. SS-DB: A Standard Science DBMS Benchmark. http://www-conf.slac.stanford.edu/xldb10/docs/ssdb_benchmark.pdfGoogle ScholarGoogle Scholar
  23. A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. In VLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Guo, J. Xiong, W. Wang, and R. Lee. Mastiff: A Mapreduce-based System for Time-Based Big Data Analytics. In CLUSTER, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A Simultaneously Pipelined Relational Query Engine. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Hasan and R. Motwani. Coloring Away Communication in Parallel Query Optimization. In VLDB, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang. Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters. In VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, and C. Bear. The Vertica Analytic Database: C-Store 7 Years Later. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-MapReduce Translator. In ICDCS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for Mapreduce Workflows. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: Leveraging Columnar Storage for Scalable Join Processing in the Mapreduce Framework. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. In VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. In VLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Neumann and G. Moerkotte. A Combined Framework for Grouping and Order Optimization. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in Mapreduce. In VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. In ICDE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Parikh. Data Infrastructure at Web Scale. http://www.vldb.org/2013/video/keynote1.flvGoogle ScholarGoogle Scholar
  40. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In SIGMOD, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. T. K. Sellis. Multiple-Query Optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Simmen, E. Shekita, and T. Malkemus. Fundamental Techniques for Order Optimization. In SIGMOD, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. Compilation in Query Execution. In DaMoN, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  46. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In SoCC, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the Cloud. In SoCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  49. J. Zhou and K. A. Ross. A Multi-resolution Block Storage Model for Database Design. In IDEAS, 2003.Google ScholarGoogle Scholar
  50. J. Zhou and K. A. Ross. Buffering Database Operations for Enhanced Instruction Cache Performance. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Major technical advancements in apache hive

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
      June 2014
      1645 pages
      ISBN:9781450323765
      DOI:10.1145/2588555

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader