skip to main content
10.1145/2452376.2452388acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Eagle-eyed elephant: split-oriented indexing in Hadoop

Published:18 March 2013Publication History

ABSTRACT

An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. In such scenarios it has been increasingly recognized that, as in classical databases, techniques for avoiding access to irrelevant data can dramatically improve query performance. Prior work on Hadoop, however, has simply ported classical techniques to the MapReduce setting, focusing on record-level indexing and key-based partition elimination. Unfortunately, record-level indexing only slightly improves overall query performance, because it does not minimize the number of mapper "waves", which is determined by the number of processed splits. Moreover, key-based partitioning requires data reorganization, which is usually impractical in Hadoop settings. We therefore need to re-envision how data access mechanisms are defined and implemented. To this end, we introduce the Eagle-Eyed Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data. Our experiments show that E3 can achieve up to 20x cost savings with small to moderate storage overheads.

References

  1. A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011. http://code.google.com/p/jaql.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. B. Dantzig. Discrete-variable extremum problems. Oper. Res., 5(2):266--288, 1957.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only aggressive elephants are fast elephants. PVLDB, 5(11):1591--1602, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB, 4(9):575--585, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. W. Golomb. Run-length encodings. IEEE Trans. Inform. Theor., 10(3):399--401, 1966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. The Apache Hadoop Project. http://hadoop.apache.org/core/, 2009.Google ScholarGoogle Scholar
  14. D. Hilley. Cloud computing: A taxonomy of platform and infrastructure-level offerings. Technical Report GIT-CERCS-09-1, Georgia Institute of Technology, 2009.Google ScholarGoogle Scholar
  15. O. Ibarra and C. Kim. Fast approximation algorithms for the knapsack and sum of subset problems. Journal of the ACM, 22:463--468, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MapReduce, pages 59--66, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In SIGMOD, pages 961--972, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Moerkette. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Nicola, I. Kogan, and B. Schiefer. An XML transaction processing benchmark. In SIGMOD, pages 937--948, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Springer, 2011. Third edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Palpanas, P. Larson, and J. Goldstein. Cache management policies for semantic caching. Technical Report CSRG-439, Dept. of Computer Science, University of Toronto, 2001.Google ScholarGoogle Scholar
  25. S. Podlipnig and L. Pöszörmenyi. A survey of web cache replacement strategies. ACM Comput. Surv., 35(4):374--398, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Reddy and G. P. Fletcher. Exp1: a comparison between a simple adaptive caching agent using document life histories and existing cache techniques. Computer Networks and ISDN Systems, 30(22-23):2149--2153, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Svitkina and L. Fleischer. Submodular approximation: Sampling-based algorithms and lower bounds. SIAM J. Comput., 40(6):1715--1737, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. The Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html.Google ScholarGoogle Scholar
  30. A. Thusoo et al. Hive - a warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009. https://cwiki.apache.org/Hive/tutorial.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. TPC-H specification 2.8.0. http://www.tpc.org/tpch.Google ScholarGoogle Scholar
  32. R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, pages 495--506, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Eagle-eyed elephant: split-oriented indexing in Hadoop

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
            March 2013
            793 pages
            ISBN:9781450315975
            DOI:10.1145/2452376

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 March 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate7of10submissions,70%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader