ABSTRACT
An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. In such scenarios it has been increasingly recognized that, as in classical databases, techniques for avoiding access to irrelevant data can dramatically improve query performance. Prior work on Hadoop, however, has simply ported classical techniques to the MapReduce setting, focusing on record-level indexing and key-based partition elimination. Unfortunately, record-level indexing only slightly improves overall query performance, because it does not minimize the number of mapper "waves", which is determined by the number of processed splits. Moreover, key-based partitioning requires data reorganization, which is usually impractical in Hadoop settings. We therefore need to re-envision how data access mechanisms are defined and implemented. To this end, we introduce the Eagle-Eyed Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data. Our experiments show that E3 can achieve up to 20x cost savings with small to moderate storage overheads.
- A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010. Google ScholarDigital Library
- K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011. http://code.google.com/p/jaql.Google ScholarDigital Library
- S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
- G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google ScholarDigital Library
- G. B. Dantzig. Discrete-variable extremum problems. Oper. Res., 5(2):266--288, 1957.Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only aggressive elephants are fast elephants. PVLDB, 5(11):1591--1602, 2012. Google ScholarDigital Library
- M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB, 4(9):575--585, 2011. Google ScholarDigital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google ScholarDigital Library
- S. W. Golomb. Run-length encodings. IEEE Trans. Inform. Theor., 10(3):399--401, 1966. Google ScholarDigital Library
- The Apache Hadoop Project. http://hadoop.apache.org/core/, 2009.Google Scholar
- D. Hilley. Cloud computing: A taxonomy of platform and infrastructure-level offerings. Technical Report GIT-CERCS-09-1, Georgia Institute of Technology, 2009.Google Scholar
- O. Ibarra and C. Kim. Fast approximation algorithms for the knapsack and sum of subset problems. Journal of the ACM, 22:463--468, 1975. Google ScholarDigital Library
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
- J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MapReduce, pages 59--66, 2011. Google ScholarDigital Library
- Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In SIGMOD, pages 961--972, 2011. Google ScholarDigital Library
- G. Moerkette. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
- M. Nicola, I. Kogan, and B. Schiefer. An XML transaction processing benchmark. In SIGMOD, pages 937--948, 2007. Google ScholarDigital Library
- A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
- M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Springer, 2011. Third edition. Google ScholarDigital Library
- T. Palpanas, P. Larson, and J. Goldstein. Cache management policies for semantic caching. Technical Report CSRG-439, Dept. of Computer Science, University of Toronto, 2001.Google Scholar
- S. Podlipnig and L. Pöszörmenyi. A survey of web cache replacement strategies. ACM Comput. Surv., 35(4):374--398, 2003. Google ScholarDigital Library
- M. Reddy and G. P. Fletcher. Exp1: a comparison between a simple adaptive caching agent using document life histories and existing cache techniques. Computer Networks and ISDN Systems, 30(22-23):2149--2153, 1998. Google ScholarDigital Library
- M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- Z. Svitkina and L. Fleischer. Submodular approximation: Sampling-based algorithms and lower bounds. SIAM J. Comput., 40(6):1715--1737, 2011. Google ScholarDigital Library
- The Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html.Google Scholar
- A. Thusoo et al. Hive - a warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009. https://cwiki.apache.org/Hive/tutorial.html. Google ScholarDigital Library
- TPC-H specification 2.8.0. http://www.tpc.org/tpch.Google Scholar
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, pages 495--506, 2010. Google ScholarDigital Library
Index Terms
- Eagle-eyed elephant: split-oriented indexing in Hadoop
Recommendations
Riding the elephant: managing ensembles with hadoop
MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputersMany important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant ...
Elephant, Do Not Forget Everything! Efficient Processing of Growing Datasets
CLOUD '13: Proceedings of the 2013 IEEE Sixth International Conference on Cloud ComputingMapReduce has become quite popular to analyse very large datasets. Nevertheless, users typically have to run their MapReduce jobs over the whole dataset every time the dataset is appended by new records. Some researchers have proposed to reuse the ...
Comments