research-article

Eagle-eyed elephant: split-oriented indexing in Hadoop

Authors:
Mohamed Y. Eltabakh

Worcester Polytechnic Institute, Worcester, MA

Worcester Polytechnic Institute, Worcester, MA
View Profile

,
Fatma Özcan

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Yannis Sismanis

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Peter J. Haas

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Hamid Pirahesh

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Jan Vondrak

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

EDBT '13: Proceedings of the 16th International Conference on Extending Database TechnologyMarch 2013Pages 89–100https://doi.org/10.1145/2452376.2452388

Published:18 March 2013Publication History

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

Pages 89–100

ABSTRACT

An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. In such scenarios it has been increasingly recognized that, as in classical databases, techniques for avoiding access to irrelevant data can dramatically improve query performance. Prior work on Hadoop, however, has simply ported classical techniques to the MapReduce setting, focusing on record-level indexing and key-based partition elimination. Unfortunately, record-level indexing only slightly improves overall query performance, because it does not minimize the number of mapper "waves", which is determined by the number of processed splits. Moreover, key-based partitioning requires data reorganization, which is usually impractical in Hadoop settings. We therefore need to re-envision how data access mechanisms are defined and implemented. To this end, we introduce the Eagle-Eyed Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data. Our experiments show that E3 can achieve up to 20x cost savings with small to moderate storage overheads.

References

A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010. Google ScholarDigital Library
K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011. http://code.google.com/p/jaql.Google ScholarDigital Library
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google ScholarDigital Library
G. B. Dantzig. Discrete-variable extremum problems. Oper. Res., 5(2):266--288, 1957.Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only aggressive elephants are fast elephants. PVLDB, 5(11):1591--1602, 2012. Google ScholarDigital Library
M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB, 4(9):575--585, 2011. Google ScholarDigital Library
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google ScholarDigital Library
S. W. Golomb. Run-length encodings. IEEE Trans. Inform. Theor., 10(3):399--401, 1966. Google ScholarDigital Library
The Apache Hadoop Project. http://hadoop.apache.org/core/, 2009.Google Scholar
D. Hilley. Cloud computing: A taxonomy of platform and infrastructure-level offerings. Technical Report GIT-CERCS-09-1, Georgia Institute of Technology, 2009.Google Scholar
O. Ibarra and C. Kim. Fast approximation algorithms for the knapsack and sum of subset problems. Journal of the ACM, 22:463--468, 1975. Google ScholarDigital Library
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MapReduce, pages 59--66, 2011. Google ScholarDigital Library
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In SIGMOD, pages 961--972, 2011. Google ScholarDigital Library
G. Moerkette. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
M. Nicola, I. Kogan, and B. Schiefer. An XML transaction processing benchmark. In SIGMOD, pages 937--948, 2007. Google ScholarDigital Library
A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Springer, 2011. Third edition. Google ScholarDigital Library
T. Palpanas, P. Larson, and J. Goldstein. Cache management policies for semantic caching. Technical Report CSRG-439, Dept. of Computer Science, University of Toronto, 2001.Google Scholar
S. Podlipnig and L. Pöszörmenyi. A survey of web cache replacement strategies. ACM Comput. Surv., 35(4):374--398, 2003. Google ScholarDigital Library
M. Reddy and G. P. Fletcher. Exp1: a comparison between a simple adaptive caching agent using document life histories and existing cache techniques. Computer Networks and ISDN Systems, 30(22-23):2149--2153, 1998. Google ScholarDigital Library
M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
Z. Svitkina and L. Fleischer. Submodular approximation: Sampling-based algorithms and lower bounds. SIAM J. Comput., 40(6):1715--1737, 2011. Google ScholarDigital Library
The Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html.Google Scholar
A. Thusoo et al. Hive - a warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009. https://cwiki.apache.org/Hive/tutorial.html. Google ScholarDigital Library
TPC-H specification 2.8.0. http://www.tpc.org/tpch.Google Scholar
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, pages 495--506, 2010. Google ScholarDigital Library

Index Terms

Eagle-eyed elephant: split-oriented indexing in Hadoop

Recommendations

Riding the elephant: managing ensembles with hadoop
MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers

Many important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant ...
Read More
Elephant, Do Not Forget Everything! Efficient Processing of Growing Datasets
CLOUD '13: Proceedings of the 2013 IEEE Sixth International Conference on Cloud Computing

MapReduce has become quite popular to analyse very large datasets. Nevertheless, users typically have to run their MapReduce jobs over the whole dataset every time the dataset is appended by new records. Some researchers have proposed to reuse the ...
Read More
Big Data Analytics with R and Hadoop
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
March 2013
793 pages
ISBN:9781450315975
DOI:10.1145/2452376
General Chair:
Giovanna Guerrini
Università di Genova, Italy
,
Program Chair:
Norman W. Paton
University of Manchester, UK
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 423
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Eagle-eyed elephant: split-oriented indexing in Hadoop

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Riding the elephant: managing ensembles with hadoop

Elephant, Do Not Forget Everything! Efficient Processing of Growing Datasets

Big Data Analytics with R and Hadoop