skip to main content
research-article

Sharing across Multiple MapReduce Jobs

Published: 26 May 2014 Publication History

Abstract

Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure can be directly associated with monetary cost. MapReduce has been a popular framework in the context of cloud computing, designed to serve long-running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges for utilizing the processing infrastructure. In this article we present a sharing framework tailored to MapReduce, namely, <tt>MRShare</tt>.
Our framework, <tt>MRShare</tt>, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Given the query grouping, we merge jobs appropriately and submit them to MapReduce for processing. A key property of <tt>MRShare</tt> is that it is independent of the MapReduce implementation. Experiments with our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach.
<tt>MRShare</tt> is primarily designed for handling I/O-intensive queries. However, with the development of high-level languages operating on top of MapReduce, user queries executed in this model become more complex and CPU intensive. Commonly, executed queries can be modeled as evaluating pipelines of CPU-expensive filters over the input stream. Examples of such filters include, but are not limited to, index probes, or certain types of joins. In this article we adapt some of the standard techniques for filter ordering used in relational and stream databases, propose their extensions, and implement them through <tt>MRAdaptiveFilter</tt>, an extension of <tt>MRShare</tt> for expensive filter ordering tailored to MapReduce, which allows one to handle both single- and batch-query execution modes. We present an experimental evaluation that demonstrates additional benefits of <tt>MRAdaptiveFilter</tt>, when executing CPU-intensive queries in <tt>MRShare</tt>.

References

[1]
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Alexander Rasin. 2009. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2, 1, 922--933.
[2]
Foto Afrati and Jeffrey D. Ullman. 2010. Optimizing joins in a MapReduce environment. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT'10). 99--110.
[3]
Parag Agrawal, Daniel Kifer, and Christopher Olston. 2008. Scheduling shared scans of large data files. Proc. VLDB Endow. 1, 1, 958--969.
[4]
Amazon. 2006. Amazon elastic compute cloud. http://aws.amazon.com/ec2/.
[5]
Ron Avnur and Joseph M. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'00). 261--272.
[6]
Shivnath Babu, Rajeev Motwani, Kamesh Munagala, Itaru Nishizawa, and Jennifer Widom. 2004. Adaptive ordering of pipelined stream filters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 407--418.
[7]
Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. 2010. A comparison of join algorithms for log processing in MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 975--986.
[8]
Blogscope. 2005. BlogScope. http://www.blogscope.net/.
[9]
George Candea, Neoklis Polyzotis, and Radek Vingralek. 2009. A scalable, predictable join operator for highly concurrent data warehouses. Proc. VLDB Endow. 2, 1, 277--288.
[10]
Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow 1, 2, 1265--1276.
[11]
Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel Madden, Vijayshankar Raman, Frederick Reiss, and Mehul A. Shah. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR'03).
[12]
Surajit Chaudhuri and Kyuseok Shim. 1993. Query optimization in the presence of foreign functions. In Proceedings of the 19th International Conference on Very Large Data Bases (VLDB'93). 529--542.
[13]
Surajit Chaudhuri and Kyuseok Shim. 1996. Optimization of queries with user-defined predicates. In Proceedings of the 19th International Conference on Very Large Data Bases (VLDB'96). 87--98.
[14]
Surajit Chaudhuri and Kyuseok Shim. 1999. Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24, 2, 177--228.
[15]
Fa-Chung Fred Chen and Margaret H. Dunham. 1998. Common subexpression processing in multiple-query processing. IEEE Trans. Knowl. Data Engin. 10, 3, 493--499.
[16]
Jianjun Chen, David J. Dewitt, Feng Tian, and Yuan Wang. 2000. NiagaraCQ: A scalable continuous query system for internet databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'00). 379--390.
[17]
Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, Yuan Yuan Yu, Gary Bradski, Andrew Y. Ng, and Kunie Olukotun. 2006. MapReduce for machine learning on multicore. In Proceedings of the Conference on Neural Information Processing Systems (NIPS'06).
[18]
Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. 2009. MAD skills: New analysis practices for big data. Proc. VLDB Endow. 2, 2, 1481--1492.
[19]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI'04). 107--113.
[20]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113.
[21]
Jens Dittrich, Jorge Quiane, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jorg Schad. 2010. Hadoop&plus;&plus;: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3, 1.
[22]
Sheldon Finkelstein. 1982. Common expression analysis in database applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'82). 235--245.
[23]
Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009. Sql/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2.
[24]
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. 2009. Building a high-level dataflow system on top of MapReduce: The pig experience. Proc. VLDB Endow. 2, 2, 1414--1425.
[25]
Hadoop. 2007. Hadoop project. http://hadoop.apache.org/.
[26]
Stavros Harizopoulos, Vladislav Shkapenyuk, and Anastassia Ailamaki. 2005. QPipe: A simultaneously pipelined relational query engine. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'00). 383--394.
[27]
Joseph M. Hellerstein. 1994. Practical predicate placement. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'94). 325--335.
[28]
Joseph M. Hellerstein and Michael Stonebraker. 1993. Predicate migration: Optimizing queries with expensive predicates. ACM SIGMOD Rec. 22, 2, 267--276.
[29]
Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR'11).
[30]
Eaman Jahani, Michael J. Cafarella, and Christopher Re. 2011. Automatic optimization for MapReduce programs. Proc. VLDB Endow. 4, 6.
[31]
Dawei Jiang, Beng Chin Ooi, Lei Shi, and Sai Wu. 2010. The performance of MapReduce: An in-depth study. Proc. VLDB Endow. 3, 1.
[32]
Ryan Johnson, Stavros Harizopoulos, Nikos Hardavellas, Kivanc Sabirli, Ippokratis Pandis, Anastassia Ailamaki, Naju G. Mancheril, and Babak Falsafi. 2007. To share or not to share&quest; In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 351--362.
[33]
Boduo Li, Edward Mazur, Yanlei Diao, Andrew Mcgregor, and Prashant J. Shenoy. 2012. SCALLA: A platform for scalable one-pass analytics using MapReduce. ACM Trans. Database Syst. 37, 4, 27.
[34]
Zhen Liu, Srinivasan Parthasarathy, Anand Ranganathan, and Hao Yang. 2008a. A generic flow algorithm for shared filter ordering problems. In Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'08). ACM Press, New York, 79--88.
[35]
Zhen Liu, Srinivasan Parthasarathy, Anand Ranganathan, and Hao Yang. 2008b. Near-optimal algorithms for shared filter evaluation in data stream systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). ACM Press, New York, 133--146.
[36]
Samuel Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. 2002. Continuously adaptive continuous queries over streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'02). 49--60.
[37]
Kamesh Munagala, Utkarsh Srivastava, and Jennifer Widom. 2007. Optimization of continuous queries with shared expensive filters. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'07). 215--224.
[38]
Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. 2010. MRShare: Sharing across multiple queries in MapReduce. Proc. VLDB Endow. 3, 1, 494--505.
[39]
Christopher Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava. 2008a. Automatic optimization of parallel dataflow programs. In Proceedings of the Annual Technical Conference on Annual Technical Conference (ATC'08). 267--273.
[40]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008b. Pig latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1099--1110.
[41]
Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. 2009. PLANET: Massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow. 2, 2.
[42]
Jooseok Park and Arie Segev. 1988. Using common subexpressions to optimize multiple queries. In Proceedings of the 4th International Conference on Data Engineering (ICDE'88). 311--319.
[43]
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09).
[44]
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. 13, 4, 277--298.
[45]
Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, and Guy M. Lohman. 2008. Main-memory scan sharing for multi-core CPUS. Proc. VLDB Endow. 1, 1, 610--621.
[46]
Daniel J. Rosenkrantz and Harry B. Hunt III. 1980. Processing conjunctive predicates and queries. In Proceedings of the 6th International Conference on Very Large Data Bases (VLDB'80). 64--72.
[47]
Prasan Roy, Sridhar Seshadri, S. Sudarshan, and Siddhesh Bhobe. 2000. Efficient and extensible algorithms for multi query optimization. ACM SIGMOD Rec. 29, 2, 249--260.
[48]
Timos Sellis. 1988. Multiple-query optimization. ACM Trans. Database Syst. 13, 1, 23--52.
[49]
Kyuseok Shim, Timos Sellis, and Dana Nau. 1994. Improvements on a heuristic algorithm for multiple-query optimization. Data Knowl. Engin. 12, 2, 197--222.
[50]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive -- A warehousing solution over a MapReduce framework. Proc. VLDB Endow. 2, 2, 1626--1629.
[51]
Carl A. Waldspurger and William E. Weihl. 1994. Lottery scheduling: Flexible proportional-share resource management. In Proceedings of the 1st USENIX Conference on Operating Systems Design and Implementation (OSDI'94). 1--11.
[52]
Xiaodan Wang, Christopher Olston, Anish Das Sarma, and Randal Burns. 2011. CoScan: Cooperative scan sharing in the cloud. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC'11). 11:1--11:12.
[53]
Joel Wolf, Andrey Balmin, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Sujay Parekh, Kun-Lung Wu, and Rares Vernica. 2012. CIRCUMFLEX: A scheduling optimizer for MapReduce workloads with shared scans. SIGOPS Oper. Syst. Rev. 46, 1, 26--32.
[54]
Sai Wu, Feng Li, Sharad Mehrotra, and Beng Chin Ooi. 2011. Query optimization for massively parallel data processing. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC'11). 12:1--12:13.
[55]
Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. 2007. MapReduce-merge: Simplified relational data processing on large clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'07). 1029--1040.
[56]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 1--14.
[57]
Jingren Zhou, Per-Ake Larson, Johann-Christoph Freytag, and Wolfgang Lehner. 2007. Efficient exploitation of similar subexpressions for query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'07). 533--544.
[58]
Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2007. Cooperative scans: Dynamic bandwidth sharing in a DBMS. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 723--734.

Cited By

View all
  • (2020)Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with FlinkComplexity10.1155/2020/66171492020Online publication date: 1-Jan-2020
  • (2018)The optimization for recurring queries in big data analysis system with MapReduceFuture Generation Computer Systems10.1016/j.future.2017.09.06387(549-556)Online publication date: Oct-2018
  • (2017)EclipseMR: Distributed and Parallel Task Processing with Consistent Hashing2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.12(322-332)Online publication date: Sep-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 39, Issue 2
May 2014
336 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2627748
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 May 2014
Accepted: 01 December 2013
Revised: 01 September 2013
Received: 01 August 2012
Published in TODS Volume 39, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MapReduce
  2. Sharing MapReduce jobs
  3. query processing
  4. systems

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with FlinkComplexity10.1155/2020/66171492020Online publication date: 1-Jan-2020
  • (2018)The optimization for recurring queries in big data analysis system with MapReduceFuture Generation Computer Systems10.1016/j.future.2017.09.06387(549-556)Online publication date: Oct-2018
  • (2017)EclipseMR: Distributed and Parallel Task Processing with Consistent Hashing2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.12(322-332)Online publication date: Sep-2017
  • (2017)Generalization of Large-Scale Data Processing in One MapReduce Job for Coarse-Grained ParallelismInternational Journal of Parallel Programming10.1007/s10766-016-0444-345:4(797-826)Online publication date: 1-Aug-2017
  • (2016)MEMoMRConcurrency and Computation: Practice & Experience10.1002/cpe.370228:14(3814-3829)Online publication date: 25-Sep-2016
  • (2015)Lottery scheduler for the Linux kernelDYNA10.15446/dyna.v82n189.4306882:189(216-225)Online publication date: 22-Feb-2015
  • (2015)Cache-oblivious scheduling of shared workloads2015 IEEE 31st International Conference on Data Engineering10.1109/ICDE.2015.7113339(855-866)Online publication date: Apr-2015

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media