ABSTRACT
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. Technical report, Stanford, http://ilpubs.stanford.edu:8090/952/, 2009.Google Scholar
- Apache. Hadoop. http://hadoop.apache.org/, 2006.Google Scholar
- R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD Conference, pages 261--272, 2000. Google ScholarDigital Library
- S. Babu, K. Munagala, J. Widom, and R. Motwani. Adaptive caching for continuous queries. In ICDE, pages 118--129, 2005. Google ScholarDigital Library
- S. Babu and J. Widom. Streamon: an adaptive engine for stream query processing. In SIGMOD Conference, pages 931--932, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1--7):107--117, 1998. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008. Google ScholarDigital Library
- H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. PVLDB, 1(2):1277--1288, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- A. Deshpande and L. Hellerstein. Flow algorithms for parallel query optimization. In ICDE, pages 754--763, 2008. Google ScholarDigital Library
- D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, pages 27--40, 1992. Google ScholarDigital Library
- D. J. DeWitt, E. Paulson, E. Robinson, J. F. Naughton, J. Royalty, S. Shankar, and A. Krioukov. Clustera: an integrated computation and data management system. PVLDB, 1(1):28--41, 2008. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In 19th ACM Symposium on Operating Systems Principles, 2003. Google ScholarDigital Library
- Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query execution system for data integration. In SIGMOD Conference, pages 299--310, 1999. Google ScholarDigital Library
- H. Jacobsson. Tree-based techniques for query evaluation. Ph.D. thesis, Dept. of CS, Stanford Univ., Stanford CA USA, STAN-CS-93-1492, 1993. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:668--677, 1999. Google ScholarDigital Library
- S. Madden, M. A. Shah, J. M. Hellerstein, and V. Raman. Continuously adaptive continuous queries over streams. In SIGMOD Conference, pages 49--60, 2002. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008. Google ScholarDigital Library
- K. A. Ross and J. Cieslewicz. Optimal splitters for database partitioning with size bounds. In ICDT, pages 98--110, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- U. Srivastava, K. Munagala, J. Widom, and R. Motwani. Query optimization over web services. In VLDB, pages 355--366, 2006. Google ScholarDigital Library
- K.-L. Tan and H. Lu. A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec., 20(4):81--82, 1991. Google ScholarDigital Library
- S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-way join queries over streaming information sources. In VLDB, pages 285--296, 2003. Google ScholarDigital Library
Recommendations
Optimizing Multiway Joins in a Map-Reduce Environment
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes ...
On optimizing relational self-joins
EDBT '12: Proceedings of the 15th International Conference on Extending Database TechnologySelf-join, which joins a relation with itself, is a prevalent operation in relational database systems. Despite its wide applicability, there has been little attention devoted to improving its performance. In this paper, we present SCALE (<u>S</u>ort ...
Processing multi-way spatial joins on map-reduce
EDBT '13: Proceedings of the 16th International Conference on Extending Database TechnologyIn this paper we investigate the problem of processing multi-way spatial joins on map-reduce platform. We look at two common spatial predicates - overlap and range. We address these two classes of join queries, discuss the challenges and outline novel ...
Comments