skip to main content
10.1145/1739041.1739056acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Optimizing joins in a map-reduce environment

Published:22 March 2010Publication History

ABSTRACT

Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

References

  1. F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. Technical report, Stanford, http://ilpubs.stanford.edu:8090/952/, 2009.Google ScholarGoogle Scholar
  2. Apache. Hadoop. http://hadoop.apache.org/, 2006.Google ScholarGoogle Scholar
  3. R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD Conference, pages 261--272, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Babu, K. Munagala, J. Widom, and R. Motwani. Adaptive caching for continuous queries. In ICDE, pages 118--129, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Babu and J. Widom. Streamon: an adaptive engine for stream query processing. In SIGMOD Conference, pages 931--932, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1--7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. PVLDB, 1(2):1277--1288, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Deshpande and L. Hellerstein. Flow algorithms for parallel query optimization. In ICDE, pages 754--763, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, pages 27--40, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. J. DeWitt, E. Paulson, E. Robinson, J. F. Naughton, J. Royalty, S. Shankar, and A. Krioukov. Clustera: an integrated computation and data management system. PVLDB, 1(1):28--41, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In 19th ACM Symposium on Operating Systems Principles, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query execution system for data integration. In SIGMOD Conference, pages 299--310, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Jacobsson. Tree-based techniques for query evaluation. Ph.D. thesis, Dept. of CS, Stanford Univ., Stanford CA USA, STAN-CS-93-1492, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:668--677, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Madden, M. A. Shah, J. M. Hellerstein, and V. Raman. Continuously adaptive continuous queries over streams. In SIGMOD Conference, pages 49--60, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. A. Ross and J. Cieslewicz. Optimal splitters for database partitioning with size bounds. In ICDT, pages 98--110, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. U. Srivastava, K. Munagala, J. Widom, and R. Motwani. Query optimization over web services. In VLDB, pages 355--366, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K.-L. Tan and H. Lu. A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec., 20(4):81--82, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-way join queries over streaming information sources. In VLDB, pages 285--296, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology
    March 2010
    741 pages
    ISBN:9781605589459
    DOI:10.1145/1739041

    Copyright © 2010 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 March 2010

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate7of10submissions,70%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader