skip to main content
10.1145/3035918.3035921acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Two-Level Sampling for Join Size Estimation

Authors Info & Claims
Published:09 May 2017Publication History

ABSTRACT

Join size estimation is a critical step in query optimization, and has been extensively studied in the literature. Among the many techniques, sampling based approaches are particularly appealing, due to their ability to handle arbitrary selection predicates. In this paper, we propose a new sampling algorithm for join size estimation, called two-level sampling, which combines the advantages of three previous sampling methods while making further improvements. Both analytical and empirical comparisons show that the new algorithm outperforms all the previous algorithms on a variety of joins, including primary key-foreign key joins, many-to-many joins, and multi-table joins. The new sampling algorithm is also very easy to implement, requiring just one pass over the data. It only relies on some basic statistical information about the data, such as the ℓk-norms and the heavy hitters.

References

  1. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD Record, volume 28, pages 275--286. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '99, pages 10--20, New York, NY, USA, 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. The VLDB Journal, 10(2--3):199--223, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chu, M. Balazinska, and D. Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. In Proc. ACM SIGMOD International Conference on Management of Data, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In Proc. International Conference on Very Large Data Bases, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In Proc. International Conference on Very Large Data Bases, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 61--72. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Estan and J. F. Naughton. End-biased samples for join cardinality estimation. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 20--20. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. In ACM SIGMOD Record, volume 25, pages 271--281. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. Ninth Intl. Conf. Scientific and Statistical Database Management, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex ad-hoc queries in big data clusters. In Proc. ACM SIGMOD International Conference on Management of Data, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join: Online aggregation via random walks. In Proc. ACM SIGMOD International Conference on Management of Data, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. In Proc. ACM Symposium on Principles of Database Systems, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Rusu and A. Dobra. Pseudo-random number generation for sketch-based estimations. ACM Transactions on Database Systems, 32(2), Article 11, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Transactions on Database Systems, 33:1--46, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation.Google ScholarGoogle Scholar
  21. D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment, 8(12):1530--1541, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Two-Level Sampling for Join Size Estimation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
      May 2017
      1810 pages
      ISBN:9781450341974
      DOI:10.1145/3035918

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 May 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Author Tags

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader