Two-Level Sampling for Join Size Estimation

Authors:
Yu Chen

Hong Kong University of Science and Technology, Hong Kong, China

Hong Kong University of Science and Technology, Hong Kong, China
View Profile

,
Ke Yi

Hong Kong University of Science and Technology, Hong Kong, China

Hong Kong University of Science and Technology, Hong Kong, China
View Profile

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMay 2017Pages 759–774https://doi.org/10.1145/3035918.3035921

Published:09 May 2017Publication History

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 759–774

ABSTRACT

Join size estimation is a critical step in query optimization, and has been extensively studied in the literature. Among the many techniques, sampling based approaches are particularly appealing, due to their ability to handle arbitrary selection predicates. In this paper, we propose a new sampling algorithm for join size estimation, called two-level sampling, which combines the advantages of three previous sampling methods while making further improvements. Both analytical and empirical comparisons show that the new algorithm outperforms all the previous algorithms on a variety of joins, including primary key-foreign key joins, many-to-many joins, and multi-table joins. The new sampling algorithm is also very easy to implement, requiring just one pass over the data. It only relies on some basic statistical information about the data, such as the ℓ_k-norms and the heavy hitters.

References

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD Record, volume 28, pages 275--286. ACM, 1999. Google ScholarDigital Library
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42. ACM, 2013. Google ScholarDigital Library
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '99, pages 10--20, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. The VLDB Journal, 10(2--3):199--223, 2001. Google ScholarDigital Library
S. Chu, M. Balazinska, and D. Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. In Proc. ACM SIGMOD International Conference on Management of Data, 2015. Google ScholarDigital Library
G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In Proc. International Conference on Very Large Data Bases, 2005. Google ScholarDigital Library
G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In Proc. International Conference on Very Large Data Bases, 2008. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 61--72. ACM, 2002. Google ScholarDigital Library
C. Estan and J. F. Naughton. End-biased samples for join cardinality estimation. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 20--20. IEEE, 2006. Google ScholarDigital Library
S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. In ACM SIGMOD Record, volume 25, pages 271--281. ACM, 1996. Google ScholarDigital Library
P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. Ninth Intl. Conf. Scientific and Statistical Database Management, 1997. Google ScholarDigital Library
S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex ad-hoc queries in big data clusters. In Proc. ACM SIGMOD International Conference on Management of Data, 2016. Google ScholarDigital Library
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, 2010. Google ScholarDigital Library
F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join: Online aggregation via random walks. In Proc. ACM SIGMOD International Conference on Management of Data, 2016. Google ScholarDigital Library
R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. In Proc. ACM Symposium on Principles of Database Systems, 1990. Google ScholarDigital Library
A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006. Google ScholarDigital Library
F. Rusu and A. Dobra. Pseudo-random number generation for sketch-based estimations. ACM Transactions on Database Systems, 32(2), Article 11, 2007. Google ScholarDigital Library
F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Transactions on Database Systems, 33:1--46, 2008. Google ScholarDigital Library
M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation.Google Scholar
D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment, 8(12):1530--1541, 2015. Google ScholarDigital Library
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985. Google ScholarDigital Library

Index Terms

Two-Level Sampling for Join Size Estimation
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Wander Join: Online Aggregation via Random Walks
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in ...
Read More
Query Size Estimation for Joins Using Systematic Sampling

We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it ...
Read More
Bifocal sampling for skew-resistant join size estimation
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data

This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
joins
sampling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 1,574
  Total Downloads
- Downloads (Last 12 months)212
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Two-Level Sampling for Join Size Estimation

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Wander Join: Online Aggregation via Random Walks

Query Size Estimation for Joins Using Systematic Sampling

Bifocal sampling for skew-resistant join size estimation