ABSTRACT
Join size estimation is a critical step in query optimization, and has been extensively studied in the literature. Among the many techniques, sampling based approaches are particularly appealing, due to their ability to handle arbitrary selection predicates. In this paper, we propose a new sampling algorithm for join size estimation, called two-level sampling, which combines the advantages of three previous sampling methods while making further improvements. Both analytical and empirical comparisons show that the new algorithm outperforms all the previous algorithms on a variety of joins, including primary key-foreign key joins, many-to-many joins, and multi-table joins. The new sampling algorithm is also very easy to implement, requiring just one pass over the data. It only relies on some basic statistical information about the data, such as the ℓk-norms and the heavy hitters.
- S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD Record, volume 28, pages 275--286. ACM, 1999. Google ScholarDigital Library
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42. ACM, 2013. Google ScholarDigital Library
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '99, pages 10--20, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. The VLDB Journal, 10(2--3):199--223, 2001. Google ScholarDigital Library
- S. Chu, M. Balazinska, and D. Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. In Proc. ACM SIGMOD International Conference on Management of Data, 2015. Google ScholarDigital Library
- G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In Proc. International Conference on Very Large Data Bases, 2005. Google ScholarDigital Library
- G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In Proc. International Conference on Very Large Data Bases, 2008. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 61--72. ACM, 2002. Google ScholarDigital Library
- C. Estan and J. F. Naughton. End-biased samples for join cardinality estimation. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 20--20. IEEE, 2006. Google ScholarDigital Library
- S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. In ACM SIGMOD Record, volume 25, pages 271--281. ACM, 1996. Google ScholarDigital Library
- P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. Ninth Intl. Conf. Scientific and Statistical Database Management, 1997. Google ScholarDigital Library
- S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex ad-hoc queries in big data clusters. In Proc. ACM SIGMOD International Conference on Management of Data, 2016. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, 2010. Google ScholarDigital Library
- F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join: Online aggregation via random walks. In Proc. ACM SIGMOD International Conference on Management of Data, 2016. Google ScholarDigital Library
- R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. In Proc. ACM Symposium on Principles of Database Systems, 1990. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006. Google ScholarDigital Library
- F. Rusu and A. Dobra. Pseudo-random number generation for sketch-based estimations. ACM Transactions on Database Systems, 32(2), Article 11, 2007. Google ScholarDigital Library
- F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Transactions on Database Systems, 33:1--46, 2008. Google ScholarDigital Library
- M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation.Google Scholar
- D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment, 8(12):1530--1541, 2015. Google ScholarDigital Library
- J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985. Google ScholarDigital Library
Index Terms
- Two-Level Sampling for Join Size Estimation
Recommendations
Wander Join: Online Aggregation via Random Walks
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJoins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in ...
Query Size Estimation for Joins Using Systematic Sampling
We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it ...
Bifocal sampling for skew-resistant join size estimation
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of dataThis paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join ...
Comments