Abstract
Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspects, the available choices for each aspect, and the impact of such choices on histogram effectiveness. In this paper, we provide a taxonomy of histograms that captures all previously proposed histogram types and indicates many new possibilities. We introduce novel choices for several of the taxonomy dimensions, and derive new histogram types by combining choices in effective ways. We also show how sampling techniques can be used to reduce the cost of histogram construction. Finally, we present results from an empirical study of the proposed histogram types used in selectivity estimation of range predicates and identify the histogram types that have the best overall performance.
- CR94 Cr M Chen and N. Roussopoulos. Adaptive selectivity estimation using query feedback. Proc. ofACM SIG- MOD Conf, pages 16 I-I 72, May 1994. Google ScholarDigital Library
- dB78 C. de Boor. A practical guide to splines. Springer- Verlag, New York, 1978.Google ScholarCross Ref
- dB95 C. de Boor. Private communication, I995.Google Scholar
- GS91 A.P. Gurajada andJ. Srivastava. Equidepth partitioning of a data set based on finding its medians. Proc. Applied Computing Symposium, pages 92-101, 1991.Google ScholarCross Ref
- HNSS95 P. J. Haas, J. E Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. Proc. of the 21st Int. Conf on Very Large Databases, pages 311-322, 1995. Google ScholarDigital Library
- Hoe63 W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer Statist. Assoco, 58:13-30, 1963.Google ScholarCross Ref
- HS95 P.J. Haas and A. N. Swami. Sampling-based selectivity estimation for joins using augmented frequent value statistics. Proc. of IEEE Conf. on Data Engineering, pages 522-531, 1995. Google ScholarDigital Library
- IC91 Y. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. Proc. of A CM SIGMOD Conf, pages 268-277, 1991. Google ScholarDigital Library
- IC93 Y. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM TODS, 1993. Google ScholarDigital Library
- IK90 Y. Ioannidis and Y. Kang. Randomized algorithms for optimizing large join queries. In Proc. of the 1990 A CM-SIGMOD Conference on the Management of Data, pages 312-321, Atlantic City, NJ, May 1990. Google ScholarDigital Library
- Ioa93 Y. loannidis. Universality of serial histograms. Proc. of the 19th Int. Conf. on Vet3, Large Databases, pages 256-267, December 1993. Google ScholarDigital Library
- IP95 Y. loannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. Proc. of A CM SIGMOD Conf, pages 233-244, May 1995. Google ScholarDigital Library
- JC85 R. Jain and I. Chlamtac. The p2 algorithm for dynamic calculation of quantiles and histograms without sorting observations. Communications of the A CM, pages 1076-1085. Oct 1985. Google ScholarDigital Library
- Kol41 A.N. Kolmogorov. Confidence limits for an unknown distribution function. Ann. Math. Statist., 12:461--463, 1941.Google ScholarCross Ref
- Koo80 R.P. Kooi. The optimization of queries in relational databases. PhD thesis, Case Western Reserver University, Sept 1980. Google ScholarDigital Library
- LNS90 R.J. Lipton, J. E Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. Proc. of ACM SIGMOD Conf, pages 1-11, May 1990. Google ScholarDigital Library
- Mas51 E J. Massey. The Kolmogorov-Smirnov test for goodness-of-fit. J. Amer. Statist. Assoc., 46:68-78, 1951.Google ScholarCross Ref
- MCS88 M.V. Mannino, E Chu, and T. Sager. Statistical profile estimation in database systems. ACM Computing Surveys, 20(3): 192-221, Sept 1988. Google ScholarDigital Library
- MD88 M. Muralikrishna and David J Dewltt. Equi-depth histograms for estimating selectivity factors for multidimensional queries. Proc. of ACM SIGMOD Con f~ pages 28-36, 1988. Google ScholarDigital Library
- PSC84 G. Piatetsky-Shapiro and C. ConneI1. Accurate estimation of the number of tuples satisfying a condition, Proc. of ACM SIGMOD Conf, pages 256-276, 1984. Google ScholarDigital Library
- Raa87 K.E.E. Raatikainen. Simultaneous estimation of several percentiles. Simulation. 49:159-164, October 1987. Google ScholarDigital Library
- SAC+79 P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R A. Lorie, and T.T. Price. Access path selection ~n a relational database management system. Proc. of ACM SIGMOD Conf, pages 23-34, 1979. Google ScholarDigital Library
- SG88 A. Swami and A. Gupta. Optimization of large join queries. In Proc. of the 1988 ACM-SIGMOD Conference on the Management of Data, pages 8-17, Chicago, IL, June 1988. Google ScholarDigital Library
- SLRD93 W. Sun, Y. Ling, N. Rishe, and Y. Deng. An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment. Proc. of ACM SIGMOD Conf, pages 79-88, 1993. Google ScholarDigital Library
- Vit85 J.S. Vitter. Random sampling with a reservoir A CM Trans. Math. Software, 11:37-57, 1985. Google ScholarDigital Library
- Zip49 G.K. Zipf. Human behavtour and the principle of teast effort. Addison-Wesley, Reading, MA, 1949.Google Scholar
Index Terms
- Improved histograms for selectivity estimation of range predicates
Recommendations
Selectivity estimation for range predicates using lightweight models
Query optimizers depend on selectivity estimates of query predicates to produce a good execution plan. When a query contains multiple predicates, today's optimizers use a variety of assumptions, such as independence between predicates, to estimate ...
Improved histograms for selectivity estimation of range predicates
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of dataMany commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never ...
Compressed histograms with arbitrary bucket layouts for selectivity estimation
Selectivity estimation is an important step of query optimization in a database management system, and multi-dimensional histogram techniques have proved promising for selectivity estimation. Recent multi-dimensional histogram techniques such as GenHist ...
Comments