skip to main content
article

Approximation and streaming algorithms for histogram construction problems

Published: 01 March 2006 Publication History

Abstract

Histograms and related synopsis structures are popular techniques for approximating data distributions. These have been successful in query optimization and a variety of applications, including approximate querying, similarity searching, and data mining, to name a few. Histograms were a few of the earliest synopsis structures proposed and continue to be used widely. The histogram construction problem is to construct the best histogram restricted to a space bound that reflects the data distribution most accurately under a given error measure.The histograms are used as quick and easy estimates. Thus, a slight loss of accuracy, compared to the optimal histogram under the given error measure, can be offset by fast histogram construction algorithms. A natural question arises in this context: Can we find a fast near optimal approximation algorithm for the histogram construction problem? In this article, we give the first linear time (1+ϵ)-factor approximation algorithms (for any ϵ > 0) for a large number of histogram construction problems including the use of piecewise small degree polynomials to approximate data, workloads, etc. Several of our algorithms extend to data streams.Using synthetic and real-life data sets, we demonstrate that in many scenarios the approximate histograms are almost identical to optimal histograms in quality and are significantly faster to construct.

References

[1]
Acharya, S., Gibbons, P., Poosala, V., and Ramaswamy, S. 1999. The aqua approximate query answering system. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 574--576.
[2]
Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.
[3]
Bertolotto, M. and Egenhofer, M. J. 1999. Progressive vector transmission. In Proceedings of the 7th ACM Symposium on Advances in Geographical Information Systems. ACM, New York, 152--157.
[4]
Borodin, A. and El-Yaniv, R. 1998. Online Computation and Competitive Analysis. Cambridge University Press, Cambridge, MA.
[5]
Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 2002). ACM, New York, 635--644.
[6]
Donjerkovic, D., Ioannidis, Y. E., and Ramakrishnan, R. 1999. Dynamic histograms: Capturing evolving data sets. CS-TR 99-1396, University of Wisconsin, Madison, WI. Mar.
[7]
Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 2002. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151.
[8]
Garofalakis, M. N. and Gibbons, P. B. 2002. Wavelet synopses with error guarantees. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 476--487.
[9]
Garofalakis, M. N. and Gibbons, P. B. 2004. Probabilistic wavelet synopses. ACM Trans. Datab. Syst. 29, 43--90.
[10]
Gibbons, P. and Matias, Y. 1999. Synopsis data structures for massive data sets. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 1999). ACM, New York, 909--910.
[11]
Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2002. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, 389--398.
[12]
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2001. Optimal and approximate computation of summary statistics for range aggregates. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, New York.
[13]
Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD Conference. ACM, New York.
[14]
Guha, S. 2005. Space efficiency in synopsis construction problems. In Proceedings of the VLDB Conference. 409--420.
[15]
Guha, S. and Harb, B. 2006. Approximation algorithms for wavelet transform coding of data streams. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 2006). ACM, New York.
[16]
Guha, S., Indyk, P., Muthukrishnan, S., and Strauss, M. 2002a. Histogramming data streams with fast per-item processing. In Proceedings of the ICALP. 681--692.
[17]
Guha, S., Koudas, N., and Srivastava, D. 2002b. Fast algorithms for hierarchical range histogram construction. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, New York, 180--187.
[18]
Guha, S. and Koudas, N. 2002. Approximating a data stream for querying and estimation: Algorithms and performance evaluation. In Proceedings of the IEEE 18th International Conference on Data Engineering (ICDE'02). IEEE Computer. Society Press, Los Alamitos, CA, 567--576.
[19]
Guha, S., Koudas, N., and Shim, K. 2001. Data streams and histograms. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, 471--475.
[20]
Guha, S., Shim, K., and Woo, J. 2004. REHIST: Relative error histogram construction algorithms. In Proceedings of the VLDB Conference, 300--311.
[21]
Hochbaum, D. 1996. Approximation Algorithms for NP Hard Problems. Brooks/Cole Pubishing Company.
[22]
Indyk, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 189--197.
[23]
Ioannidis, Y. and Poosala, V. 1995. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 233--244.
[24]
Ioannidis, Y. E. 1993. Universality of serial histograms. In Proceedings of the VLDB Conference, 256--267.
[25]
Ioannidis, Y. E. 2003. The history of histograms (abridged). In Proceedings of the VLDB Conference, 19--30.
[26]
Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the VLDB Conference, 275--286.
[27]
Keogh, E., Chakrabati, K., Mehrotra, S., and Pazzani, M. 2002. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Datab. Syst. 27, 2, 188--228.
[28]
Kooi, R. 1980. The optimization of queries in relational databases. Case Western Reserve University.
[29]
Koudas, N., Muthukrishnan, S., and Srivastava, D. 2000. Optimal histograms for hierarchical range queries. In Proceedings of the ACM Principles of Database Systems. ACM, New York, 196--204.
[30]
Manku, G. S., Rajagopalan, S., and Lindsay, B. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 426--435.
[31]
Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 448--459.
[32]
Muralikrishna, M. and DeWitt, D. J. 1988. Equi-depth histograms for estimating selectivity factors for multidimensional queries. In Proceedings of the ACM SIGMOD, Conference. ACM, New York, 28--36.
[33]
Muthukrishnan, S. and Strauss, M. 2004. Approximate histogram and wavelet summaries of streaming data. DIMACS TR 52.
[34]
Poosala, V., Ioannidis, Y., Haas, P., and Shekita, E. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 294--305.
[35]
Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 23--34.
[36]
Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the SIGMOD Conference. ACM, New York, 428--439.
[37]
Vazirani, V. 2001. Approximation Algorithms. Springer-Verlag, New York.

Cited By

View all
  • (2025)Finding Coherent Node Groups in Directed GraphsMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74633-8_36(486-497)Online publication date: 1-Jan-2025
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)Testing Closeness of Multivariate Distributions via Ramsey TheoryProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649657(340-347)Online publication date: 10-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 31, Issue 1
March 2006
438 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1132863
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2006
Published in TODS Volume 31, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data Streams
  2. approximation algorithm
  3. histograms

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)4
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Finding Coherent Node Groups in Directed GraphsMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74633-8_36(486-497)Online publication date: 1-Jan-2025
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)Testing Closeness of Multivariate Distributions via Ramsey TheoryProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649657(340-347)Online publication date: 10-Jun-2024
  • (2024)Learning complex predicates for cardinality estimation using recursive neural networksInformation Systems10.1016/j.is.2024.102402124(102402)Online publication date: Sep-2024
  • (2024)Recurrent segmentation meets block models in temporal networksMachine Language10.1007/s10994-023-06507-6113:8(5623-5653)Online publication date: 1-Aug-2024
  • (2024)Flexible grouping of linear segments for highly accurate lossy compression of time series dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00862-z33:5(1569-1589)Online publication date: 1-Sep-2024
  • (2023)Volume Exploration Using Multidimensional Bhattacharyya FlowIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.312791829:3(1651-1663)Online publication date: 1-Mar-2023
  • (2023)JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00050(572-584)Online publication date: Apr-2023
  • (2023)Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language modelNature Communications10.1038/s41467-023-41093-014:1Online publication date: 31-Aug-2023
  • (2023)Column-coherent matrix decompositionData Mining and Knowledge Discovery10.1007/s10618-023-00954-437:6(2564-2588)Online publication date: 11-Aug-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media