skip to main content
10.1145/1142473.1142496acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Fast range-summable random variables for efficient aggregate estimation

Published: 27 June 2006 Publication History

Abstract

Exact computation for aggregate queries usually requires large amounts of memory - constrained in data-streaming - or communication - constrained in distributed computation - and large processing times. In this situation, approximation techniques with provable guarantees, like sketches, are the only viable solution. The performance of sketches crucially depends on the ability to efficiently generate particular pseudo-random numbers. In this paper we investigate both theoretically and empirically the problem of generating k-wise independent pseudo-random numbers and, in particular, that of generating 3 and 4-wise independent pseudo-random numbers that are fast range-summable (i.e., they can be summed up in sub-linear time). Our specific contributions are: (a) we provide an empirical comparison of the various pseudo-random number generating schemes, (b) we study both theoretically and empirically the fast range-summation practicality for the 3 and 4-wise independent generating schemes and we provide efficient implementations for the 3-wise independent schemes, (c) we show convincing theoretical and empirical evidence that the extended Hamming scheme performs as well as any 4-wise independent scheme for estimating the size of join using AMS-sketches, even though it is only 3-wise independent. We use this generating scheme to produce estimators that significantly out-perform the state-of-the-art solutions for two problems - size of spatial joins and selectivity estimation.

References

[1]
P. Aduri and S. Tirthapura. Range efficient computation of F0 over massive data streams. In ICDE 2005, pages 32--43.
[2]
N. Alon, L. Babai, and A. Itai. A fast and simple randomized parallel algorithm for the maximal independent set problem. J. Algorithms, 7(4):567--583, 1986.
[3]
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci., 64(3):719--747, 2002.
[4]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC 1996, pages 20--29.
[5]
Z. Bar-Yosseff, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In SODA 2002, pages 623--632.
[6]
A. R. Calderbank, A. Gilbert, K. Levchenko, S. Muthukrishnan, and M. Strauss. Improved range-summable random variable construction algorithms. In SODA 2005, pages 840--849.
[7]
A. Das, J. Gehrke, and M. Riedewald. Approximation techniques for spatial data. In SIGMOD 2004, pages 695--706.
[8]
A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In SIGMOD 2002, pages 61--72.
[9]
A. Ehrenfeucht and M. Karpinski. The computational complexity of (xor-and) counting problems. Technical Report TR-90-033, ICSI, 1990.
[10]
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate L1-difference algorithm for massive data streams. SIAM J. Comput., 32(1):131--151, 2002.
[11]
S. Ganguly, M. Garofalakis, and R. Rastogi. Processing set expressions over continuous update streams. In SIGMOD 2003, pages 265--276.
[12]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. One-pass wavelet decompositions of data streams. IEEE TKDE, 15(3):541--554, 2003.
[13]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Domain-driven data synopses for dynamic quantiles. IEEE TKDE, 17(7):927--938, 2005.
[14]
A. S. Hedayat, N. J. A. Sloane, and J. Stufken. Orthogonal Arrays: Theory and Applications. Springer-Verlag, 1999.
[15]
H. J. Karloff and Y. Mansour. On construction of k-wise independent random variables. In STOC 1994, pages 564--573.
[16]
M. Karpinski and M. Luby. Approximating the number of solutions of a GF{2} polynomial. Technical Report CS-8544, U. of Bonn, 1990.
[17]
D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In FOCS 2003, pages 482--491.
[18]
K. Levchenko. http://www.cse.ucsd.edu/~klevchen/II-2005.pdf.
[19]
M. Luby and A. Wigderson. Pairwise independence and derandomization. Technical Report UCB/CSD-95-880, EECS UC Berkeley, 1995.
[20]
MassDAL. http://www.cs.rutgers.edu/~muthu/massdal.html.
[21]
F. Rusu and A. Dobra. Fast range-summable random variables for efficient aggregate estimation. Manuscript, 2005.
[22]
N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD 2002, pages 428--439.

Cited By

View all
  • (2022)Video Monitoring QueriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.304860634:10(5023-5036)Online publication date: 1-Oct-2022
  • (2020)Video Monitoring Queries2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00115(1285-1296)Online publication date: Apr-2020
  • (2012)Rectangle-efficient aggregation in spatial data streamsProceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems10.1145/2213556.2213595(283-294)Online publication date: 21-May-2012
  • Show More Cited By

Index Terms

  1. Fast range-summable random variables for efficient aggregate estimation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
      June 2006
      830 pages
      ISBN:1595934340
      DOI:10.1145/1142473
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 June 2006

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. approximate query processing
      2. data-streaming
      3. sketches

      Qualifiers

      • Article

      Conference

      SIGMOD/PODS06
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 07 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Video Monitoring QueriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.304860634:10(5023-5036)Online publication date: 1-Oct-2022
      • (2020)Video Monitoring Queries2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00115(1285-1296)Online publication date: Apr-2020
      • (2012)Rectangle-efficient aggregation in spatial data streamsProceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems10.1145/2213556.2213595(283-294)Online publication date: 21-May-2012
      • (2009)Coordinated weighted sampling for estimating aggregates over multiple weight assignmentsProceedings of the VLDB Endowment10.14778/1687627.16877012:1(646-657)Online publication date: 1-Aug-2009
      • (2009)Sketch-Based Summarization of Ordered XML StreamsProceedings of the 2009 IEEE International Conference on Data Engineering10.1109/ICDE.2009.107(541-552)Online publication date: 29-Mar-2009
      • (2007)Statistical analysis of sketch estimatorsProceedings of the 2007 ACM SIGMOD international conference on Management of data10.1145/1247480.1247503(187-198)Online publication date: 11-Jun-2007

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media