skip to main content
research-article

Approximate continuous querying over distributed streams

Published: 24 June 2008 Publication History

Abstract

While traditional database systems optimize for performance on one-shot query processing, emerging large-scale monitoring applications require continuous tracking of complex data-analysis queries over collections of physically distributed streams. Thus, effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality approximate query answers. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking a broad class of complex aggregate queries in such a distributed-streams setting. Our tracking schemes maintain approximate query answers with provable error guarantees, while simultaneously optimizing the storage space and processing time at each remote site, and the communication cost across the network. In a nutshell, our algorithms rely on tracking general-purpose randomized sketch summaries of local streams at remote sites along with concise prediction models of local site behavior in order to produce highly communication- and space/time-efficient solutions. The end result is a powerful approximate query tracking framework that readily incorporates several complex analysis queries (including distributed join and multi-join aggregates, and approximate wavelet representations), thus giving the first known low-overhead tracking solution for such queries in the distributed-streams model. Experiments with real data validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.

References

[1]
Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Philadeplphia, PA.
[2]
Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing. Philadelphia, PA, 20--29.
[3]
Babcock, B. and Olston, C. 2003. Distributed top-K monitoring. In Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, CA.
[4]
Box, G. and Jenkins, G. 1970. Time Series Analaysis: Forecasting and Control. Holden-Day.
[5]
Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming. Malaga, Spain.
[6]
Chu, D., Deshpande, A., Hellerstein, J. M., and Hong, W. 2006. Approximate data collection in sensor networks using probabilistic models. In Proceedings of the 22nd International Conference on Data Engineering. Atlanta, GA.
[7]
Cormode, G. and Garofalakis, M. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st International Conference on Very Large Data Bases. Trondheim, Norway.
[8]
Cormode, G., Garofalakis, M., Muthukrishnan, S., and Rastogi, R. 2005. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Baltimore, MD.
[9]
Cormode, G. and Muthukrishnan, S. 2003. What's hot and what's not: Tracking most frequent items dynamically. In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. San Diego, CA, 296--306.
[10]
Cormode, G. and Muthukrishnan, S. 2004. An improved data stream summary: The count-min sketch and its applications. Latin American Informatics. 29--38.
[11]
Cranor, C., Johnson, T., Spatscheck, O., and Shkapenyuk, V. 2003. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, CA.
[12]
Das, A., Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Distributed set-expression cardinality estimation. In Proceedings of the 30th International Conference on Very Large Data Bases. Toronto, Canada.
[13]
Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms. San Francisco, CA, 635--644.
[14]
Deshpande, A., Guestrin, C., Madden, S. R., Hellerstein, J. M., and Hong, W. 2004. Model-driven data acquisition in sensor networks. In Proceedings of the 30th International Conference on Very Large Data Bases. Toronto, Canada.
[15]
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Madison, WI, 61--72.
[16]
Ganguly, S., Garofalakis, M., and Rastogi, R. 2003. Processing set expressions over continuous update streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, CA.
[17]
Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'04). Heraklion-Crete, Greece.
[18]
Gibbons, P. B. 2001. Distinct sampling for highly accurate answers to distinct values queries and event reports. In Proceedings of the 27th International Conference on Very Large Data Bases. Roma, Italy.
[19]
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Data Bases. Roma, Italy.
[20]
Greenwald, M. B. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Santa Barbara, CA.
[21]
Greenwald, M. B. and Khanna, S. 2004. Power-conserving computation of order-statistics over sensor networks. In Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Paris, France.
[22]
Hardy, G., Littlewood, J., and Pólya, G. 1988. Inequalities, 2nd Ed. Cambridge University Press.
[23]
Huang, L., Garofalakis, M., Joseph, A. D., and Taft, N. 2007a. Communication-efficient tracking of distributed cumulative triggers. In Proceedings of the 27th IEEE International Conference on Distributed Computing Systems (ICDCS'07). Toronto, Canada.
[24]
Huang, L., Nguyen, X., Garofalakis, M., Hellerstein, J. M., Jordan, M. I., Joseph, A. D., and Taft, N. 2007b. Communication-efficient online detection of network-wide anomalies. In Proceedings of IEEE INFOCOM. Anchorage, AK.
[25]
Jain, A., Chang, E. Y., and Wang, Y.-F. 2004. Adaptive stream resource management using Kalman filters. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Paris, France.
[26]
Keralapura, R., Cormode, G., and Ramamirtham, J. 2006. Communication-efficient distributed monitoring of thresholded counts. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Chicago, IL, 289--300.
[27]
Madden, S. R., Franklin, M. J., Hellerstein, J. M., and Hong, W. 2003. The design of an acquisitional query processor for sensor networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, CA.
[28]
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005. Finding (recently) frequent items in distributed data streams. In Proceedings of the 21st International Conference on Data Engineering. Tokyo, Japan.
[29]
Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong, China, 346--357.
[30]
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press.
[31]
Olston, C., Jiang, J., and Widom, J. 2003. Adaptive filters for continuous queries over distributed data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, CA.
[32]
Sharfman, I., Schuster, A., and Keren, D. 2006. A geometric approach to monitoring threshold functions over distributed data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Chicago, IL, 301--312.
[33]
Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Madison, WI, 428--439.
[34]
Tulone, D. and Madden, S. 2006. Paq: Time series forecasting for approximate query answering in sensor networks. In the European Conference on Wireless Sensor Networks (EWSN).

Cited By

View all
  • (2023)And synopses for all: A synopses data engine for extreme scale analytics-as-a-serviceInformation Systems10.1016/j.is.2023.102221116(102221)Online publication date: Jun-2023
  • (2021)PriStream: Privacy-preserving distributed stream monitoring of thresholded PERCENTILE statisticsIEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications10.1109/INFOCOM.2016.7524461(1-9)Online publication date: 10-Mar-2021
  • (2020)Elementary Students’ Understanding of CS TermsACM Transactions on Computing Education10.1145/338636420:3(1-19)Online publication date: 16-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 33, Issue 2
June 2008
309 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1366102
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2008
Accepted: 01 January 2008
Revised: 01 August 2007
Received: 01 January 2007
Published in TODS Volume 33, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Continuous distributed monitoring
  2. approximate query processing
  3. data stream algorithms
  4. data synopses

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)And synopses for all: A synopses data engine for extreme scale analytics-as-a-serviceInformation Systems10.1016/j.is.2023.102221116(102221)Online publication date: Jun-2023
  • (2021)PriStream: Privacy-preserving distributed stream monitoring of thresholded PERCENTILE statisticsIEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications10.1109/INFOCOM.2016.7524461(1-9)Online publication date: 10-Mar-2021
  • (2020)Elementary Students’ Understanding of CS TermsACM Transactions on Computing Education10.1145/338636420:3(1-19)Online publication date: 16-Jun-2020
  • (2020)Logic-Based Approach to Incremental Monitoring and Optimization on Strongly Distributed Data StreamsFoundations of Information and Knowledge Systems10.1007/978-3-030-39951-1_15(242-262)Online publication date: 17-Feb-2020
  • (2019)An Instruction Set Architecture for Machine LearningACM Transactions on Computer Systems10.1145/333146936:3(1-35)Online publication date: 13-Aug-2019
  • (2019)High-Availability at Massive Scale: Building Google’s Data Infrastructure for AdsReal-Time Business Intelligence and Analytics10.1007/978-3-030-24124-7_5(63-81)Online publication date: 11-Oct-2019
  • (2018)Lightweight Monitoring of Distributed StreamsACM Transactions on Database Systems10.1145/322611343:2(1-37)Online publication date: 31-Jul-2018
  • (2018)Finding Persistent Items in Distributed DatasetsIEEE INFOCOM 2018 - IEEE Conference on Computer Communications10.1109/INFOCOM.2018.8486425(1403-1411)Online publication date: Apr-2018
  • (2018)Scalable approximate query tracking over highly distributed data streams with tunable accuracy guaranteesInformation Systems10.1016/j.is.2018.05.00176(59-87)Online publication date: Jul-2018
  • (2018)Monitoring distributed fragmented skylinesDistributed and Parallel Databases10.1007/s10619-018-7223-736:4(675-715)Online publication date: 1-Dec-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media