skip to main content
10.1145/1376616.1376657acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Sampling time-based sliding windows in bounded space

Published:09 June 2008Publication History

ABSTRACT

Random sampling is an appealing approach to build synopses of large data streams because random samples can be used for a broad spectrum of analytical tasks. Users are often interested in analyzing only the most recent fraction of the data stream in order to avoid outdated results. In this paper, we focus on sampling schemes that sample from a sliding window over a recent time interval; such windows are a popular and highly comprehensible method to model recency. In this setting, the main challenge is to guarantee an upper bound on the space consumption of the sample while using the allotted space efficiently at the same time. The difficulty arises from the fact that the number of items in the window is unknown in advance and may vary significantly over time, so that the sampling fraction has to be adjusted dynamically. We consider uniform sampling schemes, which produce each sample of the same size with equal probability, and stratified sampling schemes, in which the window is divided into smaller strata and a uniform sample is maintained per stratum. For uniform sampling, we prove that it is impossible to guarantee a minimum sample size in bounded space. We then introduce a novel sampling scheme called bounded priority sampling (BPS), which requires only bounded space. We derive a lower bound on the expected sample size and show that BPS quickly adapts to changing data rates. For stratified sampling, we propose a merge-based stratification scheme (MBS), which maintains strata of approximately equal size. Compared to naive stratification, MBS has the advantage that the sample is evenly distributed across the window, so that no part of the window is over- or underrepresented. We conclude the paper with a feasibility study of our algorithms on large real-world datasets.

References

  1. Charu C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In Proc. VLDB, pages 607--618, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brian Babcock, Mayur Datar, and Rajeev Motwani. Sampling from a moving window over streaming data. In Proc. SODA, pages 633--634, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On synopses for distinct-value estimation under multiset operations. In Proc. SIGMOD, pages 199--210, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Paul G. Brown and Peter J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Towards estimation error guarantees for distinct values. In Proc. PODS, pages 268--279, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794--1813, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. Maintaining bounded-size sample synopses of evolving datasets. The VLDB Journal, 17(2):173--201, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In Proc. VLDB, pages 466--475, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Peter J. Haas. Data stream sampling: Basic techniques and results. In Data Stream Management: Processing High Speed Data Streams. Springer, 2006.Google ScholarGoogle Scholar
  10. H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, and Torsten Suel. Optimal histograms with quality guarantees. In Proc. VLDB, pages 275--286, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Suman Nath, Phillip B. Gibbons, Srinivasan Seshan, and Zachary R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In Proc. SenSys, pages 250--262, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Carl-Erik Särndal, Bengt Swensson, and Jan Wretman. Model Assisted Survey Sampling. Springer Series in Statistics. Springer, 1991.Google ScholarGoogle Scholar
  13. Raimund Seidel and Cecilia R. Aragon. Randomized search trees. Algorithmica, 16(4/5):464--497, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  14. Jeffrey Scott Vitter. Faster methods for random sampling. Commun. ACM, 27(7):703--718, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jeffrey Scott Vitter. Random sampling with a reservoir. ACM TOMS, 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sampling time-based sliding windows in bounded space

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
            June 2008
            1396 pages
            ISBN:9781605581026
            DOI:10.1145/1376616

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 June 2008

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader