ABSTRACT
Random sampling is an appealing approach to build synopses of large data streams because random samples can be used for a broad spectrum of analytical tasks. Users are often interested in analyzing only the most recent fraction of the data stream in order to avoid outdated results. In this paper, we focus on sampling schemes that sample from a sliding window over a recent time interval; such windows are a popular and highly comprehensible method to model recency. In this setting, the main challenge is to guarantee an upper bound on the space consumption of the sample while using the allotted space efficiently at the same time. The difficulty arises from the fact that the number of items in the window is unknown in advance and may vary significantly over time, so that the sampling fraction has to be adjusted dynamically. We consider uniform sampling schemes, which produce each sample of the same size with equal probability, and stratified sampling schemes, in which the window is divided into smaller strata and a uniform sample is maintained per stratum. For uniform sampling, we prove that it is impossible to guarantee a minimum sample size in bounded space. We then introduce a novel sampling scheme called bounded priority sampling (BPS), which requires only bounded space. We derive a lower bound on the expected sample size and show that BPS quickly adapts to changing data rates. For stratified sampling, we propose a merge-based stratification scheme (MBS), which maintains strata of approximately equal size. Compared to naive stratification, MBS has the advantage that the sample is evenly distributed across the window, so that no part of the window is over- or underrepresented. We conclude the paper with a feasibility study of our algorithms on large real-world datasets.
- Charu C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In Proc. VLDB, pages 607--618, 2006. Google ScholarDigital Library
- Brian Babcock, Mayur Datar, and Rajeev Motwani. Sampling from a moving window over streaming data. In Proc. SODA, pages 633--634, 2002. Google ScholarDigital Library
- Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On synopses for distinct-value estimation under multiset operations. In Proc. SIGMOD, pages 199--210, 2007. Google ScholarDigital Library
- Paul G. Brown and Peter J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006. Google ScholarDigital Library
- Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Towards estimation error guarantees for distinct values. In Proc. PODS, pages 268--279, 2000. Google ScholarDigital Library
- Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794--1813, 2002. Google ScholarDigital Library
- Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. Maintaining bounded-size sample synopses of evolving datasets. The VLDB Journal, 17(2):173--201, 2007. Google ScholarDigital Library
- Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In Proc. VLDB, pages 466--475, 1997. Google ScholarDigital Library
- Peter J. Haas. Data stream sampling: Basic techniques and results. In Data Stream Management: Processing High Speed Data Streams. Springer, 2006.Google Scholar
- H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, and Torsten Suel. Optimal histograms with quality guarantees. In Proc. VLDB, pages 275--286, 1998. Google ScholarDigital Library
- Suman Nath, Phillip B. Gibbons, Srinivasan Seshan, and Zachary R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In Proc. SenSys, pages 250--262, 2004. Google ScholarDigital Library
- Carl-Erik Särndal, Bengt Swensson, and Jan Wretman. Model Assisted Survey Sampling. Springer Series in Statistics. Springer, 1991.Google Scholar
- Raimund Seidel and Cecilia R. Aragon. Randomized search trees. Algorithmica, 16(4/5):464--497, 1996.Google ScholarCross Ref
- Jeffrey Scott Vitter. Faster methods for random sampling. Commun. ACM, 27(7):703--718, 1984. Google ScholarDigital Library
- Jeffrey Scott Vitter. Random sampling with a reservoir. ACM TOMS, 11(1):37--57, 1985. Google ScholarDigital Library
Index Terms
- Sampling time-based sliding windows in bounded space
Recommendations
Optimal sampling from sliding windows
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsA sliding windows model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, ...
Fast balanced sampling for highly stratified population
Balanced sampling is a very efficient sampling design when the variable of interest is correlated to the auxiliary variables on which the sample is balanced. A procedure to select balanced samples in a stratified population has previously been proposed. ...
Stratified random sampling from streaming and stored data
AbstractStratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming ...
Comments