ABSTRACT
Clustering of streaming sensor data aims at providing online summaries of the observed stream. This task is mostly done under limited processing and storage resources. This makes the sensed stream speed (data per time) a sensitive restriction when designing stream clustering algorithms. Additionally, the varying speed of the stream is a natural characteristic of sensor data, e.g. changing the sampling rate upon detecting an event or for a certain time. In such cases, most clustering algorithms have to heavily restrict their model size such that they can handle the minimal time allowance. Recently the first anytime stream clustering algorithm has been proposed that flexibly uses all available time and dynamically adapts its model size. However, the method was not designed to precisely cluster sensor data which are usually noisy and extremely evolving. In this paper we detail the LiarTree algorithm that provides precise stream summaries and effectively handles noise, drift and novelty. We prove that the runtime of the LiarTree is logarithmic in the size of the maintained model opposed to a linear time complexity often observed in previous approaches. We demonstrate in an extensive experimental evaluation using synthetic and real sensor datasets that the LiarTree outperforms competing approaches in terms of the quality of the resulting summaries and exposes only a logarithmic time complexity.
- Physiological Sensor Dataset in PDMC (ICML 2004 workshop) http://www.cs.utexas.edu/~sherstov/pdmc/.Google Scholar
- C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In VLDB, pages 81--92, 2003. Google ScholarDigital Library
- B. Arai, G. Das, D. Gunopulos, and N. Koudas. Anytime measures for top-k algorithms. In VLDB, pages 914--925, 2007. Google ScholarDigital Library
- N. L. Bowers, H. U. Gerber, J. C. Hickman, D. A. Jones, and C. J. Nesbitt. Actuarial Mathematics. Society of Actuaries, Itasca, IL, 1997.Google Scholar
- F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In SDM, 2006.Google ScholarCross Ref
- Y. Chen and L. Tu. Density-based clustering for real-time stream data. In KDD, pages 133--142, 2007. Google ScholarDigital Library
- D. DeCoste. Anytime interval-valued outputs for kernel machines: Fast support vector machine classification via distance geometry. In ICML, 2002. Google ScholarDigital Library
- M. Hassani, E. Müller, and T. Seidl. EDISKCO: energy efficient distributed in-sensor-network k-center clustering with outliers. In Proc. Sensor KDD 2009, pages 39--48, 2009. Google ScholarDigital Library
- S. Hettich and S. Bay. The UCI KDD archive http://kdd.ics.uci.edu, 1999.Google Scholar
- A. Jain, Z. Zhang, and E. Y. Chang. Adaptive non-linear clustering in data streams. In CIKM, pages 122--131, 2006. Google ScholarDigital Library
- P. Kranen, I. Assent, C. Baldauf, and T. Seidl. Self-adaptive anytime stream clustering. In IEEE ICDM, pages 249--258, 2009. Google ScholarDigital Library
- P. Kranen, I. Assent, C. Baldauf, and T. Seidl. The clustree: Indexing micro-clusters for anytime stream mining. In KAIS Journal, 2010.Google Scholar
- P. Kranen, S. Günnemann, S. Fries, and T. Seidl. MC-tree: Improving bayesian anytime classification. In 22nd SSDBM, Springer LNCS, 2010. Google ScholarDigital Library
- P. Kranen, F. Reidl, F. S. Villaamil, and T. Seidl. Hierarchical clustering for real-time stream data with noise. In SSDBM (to appear), 2011. Google ScholarDigital Library
- P. Kranen and T. Seidl. Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal (19)2, ECML PKDD Special Issue, 2009. Google ScholarDigital Library
- G. Lin and L. Chen. A grid and fractal dimension-based data stream clustering algorithm. In ISISE, volume 1, pages 66--70, 2008. Google ScholarDigital Library
- T. Seidl, I. Assent, P. Kranen, R. Krieger, and J. Herrmann. Indexing density models for incremental learning and anytime classification on data streams. In EDBT/ICDT, 2009. Google ScholarDigital Library
- J. Shieh and E. Keogh. Polishing the right apple: Anytime classification also benefits data streams with constant arrival times. In Proc. of ICDM, 2010. Google ScholarDigital Library
- Z. F. Siddiqui and M. Spiliopoulou. Combining multiple interrelated streams for incremental clustering. In SSDBM, pages 535--552, 2009. Google ScholarDigital Library
- W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proc. of the 7th ACM KDD, pages 377--382, 2001. Google ScholarDigital Library
- K. Ueno, X. Xi, E. J. Keogh, and D.-J. Lee. Anytime classification using the nearest neighbor algorithm with applications to stream mining. In ICDM, 2006. Google ScholarDigital Library
- H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. of the 9th ACM KDD, pages 226--235, 2003. Google ScholarDigital Library
- Y. Yang, G. I. Webb, K. B. Korb, and K. M. Ting. Classifying under computational resource constraints: anytime classification using probabilistic estimators. Machine Learning, 69(1), 2007. Google ScholarDigital Library
- L. Ye, X. Wang, E. J. Keogh, and A. Mafra-Neto. Autocannibalistic and anyspace indexing algorithms with application to sensor data mining. In SDM, pages 85--96, 2009.Google ScholarCross Ref
- T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In SIGMOD, 1996. Google ScholarDigital Library
Index Terms
- Precise anytime clustering of noisy sensor data with logarithmic complexity
Recommendations
Data stream clustering: A survey
Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised ...
Self-Adaptive Anytime Stream Clustering
ICDM '09: Proceedings of the 2009 Ninth IEEE International Conference on Data MiningClustering streaming data requires algorithms which are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data ...
Subspace anytime stream clustering
SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database ManagementClustering of high dimensional streaming data is an emerging field of research. A real life data stream imposes many challenges on the clustering task, as an endless amount of data arrives constantly. A lot of research has been done in the full space ...
Comments