ABSTRACT
Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.
- R. Al-Otaibi, R. B. Prudêncio, M. Kull, and P. Flach. Versatile decision trees for learning over multiple contexts. In ECML/PKDD, pages 184--199. 2015.Google ScholarDigital Library
- M. Amir and D. Toshniwal. Instance-based classification of streaming data using emerging patterns. In ICT, pages 228--236, 2010.Google Scholar
- C. R. Aragon and R. G. Seidel. Randomized search trees. In FOCS, pages 540--545. IEEE, 1989. Google ScholarDigital Library
- G. E. A. P. A. Batista, E. J. Keogh, A. M. Neto, and E. Rowton. Sensors and software to allow computational entomology, an emerging application of data mining. In KDD, pages 761--764, 2011. Google ScholarDigital Library
- A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. In SDM, pages 443--448, 2007.Google ScholarCross Ref
- A. Bifet and R. Gavaldà. Adaptive learning from evolving data streams. In IDA, pages 249--260. 2009. Google ScholarDigital Library
- A. Bifet, G. Holmes, and B. Pfahringer. Leveraging bagging for evolving data streams. In ECML/PKDD, pages 135--150. 2010. Google ScholarDigital Library
- A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi. Credit card fraud detection and concept-drift adaptation with delayed supervised information. In IJCNN, 2015.Google ScholarCross Ref
- V. M. De Souza, D. F. Silva, G. E. Batista, et al. Classification of data streams applied to insect recognition: Initial results. In BRACIS, pages 76--81. IEEE, 2013. Google ScholarDigital Library
- L. Devroye. A note on the height of binary search trees. JACM, 33(3):489--498, 1986. Google ScholarDigital Library
- D. dos Reis. Fast unsupervised online drift detection using incremental kolmogorov-smirnov test, online supplementary material. https://github.com/denismr/incremental-ks, 2016.Google Scholar
- K. B. Dyer, R. Capo, and R. Polikar. Compose: A semisupervised learning framework for initially labeled nonstationary streaming data. TNNLS, 2013.Google Scholar
- H. Fanaee-T and J. a. Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, pages 113--127, 2013.Google Scholar
- J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In SBIA, pages 286--295. 2004.Google ScholarCross Ref
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, pages 359--366, 2000. Google ScholarDigital Library
- N. Hammami and M. Bedda. Improved tree model for arabic speech recognition. In ICCSIT, volume 5, pages 521--526, 2010.Google ScholarCross Ref
- A. Haque, L. Khan, and M. Baron. Semi supervised adaptive framework for classifying evolving data stream. In PAKDD, pages 383--394. 2015.Google ScholarCross Ref
- D. J. Hill and B. S. Minsker. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environmental Modelling & Software, 25(9):1014--1022, 2010. Google ScholarDigital Library
- B. Kaluža, V. Mirchevska, E. Dovgan, M. Luštrek, and M. Gams. An agent-based approach to care in independent living. In AmI, pages 177--186. 2010. Google ScholarDigital Library
- L. I. Kuncheva et al. Nearest neighbour classifiers for streaming data with delayed labelling. In ICDM, pages 869--874. IEEE, 2008. Google ScholarDigital Library
- R. H. Lopes. Kolmogorov-smirnov test. In International Encyclopedia of Statistical Science, pages 718--720. Springer, 2011.Google ScholarCross Ref
- M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Classification and novel class detection in concept-drifting data streams under time constraints. TKDE, 23(6):859--874, 2011. Google ScholarDigital Library
- M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza. Facing the reality of data stream classification: coping with scarcity of labeled data. KAIS, 33(1):213--244, 2012.Google ScholarDigital Library
- N. C. Oza. Online bagging and boosting. In SMC, volume 3, pages 2340--2345. IEEE, 2005.Google ScholarCross Ref
- R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16(4--5):464--497, 1996.Google Scholar
- V. M. Souza, D. F. Silva, J. a. Gama, and G. E. Batista. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SDM, pages 873--881. SIAM, 2015.Google Scholar
- X. Wu, P. Li, and X. Hu. Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92:145--155, 2012. Google ScholarDigital Library
- I. Zliobaite. Change with delayed labeling: when is it detectable? In ICDMW, pages 843--850. IEEE, 2010. Google ScholarDigital Library
Index Terms
- Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test
Recommendations
Bayesian Nonparametric Unsupervised Concept Drift Detection for Data Stream Mining
Regular PapersOnline data stream mining is of great significance in practice because of its ubiquity in many real-world scenarios, especially in the big data era. Traditional data mining algorithms cannot be directly applied to data streams due to (1) the possible ...
Unsupervised Concept Drift Detection with a Discriminative Classifier
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementIn data stream mining, one of the biggest challenges is to develop algorithms that deal with the changing data. As data evolve over time, static models become outdated. This phenomenon is called concept drift, and it is investigated extensively in the ...
Wilcoxon Rank Sum Test Drift Detector
Online learning regards extracting information from large quantities of data (streams) usually affected by changes in the distribution (concept drift). Drift detectors are software that estimate the positions of these changes to substitute the base ...
Comments