research-article

Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test

Authors:
Denis Moreira dos Reis

Universidade de São Paulo, São Carlos, Brazil

Universidade de São Paulo, São Carlos, Brazil
View Profile

,
Peter Flach

University of Bristol, Bristol, United Kingdom

University of Bristol, Bristol, United Kingdom
View Profile

,
Stan Matwin

Polish Academy of Sciences, Halifax, Canada

Polish Academy of Sciences, Halifax, Canada
View Profile

,
Gustavo Batista

ICMC-USP (Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo), São Carlos, Brazil

ICMC-USP (Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo), São Carlos, Brazil
View Profile

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016Pages 1545–1554https://doi.org/10.1145/2939672.2939836

Published:13 August 2016Publication History

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1545–1554

ABSTRACT

Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.

References

R. Al-Otaibi, R. B. Prudêncio, M. Kull, and P. Flach. Versatile decision trees for learning over multiple contexts. In ECML/PKDD, pages 184--199. 2015.Google ScholarDigital Library
M. Amir and D. Toshniwal. Instance-based classification of streaming data using emerging patterns. In ICT, pages 228--236, 2010.Google Scholar
C. R. Aragon and R. G. Seidel. Randomized search trees. In FOCS, pages 540--545. IEEE, 1989. Google ScholarDigital Library
G. E. A. P. A. Batista, E. J. Keogh, A. M. Neto, and E. Rowton. Sensors and software to allow computational entomology, an emerging application of data mining. In KDD, pages 761--764, 2011. Google ScholarDigital Library
A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. In SDM, pages 443--448, 2007.Google ScholarCross Ref
A. Bifet and R. Gavaldà. Adaptive learning from evolving data streams. In IDA, pages 249--260. 2009. Google ScholarDigital Library
A. Bifet, G. Holmes, and B. Pfahringer. Leveraging bagging for evolving data streams. In ECML/PKDD, pages 135--150. 2010. Google ScholarDigital Library
A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi. Credit card fraud detection and concept-drift adaptation with delayed supervised information. In IJCNN, 2015.Google ScholarCross Ref
V. M. De Souza, D. F. Silva, G. E. Batista, et al. Classification of data streams applied to insect recognition: Initial results. In BRACIS, pages 76--81. IEEE, 2013. Google ScholarDigital Library
L. Devroye. A note on the height of binary search trees. JACM, 33(3):489--498, 1986. Google ScholarDigital Library
D. dos Reis. Fast unsupervised online drift detection using incremental kolmogorov-smirnov test, online supplementary material. https://github.com/denismr/incremental-ks, 2016.Google Scholar
K. B. Dyer, R. Capo, and R. Polikar. Compose: A semisupervised learning framework for initially labeled nonstationary streaming data. TNNLS, 2013.Google Scholar
H. Fanaee-T and J. a. Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, pages 113--127, 2013.Google Scholar
J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In SBIA, pages 286--295. 2004.Google ScholarCross Ref
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, pages 359--366, 2000. Google ScholarDigital Library
N. Hammami and M. Bedda. Improved tree model for arabic speech recognition. In ICCSIT, volume 5, pages 521--526, 2010.Google ScholarCross Ref
A. Haque, L. Khan, and M. Baron. Semi supervised adaptive framework for classifying evolving data stream. In PAKDD, pages 383--394. 2015.Google ScholarCross Ref
D. J. Hill and B. S. Minsker. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environmental Modelling & Software, 25(9):1014--1022, 2010. Google ScholarDigital Library
B. Kaluža, V. Mirchevska, E. Dovgan, M. Luštrek, and M. Gams. An agent-based approach to care in independent living. In AmI, pages 177--186. 2010. Google ScholarDigital Library
L. I. Kuncheva et al. Nearest neighbour classifiers for streaming data with delayed labelling. In ICDM, pages 869--874. IEEE, 2008. Google ScholarDigital Library
R. H. Lopes. Kolmogorov-smirnov test. In International Encyclopedia of Statistical Science, pages 718--720. Springer, 2011.Google ScholarCross Ref
M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Classification and novel class detection in concept-drifting data streams under time constraints. TKDE, 23(6):859--874, 2011. Google ScholarDigital Library
M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza. Facing the reality of data stream classification: coping with scarcity of labeled data. KAIS, 33(1):213--244, 2012.Google ScholarDigital Library
N. C. Oza. Online bagging and boosting. In SMC, volume 3, pages 2340--2345. IEEE, 2005.Google ScholarCross Ref
R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16(4--5):464--497, 1996.Google Scholar
V. M. Souza, D. F. Silva, J. a. Gama, and G. E. Batista. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SDM, pages 873--881. SIAM, 2015.Google Scholar
X. Wu, P. Li, and X. Hu. Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92:145--155, 2012. Google ScholarDigital Library
I. Zliobaite. Change with delayed labeling: when is it detectable? In ICDMW, pages 843--850. IEEE, 2010. Google ScholarDigital Library

Index Terms

Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test

Recommendations

Bayesian Nonparametric Unsupervised Concept Drift Detection for Data Stream Mining
Regular Papers

Online data stream mining is of great significance in practice because of its ubiquity in many real-world scenarios, especially in the big data era. Traditional data mining algorithms cannot be directly applied to data streams due to (1) the possible ...
Read More
Unsupervised Concept Drift Detection with a Discriminative Classifier
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

In data stream mining, one of the biggest challenges is to develop algorithms that deal with the changing data. As data evolve over time, static models become outdated. This phenomenon is called concept drift, and it is investigated extensively in the ...
Read More
Wilcoxon Rank Sum Test Drift Detector

Online learning regards extracting information from large quantities of data (streams) usually affected by changes in the distribution (concept drift). Drift detectors are software that estimate the positions of these changes to substitute the base ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cartesian tree
concept drift
data stream
kolmogorov-smirnov
lazy propagation
treap
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 78
  Total Citations
  View Citations
- 840
  Total Downloads
- Downloads (Last 12 months)164
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bayesian Nonparametric Unsupervised Concept Drift Detection for Data Stream Mining

Unsupervised Concept Drift Detection with a Discriminative Classifier

Wilcoxon Rank Sum Test Drift Detector