skip to main content
10.1145/2939672.2939836acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test

Authors Info & Claims
Published:13 August 2016Publication History

ABSTRACT

Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.

References

  1. R. Al-Otaibi, R. B. Prudêncio, M. Kull, and P. Flach. Versatile decision trees for learning over multiple contexts. In ECML/PKDD, pages 184--199. 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Amir and D. Toshniwal. Instance-based classification of streaming data using emerging patterns. In ICT, pages 228--236, 2010.Google ScholarGoogle Scholar
  3. C. R. Aragon and R. G. Seidel. Randomized search trees. In FOCS, pages 540--545. IEEE, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. E. A. P. A. Batista, E. J. Keogh, A. M. Neto, and E. Rowton. Sensors and software to allow computational entomology, an emerging application of data mining. In KDD, pages 761--764, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. In SDM, pages 443--448, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Bifet and R. Gavaldà. Adaptive learning from evolving data streams. In IDA, pages 249--260. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bifet, G. Holmes, and B. Pfahringer. Leveraging bagging for evolving data streams. In ECML/PKDD, pages 135--150. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi. Credit card fraud detection and concept-drift adaptation with delayed supervised information. In IJCNN, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  9. V. M. De Souza, D. F. Silva, G. E. Batista, et al. Classification of data streams applied to insect recognition: Initial results. In BRACIS, pages 76--81. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Devroye. A note on the height of binary search trees. JACM, 33(3):489--498, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. dos Reis. Fast unsupervised online drift detection using incremental kolmogorov-smirnov test, online supplementary material. https://github.com/denismr/incremental-ks, 2016.Google ScholarGoogle Scholar
  12. K. B. Dyer, R. Capo, and R. Polikar. Compose: A semisupervised learning framework for initially labeled nonstationary streaming data. TNNLS, 2013.Google ScholarGoogle Scholar
  13. H. Fanaee-T and J. a. Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, pages 113--127, 2013.Google ScholarGoogle Scholar
  14. J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In SBIA, pages 286--295. 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, pages 359--366, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Hammami and M. Bedda. Improved tree model for arabic speech recognition. In ICCSIT, volume 5, pages 521--526, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Haque, L. Khan, and M. Baron. Semi supervised adaptive framework for classifying evolving data stream. In PAKDD, pages 383--394. 2015.Google ScholarGoogle ScholarCross RefCross Ref
  18. D. J. Hill and B. S. Minsker. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environmental Modelling & Software, 25(9):1014--1022, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Kaluža, V. Mirchevska, E. Dovgan, M. Luštrek, and M. Gams. An agent-based approach to care in independent living. In AmI, pages 177--186. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. I. Kuncheva et al. Nearest neighbour classifiers for streaming data with delayed labelling. In ICDM, pages 869--874. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. H. Lopes. Kolmogorov-smirnov test. In International Encyclopedia of Statistical Science, pages 718--720. Springer, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  22. M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Classification and novel class detection in concept-drifting data streams under time constraints. TKDE, 23(6):859--874, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza. Facing the reality of data stream classification: coping with scarcity of labeled data. KAIS, 33(1):213--244, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. C. Oza. Online bagging and boosting. In SMC, volume 3, pages 2340--2345. IEEE, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  25. R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16(4--5):464--497, 1996.Google ScholarGoogle Scholar
  26. V. M. Souza, D. F. Silva, J. a. Gama, and G. E. Batista. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SDM, pages 873--881. SIAM, 2015.Google ScholarGoogle Scholar
  27. X. Wu, P. Li, and X. Hu. Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92:145--155, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. I. Zliobaite. Change with delayed labeling: when is it detectable? In ICDMW, pages 843--850. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
          August 2016
          2176 pages
          ISBN:9781450342322
          DOI:10.1145/2939672

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 August 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader