ABSTRACT
Current research on data stream classification mainly focuses on supervised learning, in which a fully labeled data stream is needed for training. However, fully labeled data streams are expensive to obtain, which make the supervised learning approach difficult to be applied to real-life applications. In this paper, we model applications, such as credit fraud detection and intrusion detection, as a one-class data stream classification problem. The cost of fully labeling the data stream is reduced as users only need to provide some positive samples together with the unlabeled samples to the learner. Based on VFDT and POSC4.5, we propose our OcVFDT (One-class Very Fast Decision Tree) algorithm. Experimental study on both synthetic and real-life datasets shows that the OcVFDT has excellent classification performance. Even 80% of the samples in data stream are unlabeled, the classification performance of OcVFDT is still very close to that of VFDT, which is trained on fully labeled stream.
- B. Calvo, P. Larranaga, and J. A. Lozano. Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters, 28:2375--2384, 2007. Google ScholarDigital Library
- F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, pages 70--83, 2005. Google ScholarDigital Library
- T. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895--1923, 1998. Google ScholarDigital Library
- P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'00), pages 71--80. ACM New York, NY, USA, 2000. Google ScholarDigital Library
- C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'08). Google ScholarDigital Library
- U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87--102, 1992. Google ScholarCross Ref
- G. Fung, J. Yu, H. Lu, and P. Yu. Text classification without negative examples revisit. IEEE Transactions on Knowledge and Data Engineering, 18(1):6--20, 2006. Google ScholarDigital Library
- J. Gama, P. Medas, and P. Rodrigues. Learning Decision Trees from Dynamic Data Streams. Journal of Universal Computer Science, 11(8):1353--1366, 2005.Google Scholar
- J. Gama, R. Rocha, and P. Medas. Accurate decision trees for mining high-speed data streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 523--528. ACM Press New York, NY, USA, 2003. Google ScholarDigital Library
- G. Hulten, P. Domingos, and L. Spencer. Mining massive data streams. In The Journal of Machine Learning Research, 2005.Google Scholar
- G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'01), pages 97--106. ACM New York, NY, USA, 2001. Google ScholarDigital Library
- R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 571--576. ACM New York, NY, USA, 2003. Google ScholarDigital Library
- W. Lee and B. Liu. Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression. In Proceedings of Twentieth International Conference on Machine Learning. (ICML'03), volume 20, page 448, 2003.Google Scholar
- D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A New Benchmark Collection for Text Categorization Research. The Journal of Machine Learning Research, 5:361--397, 2004. Google ScholarDigital Library
- B. Liu, Y. Dai, X. Li, W. Lee, and P. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining. (ICDM'03), pages 179--186, 2003. Google ScholarDigital Library
- J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarDigital Library
- B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7):1443--1471, 2001. Google ScholarDigital Library
- W. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'01), pages 377--382. ACM New York, NY, USA, 2001. Google ScholarDigital Library
- H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 226--235. ACM New York, NY, USA, 2003. Google ScholarDigital Library
- H. Yu. Single-Class Classification with Mapping Convergence. Machine Learning, 61(1):49--69, 2005. Google ScholarDigital Library
- H. Yu, J. Han, and K. Chang. PEBL: web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70--81, 2004. Google ScholarDigital Library
- Y. Zhang and X. Jin. An automatic construction and organization strategy for ensemble learning on data streams. ACM SIGMOD Record, 35(3):28--33, 2006. Google ScholarDigital Library
- Y. Zhang, X. Li, and M. Orlowska. One-Class Classification of Text Streams with Concept Drift. In Proceedings of the Third IEEE International Conference on Data Mining Workshops. (ICDMW'08), pages 116--125, 2008. Google ScholarDigital Library
- X. Zhu, X. Wu, and Y. Yang. Dynamic Classifier Selection for Effective Mining from Noisy Data Streams. In Proceedings of the Fourth IEEE International Conference on Data Mining. (ICDM'04), pages 305--312, 2004. Google ScholarDigital Library
Index Terms
- OcVFDT: one-class very fast decision tree for one-class classification of data streams
Recommendations
Extremely Fast Decision Tree
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWe introduce a novel incremental decision tree learning algorithm, Hoeffding Anytime Tree, that is statistically more efficient than the current state-of-the-art, Hoeffding Tree. We demonstrate that an implementation of Hoeffding Anytime Tree---"...
Mining time-changing data streams
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data miningMost statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months ...
Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning
DaWaK'11: Proceedings of the 13th international conference on Data warehousing and knowledge discoveryVery Fast Decision Tree (VFDT) is one of the most popular decision tree algorithms in data stream mining. The tree building process is based on the principle of the Hoeffding bound to decide on splitting nodes with sufficient data statistics at the ...
Comments