research-article

OcVFDT: one-class very fast decision tree for one-class classification of data streams

Authors:
Chen Li

Northwest A&F University, Yangling, Shaanxi Province, P.R. China

Northwest A&F University, Yangling, Shaanxi Province, P.R. China
View Profile

,
Yang Zhang

Northwest A&F University, Yangling, Shaanxi Province, P.R. China

Northwest A&F University, Yangling, Shaanxi Province, P.R. China
View Profile

,
Xue Li

The University of Queensland, Brisbane, Queensland, Australia

The University of Queensland, Brisbane, Queensland, Australia
View Profile

SensorKDD '09: Proceedings of the Third International Workshop on Knowledge Discovery from Sensor DataJune 2009Pages 79–86https://doi.org/10.1145/1601966.1601981

Published:28 June 2009Publication History

SensorKDD '09: Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data

Pages 79–86

ABSTRACT

Current research on data stream classification mainly focuses on supervised learning, in which a fully labeled data stream is needed for training. However, fully labeled data streams are expensive to obtain, which make the supervised learning approach difficult to be applied to real-life applications. In this paper, we model applications, such as credit fraud detection and intrusion detection, as a one-class data stream classification problem. The cost of fully labeling the data stream is reduced as users only need to provide some positive samples together with the unlabeled samples to the learner. Based on VFDT and POSC4.5, we propose our OcVFDT (One-class Very Fast Decision Tree) algorithm. Experimental study on both synthetic and real-life datasets shows that the OcVFDT has excellent classification performance. Even 80% of the samples in data stream are unlabeled, the classification performance of OcVFDT is still very close to that of VFDT, which is trained on fully labeled stream.

References

B. Calvo, P. Larranaga, and J. A. Lozano. Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters, 28:2375--2384, 2007. Google ScholarDigital Library
F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, pages 70--83, 2005. Google ScholarDigital Library
T. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895--1923, 1998. Google ScholarDigital Library
P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'00), pages 71--80. ACM New York, NY, USA, 2000. Google ScholarDigital Library
C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'08). Google ScholarDigital Library
U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87--102, 1992. Google ScholarCross Ref
G. Fung, J. Yu, H. Lu, and P. Yu. Text classification without negative examples revisit. IEEE Transactions on Knowledge and Data Engineering, 18(1):6--20, 2006. Google ScholarDigital Library
J. Gama, P. Medas, and P. Rodrigues. Learning Decision Trees from Dynamic Data Streams. Journal of Universal Computer Science, 11(8):1353--1366, 2005.Google Scholar
J. Gama, R. Rocha, and P. Medas. Accurate decision trees for mining high-speed data streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 523--528. ACM Press New York, NY, USA, 2003. Google ScholarDigital Library
G. Hulten, P. Domingos, and L. Spencer. Mining massive data streams. In The Journal of Machine Learning Research, 2005.Google Scholar
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'01), pages 97--106. ACM New York, NY, USA, 2001. Google ScholarDigital Library
R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 571--576. ACM New York, NY, USA, 2003. Google ScholarDigital Library
W. Lee and B. Liu. Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression. In Proceedings of Twentieth International Conference on Machine Learning. (ICML'03), volume 20, page 448, 2003.Google Scholar
D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A New Benchmark Collection for Text Categorization Research. The Journal of Machine Learning Research, 5:361--397, 2004. Google ScholarDigital Library
B. Liu, Y. Dai, X. Li, W. Lee, and P. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining. (ICDM'03), pages 179--186, 2003. Google ScholarDigital Library
J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarDigital Library
B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7):1443--1471, 2001. Google ScholarDigital Library
W. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'01), pages 377--382. ACM New York, NY, USA, 2001. Google ScholarDigital Library
H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 226--235. ACM New York, NY, USA, 2003. Google ScholarDigital Library
H. Yu. Single-Class Classification with Mapping Convergence. Machine Learning, 61(1):49--69, 2005. Google ScholarDigital Library
H. Yu, J. Han, and K. Chang. PEBL: web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70--81, 2004. Google ScholarDigital Library
Y. Zhang and X. Jin. An automatic construction and organization strategy for ensemble learning on data streams. ACM SIGMOD Record, 35(3):28--33, 2006. Google ScholarDigital Library
Y. Zhang, X. Li, and M. Orlowska. One-Class Classification of Text Streams with Concept Drift. In Proceedings of the Third IEEE International Conference on Data Mining Workshops. (ICDMW'08), pages 116--125, 2008. Google ScholarDigital Library
X. Zhu, X. Wu, and Y. Yang. Dynamic Classifier Selection for Effective Mining from Noisy Data Streams. In Proceedings of the Fourth IEEE International Conference on Data Mining. (ICDM'04), pages 305--312, 2004. Google ScholarDigital Library

Index Terms

OcVFDT: one-class very fast decision tree for one-class classification of data streams
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Extremely Fast Decision Tree
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We introduce a novel incremental decision tree learning algorithm, Hoeffding Anytime Tree, that is statistically more efficient than the current state-of-the-art, Hoeffding Tree. We demonstrate that an implementation of Hoeffding Anytime Tree---"...
Read More
Mining time-changing data streams
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months ...
Read More
Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning
DaWaK'11: Proceedings of the 13th international conference on Data warehousing and knowledge discovery

Very Fast Decision Tree (VFDT) is one of the most popular decision tree algorithms in data stream mining. The tree building process is based on the principle of the Hoeffding bound to decide on splitting nodes with sufficient data statistics at the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SensorKDD '09: Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data
June 2009
150 pages
ISBN:9781605586687
DOI:10.1145/1601966
Conference Chairs:
Olufemi A. Omitaomu
Oak Ridge National Laboratory, TN
,
Auroop R. Ganguly
Oak Ridge National Laboratory, TN
,
Joao Gama
University of Porto, Portugal
,
Ranga Raju Vatsavai
Oak Ridge National Laboratory, TN
,
Nitesh V. Chawla
University of Notre Dame, IN
,
Mohamed Medhat Gaber
Monash University, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hoeffding bound
decision trees
incremental learning
one-class data streams
Qualifiers
- research-article
Conference
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 904
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OcVFDT: one-class very fast decision tree for one-class classification of data streams

SensorKDD '09: Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Extremely Fast Decision Tree

Mining time-changing data streams

Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning