| Learning on the border: active learning in imbalanced data classification |
| Full text |
Pdf
(1.99 MB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
table of contents
Lisbon, Portugal
SESSION: Classification and clustering I (KM)
table of contents
Pages 127-136
Year of Publication: 2007
ISBN:978-1-59593-803-9
|
|
Authors
|
|
Seyda Ertekin
|
The Pennsylvania State University, University Park, PA
|
|
Jian Huang
|
The Pennsylvania State University, University Park, PA
|
|
Leon Bottou
|
NEC Laboratories America, Princeton, NJ
|
|
Lee Giles
|
The Pennsylvania State University, University Park, PA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 29, Downloads (12 Months): 229, Citation Count: 0
|
|
|
ABSTRACT
This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer from this phenomenon. The standard machine learning algorithms yield better prediction performance with balanced datasets. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. We also propose an efficient way of selecting informative instances from a smaller pool of samples for active learning which does not necessitate a search through the entire dataset. The proposed method yields an efficient querying system and allows active learning to be applied to very large datasets. Our experimental results show that with an early stopping criteria, active learning achieves a fast solution with competitive prediction performance in imbalanced data classification.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
N. Abe. Invited talk: Sampling approaches to learning from imbalanced datasets: Active learning, cost sensitive learning and beyond. Proc. of ICML Workshop: Learning from Imbalanced Data Sets, 2003.
|
| |
2
|
R. Akbani, S. Kwek, and N. Japkowicz. Applying support vector machines to imbalanced datasets. Proc. of European Conference on Machine Learning, pages 39--50, 2004.
|
| |
3
|
|
| |
4
|
P. K. Chan and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1998.
|
| |
5
|
N. V. Chawla, K. W. Bowyer., L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR), 16:321--357, 2002.
|
 |
6
|
|
 |
7
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
8
|
J. W. Grzymala-Busse, Z. Zheng, L. K. Goodwin, and W. J. Grzymala-Busse. An approach to imbalanced datasets based on changing rule strength. In Proc. of In Learning from Imbalanced Datasets, AAAI Workshop, 2000.
|
| |
9
|
J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large scale datasets. In Proc. of European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2006.
|
| |
10
|
N. Japkowicz. A novelty detection approach to classification. In Proc. of the Int. Joint Conference on Artificial Intelligence (IJCAI), pages 518--523, 1995.
|
| |
11
|
N. Japkowicz. The class imbalance problem: Significance and strategies. In Proc. of 2000 Int. Conference on Artificial Intelligence (IC-AI'2000), volume 1, pages 111--117, 2000.
|
| |
12
|
|
| |
13
|
|
| |
14
|
M. Kubat and S. Matwin. Addressing the curse of imbalanced training datasets: One sided selection. Proc. of Int. Conference on Machine Learning (ICML), 30(2-3), 1997.
|
| |
15
|
C. X. Ling and C. Li. Data mining for direct marketing: Problems and solutions. In Knowledge Discovery and Data Mining, pages 73--79, 1998.
|
| |
16
|
|
| |
17
|
M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In Proc. of 11th Int. Conference on Machine Learning (ICML), 1994.
|
| |
18
|
F. Provost. Machine learning from imbalanced datasets 101. In Proc. of AAAI Workshop on Imbalanced Data Sets, 2000.
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
|