ACM Home Page
Please provide us with feedback. Feedback
Mining with rarity: a unifying framework
Full text PdfPdf (182 KB)
Source ACM SIGKDD Explorations Newsletter archive
Volume 6 ,  Issue 1  (June 2004) table of contents
Special issue on learning from imbalanced datasets
SPECIAL ISSUE: Special issue on learning from imbalanced datasets table of contents
Pages: 7 - 19  
Year of Publication: 2004
ISSN:1931-0145
Author
Gary M. Weiss  AT&T Laboratories, Piscataway, NJ
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 200,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007730.1007734
What is a DOI?

ABSTRACT

Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research. So that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
K. Ali, and M. Pazzani HYDRA-MM: learning multiple descriptions to improve classification accuracy. International Journal of Artificial Intelligence Tools, 4, 1995.
 
3
A. van den Bosch, T. Weijters, H. J. van den Herik, and W. Daelemans. When small disjuncts abound, try lazy learning: A case study. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pages 109--118, 1997.
 
4
A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): 1145--1159, 1997.
 
5
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Chapman and Hall/CRC Press, Boca Raton, Fl, 1984.
 
6
 
7
D. R. Carvalho, and A. A. Freitas. A genetic algorithm for discovering small-disjunct rules in data mining. Applied Soft Computing 2(2):75--88, 2002.
 
8
D. R. Carvalho, and A. A. Freitas. New results for a hybrid decision tree/genetic algorithm for data mining. In Proceedings of the Fourth International Conference on Recent Advances in Soft Computing, pages 260--265, 2002.
 
9
P. K. Chan, and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 164--168, 2001.
 
10
N. V. Chawla. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Workshop on Learning from Imbalanced Datasets II, International Conference on Machine Learning, 2003.
 
11
N. V. Chawla, K. W. Bowyer. L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16: 321--357, 2002.
 
12
N. V. Chawla, A. Lazarevie, L. O. Hall, and K. Bowyer. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of Principles of Knowledge Discovery in Databases, 2003.
 
13
W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, 1995.
 
14
 
15
C. Drummond and R. C. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, 2003.
 
16
C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 239--246, 2001.
 
17
 
18
 
19
 
20
J. H. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 717--724, 1996.
 
21
 
22
J. W. Grzymala-Busse, Z. Zheng, L. K. Goodwin, and W. J. Grzymala-Busse. An approach to imbalanced data sets based on changing rule strength. In Learning from Imbalanced Data Sets: Papers from the AAAI Workshop, pages 69--74, AAAI Press Technical Report WS-00-05, 2000.
 
23
R. C. Holte, L. E. Acker, and B. W. Porter, Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 813--818, 1989.
 
24
 
25
N. Japkowicz. Supervised learning with unsupervised output separation. In International Conference on Artificial Intelligence and Soft Computing, pages 321--325, 2002.
 
26
N. Japkowicz. Class imbalances: are we focusing on the right issue? In International Conference on Machine Learning Workshop on Learning from Imbalanced Data Sets II, 2003.
 
27
N. Japkowicz, C. Myers, and M. A. Gluck. A novelty detection approach to classification. In Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, pages 518--523, 1995.
 
28
N. Japkowicz, and S. Stephen. The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5):429--450, 2002.
29
30
 
31
 
32
R. Kohavi. Data mining with MineSet: what worked, what did not, and what might. In Workshop on Commercial Success of Data Mining, Fourth International Conference on Knowledge Discovery and Date Mining, 1998.
 
33
 
34
 
35
M. Kubat, and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179--186. Morgan Kaufmann, 1997.
 
36
C. Ling, and C. Li. Data mining for direct marketing problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 73--79, 1998.
37
 
38
M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In Proceedings of the Eleventh International Conference on Machine Learning, pages 217--225, 1994.
 
39
 
40
 
41
 
42
 
43
B. Raskutti, and A. Kowalczyk. Extreme re-balancing for SVMs: a case study. In Workshop on Learning from Imbalanced Data Sets II. International Conference on Machine Learning, 2003.
 
44
P. Riddle, R. Segal and O. Etzioni. Representation design and brute-force induction in a Boeing manufacturing design. Applied Artificial Intelligence, 8:125--147, 1994.
 
45
 
46
 
47
K. M. Ting. The problem of small disjuncts: its remedy in decision trees. In Proceeding of the Tenth Canadian Conference on Artificial Intelligence, pages 91--97, 1994.
 
48
G. M. Weiss. Learning with rare cases and small disjuncts. In Proceedings of the Twelfth International Conference on Machine Learning, pages 558--565, Morgan Kaufmann, 1995.
 
49
G. M. Weiss. Timeweaver: a genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 718--725, Morgan Kaufmann, 1999.
 
50
G. M. Weiss, and H. Hirsh. Learning to predict rare events in event sequences. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 359--363, 1998.
 
51
 
52
G. M. Weiss, and F. Provost. Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19:315--354, 2003.
 
53
R. Yan, Y. Liu, R. Jin, and A. Hauptmann. On predicting rare classes with SVM ensembles in scene classification. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2003.
54

CITED BY  16
 
 
 
 
 
 
 
 
 
 


Peer to Peer - Readers of this Article have also read: