|
ABSTRACT
Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research. So that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
2
|
K. Ali, and M. Pazzani HYDRA-MM: learning multiple descriptions to improve classification accuracy. International Journal of Artificial Intelligence Tools, 4, 1995.
|
| |
3
|
A. van den Bosch, T. Weijters, H. J. van den Herik, and W. Daelemans. When small disjuncts abound, try lazy learning: A case study. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pages 109--118, 1997.
|
| |
4
|
A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): 1145--1159, 1997.
|
| |
5
|
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Chapman and Hall/CRC Press, Boca Raton, Fl, 1984.
|
| |
6
|
|
| |
7
|
D. R. Carvalho, and A. A. Freitas. A genetic algorithm for discovering small-disjunct rules in data mining. Applied Soft Computing 2(2):75--88, 2002.
|
| |
8
|
D. R. Carvalho, and A. A. Freitas. New results for a hybrid decision tree/genetic algorithm for data mining. In Proceedings of the Fourth International Conference on Recent Advances in Soft Computing, pages 260--265, 2002.
|
| |
9
|
P. K. Chan, and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 164--168, 2001.
|
| |
10
|
N. V. Chawla. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Workshop on Learning from Imbalanced Datasets II, International Conference on Machine Learning, 2003.
|
| |
11
|
N. V. Chawla, K. W. Bowyer. L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16: 321--357, 2002.
|
| |
12
|
N. V. Chawla, A. Lazarevie, L. O. Hall, and K. Bowyer. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of Principles of Knowledge Discovery in Databases, 2003.
|
| |
13
|
W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, 1995.
|
| |
14
|
|
| |
15
|
C. Drummond and R. C. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, 2003.
|
| |
16
|
C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 239--246, 2001.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
J. H. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 717--724, 1996.
|
| |
21
|
|
| |
22
|
J. W. Grzymala-Busse, Z. Zheng, L. K. Goodwin, and W. J. Grzymala-Busse. An approach to imbalanced data sets based on changing rule strength. In Learning from Imbalanced Data Sets: Papers from the AAAI Workshop, pages 69--74, AAAI Press Technical Report WS-00-05, 2000.
|
| |
23
|
R. C. Holte, L. E. Acker, and B. W. Porter, Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 813--818, 1989.
|
| |
24
|
|
| |
25
|
N. Japkowicz. Supervised learning with unsupervised output separation. In International Conference on Artificial Intelligence and Soft Computing, pages 321--325, 2002.
|
| |
26
|
N. Japkowicz. Class imbalances: are we focusing on the right issue? In International Conference on Machine Learning Workshop on Learning from Imbalanced Data Sets II, 2003.
|
| |
27
|
N. Japkowicz, C. Myers, and M. A. Gluck. A novelty detection approach to classification. In Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, pages 518--523, 1995.
|
| |
28
|
N. Japkowicz, and S. Stephen. The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5):429--450, 2002.
|
 |
29
|
Mahesh V. Joshi , Ramesh C. Agarwal , Vipin Kumar, Mining needle in a haystack: classifying rare classes via two-phase rule induction, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.91-102, May 21-24, 2001, Santa Barbara, California, United States
|
 |
30
|
|
| |
31
|
|
| |
32
|
R. Kohavi. Data mining with MineSet: what worked, what did not, and what might. In Workshop on Commercial Success of Data Mining, Fourth International Conference on Knowledge Discovery and Date Mining, 1998.
|
| |
33
|
|
| |
34
|
|
| |
35
|
M. Kubat, and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179--186. Morgan Kaufmann, 1997.
|
| |
36
|
C. Ling, and C. Li. Data mining for direct marketing problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 73--79, 1998.
|
 |
37
|
Bing Liu , Wynne Hsu , Yiming Ma, Mining association rules with multiple minimum supports, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.337-341, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312274]
|
| |
38
|
M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In Proceedings of the Eleventh International Conference on Machine Learning, pages 217--225, 1994.
|
| |
39
|
|
| |
40
|
|
| |
41
|
|
| |
42
|
|
| |
43
|
B. Raskutti, and A. Kowalczyk. Extreme re-balancing for SVMs: a case study. In Workshop on Learning from Imbalanced Data Sets II. International Conference on Machine Learning, 2003.
|
| |
44
|
P. Riddle, R. Segal and O. Etzioni. Representation design and brute-force induction in a Boeing manufacturing design. Applied Artificial Intelligence, 8:125--147, 1994.
|
| |
45
|
|
| |
46
|
|
| |
47
|
K. M. Ting. The problem of small disjuncts: its remedy in decision trees. In Proceeding of the Tenth Canadian Conference on Artificial Intelligence, pages 91--97, 1994.
|
| |
48
|
G. M. Weiss. Learning with rare cases and small disjuncts. In Proceedings of the Twelfth International Conference on Machine Learning, pages 558--565, Morgan Kaufmann, 1995.
|
| |
49
|
G. M. Weiss. Timeweaver: a genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 718--725, Morgan Kaufmann, 1999.
|
| |
50
|
G. M. Weiss, and H. Hirsh. Learning to predict rare events in event sequences. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 359--363, 1998.
|
| |
51
|
|
| |
52
|
G. M. Weiss, and F. Provost. Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19:315--354, 2003.
|
| |
53
|
R. Yan, Y. Liu, R. Jin, and A. Hauptmann. On predicting rare classes with SVM ensembles in scene classification. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2003.
|
 |
54
|
|
CITED BY 16
|
|
|
|
|
|
|
|
|
|
Larry Shoemaker , Robert E. Banfield , Lawrence O. Hall , Kevin W. Bowyer , W. Philip Kegelmeyer, Using classifier ensembles to label spatially disjoint data, Information Fusion, v.9 n.1, p.120-133, January, 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dong Wang , Xiaobing Liu , Linjie Luo , Jianmin Li , Bo Zhang, Video diver: generic video indexing with diverse features, Proceedings of the international workshop on Workshop on multimedia information retrieval, September 24-29, 2007, Augsburg, Bavaria, Germany
|
|
Junjie Wu , Hui Xiong , Peng Wu , Jian Chen, Local decomposition for rare class analysis, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|