|
ABSTRACT
Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide non-trivial improvements over the base algorithm.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
| |
5
|
V. Barnett and T. Lewis, Outliers in Statistical Data. New York, NY, John Wiley and Sons, 1994.
|
| |
6
|
|
| |
7
|
|
| |
8
|
C. Blake,C. Merz, UCI Repository of machine learning databases,www.ics.uci.edu/~mlearn/MLRepository.html, 1998.
|
| |
9
|
|
 |
10
|
Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , Jörg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States
|
| |
11
|
N. Chawla, A. Lazarevic, L. Hall,K. Bowyer, SMOTEBoost: Improving the Prediction of Minority Class in Boosting, In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD-2003, Cavtat, Croatia, September 2003.
|
| |
12
|
|
| |
13
|
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data, in Applications of Data Mining in Computer Security, Advances In Information Security, S. Jajodia D. Barbara, Ed. Boston: Kluwer, 2002.
|
| |
14
|
Y. Freund, R. Schapire, Experiments with a New Boosting Algorithm, In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 325--332, July 1996.
|
| |
15
|
|
 |
16
|
Mahesh V. Joshi , Ramesh C. Agarwal , Vipin Kumar, Mining needle in a haystack: classifying rare classes via two-phase rule induction, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.91-102, May 21-24, 2001, Santa Barbara, California, United States
|
 |
17
|
|
| |
18
|
M. Joshi and V. Kumar, CREDOS: Classification using Ripple Down Structure (A Case for Rare Classes), In Proceedings of the SIAM International Conference on Data Mining, Lake Buena Vista, FL, April 2004.
|
| |
19
|
|
| |
20
|
E. Kong and T. Dietterich, Error-Correcting Output Coding Corrects Bias and Variance, In Proceedings of the 12th International Conference on Machine Learning, San Francisco, CA, 313--321, 1995.
|
| |
21
|
A. Lazarevic, L. Ertoz, A. Ozgur, J. Srivastava and V. Kumar, A comparative study of anomaly detection schemes in network intrusion detection, In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, May 2003.
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
R. Michalski, I. Mozetic, J. Hong and N. Lavrac, The Multi-Purpose Incremental Learning System AQ15 and its Testing Applications to Three Medical Domains, In Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA, 1041--1045, 1986.
|
| |
26
|
|
 |
27
|
Sridhar Ramaswamy , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.427-438, May 15-18, 2000, Dallas, Texas, United States
|
| |
28
|
|
| |
29
|
|
| |
30
|
P. van der Putten, M. van Someren, CoIL Challenge 2000: The Insurance Company Case, Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, Leiden LIACS Technical Report 2000-09, June, 2000.
|
| |
31
|
|
| |
32
|
A. E. Howe, D. Dreilinger, SavvySearch: A meta-search engine that learns which search engines to query, AI Magazine, Vol. 18., No. 2, 1997.
|
| |
33
|
|
 |
34
|
|
 |
35
|
|
| |
36
|
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection Using the Local Correlation Integral. In Proceedings of IEEE International Conference on Data engineering, Bangalore, India March 2003.
|
| |
37
|
|
| |
38
|
L. Ertoz, Similarity Measures, PhD dissertation, University of Minnesota, in progress, 2005.
|
CITED BY 6
|
|
|
|
|
|
|
Bo Sheng , Qun Li , Weizhen Mao , Wen Jin, Outlier detection in sensor networks, Proceedings of the 8th ACM international symposium on Mobile ad hoc networking and computing, September 09-14, 2007, Montreal, Quebec, Canada
|
|
Raymond K. Pon , Alfonso F. Cardenas , David Buttler , Terence Critchlow, Tracking multiple topics for finding interesting articles, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|