ABSTRACT
Local Outlier Factor (LOF) method that labels all points with their respective LOF scores to indicate their status is known to be very effective for identifying outliers in datasets with a skewed distribution. Since outliers by definition are the absolute minority in a dataset, the concept of Top-N local outlier was proposed to discover the n points with the largest LOF scores. The detection of the Top-N local outliers is prohibitively expensive, since it requires huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we present the first scalable Top-N local outlier detection approach called TOLF. The key innovation of TOLF is a multi-granularity pruning strategy that quickly prunes most points from the set of potential outlier candidates without computing their exact LOF scores or even without conducting any kNN search for them. Our customized density-aware indexing structure not only effectively supports the pruning strategy, but also accelerates the $k$NN search. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of TOLF - up to 35 times faster than the state-of-the-art methods.
- 2016. SDSS 13 Release. http://www.sdss.org/dr13/. (2016).Google Scholar
- 2017. Top-n LOF Code. https://github.com/yizhouyan/TopNLOFKDD. (2017).Google Scholar
- Charu C. Aggarwal. 2013. Outlier Analysis. Springer. Google ScholarCross Ref
- Stephen D. Bay and Mark Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD . 29--38. Google ScholarDigital Library
- Kanishka Bhaduri, Bryan L. Matthews, and Chris Giannella. 2011. Algorithms for Speeding up Distance-based Outlier Detection. In KDD. 859--867. Google ScholarDigital Library
- Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jürg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD (SIGMOD '00). ACM, New York, NY, USA, 93--104. Google ScholarDigital Library
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3. ed.). MIT Press.Google ScholarDigital Library
- Ahmed Eldawy and Mohamed F Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In ICDE. IEEE, 1352--1363.Google Scholar
- Kyle S. Dawson et.al. 2016. The SDSS-IV Extended Baryon Oscillation Spectro- scopic Survey: Overview and Early Data. The Astronomical Journal 151, 2 (2016), 44.Google ScholarCross Ref
- Antonin Guttman. 1984. R-trees: A Dynamic Index Structure for Spatial Searching. In INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. ACM, 47--57. Google ScholarDigital Library
- Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7, 4 (2008), 12--18. Google ScholarDigital Library
- Douglas M. Hawkins. 1980. Identification of Outliers. Springer. 1--188 pages. Google ScholarCross Ref
- Wen Jin, Anthony K. H. Tung, and Jiawei Han. 2001. Mining Top-n Local Outliers in Large Databases. In KDD (KDD '01). New York, NY, USA, 293--298. Google ScholarDigital Library
- Edwin M. Knorr and Raymond T. Ng. 1998. Algorithms for Mining Distance-Based Outliers in Large Datasets. In VLDB. 392--403.Google ScholarDigital Library
- Edwin M Knox and Raymond T Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases. 392--403.Google Scholar
- Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection.. In SDM. SIAM, 25--36. Google ScholarCross Ref
- Wei Lu, Yanyan Shen, Su Chen, and Beng Chin Ooi. 2012. Efficient processing of k nearest neighbor joins using mapreduce. Proceedings of the VLDB Endowment 5, 10 (2012), 1016--1027. Google ScholarDigital Library
- Gustavo Henrique Orair, Carlos H. C. Teixeira, Ye Wang, Wagner Meira Jr., and Srinivasan Parthasarathy. 2010. Distance-Based Outlier Detection: Consolidation and Renewed Bearing. PVLDB 3, 2 (2010), 1469--1480. Google ScholarDigital Library
- Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Falout- sos. 2003. LOCI: Fast Outlier Detection Using the Local Correlation Integral. In ICDE. 315--326.Google Scholar
- Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient Algorithms for Mining Outliers from Large Data Sets. In SIGMOD. 427--438. Google ScholarDigital Library
- Hanan Samet. 1984. The Quadtree and Related Hierarchical Data Structures. ACM Comput. Surv. 16, 2 (June 1984), 187--260. Google ScholarDigital Library
- Chi Zhang, Feifei Li, and Jeffrey Jestes. 2012. Efficient parallel kNN joins for large data in MapReduce. In EDBT. 38--49. Google ScholarDigital Library
- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: an efficient data clustering method for very large databases. In ACM SIGMOD Record, Vol. 25. ACM, 103--114. Google ScholarDigital Library
Index Terms
- Scalable Top-n Local Outlier Detection
Recommendations
Outlier detection using isolation forest and local outlier factor
RACS '19: Proceedings of the Conference on Research in Adaptive and Convergent SystemsOutlier detection, also named as anomaly detection, is one of the hot issues in the field of data mining. As well-known outlier detection algorithms, Isolation Forest(iForest) and Local Outlier Factor(LOF) have been widely used. However, iForest is only ...
A Novel Noise Clustering Based on Local Outlier Factor
Integrated Uncertainty in Knowledge Modelling and Decision MakingAbstractReducing the impact of outliers is an essential issue in machine learning, including clustering. There are two main approaches to reducing the impact of outliers: one is to build robust models, and the other is to remove outliers through ...
Improving Detection Efficiency: Optimizing Block Size in the Local Outlier Factor (LOF) Algorithm
Rough SetsAbstractDetecting outliers in data is essential in various fields, such as finance, healthcare, and many other domains with anomalies. Among well-known outlier detection algorithms, Local Outlier Factor (LOF) is widely used for identifying unusual data ...
Comments