research-article

Public Access

Scalable Top-n Local Outlier Detection

Authors:
Yizhou Yan

Worcester Polytechnic Institute, Worcester, MA, USA

Worcester Polytechnic Institute, Worcester, MA, USA
View Profile

,
Lei Cao

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

,
Elke A. Rundensteiner

Worcester Polytechnic Institute, Worcester, MA, USA

Worcester Polytechnic Institute, Worcester, MA, USA
View Profile

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2017Pages 1235–1244https://doi.org/10.1145/3097983.3098191

Published:13 August 2017Publication History

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1235–1244

ABSTRACT

Local Outlier Factor (LOF) method that labels all points with their respective LOF scores to indicate their status is known to be very effective for identifying outliers in datasets with a skewed distribution. Since outliers by definition are the absolute minority in a dataset, the concept of Top-N local outlier was proposed to discover the n points with the largest LOF scores. The detection of the Top-N local outliers is prohibitively expensive, since it requires huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we present the first scalable Top-N local outlier detection approach called TOLF. The key innovation of TOLF is a multi-granularity pruning strategy that quickly prunes most points from the set of potential outlier candidates without computing their exact LOF scores or even without conducting any kNN search for them. Our customized density-aware indexing structure not only effectively supports the pruning strategy, but also accelerates the $k$NN search. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of TOLF - up to 35 times faster than the state-of-the-art methods.

References

2016. SDSS 13 Release. http://www.sdss.org/dr13/. (2016).Google Scholar
2017. Top-n LOF Code. https://github.com/yizhouyan/TopNLOFKDD. (2017).Google Scholar
Charu C. Aggarwal. 2013. Outlier Analysis. Springer. Google ScholarCross Ref
Stephen D. Bay and Mark Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD . 29--38. Google ScholarDigital Library
Kanishka Bhaduri, Bryan L. Matthews, and Chris Giannella. 2011. Algorithms for Speeding up Distance-based Outlier Detection. In KDD. 859--867. Google ScholarDigital Library
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jürg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD (SIGMOD '00). ACM, New York, NY, USA, 93--104. Google ScholarDigital Library
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3. ed.). MIT Press.Google ScholarDigital Library
Ahmed Eldawy and Mohamed F Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In ICDE. IEEE, 1352--1363.Google Scholar
Kyle S. Dawson et.al. 2016. The SDSS-IV Extended Baryon Oscillation Spectro- scopic Survey: Overview and Early Data. The Astronomical Journal 151, 2 (2016), 44.Google ScholarCross Ref
Antonin Guttman. 1984. R-trees: A Dynamic Index Structure for Spatial Searching. In INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. ACM, 47--57. Google ScholarDigital Library
Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7, 4 (2008), 12--18. Google ScholarDigital Library
Douglas M. Hawkins. 1980. Identification of Outliers. Springer. 1--188 pages. Google ScholarCross Ref
Wen Jin, Anthony K. H. Tung, and Jiawei Han. 2001. Mining Top-n Local Outliers in Large Databases. In KDD (KDD '01). New York, NY, USA, 293--298. Google ScholarDigital Library
Edwin M. Knorr and Raymond T. Ng. 1998. Algorithms for Mining Distance-Based Outliers in Large Datasets. In VLDB. 392--403.Google ScholarDigital Library
Edwin M Knox and Raymond T Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases. 392--403.Google Scholar
Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection.. In SDM. SIAM, 25--36. Google ScholarCross Ref
Wei Lu, Yanyan Shen, Su Chen, and Beng Chin Ooi. 2012. Efficient processing of k nearest neighbor joins using mapreduce. Proceedings of the VLDB Endowment 5, 10 (2012), 1016--1027. Google ScholarDigital Library
Gustavo Henrique Orair, Carlos H. C. Teixeira, Ye Wang, Wagner Meira Jr., and Srinivasan Parthasarathy. 2010. Distance-Based Outlier Detection: Consolidation and Renewed Bearing. PVLDB 3, 2 (2010), 1469--1480. Google ScholarDigital Library
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Falout- sos. 2003. LOCI: Fast Outlier Detection Using the Local Correlation Integral. In ICDE. 315--326.Google Scholar
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient Algorithms for Mining Outliers from Large Data Sets. In SIGMOD. 427--438. Google ScholarDigital Library
Hanan Samet. 1984. The Quadtree and Related Hierarchical Data Structures. ACM Comput. Surv. 16, 2 (June 1984), 187--260. Google ScholarDigital Library
Chi Zhang, Feifei Li, and Jeffrey Jestes. 2012. Efficient parallel kNN joins for large data in MapReduce. In EDBT. 38--49. Google ScholarDigital Library
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: an efficient data clustering method for very large databases. In ACM SIGMOD Record, Vol. 25. ACM, 103--114. Google ScholarDigital Library

Index Terms

Scalable Top-n Local Outlier Detection
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Outlier detection using isolation forest and local outlier factor
RACS '19: Proceedings of the Conference on Research in Adaptive and Convergent Systems

Outlier detection, also named as anomaly detection, is one of the hot issues in the field of data mining. As well-known outlier detection algorithms, Isolation Forest(iForest) and Local Outlier Factor(LOF) have been widely used. However, iForest is only ...
Read More
A Novel Noise Clustering Based on Local Outlier Factor
Integrated Uncertainty in Knowledge Modelling and Decision Making
Abstract
Reducing the impact of outliers is an essential issue in machine learning, including clustering. There are two main approaches to reducing the impact of outliers: one is to build robust models, and the other is to remove outliers through ...
Read More
Improving Detection Efficiency: Optimizing Block Size in the Local Outlier Factor (LOF) Algorithm
Rough Sets
Abstract
Detecting outliers in data is essential in various fields, such as finance, healthcare, and many other domains with anomalies. Among well-known outlier detection algorithms, Local Outlier Factor (LOF) is widely used for identifying unusual data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
local outlier factor
pruning strategy
top-n
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 1,108
  Total Downloads
- Downloads (Last 12 months)71
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable Top-n Local Outlier Detection

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Outlier detection using isolation forest and local outlier factor

A Novel Noise Clustering Based on Local Outlier Factor

Improving Detection Efficiency: Optimizing Block Size in the Local Outlier Factor (LOF) Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable Top-n Local Outlier Detection

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Outlier detection using isolation forest and local outlier factor

A Novel Noise Clustering Based on Local Outlier Factor

Improving Detection Efficiency: Optimizing Block Size in the Local Outlier Factor (LOF) Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media