skip to main content
10.1145/3097983.3098191acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Scalable Top-n Local Outlier Detection

Published:13 August 2017Publication History

ABSTRACT

Local Outlier Factor (LOF) method that labels all points with their respective LOF scores to indicate their status is known to be very effective for identifying outliers in datasets with a skewed distribution. Since outliers by definition are the absolute minority in a dataset, the concept of Top-N local outlier was proposed to discover the n points with the largest LOF scores. The detection of the Top-N local outliers is prohibitively expensive, since it requires huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we present the first scalable Top-N local outlier detection approach called TOLF. The key innovation of TOLF is a multi-granularity pruning strategy that quickly prunes most points from the set of potential outlier candidates without computing their exact LOF scores or even without conducting any kNN search for them. Our customized density-aware indexing structure not only effectively supports the pruning strategy, but also accelerates the $k$NN search. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of TOLF - up to 35 times faster than the state-of-the-art methods.

References

  1. 2016. SDSS 13 Release. http://www.sdss.org/dr13/. (2016).Google ScholarGoogle Scholar
  2. 2017. Top-n LOF Code. https://github.com/yizhouyan/TopNLOFKDD. (2017).Google ScholarGoogle Scholar
  3. Charu C. Aggarwal. 2013. Outlier Analysis. Springer. Google ScholarGoogle ScholarCross RefCross Ref
  4. Stephen D. Bay and Mark Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD . 29--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kanishka Bhaduri, Bryan L. Matthews, and Chris Giannella. 2011. Algorithms for Speeding up Distance-based Outlier Detection. In KDD. 859--867. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jürg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD (SIGMOD '00). ACM, New York, NY, USA, 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3. ed.). MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ahmed Eldawy and Mohamed F Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In ICDE. IEEE, 1352--1363.Google ScholarGoogle Scholar
  9. Kyle S. Dawson et.al. 2016. The SDSS-IV Extended Baryon Oscillation Spectro- scopic Survey: Overview and Early Data. The Astronomical Journal 151, 2 (2016), 44.Google ScholarGoogle ScholarCross RefCross Ref
  10. Antonin Guttman. 1984. R-trees: A Dynamic Index Structure for Spatial Searching. In INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. ACM, 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7, 4 (2008), 12--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Douglas M. Hawkins. 1980. Identification of Outliers. Springer. 1--188 pages. Google ScholarGoogle ScholarCross RefCross Ref
  13. Wen Jin, Anthony K. H. Tung, and Jiawei Han. 2001. Mining Top-n Local Outliers in Large Databases. In KDD (KDD '01). New York, NY, USA, 293--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Edwin M. Knorr and Raymond T. Ng. 1998. Algorithms for Mining Distance-Based Outliers in Large Datasets. In VLDB. 392--403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Edwin M Knox and Raymond T Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases. 392--403.Google ScholarGoogle Scholar
  16. Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection.. In SDM. SIAM, 25--36. Google ScholarGoogle ScholarCross RefCross Ref
  17. Wei Lu, Yanyan Shen, Su Chen, and Beng Chin Ooi. 2012. Efficient processing of k nearest neighbor joins using mapreduce. Proceedings of the VLDB Endowment 5, 10 (2012), 1016--1027. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gustavo Henrique Orair, Carlos H. C. Teixeira, Ye Wang, Wagner Meira Jr., and Srinivasan Parthasarathy. 2010. Distance-Based Outlier Detection: Consolidation and Renewed Bearing. PVLDB 3, 2 (2010), 1469--1480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Falout- sos. 2003. LOCI: Fast Outlier Detection Using the Local Correlation Integral. In ICDE. 315--326.Google ScholarGoogle Scholar
  20. Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient Algorithms for Mining Outliers from Large Data Sets. In SIGMOD. 427--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hanan Samet. 1984. The Quadtree and Related Hierarchical Data Structures. ACM Comput. Surv. 16, 2 (June 1984), 187--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chi Zhang, Feifei Li, and Jeffrey Jestes. 2012. Efficient parallel kNN joins for large data in MapReduce. In EDBT. 38--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: an efficient data clustering method for very large databases. In ACM SIGMOD Record, Vol. 25. ACM, 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable Top-n Local Outlier Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
      August 2017
      2240 pages
      ISBN:9781450348874
      DOI:10.1145/3097983

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 August 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader