skip to main content
10.1145/1321440.1321550acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Very efficient mining of distance-based outliers

Published: 06 November 2007 Publication History

Abstract

In this work a novel algorithm, named DOLPHIN, for detecting distance-based outliers is presented.
The proposed algorithm performs only two sequential scans of the dataset. It needs to store into main memory a portion of the dataset, to efficiently search for neighbors and early prune inliers. The strategy pursued by the algorithm allows to keep this portion very small. Both theoretical justification and empirical evidence that the size of the stored data amounts only to a few percent of the dataset are provided.
Another important feature of DOLPHIN is that the memory-resident data are indexed by using a suitable proximity search approach. This allows to search for nearest neighbors looking only at a small subset of the main memory stored data.
Temporal and spatial cost analysis show that the novel algorithm achieves both near linear CPU and I/O cost. DOLPHIN has been compared with state of the art methods, showing that it outperforms existing ones.

References

[1]
C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proc. Int. Conference on Managment of Data (SIGMOD'01), 2001.
[2]
F. Angiulli, S. Basta, and C. Pizzuti. Distance-based detection and prediction of outliers. IEEE Transaction on Knowledge and Data Engineering, 18(2):145--160, February 2006.
[3]
F. Angiulli and C. Pizzuti. Fast outlier detection in large high-dimensional data sets. In Proc. Int. Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'02), pages 15--26, 2002.
[4]
F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. IEEE Transaction on Knowledge and Data Engineering, 17(2):203--215, February 2005.
[5]
A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection in large databases. In Proc. Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), pages 164--169, 1996.
[6]
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.
[7]
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), 2003.
[8]
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In Proc. of the SIGMOD Conference, pages 322--331, 1990.
[9]
J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18:509--517, 1975.
[10]
S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proc. of the Conf. on VLDB, pages 28--39, 1996.
[11]
C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322--373, 2001.
[12]
M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proc. Int. Conf. on Managment of Data (SIGMOD'00), 2000.
[13]
E. Chávez, G. Navarro, R. A. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM Comput. Surv., 33(3):273--321, 2001.
[14]
Defense Advanced Research Projects Agency DARPA. Intrusion detection evaluation. In http://www.ll.mit.edu/IST/ideval/index.html.
[15]
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security, Kluwer, 2002.
[16]
A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. In Proc. of the SIAM International Conference on Data Mining (SDM'06), Bethesda, MD, USA, 2006.
[17]
W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'01), 2001.
[18]
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. Int. Conf. on Very Large Databases (VLDB'98), pages 392--403, 1998.
[19]
E. Knorr and R. Ng. Finding intensional knowledge of distance-based outliers. In Proc. Int. Conf. on Very Large Databases (VLDB'99), pages 211--222, 1999.
[20]
E. Knorr, R. Ng, and V. Tucakov. Distance-based outlier: algorithms and applications. VLDB Journal, 8(3-4):237--253, 2000.
[21]
A. Lazarevic, L. Ertöz, V. Kumar, A. Ozgur, and J. Srivastava. A comparative study of anomaly detection schemes in network intrusion detection. In Proc. of the SIAM Int. Conf. on Data Mining, 2003.
[22]
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In Proc. Int. Conf. on Data Enginnering (ICDE), pages 315--326, 2003.
[23]
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. Int. Conf. on Managment of Data (SIGMOD'00), pages 427--438, 2000.
[24]
Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. In Proc. of the Int. Conf. on Knowledge Discovery and Data Mining (KDD'06), pages 394--403, Philadelphia, PA, USA, 1999.

Cited By

View all
  • (2024)A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328289836:1(1-20)Online publication date: 1-Jan-2024
  • (2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
  • (2023)Modelling a stacked dense network model for outlier prediction over medical-based heart prediction dataJournal of High Speed Networks10.3233/JHS-22207929:4(279-294)Online publication date: 14-Nov-2023
  • Show More Cited By

Index Terms

  1. Very efficient mining of distance-based outliers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
    November 2007
    1048 pages
    ISBN:9781595938039
    DOI:10.1145/1321440
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anomaly detection
    2. data mining
    3. distance-based outliers

    Qualifiers

    • Research-article

    Conference

    CIKM07

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328289836:1(1-20)Online publication date: 1-Jan-2024
    • (2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
    • (2023)Modelling a stacked dense network model for outlier prediction over medical-based heart prediction dataJournal of High Speed Networks10.3233/JHS-22207929:4(279-294)Online publication date: 14-Nov-2023
    • (2023)Taxonomy of outlier detection methods for power system measurementsEnergy Conversion and Economics10.1049/enc2.120824:2(73-88)Online publication date: 20-Apr-2023
    • (2022)Design of a deep network model for outlier predictionInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01731-0Online publication date: 15-Jul-2022
    • (2022)An Extensive Survey on Outlier Prediction Using Mining and Learning ApproachesEvolutionary Computing and Mobile Sustainable Networks10.1007/978-981-16-9605-3_40(593-610)Online publication date: 22-Mar-2022
    • (2020)A comprehensive survey of anomaly detection techniques for high dimensional big dataJournal of Big Data10.1186/s40537-020-00320-x7:1Online publication date: 2-Jul-2020
    • (2019)Progress in Outlier Detection Techniques: A SurveyIEEE Access10.1109/ACCESS.2019.29327697(107964-108000)Online publication date: 2019
    • (2017)A Distributed Algorithm for the Cluster‐Based Outlier Detection Using Unsupervised Extreme Learning MachinesMathematical Problems in Engineering10.1155/2017/26495352017:1Online publication date: 9-Apr-2017
    • (2016)An efficient algorithm for distributed density-based outlier detection on big dataNeurocomputing10.1016/j.neucom.2015.05.135181:C(19-28)Online publication date: 12-Mar-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media