skip to main content
research-article

DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets

Published: 23 March 2009 Publication History

Abstract

In this work a novel distance-based outlier detection algorithm, named DOLPHIN, working on disk-resident datasets and whose I/O cost corresponds to the cost of sequentially reading the input dataset file twice, is presented.
It is both theoretically and empirically shown that the main memory usage of DOLPHIN amounts to a small fraction of the dataset and that DOLPHIN has linear time performance with respect to the dataset size. DOLPHIN gains efficiency by naturally merging together in a unified schema three strategies, namely the selection policy of objects to be maintained in main memory, usage of pruning rules, and similarity search techniques. Importantly, similarity search is accomplished by the algorithm without the need of preliminarily indexing the whole dataset, as other methods do.
The algorithm is simple to implement and it can be used with any type of data, belonging to either metric or nonmetric spaces. Moreover, a modification to the basic method allows DOLPHIN to deal with the scenario in which the available buffer of main memory is smaller than its standard requirements. DOLPHIN has been compared with state-of-the-art distance-based outlier detection algorithms, showing that it is much more efficient.

References

[1]
Aggarwal, C. C. and Yu, P. 2001. Outlier detection for high dimensional data. In Proceedings of the International Conference on Managment of Data (SIGMOD'01).
[2]
Angiulli, F. and Fassetti, F. 2007. Very efficient mining of distance-based outliers. In Proceedings of the International Conference on Information and Management (CIKM), 791--800.
[3]
Angiulli, F. and Pizzuti, C. 2002. Fast outlier detection in large high-dimensional data sets. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery (PKDD'02), 15--26.
[4]
Angiulli, F. and Pizzuti, C. 2005. Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 2, 17, 203--215.
[5]
Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'96), 164--169.
[6]
Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.
[7]
Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'03).
[8]
Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. 1990. The r*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the SIGMOD Conference, 322--331.
[9]
Bentley, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509--517.
[10]
Berchtold, S., Keim, D., and Kriegel, H.-P. 1996. The x-tree: An index structure for high-dimensional data. In Proceedings of the Conference on Very Large Databases (VLDB), 28--39.
[11]
Böhm, C., Berchtold, S., and Keim, D. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373.
[12]
Breunig, M. M., Kriegel, H., Ng, R., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the International Conference on Managment of Data (SIGMOD'00).
[13]
Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquín, J. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--321.
[14]
Ciaccia, P., Patella, M., and Zezula, P. 1997. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the International Conference on Very Large Databases (VLDB), 426--435.
[15]
Davies, L. and Gather, U. 1989. The identification of multiple outliers. Tech. rep. 89/1, Department of Statistics, University of Dortmund.
[16]
Davies, L. and Gather, U. 1993. The identification of multiple outliers. J. Amer. Statist. Assoc. 88, 782--792.
[17]
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. 2002. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security. Kluwer.
[18]
Ghoting, A., Parthasarathy, S., and Otey, M. 2006. Fast mining of distance-based outliers in high-dimensional datasets. In Proceedings of the SIAM International Conference on Data Mining (SDM'06).
[19]
Hawkins, D. 1980. Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman & Hall.
[20]
Jin, W., Tung, A., and Han, J. 2001. Mining top-n local outliers in large databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'01).
[21]
Knorr, E. and Ng, R. 1997. A unified approach for mining outliers. In Proceedings of the IBM Centre for Advanced Studies Conference (CASCON), 219--222.
[22]
Knorr, E. and Ng, R. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Databases (VLDB'98), 392--403.
[23]
Knorr, E. and Ng, R. 1999. Finding intensional knowledge of distance-based outliers. In Proceedings of the International Conference on Very Large Databases (VLDB'99), 211--222.
[24]
Knorr, E., Ng, R., and Tucakov, V. 2000. Distance-based outlier: Algorithms and applications. VLDB J. 8, 3-4, 237--253.
[25]
Lazarevic, A., Ertöz, L., Kumar, V., Ozgur, A., and Srivastava, J. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM International Conference on Data Mining.
[26]
Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the International Conference on Data Enginnering (ICDE), 315--326.
[27]
Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the International Conference on Managment of Data (SIGMOD'00), 427--438.
[28]
Rider, P. 1962. The negative binomial distribution and the incomplete beta function. The Amer. Math. Monthly 69, 4, 302--304.
[29]
Ruiz, E. V. 1986. An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recogn. Lett. 4, 3, 145--157.
[30]
Samet, H. 2005. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann.
[31]
Schultze, V. and Pawlitschko, J. 2002. The identification of outliers in exponential samples. Statistica Neerlandica 56, 1, 41--57.
[32]
Tao, Y., Xiao, X., and Zhou, S. 2006. Mining distance-based outliers from large databases in any metric space. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'06), 394--403.
[33]
Uhlmann, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4, 175--179.
[34]
Watanabe, O. 2000. Simple sampling techniques for discovery science. TIEICE: IEICE Trans. Commun./Electron./Inf. Syst.

Cited By

View all
  • (2024)Layered isolation forest: A multi-level subspace algorithm for improving isolation forestNeurocomputing10.1016/j.neucom.2024.127525581(127525)Online publication date: May-2024
  • (2024)SymNOM-GED: Symmetric neighbor outlier mining in gene expression datasetsJournal of Computational Science10.1016/j.jocs.2024.10236581(102365)Online publication date: Sep-2024
  • (2024)On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor GraphsSimilarity Search and Applications10.1007/978-3-031-75823-2_14(170-184)Online publication date: 25-Oct-2024
  • Show More Cited By

Index Terms

  1. DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 3, Issue 1
    March 2009
    251 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1497577
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 March 2009
    Accepted: 01 November 2008
    Revised: 01 September 2008
    Received: 01 February 2008
    Published in TKDD Volume 3, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data mining
    2. distance-based outliers
    3. outlier detection

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Layered isolation forest: A multi-level subspace algorithm for improving isolation forestNeurocomputing10.1016/j.neucom.2024.127525581(127525)Online publication date: May-2024
    • (2024)SymNOM-GED: Symmetric neighbor outlier mining in gene expression datasetsJournal of Computational Science10.1016/j.jocs.2024.10236581(102365)Online publication date: Sep-2024
    • (2024)On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor GraphsSimilarity Search and Applications10.1007/978-3-031-75823-2_14(170-184)Online publication date: 25-Oct-2024
    • (2023)Anomaly detection with correlation lawsData & Knowledge Engineering10.1016/j.datak.2023.102181145:COnline publication date: 5-Jun-2023
    • (2022)Research on the Derated Power Data Identification Method of a Wind Turbine Based on a Multi-Gaussian–Discrete Joint Probability ModelSensors10.3390/s2222889122:22(8891)Online publication date: 17-Nov-2022
    • (2022)Anomaly Detection in Resource Constrained Environments With Streaming DataIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2021.30706606:3(649-659)Online publication date: Jun-2022
    • (2022)$${{\mathrm {Latent}}Out}$$: an unsupervised deep anomaly detection approach exploiting latent space distributionMachine Learning10.1007/s10994-022-06153-4112:11(4323-4349)Online publication date: 24-May-2022
    • (2022)Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spacesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00729-131:4(797-821)Online publication date: 27-Jan-2022
    • (2022)Detecting Anomalies with $${{\textrm{Latent}}Out}$$: Novel Scores, Architectures, and SettingsFoundations of Intelligent Systems10.1007/978-3-031-16564-1_24(251-261)Online publication date: 26-Sep-2022
    • (2022)Meta-feature Extraction Strategies for Active Anomaly DetectionIntelligent Data Engineering and Automated Learning – IDEAL 202110.1007/978-3-030-91608-4_40(406-414)Online publication date: 1-Jan-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media