research-article

DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets

Authors:

Fabrizio Angiulli,

Fabio FassettiAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 3, Issue 1

Article No.: 4, Pages 1 - 57

https://doi.org/10.1145/1497577.1497581

Published: 23 March 2009 Publication History

Abstract

In this work a novel distance-based outlier detection algorithm, named DOLPHIN, working on disk-resident datasets and whose I/O cost corresponds to the cost of sequentially reading the input dataset file twice, is presented.

It is both theoretically and empirically shown that the main memory usage of DOLPHIN amounts to a small fraction of the dataset and that DOLPHIN has linear time performance with respect to the dataset size. DOLPHIN gains efficiency by naturally merging together in a unified schema three strategies, namely the selection policy of objects to be maintained in main memory, usage of pruning rules, and similarity search techniques. Importantly, similarity search is accomplished by the algorithm without the need of preliminarily indexing the whole dataset, as other methods do.

The algorithm is simple to implement and it can be used with any type of data, belonging to either metric or nonmetric spaces. Moreover, a modification to the basic method allows DOLPHIN to deal with the scenario in which the available buffer of main memory is smaller than its standard requirements. DOLPHIN has been compared with state-of-the-art distance-based outlier detection algorithms, showing that it is much more efficient.

References

[1]

Aggarwal, C. C. and Yu, P. 2001. Outlier detection for high dimensional data. In Proceedings of the International Conference on Managment of Data (SIGMOD'01).

Digital Library

[2]

Angiulli, F. and Fassetti, F. 2007. Very efficient mining of distance-based outliers. In Proceedings of the International Conference on Information and Management (CIKM), 791--800.

Digital Library

[3]

Angiulli, F. and Pizzuti, C. 2002. Fast outlier detection in large high-dimensional data sets. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery (PKDD'02), 15--26.

Digital Library

[4]

Angiulli, F. and Pizzuti, C. 2005. Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 2, 17, 203--215.

Digital Library

[5]

Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'96), 164--169.

[6]

Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.

[7]

Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'03).

Digital Library

[8]

Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. 1990. The r&ast;-tree: An efficient and robust access method for points and rectangles. In Proceedings of the SIGMOD Conference, 322--331.

Digital Library

[9]

Bentley, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509--517.

Digital Library

[10]

Berchtold, S., Keim, D., and Kriegel, H.-P. 1996. The x-tree: An index structure for high-dimensional data. In Proceedings of the Conference on Very Large Databases (VLDB), 28--39.

Digital Library

[11]

Böhm, C., Berchtold, S., and Keim, D. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373.

Digital Library

[12]

Breunig, M. M., Kriegel, H., Ng, R., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the International Conference on Managment of Data (SIGMOD'00).

Digital Library

[13]

Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquín, J. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--321.

Digital Library

[14]

Ciaccia, P., Patella, M., and Zezula, P. 1997. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the International Conference on Very Large Databases (VLDB), 426--435.

Digital Library

[15]

Davies, L. and Gather, U. 1989. The identification of multiple outliers. Tech. rep. 89/1, Department of Statistics, University of Dortmund.

[16]

Davies, L. and Gather, U. 1993. The identification of multiple outliers. J. Amer. Statist. Assoc. 88, 782--792.

[17]

Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. 2002. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security. Kluwer.

[18]

Ghoting, A., Parthasarathy, S., and Otey, M. 2006. Fast mining of distance-based outliers in high-dimensional datasets. In Proceedings of the SIAM International Conference on Data Mining (SDM'06).

[19]

Hawkins, D. 1980. Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman & Hall.

[20]

Jin, W., Tung, A., and Han, J. 2001. Mining top-n local outliers in large databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'01).

Digital Library

[21]

Knorr, E. and Ng, R. 1997. A unified approach for mining outliers. In Proceedings of the IBM Centre for Advanced Studies Conference (CASCON), 219--222.

Digital Library

[22]

Knorr, E. and Ng, R. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Databases (VLDB'98), 392--403.

Digital Library

[23]

Knorr, E. and Ng, R. 1999. Finding intensional knowledge of distance-based outliers. In Proceedings of the International Conference on Very Large Databases (VLDB'99), 211--222.

Digital Library

[24]

Knorr, E., Ng, R., and Tucakov, V. 2000. Distance-based outlier: Algorithms and applications. VLDB J. 8, 3-4, 237--253.

Digital Library

[25]

Lazarevic, A., Ertöz, L., Kumar, V., Ozgur, A., and Srivastava, J. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM International Conference on Data Mining.

[26]

Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the International Conference on Data Enginnering (ICDE), 315--326.

[27]

Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the International Conference on Managment of Data (SIGMOD'00), 427--438.

Digital Library

[28]

Rider, P. 1962. The negative binomial distribution and the incomplete beta function. The Amer. Math. Monthly 69, 4, 302--304.

[29]

Ruiz, E. V. 1986. An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recogn. Lett. 4, 3, 145--157.

Digital Library

[30]

Samet, H. 2005. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann.

Digital Library

[31]

Schultze, V. and Pawlitschko, J. 2002. The identification of outliers in exponential samples. Statistica Neerlandica 56, 1, 41--57.

[32]

Tao, Y., Xiao, X., and Zhou, S. 2006. Mining distance-based outliers from large databases in any metric space. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'06), 394--403.

Digital Library

[33]

Uhlmann, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4, 175--179.

[34]

Watanabe, O. 2000. Simple sampling techniques for discovery science. TIEICE: IEICE Trans. Commun./Electron./Inf. Syst.

Cited By

Liu TZhou ZYang L(2024)Layered isolation forest: A multi-level subspace algorithm for improving isolation forestNeurocomputing10.1016/j.neucom.2024.127525581(127525)Online publication date: May-2024
https://doi.org/10.1016/j.neucom.2024.127525
Baruah BDutta MBanerjee SBhattacharyya D(2024)SymNOM-GED: Symmetric neighbor outlier mining in gene expression datasetsJournal of Computational Science10.1016/j.jocs.2024.10236581(102365)Online publication date: Sep-2024
https://doi.org/10.1016/j.jocs.2024.102365
Okkels CAumüller MZimek A(2024)On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor GraphsSimilarity Search and Applications10.1007/978-3-031-75823-2_14(170-184)Online publication date: 25-Oct-2024
https://doi.org/10.1007/978-3-031-75823-2_14
Show More Cited By

Index Terms

DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets
1. Information systems
  1. Information retrieval

Recommendations

Very efficient mining of distance-based outliers
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

In this work a novel algorithm, named DOLPHIN, for detecting distance-based outliers is presented.

The proposed algorithm performs only two sequential scans of the dataset. It needs to store into main memory a portion of the dataset, to efficiently ...
Capabilities of outlier detection schemes in large datasets, framework and methodologies

Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors ...
Neighborhood outlier detection

KNN (k nearest neighbor) is widely discussed and applied in pattern recognition and data mining, however, as a similar outlier detection method using local information for mining a new outlier, neighborhood outlier detection, few literatures are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 3, Issue 1

March 2009

251 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1497577

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2009

Accepted: 01 November 2008

Revised: 01 September 2008

Received: 01 February 2008

Published in TKDD Volume 3, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
1,322
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu TZhou ZYang L(2024)Layered isolation forest: A multi-level subspace algorithm for improving isolation forestNeurocomputing10.1016/j.neucom.2024.127525581(127525)Online publication date: May-2024
https://doi.org/10.1016/j.neucom.2024.127525
Baruah BDutta MBanerjee SBhattacharyya D(2024)SymNOM-GED: Symmetric neighbor outlier mining in gene expression datasetsJournal of Computational Science10.1016/j.jocs.2024.10236581(102365)Online publication date: Sep-2024
https://doi.org/10.1016/j.jocs.2024.102365
Okkels CAumüller MZimek A(2024)On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor GraphsSimilarity Search and Applications10.1007/978-3-031-75823-2_14(170-184)Online publication date: 25-Oct-2024
https://doi.org/10.1007/978-3-031-75823-2_14
Angiulli FFassetti FSerrao C(2023)Anomaly detection with correlation lawsData & Knowledge Engineering10.1016/j.datak.2023.102181145:COnline publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1016/j.datak.2023.102181
Ma YLiu YYang ZYan JTao TInfield D(2022)Research on the Derated Power Data Identification Method of a Wind Turbine Based on a Multi-Gaussian–Discrete Joint Probability ModelSensors10.3390/s2222889122:22(8891)Online publication date: 17-Nov-2022
https://doi.org/10.3390/s22228891
Jain PJain SZaiane OSrivastava A(2022)Anomaly Detection in Resource Constrained Environments With Streaming DataIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2021.30706606:3(649-659)Online publication date: Jun-2022
https://doi.org/10.1109/TETCI.2021.3070660
Angiulli FFassetti FFerragina L(2022)$${{\mathrm {Latent}}Out}$$: an unsupervised deep anomaly detection approach exploiting latent space distributionMachine Learning10.1007/s10994-022-06153-4112:11(4323-4349)Online publication date: 24-May-2022
https://doi.org/10.1007/s10994-022-06153-4
Amagata DOnizuka MHara T(2022)Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spacesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00729-131:4(797-821)Online publication date: 27-Jan-2022
https://dl.acm.org/doi/10.1007/s00778-022-00729-1
Angiulli FFassetti FFerragina L(2022)Detecting Anomalies with $${{\textrm{Latent}}Out}$$: Novel Scores, Architectures, and SettingsFoundations of Intelligent Systems10.1007/978-3-031-16564-1_24(251-261)Online publication date: 26-Sep-2022
https://doi.org/10.1007/978-3-031-16564-1_24
Angiulli FFassetti FFerragina LPapaleo P(2022)Meta-feature Extraction Strategies for Active Anomaly DetectionIntelligent Data Engineering and Automated Learning – IDEAL 202110.1007/978-3-030-91608-4_40(406-414)Online publication date: 1-Jan-2022
https://doi.org/10.1007/978-3-030-91608-4_40
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents