research-article

Very efficient mining of distance-based outliers

Authors:

Fabrizio Angiulli,

Fabio FassettiAuthors Info & Claims

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 791 - 800

https://doi.org/10.1145/1321440.1321550

Published: 06 November 2007 Publication History

Abstract

In this work a novel algorithm, named DOLPHIN, for detecting distance-based outliers is presented.

The proposed algorithm performs only two sequential scans of the dataset. It needs to store into main memory a portion of the dataset, to efficiently search for neighbors and early prune inliers. The strategy pursued by the algorithm allows to keep this portion very small. Both theoretical justification and empirical evidence that the size of the stored data amounts only to a few percent of the dataset are provided.

Another important feature of DOLPHIN is that the memory-resident data are indexed by using a suitable proximity search approach. This allows to search for nearest neighbors looking only at a small subset of the main memory stored data.

Temporal and spatial cost analysis show that the novel algorithm achieves both near linear CPU and I/O cost. DOLPHIN has been compared with state of the art methods, showing that it outperforms existing ones.

References

[1]

C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proc. Int. Conference on Managment of Data (SIGMOD'01), 2001.

Digital Library

[2]

F. Angiulli, S. Basta, and C. Pizzuti. Distance-based detection and prediction of outliers. IEEE Transaction on Knowledge and Data Engineering, 18(2):145--160, February 2006.

Digital Library

[3]

F. Angiulli and C. Pizzuti. Fast outlier detection in large high-dimensional data sets. In Proc. Int. Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'02), pages 15--26, 2002.

Digital Library

[4]

F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. IEEE Transaction on Knowledge and Data Engineering, 17(2):203--215, February 2005.

Digital Library

[5]

A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection in large databases. In Proc. Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), pages 164--169, 1996.

[6]

V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.

[7]

S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), 2003.

Digital Library

[8]

N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In Proc. of the SIGMOD Conference, pages 322--331, 1990.

Digital Library

[9]

J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18:509--517, 1975.

Digital Library

[10]

S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proc. of the Conf. on VLDB, pages 28--39, 1996.

Digital Library

[11]

C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322--373, 2001.

Digital Library

[12]

M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proc. Int. Conf. on Managment of Data (SIGMOD'00), 2000.

Digital Library

[13]

E. Chávez, G. Navarro, R. A. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM Comput. Surv., 33(3):273--321, 2001.

Digital Library

[14]

Defense Advanced Research Projects Agency DARPA. Intrusion detection evaluation. In http://www.ll.mit.edu/IST/ideval/index.html.

[15]

E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security, Kluwer, 2002.

[16]

A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. In Proc. of the SIAM International Conference on Data Mining (SDM'06), Bethesda, MD, USA, 2006.

[17]

W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'01), 2001.

Digital Library

[18]

E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. Int. Conf. on Very Large Databases (VLDB'98), pages 392--403, 1998.

Digital Library

[19]

E. Knorr and R. Ng. Finding intensional knowledge of distance-based outliers. In Proc. Int. Conf. on Very Large Databases (VLDB'99), pages 211--222, 1999.

Digital Library

[20]

E. Knorr, R. Ng, and V. Tucakov. Distance-based outlier: algorithms and applications. VLDB Journal, 8(3-4):237--253, 2000.

Digital Library

[21]

A. Lazarevic, L. Ertöz, V. Kumar, A. Ozgur, and J. Srivastava. A comparative study of anomaly detection schemes in network intrusion detection. In Proc. of the SIAM Int. Conf. on Data Mining, 2003.

[22]

S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In Proc. Int. Conf. on Data Enginnering (ICDE), pages 315--326, 2003.

[23]

S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. Int. Conf. on Managment of Data (SIGMOD'00), pages 427--438, 2000.

Digital Library

[24]

Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. In Proc. of the Int. Conf. on Knowledge Discovery and Data Mining (KDD'06), pages 394--403, Philadelphia, PA, USA, 1999.

Digital Library

Cited By

Pazho ANoghre GPurkayastha AVempati JMartin OTabkhi H(2024)A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328289836:1(1-20)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3282898
Xu HLiang ZHuang KHuang GHe Y(2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
https://dl.acm.org/doi/10.4018/IJGHPC.334125
Krishna BMahalakshmi VNookala G(2023)Modelling a stacked dense network model for outlier prediction over medical-based heart prediction dataJournal of High Speed Networks10.3233/JHS-22207929:4(279-294)Online publication date: 14-Nov-2023
https://doi.org/10.3233/JHS-222079
Show More Cited By

Index Terms

Very efficient mining of distance-based outliers
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Distance-based outlier queries in data streams: the novel task and algorithms

This work proposes a method for detecting distance-based outliers in data streams under the sliding window model. The novel notion of one-time outlier query is introduced in order to detect anomalies in the current window at arbitrary points-in-time. ...
Distance-based outliers: algorithms and applications

This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the ...
Detecting distance-based outliers in streams of data
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

In this work a method for detecting distance-based outliers in data streams is presented. We deal with the sliding window model, where outlier queries are performed in order to detect anomalies in the current window. Two algorithms are presented. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

November 2007

1048 pages

ISBN:9781595938039

DOI:10.1145/1321440

Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 6 - 10, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
662
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pazho ANoghre GPurkayastha AVempati JMartin OTabkhi H(2024)A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328289836:1(1-20)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3282898
Xu HLiang ZHuang KHuang GHe Y(2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
https://dl.acm.org/doi/10.4018/IJGHPC.334125
Krishna BMahalakshmi VNookala G(2023)Modelling a stacked dense network model for outlier prediction over medical-based heart prediction dataJournal of High Speed Networks10.3233/JHS-22207929:4(279-294)Online publication date: 14-Nov-2023
https://doi.org/10.3233/JHS-222079
Patel VKapoor ASharma AChakrabarti S(2023)Taxonomy of outlier detection methods for power system measurementsEnergy Conversion and Economics10.1049/enc2.120824:2(73-88)Online publication date: 20-Apr-2023
https://doi.org/10.1049/enc2.12082
Chigurupati RKulasekaran RBabu M(2022)Design of a deep network model for outlier predictionInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01731-0Online publication date: 15-Jul-2022
https://doi.org/10.1007/s13198-022-01731-0
Chigurupati SRaja KBabu M(2022)An Extensive Survey on Outlier Prediction Using Mining and Learning ApproachesEvolutionary Computing and Mobile Sustainable Networks10.1007/978-981-16-9605-3_40(593-610)Online publication date: 22-Mar-2022
https://doi.org/10.1007/978-981-16-9605-3_40
Thudumu SBranch PJin JSingh J(2020)A comprehensive survey of anomaly detection techniques for high dimensional big dataJournal of Big Data10.1186/s40537-020-00320-x7:1Online publication date: 2-Jul-2020
https://doi.org/10.1186/s40537-020-00320-x
Wang HBah MHammad M(2019)Progress in Outlier Detection Techniques: A SurveyIEEE Access10.1109/ACCESS.2019.29327697(107964-108000)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2932769
Wang XBai MShen DNie TKou YYu G(2017)A Distributed Algorithm for the Cluster‐Based Outlier Detection Using Unsupervised Extreme Learning MachinesMathematical Problems in Engineering10.1155/2017/26495352017:1Online publication date: 9-Apr-2017
https://doi.org/10.1155/2017/2649535
Bai MWang XXin JWang G(2016)An efficient algorithm for distributed density-based outlier detection on big dataNeurocomputing10.1016/j.neucom.2015.05.135181:C(19-28)Online publication date: 12-Mar-2016
https://dl.acm.org/doi/10.1016/j.neucom.2015.05.135
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten