ABSTRACT
We present a cloud based anomaly detection service framework that uses a containerized Spark cluster and ancillary user interfaces all managed by Kubernetes. The stack of technology put together allows for fast, reliable, resilient and easily scalable service for either batch or streaming data. At the heart of the service, we utilize an improved version of the algorithm Isolation Forest called Extended Isolation Forest for robust and efficient anomaly detection. We showcase the design and a normal workflow of our infrastructure which is ready to deploy on any Kubernetes cluster without extra technical knowledge. With exposed APIs and simple graphical interfaces, users can load any data and detect anomalies on the loaded set or on newly presented data points using a batch or a streaming mode. With the latter, users can subscribe and get notifications on the desired output. Our aim is to develop and apply these techniques to use with scientific data. In particular we are interested in finding anomalous objects within the overwhelming set of images and catalogs produced by current and future astronomical surveys, but that can be easily adopted to other fields.
- 2018. Jupyter Lab.Google Scholar
- David Bernstein. 2014. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing 1, 3 (2014), 81--84.Google ScholarCross Ref
- Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5--32. Google ScholarDigital Library
- Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 15. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
- Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. 2016. Robust Random Cut Forest Based Anomaly Detection on Streams. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (ICML'16). JMLR.org, 2712--2721. http://dl.acm.org/citation.cfm?id=3045390.3045676 Google ScholarDigital Library
- Sahand Hariri and Matias Carrasco Kind. 2018. Extended Isolation Forest. In preparation (2018).Google Scholar
- Marc Henrion, Daniel J. Mortlock, David J. Hand, and Axel Gandy. 2013. Classification and Anomaly Detection for Astronomical Survey Data. Springer New York, New York, NY, 149--184.Google Scholar
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 413--422.Google ScholarDigital Library
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-Based Anomaly Detection. ACM Trans. Knowl. Discov. Data 6, 1, Article 3 (March 2012), 39 pages. Google ScholarDigital Library
- I. Nun, K. Pichara, P. Protopapas, and D.-W. Kim. 2014. Supervised Detection of Anomalous Light Curves in Massive Astronomical Catalogs. The Astrophysical Journal 793, Article 23 (Sept. 2014), 23 pages. arXiv:cs.CE/1404.4888Google ScholarCross Ref
- Tiago Rosado and Jorge Bernardino. 2014. An Overview of Openstack Architecture. In Proceedings of the 18th International Database Engineering & Applications Symposium (IDEAS '14). ACM, New York, NY, USA, 366--367. Google ScholarDigital Library
- Swee Chuan Tan, Kai Ming Ting, and Fei Tony Liu. 2011. Fast Anomaly Detection for Streaming Data. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16--22, 2011. 1511--1516. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10-10 (2010), 95. Google ScholarDigital Library
- Weijia Zhang and Xiaofeng He. 2017. An Anomaly Detection Method for Medicare Fraud Detection. (2017), 309--314.Google Scholar
Recommendations
Fuzzy Isolation Forest for Anomaly Detection
AbstractAnomaly detection is nowadays a key data mining task. Anomaly detection methods generally look for patterns of ”normal” profile and then identify data points that do not match that profile. One outstanding method, Isolation Forest, showed high ...
On the performance of SQL scalable systems on Kubernetes: a comparative study
AbstractThe popularization of Hadoop as the the-facto standard platform for data analytics in the context of Big Data applications has led to the upsurge of SQL-on-Hadoop systems, which provide scalable query execution engines allowing the use of SQL ...
Improving iForest for Hydrological Time Series Anomaly Detection
Algorithms and Architectures for Parallel ProcessingAbstractWith the increasing number of installed hydrological sensors, the data from these sensors usually contain a variety of abnormal values due to network congestion, equipment failure, or environmental influence. To deal with the anomaly on a larger ...
Comments