ABSTRACT
Anomaly detection is of great interest to big data applications, and both supervised and unsupervised learning have been applied for anomaly detection. However, it still remains a challenging problem because: (1) for supervised learning, it is difficult to acquire training data for anomaly samples; while (2) for unsupervised learning, the performance may not be satisfactory due to the lack of training data. To address the limitations, we propose a hybrid solution by using both normal (positive) data and unlabeled data (could be positive or negative) for semi-supervised anomaly detection. Particularly, we introduce a new framework based on Positive and Unlabeled (PU) Learning using multi-features to detect anomalies. We extend previous PU learning methods to (1) better address unbalanced class problem which is typical for anomaly detection, and (2) handle multiple features for anomaly detection. An iterative algorithm is proposed to learn the anomaly classifier incrementally from the labeled normal data and also unlabeled data. Our proposed method is verified on three benchmark datasets and one synthetic dataset. Experimental results show that our method outperforms existing methods under different class priors and different proportions of given positive classes.
- Unusual crowd activity dataset of University of Minnesota. Available at http: //mha.cs.umn.edu/proj_events.shtml, Accessed: 2017-04.Google Scholar
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3 (2011), 27. Google ScholarDigital Library
- Kaustav Das and Jeff Schneider. 2007. Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 220--229. Google ScholarDigital Library
- Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama. 2015. Convex formulation for learning from positive and unlabeled data. In International Conference on Machine Learning. 1386--1394. Google ScholarDigital Library
- Marthinus C du Plessis, Gang Niu, and Masashi Sugiyama. 2014. Analysis of learning from positive and unlabeled data. In Advances in neural information processing systems. 703--711. Google ScholarDigital Library
- Marthinus Christoffel Du Plessis and Masashi Sugiyama. 2014. Class prior estimation from positive and unlabeled data. IEICE TRANSACTIONS on Information and Systems 97, 5 (2014), 1358--1362.Google ScholarCross Ref
- Levent Ertöz, Michael Steinbach, and Vipin Kumar. 2003. Finding topics in collections of documents: A shared nearest neighbor approach. Clustering and Information Retrieval 11 (2003), 83--103.Google ScholarCross Ref
- Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery. Springer, 170--180. Google ScholarCross Ref
- Katherine A Heller, Krysta M Svore, Angelos D Keromytis, and Salvatore J Stolfo. 2003. One class support vector machines for detecting anomalous windows registry accesses. In Proc. of the workshop on Data Mining for Computer Security, Vol. 9.Google Scholar
- Anurag Kumar and Bhiksha Raj. 2016. Audio event detection using weakly labeled data. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1038--1047. Google ScholarDigital Library
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Huayi Li, Bing Liu, Arjun Mukherjee, and Jidong Shao. 2014. Spotting fake reviews using positive-unlabeled learning. Computación y Sistemas 18, 3 (2014), 467--475.Google Scholar
- Xiaoli Li and Bing Liu. 2003. Learning to classify texts using positive and unlabeled data. In IJCAI, Vol. 3. 587--592. Google ScholarDigital Library
- Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. 2002. Partially supervised classification of text documents. In ICML, Vol. 2. Citeseer, 387--394. Google ScholarDigital Library
- Fantine Mordelet and Jean-Philippe Vert. 2011. ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC bioinformatics 12, 1 (2011), 389.Google Scholar
- Gang Niu, Marthinus Christoffel du Plessis, Tomoya Sakai, and Masashi Sugiyama. 2016. Theoretical Comparisons of Learning from Positive-Negative, Positive-Unlabeled, and Negative-Unlabeled Data. arXiv preprint arXiv:1603.03130 (2016).Google Scholar
- Yafeng Ren, Donghong Ji, and Hongbin Zhang. 2014. Positive Unlabeled Learning for Deceptive Reviews Detection.. In EMNLP. 488--498.Google Scholar
- Jungsuk Song, Hiroki Takakura, and Yasuo Okabe. 2008. Cooperation of intelligent honeypots to detect unknown malicious codes. In Information Security Threats Data Collection and Sharing, 2008. WISTDCS'08. WOMBAT Workshop on. IEEE, 31--39. Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
- Xin-Shun Xu, Yuan Jiang, Xiangyang Xue, and Zhi-Hua Zhou. 2012. Semi-supervised multi-instance multi-label learning for video annotation task. In Proceedings of the 20th ACM international conference on Multimedia. ACM, 737--740. Google ScholarDigital Library
- Peng Yang, Xiaoli Li, Hon-Nian Chua, Chee-Keong Kwoh, and See-Kiong Ng. 2014. Ensemble positive unlabeled learning for disease gene identification. PloS one 9, 5 (2014), e97079.Google ScholarCross Ref
- Peng Yang, Xiao-Li Li, Jian-Ping Mei, Chee-Keong Kwoh, and See-Kiong Ng. 2012. Positive-unlabeled learning for disease gene identification. Bioinformatics 28, 20 (2012), 2640--2647. Google ScholarDigital Library
- Kun Zhao, Wei Liu, and Jianzhuang Liu. 2012. Optimal semi-supervised metric learning for image retrieval. In Proceedings of the 20th ACM international conference on Multimedia. ACM, 893--896. Google ScholarDigital Library
Recommendations
Learning from Positive and Unlabeled Multi-Instance Bags in Anomaly Detection
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningIn the multi-instance learning (MIL) setting instances are grouped together into bags. Labels are provided only for the bags and not on the level of individual instances. A positive bag label means that at least one instance inside the bag is positive, ...
Deep Weakly-supervised Anomaly Detection
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningRecent semi-supervised anomaly detection methods that are trained using small labeled anomaly examples and large unlabeled data (mostly normal data) have shown largely improved performance over unsupervised methods. However, these methods often focus on ...
A unified framework for semi-supervised PU learning
Traditional supervised classifiers use only labeled data (features/label pairs) as the training set, while the unlabeled data is used as the testing set. In practice, it is often the case that the labeled data is hard to obtain and the unlabeled data ...
Comments