ABSTRACT
Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.
- Yannick Benezeth, P-M Jodoin, Venkatesh Saligrama, and Christophe Rosenberger. 2009. Abnormal events detection based on spatio-temporal co-occurences Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2458--2465.Google Scholar
- Yang Cong, Junsong Yuan, and Ji Liu. 2011. Sparse reconstruction cost for abnormal event detection Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3449--3456. Google ScholarDigital Library
- Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. 2016. Learning temporal regularity in video sequences. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 733--742.Google ScholarCross Ref
- Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 448--456. Google ScholarDigital Library
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1 (2013), 221--231. Google ScholarDigital Library
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarDigital Library
- Jaechul Kim and Kristen Grauman. 2009. Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2921--2928.Google ScholarCross Ref
- Louis Kratz and Ko Nishino. 2009. Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 1446--1453.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105. Google ScholarDigital Library
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google Scholar
- William Lotter, Gabriel Kreiman, and David Cox. 2015. Unsupervised learning of visual structure using predictive generative networks. arXiv preprint arXiv:1511.06380 (2015).Google Scholar
- Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision. 2720--2727. Google ScholarDigital Library
- Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models Proc. ICML, Vol. Vol. 30.Google Scholar
- Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 1975--1981.Google ScholarCross Ref
- Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).Google Scholar
- Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd behavior detection using social force model Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 935--942.Google Scholar
- Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines Proceedings of the 27th international conference on machine learning (ICML-10). 807--814. Google ScholarDigital Library
- Rasmus Berg Palm. 2012. Prediction as a candidate for learning deep hierarchical models of data. Technical University of Denmark Vol. 5 (2012).Google Scholar
- Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1 optical flow estimation. Image Processing On Line Vol. 2013 (2013), 137--150.Google ScholarCross Ref
- Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei-Fei. 2016. Detecting events and key actors in multi-person videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3043--3053.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarDigital Library
- Elisa Ricci, Gloria Zen, Nicu Sebe, and Stefano Messelodi. 2013. A prototype learning framework using emd: Application to complex scenes analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 3 (2013), 513--526. Google ScholarDigital Library
- Venkatesh Saligrama and Zhu Chen. 2012. Video anomaly detection based on local statistical aggregates Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2112--2119. Google ScholarDigital Library
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs. ICML. 843--852. Google ScholarDigital Library
- Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. 2015. Human action recognition using factorized spatio-temporal convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4597--4605. Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
- Gül Varol, Ivan Laptev, and Cordelia Schmid. 2016. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494 (2016).Google Scholar
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2015. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 (2015).Google Scholar
- Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google Scholar
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: towards good practices for deep action recognition European Conference on Computer Vision. Springer, 20--36.Google Scholar
- Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).Google Scholar
- Kimin Yun, Hawook Jeong, Kwang Moo Yi, Soo Wan Kim, and Jin Young Choi. 2014. Motion interaction field for accident detection in traffic surveillance video Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 3062--3067. Google ScholarDigital Library
- Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. 2010. Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2528--2535.Google ScholarCross Ref
- Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google Scholar
Index Terms
- Spatio-Temporal AutoEncoder for Video Anomaly Detection
Recommendations
Video anomaly detection with spatio-temporal dissociation
Highlights- We propose a novel autoencoder architecture to dissociate the spatio temporal representation and learn the regularity in both the spatial and motion feature ...
AbstractAnomaly detection in videos remains a challenging task due to the ambiguous definition of anomaly and the complexity of visual scenes from real video data. Different from the previous work which utilizes reconstruction or prediction as ...
Video anomaly detection based on spatio-temporal relationships among objects
Highlights- A new video anomaly detection model via spatio-temporal relationships among objects.
AbstractVideo anomaly detection is to automatically identify predefined anomalous contents (e.g. abnormal objects, behaviors and scenes) in videos. The performance of video anomaly detection can be effectively improved by making the model ...
Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles
Computer Vision – ECCV 2022AbstractVideo Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task, i.e., spatio-temporal jigsaw ...
Comments