skip to main content
10.1145/3123266.3123451acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Spatio-Temporal AutoEncoder for Video Anomaly Detection

Authors Info & Claims
Published:23 October 2017Publication History

ABSTRACT

Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.

References

  1. Yannick Benezeth, P-M Jodoin, Venkatesh Saligrama, and Christophe Rosenberger. 2009. Abnormal events detection based on spatio-temporal co-occurences Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2458--2465.Google ScholarGoogle Scholar
  2. Yang Cong, Junsong Yuan, and Ji Liu. 2011. Sparse reconstruction cost for abnormal event detection Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3449--3456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. 2016. Learning temporal regularity in video sequences. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 733--742.Google ScholarGoogle ScholarCross RefCross Ref
  4. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1 (2013), 221--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jaechul Kim and Kristen Grauman. 2009. Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2921--2928.Google ScholarGoogle ScholarCross RefCross Ref
  8. Louis Kratz and Ko Nishino. 2009. Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 1446--1453.Google ScholarGoogle Scholar
  9. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google ScholarGoogle Scholar
  11. William Lotter, Gabriel Kreiman, and David Cox. 2015. Unsupervised learning of visual structure using predictive generative networks. arXiv preprint arXiv:1511.06380 (2015).Google ScholarGoogle Scholar
  12. Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision. 2720--2727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models Proc. ICML, Vol. Vol. 30.Google ScholarGoogle Scholar
  14. Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 1975--1981.Google ScholarGoogle ScholarCross RefCross Ref
  15. Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).Google ScholarGoogle Scholar
  16. Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd behavior detection using social force model Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 935--942.Google ScholarGoogle Scholar
  17. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines Proceedings of the 27th international conference on machine learning (ICML-10). 807--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rasmus Berg Palm. 2012. Prediction as a candidate for learning deep hierarchical models of data. Technical University of Denmark Vol. 5 (2012).Google ScholarGoogle Scholar
  19. Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1 optical flow estimation. Image Processing On Line Vol. 2013 (2013), 137--150.Google ScholarGoogle ScholarCross RefCross Ref
  20. Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei-Fei. 2016. Detecting events and key actors in multi-person videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3043--3053.Google ScholarGoogle Scholar
  21. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Elisa Ricci, Gloria Zen, Nicu Sebe, and Stefano Messelodi. 2013. A prototype learning framework using emd: Application to complex scenes analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 3 (2013), 513--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Venkatesh Saligrama and Zhu Chen. 2012. Video anomaly detection based on local statistical aggregates Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2112--2119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google ScholarGoogle Scholar
  25. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs. ICML. 843--852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. 2015. Human action recognition using factorized spatio-temporal convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4597--4605. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gül Varol, Ivan Laptev, and Cordelia Schmid. 2016. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494 (2016).Google ScholarGoogle Scholar
  29. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2015. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 (2015).Google ScholarGoogle Scholar
  30. Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google ScholarGoogle Scholar
  31. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: towards good practices for deep action recognition European Conference on Computer Vision. Springer, 20--36.Google ScholarGoogle Scholar
  32. Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).Google ScholarGoogle Scholar
  33. Kimin Yun, Hawook Jeong, Kwang Moo Yi, Soo Wan Kim, and Jin Young Choi. 2014. Motion interaction field for accident detection in traffic surveillance video Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 3062--3067. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. 2010. Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2528--2535.Google ScholarGoogle ScholarCross RefCross Ref
  35. Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google ScholarGoogle Scholar

Index Terms

  1. Spatio-Temporal AutoEncoder for Video Anomaly Detection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '17: Proceedings of the 25th ACM international conference on Multimedia
        October 2017
        2028 pages
        ISBN:9781450349062
        DOI:10.1145/3123266

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 October 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader