research-article

Spatio-Temporal AutoEncoder for Video Anomaly Detection

Authors:
Yiru Zhao

Shanghai Jiao Tong University & Alibaba Group, Shanghai, China

Shanghai Jiao Tong University & Alibaba Group, Shanghai, China
View Profile

,
Bing Deng

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Chen Shen

Zhejiang University & Alibaba Group, Hangzhou, China

Zhejiang University & Alibaba Group, Hangzhou, China
View Profile

,
Yao Liu

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Hongtao Lu

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Xian-Sheng Hua

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 1933–1941https://doi.org/10.1145/3123266.3123451

Published:23 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1933–1941

ABSTRACT

Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.

References

Yannick Benezeth, P-M Jodoin, Venkatesh Saligrama, and Christophe Rosenberger. 2009. Abnormal events detection based on spatio-temporal co-occurences Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2458--2465.Google Scholar
Yang Cong, Junsong Yuan, and Ji Liu. 2011. Sparse reconstruction cost for abnormal event detection Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3449--3456. Google ScholarDigital Library
Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. 2016. Learning temporal regularity in video sequences. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 733--742.Google ScholarCross Ref
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 448--456. Google ScholarDigital Library
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1 (2013), 221--231. Google ScholarDigital Library
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarDigital Library
Jaechul Kim and Kristen Grauman. 2009. Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2921--2928.Google ScholarCross Ref
Louis Kratz and Ko Nishino. 2009. Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 1446--1453.Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105. Google ScholarDigital Library
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google Scholar
William Lotter, Gabriel Kreiman, and David Cox. 2015. Unsupervised learning of visual structure using predictive generative networks. arXiv preprint arXiv:1511.06380 (2015).Google Scholar
Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision. 2720--2727. Google ScholarDigital Library
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models Proc. ICML, Vol. Vol. 30.Google Scholar
Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 1975--1981.Google ScholarCross Ref
Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).Google Scholar
Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd behavior detection using social force model Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 935--942.Google Scholar
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines Proceedings of the 27th international conference on machine learning (ICML-10). 807--814. Google ScholarDigital Library
Rasmus Berg Palm. 2012. Prediction as a candidate for learning deep hierarchical models of data. Technical University of Denmark Vol. 5 (2012).Google Scholar
Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1 optical flow estimation. Image Processing On Line Vol. 2013 (2013), 137--150.Google ScholarCross Ref
Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei-Fei. 2016. Detecting events and key actors in multi-person videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3043--3053.Google Scholar
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarDigital Library
Elisa Ricci, Gloria Zen, Nicu Sebe, and Stefano Messelodi. 2013. A prototype learning framework using emd: Application to complex scenes analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 3 (2013), 513--526. Google ScholarDigital Library
Venkatesh Saligrama and Zhu Chen. 2012. Video anomaly detection based on local statistical aggregates Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2112--2119. Google ScholarDigital Library
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs. ICML. 843--852. Google ScholarDigital Library
Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. 2015. Human action recognition using factorized spatio-temporal convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4597--4605. Google ScholarDigital Library
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
Gül Varol, Ivan Laptev, and Cordelia Schmid. 2016. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494 (2016).Google Scholar
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2015. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 (2015).Google Scholar
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google Scholar
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: towards good practices for deep action recognition European Conference on Computer Vision. Springer, 20--36.Google Scholar
Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).Google Scholar
Kimin Yun, Hawook Jeong, Kwang Moo Yi, Soo Wan Kim, and Jin Young Choi. 2014. Motion interaction field for accident detection in traffic surveillance video Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 3062--3067. Google ScholarDigital Library
Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. 2010. Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2528--2535.Google ScholarCross Ref
Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google Scholar

Index Terms

Spatio-Temporal AutoEncoder for Video Anomaly Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Machine learning approaches

Recommendations

Video anomaly detection with spatio-temporal dissociation
Highlights
- We propose a novel autoencoder architecture to dissociate the spatio temporal representation and learn the regularity in both the spatial and motion feature ...
Abstract
Anomaly detection in videos remains a challenging task due to the ambiguous definition of anomaly and the complexity of visual scenes from real video data. Different from the previous work which utilizes reconstruction or prediction as ...
Read More
Video anomaly detection based on spatio-temporal relationships among objects
Highlights
- A new video anomaly detection model via spatio-temporal relationships among objects.
Abstract
Video anomaly detection is to automatically identify predefined anomalous contents (e.g. abnormal objects, behaviors and scenes) in videos. The performance of video anomaly detection can be effectively improved by making the model ...
Read More
Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles
Computer Vision – ECCV 2022
Abstract
Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task, i.e., spatio-temporal jigsaw ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3d convolutions
autoencoder
video anomaly detection
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 322
  Total Citations
  View Citations
- 5,267
  Total Downloads
- Downloads (Last 12 months)693
- Downloads (Last 6 weeks)75
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spatio-Temporal AutoEncoder for Video Anomaly Detection

MM '17: Proceedings of the 25th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Video anomaly detection with spatio-temporal dissociation

Video anomaly detection based on spatio-temporal relationships among objects

Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles