ABSTRACT
The recent availability of lightweight, wearable cameras allows for collecting video data from a "first-person' perspective, capturing the visual world of the wearer in everyday interactive contexts. In this paper, we investigate how to exploit egocentric vision to infer multimodal behaviors from people wearing head-mounted cameras. More specifically, we estimate head (camera) motion from egocentric video, which can be further used to infer non-verbal behaviors such as head turns and nodding in multimodal interactions. We propose several approaches based on Convolutional Neural Networks (CNNs) that combine raw images and optical flow fields to learn to distinguish regions with optical flow caused by global ego-motion from those caused by other motion in a scene. Our results suggest that CNNs do not directly learn useful visual features with end-to-end training from raw images alone; instead, a better approach is to first extract optical flow explicitly and then train CNNs to integrate optical flow and visual information.
- Maedeh Aghaei. 2017. Social signal extraction from egocentric photo-streams. In International Conference on Multimodal Interaction (ICMI). Google ScholarDigital Library
- Sven Bambach, David J. Crandall, and Chen Yu. 2015. Viewpoint integration for hand-based recognition of social interactions from a first-person view. In International Conference on Multimodal Interaction (ICMI). Google ScholarDigital Library
- Sven Bambach, John Franchak, David Crandall, and Chen Yu. 2014. Detecting hands in children's egocentric views to understand embodied attention during social interaction. In Annual Meeting of the Cognitive Science Society (CogSci).Google Scholar
- Sven Bambach, Linda B. Smith, David J. Crandall, and Chen Yu. 2016. Objects in the center: How the infant's body constrains infant scenes. In International Conference on Development and Learning and Epigenetic Robotics (ICDL).Google ScholarCross Ref
- Alejandro Betancourt, Pietro Morerio, Carlo S. Regazzoni, and Matthias Rauterberg. 2015. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (2015), 744--760.Google ScholarDigital Library
- Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In International Conference on Computer Vision (ICCV). Google ScholarDigital Library
- Martin A Fischler and Robert C Bolles. 1987. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. In Readings in computer vision. Elsevier, 726--740. Google ScholarDigital Library
- Jeffrey M. Girard. 2014. Perceptions of interpersonal behavior are influenced by gender, facial expression intensity, and head pose. In International Conference on Multimodal Interaction (ICMI). Google ScholarDigital Library
- Google. {n. d.}. ARCore Overview. https://developers.google.com/ar/discover/. Accessed: 2018-05-01.Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).Google Scholar
- Kris M. Kitani, Takahiro Okabe, Yoichi Sato, and Akihiro Sugimoto. 2011. Fast unsupervised ego-action learning for first-person sports videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. Google ScholarDigital Library
- Yin Li, Alireza Fathi, and James M Rehg. 2013. Learning to predict gaze in egocentric video. In International Conference on Computer Vision (ICCV). Google ScholarDigital Library
- Yin Li, Zhefan Ye, and James M Rehg. 2015. Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. 2015. ORBSLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31, 5 (2015), 1147--1163.Google ScholarDigital Library
- Erik Murphy-Chutorian and Mohan Manubhai Trivedi. 2009. Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 4 (2009), 607--626. Google ScholarDigital Library
- Fumio Nihei, Yukiko I Nakano, Yuki Hayashi, Hung-Hsuan Hung, and Shogo Okada. 2014. Predicting influential statements in group discussions using speech and head motion information. In International Conference on Multimodal Interaction (ICMI). Google ScholarDigital Library
- Jeff Pelz, Mary Hayhoe, and Russ Loeber. 2001. The coordination of eye, head, and hand movements in a natural task. Experimental Brain Research 139, 3 (2001), 266--277.Google ScholarCross Ref
- Yair Poleg, Ariel Ephrat, Shmuel Peleg, and Chetan Arora. 2016. Compact CNN for Indexing Egocentric Videos. In IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS). Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google Scholar
- Oleg Špakov, Poika Isokoski, Jari Kangas, Jussi Rantala, Deepak Akkil, and Roope Raisamo. 2016. Comparison of three implementations of HeadTurn: a multimodal interaction technique with gaze and head turns. In International Conference on Multimodal Interaction (ICMI). Google ScholarDigital Library
- Ramanathan Subramanian, Yan Yan, Jacopo Staiano, Oswald Lanz, and Nicu Sebe. 2013. On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In International conference on Multimodal Interaction (ICMI). Google ScholarDigital Library
- Deqing Sun, Stefan Roth, and Michael Black. 2010. Secrets of optical flow estimation and their principles. In Conference Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Takuma Yagi, Karttikeya Mangalam, Ryo Yonetani, and Yoichi Sato. 2018. Future Person Localization in First-Person Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Akiko Yamazaki, Keiichi Yamazaki, Takaya Ohyama, Yoshinori Kobayashi, and Yoshinori Kuno. 2012. A techno-sociological solution for designing a museum guide robot: regarding choosing an appropriate visitor. In Human-Robot Interaction (HRI), 2012 7th ACM/IEEE International Conference on. IEEE, 309--316. Google ScholarDigital Library
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV).Google Scholar
Index Terms
- Estimating Head Motion from Egocentric Vision
Recommendations
Left/right hand segmentation in egocentric videos
AbstractWearable cameras allow people to record their daily activities from a user-centered (First Person Vision) perspective. Due to their favorable location, wearable cameras frequently capture the hands of the user, and may thus represent a ...
Fixation detection for head-mounted eye tracking based on visual similarity of gaze targets
ETRA '18: Proceedings of the 2018 ACM Symposium on Eye Tracking Research & ApplicationsFixations are widely analysed in human vision, gaze-based interaction, and experimental psychology research. However, robust fixation detection in mobile settings is profoundly challenging given the prevalence of user and gaze target motion. These ...
Head or gaze?: controlling remote camera for hands-busy tasks in teleoperation: a comparison
OZCHI '10: Proceedings of the 22nd Conference of the Computer-Human Interaction Special Interest Group of Australia on Computer-Human InteractionHead motion and eye gaze are general models of natural human interaction. Recent computer vision based head tracking and eye tracking technologies have expanded the possibilities of designing and developing more natural and intuitive user interfaces for ...
Comments