skip to main content
10.1145/3242969.3242982acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Public Access

Estimating Head Motion from Egocentric Vision

Published:02 October 2018Publication History

ABSTRACT

The recent availability of lightweight, wearable cameras allows for collecting video data from a "first-person' perspective, capturing the visual world of the wearer in everyday interactive contexts. In this paper, we investigate how to exploit egocentric vision to infer multimodal behaviors from people wearing head-mounted cameras. More specifically, we estimate head (camera) motion from egocentric video, which can be further used to infer non-verbal behaviors such as head turns and nodding in multimodal interactions. We propose several approaches based on Convolutional Neural Networks (CNNs) that combine raw images and optical flow fields to learn to distinguish regions with optical flow caused by global ego-motion from those caused by other motion in a scene. Our results suggest that CNNs do not directly learn useful visual features with end-to-end training from raw images alone; instead, a better approach is to first extract optical flow explicitly and then train CNNs to integrate optical flow and visual information.

References

  1. Maedeh Aghaei. 2017. Social signal extraction from egocentric photo-streams. In International Conference on Multimodal Interaction (ICMI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sven Bambach, David J. Crandall, and Chen Yu. 2015. Viewpoint integration for hand-based recognition of social interactions from a first-person view. In International Conference on Multimodal Interaction (ICMI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sven Bambach, John Franchak, David Crandall, and Chen Yu. 2014. Detecting hands in children's egocentric views to understand embodied attention during social interaction. In Annual Meeting of the Cognitive Science Society (CogSci).Google ScholarGoogle Scholar
  4. Sven Bambach, Linda B. Smith, David J. Crandall, and Chen Yu. 2016. Objects in the center: How the infant's body constrains infant scenes. In International Conference on Development and Learning and Epigenetic Robotics (ICDL).Google ScholarGoogle ScholarCross RefCross Ref
  5. Alejandro Betancourt, Pietro Morerio, Carlo S. Regazzoni, and Matthias Rauterberg. 2015. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (2015), 744--760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In International Conference on Computer Vision (ICCV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Martin A Fischler and Robert C Bolles. 1987. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. In Readings in computer vision. Elsevier, 726--740. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jeffrey M. Girard. 2014. Perceptions of interpersonal behavior are influenced by gender, facial expression intensity, and head pose. In International Conference on Multimodal Interaction (ICMI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Google. {n. d.}. ARCore Overview. https://developers.google.com/ar/discover/. Accessed: 2018-05-01.Google ScholarGoogle Scholar
  10. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  12. Kris M. Kitani, Takahiro Okabe, Yoichi Sato, and Akihiro Sugimoto. 2011. Fast unsupervised ego-action learning for first-person sports videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yin Li, Alireza Fathi, and James M Rehg. 2013. Learning to predict gaze in egocentric video. In International Conference on Computer Vision (ICCV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yin Li, Zhefan Ye, and James M Rehg. 2015. Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  15. Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  16. Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. 2015. ORBSLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31, 5 (2015), 1147--1163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Erik Murphy-Chutorian and Mohan Manubhai Trivedi. 2009. Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 4 (2009), 607--626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fumio Nihei, Yukiko I Nakano, Yuki Hayashi, Hung-Hsuan Hung, and Shogo Okada. 2014. Predicting influential statements in group discussions using speech and head motion information. In International Conference on Multimodal Interaction (ICMI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jeff Pelz, Mary Hayhoe, and Russ Loeber. 2001. The coordination of eye, head, and hand movements in a natural task. Experimental Brain Research 139, 3 (2001), 266--277.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yair Poleg, Ariel Ephrat, Shmuel Peleg, and Chetan Arora. 2016. Compact CNN for Indexing Egocentric Videos. In IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE.Google ScholarGoogle Scholar
  21. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  23. Oleg Špakov, Poika Isokoski, Jari Kangas, Jussi Rantala, Deepak Akkil, and Roope Raisamo. 2016. Comparison of three implementations of HeadTurn: a multimodal interaction technique with gaze and head turns. In International Conference on Multimodal Interaction (ICMI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ramanathan Subramanian, Yan Yan, Jacopo Staiano, Oswald Lanz, and Nicu Sebe. 2013. On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In International conference on Multimodal Interaction (ICMI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Deqing Sun, Stefan Roth, and Michael Black. 2010. Secrets of optical flow estimation and their principles. In Conference Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  26. Takuma Yagi, Karttikeya Mangalam, Ryo Yonetani, and Yoichi Sato. 2018. Future Person Localization in First-Person Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  27. Akiko Yamazaki, Keiichi Yamazaki, Takaya Ohyama, Yoshinori Kobayashi, and Yoshinori Kuno. 2012. A techno-sociological solution for designing a museum guide robot: regarding choosing an appropriate visitor. In Human-Robot Interaction (HRI), 2012 7th ACM/IEEE International Conference on. IEEE, 309--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar

Index Terms

  1. Estimating Head Motion from Egocentric Vision

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
      October 2018
      687 pages
      ISBN:9781450356923
      DOI:10.1145/3242969

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICMI '18 Paper Acceptance Rate63of149submissions,42%Overall Acceptance Rate453of1,080submissions,42%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader