Abstract
We present a new interactive and online approach to 3D scene understanding. Our system, SemanticPaint, allows users to simultaneously scan their environment whilst interactively segmenting the scene simply by reaching out and touching any desired object or surface. Our system continuously learns from these segmentations, and labels new unseen parts of the environment. Unlike offline systems where capture, labeling, and batch learning often take hours or even days to perform, our approach is fully online. This provides users with continuous live feedback of the recognition during capture, allowing to immediately correct errors in the segmentation and/or learning—a feature that has so far been unavailable to batch and offline methods. This leads to models that are tailored or personalized specifically to the user's environments and object classes of interest, opening up the potential for new applications in augmented reality, interior design, and human/robot navigation. It also provides the ability to capture substantial labeled 3D datasets for training large-scale visual recognition systems.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, SemanticPaint: Interactive 3D Labeling and Learning at your Fingertips
- M. Abdelrahman, M. Aono, M. El-Elegy, A. Farag, A. Fereira, H. Johan, B. Li, Y. Lu, J. Machado, P.-B. Pascoal, and A. Tatsuma. 2013. SHREC13: Retrieval of objects captured with low-cost depth-sensing cameras. In Proceedings of the 6th Eurographics Workshop on 3D Object Retrieval (3DOR'13). 65--71. Google ScholarDigital Library
- A. Anand, H. S. Koppula, T. Joachims, and A. Saxena. 2013. Contextally guided semantic labeling and search for three-dimensional point clouds. The Int. J. Robot. Res. 32, 1, 19--34. Google ScholarDigital Library
- H. Bay, A. Ess, T. Tuytelaars, and van L. Gool. 2008. Surf: Speeded up robust features. In Proceedings of the IEEE Conference on Computer Vision and Image Understanding (CVIU'08). Google ScholarDigital Library
- A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. gavaldà. 2009. New evolving data streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00). Google ScholarDigital Library
- U. Bonde, V. Badrinarayanan, and R. Cipolla. 2013. Multi scale shape index for 3D object recognition. In Scale Space and Variational Methods in Computer Vision. Springer, 306--318.Google Scholar
- Y. Boykov, O. Versler, and R. Zabih. 2001. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 11. Google ScholarDigital Library
- L. Breiman. 2001. Random forests. Mach. Learn. 45, 1. Google ScholarDigital Library
- G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. 2008. Segmentation and recognition using structure from motion point clouds. In Proceedings of the European Conference on Computer Vision (ECCV'08). Google ScholarDigital Library
- R. O. Castle, D. Gawley, G. Klein, and D. W. Murray. 2007. Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA'07).Google Scholar
- J. Chen, D. Bautembach, and S. Izadi. 2013. Scalable real-time volumetric surface reconstruction. ACM Trans. Graph. 32, 4. Google ScholarDigital Library
- X. Chen, A. Golovinskiy, and T. Funkhouser. 2009. A benchmark for 3D mesh segmentation. ACM Trans. Graph. 28, 3. Google ScholarDigital Library
- M.-M. Cheng, S. Zheng, W.-Y. Lin, V. Vineet, P. Sturgess, N. Crook, N. Mitra, and P. Torr. 2014. ImageSpirit: Verbal guided image parsing. ACM Trans. Graph. 34, 1. Google ScholarDigital Library
- C. Couprie, C. Farabet, L. Najman, and Y. Lecun. 2013. Indoor semantic segmentation using depth information. http://arxiv.org/abs/1301.3572.Google Scholar
- A. Criminisi and J. Shotton. 2013. Decision Forests for Computer Vision and Medical Image Analysis. Springer. Google ScholarDigital Library
- B. Curless and M. Levoy. 1996. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual ACM Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'96). ACM Press, New York, 303--312. Google ScholarDigital Library
- N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'05). Google ScholarDigital Library
- P. Domingos and G. Hulten. 2000. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00). Google ScholarDigital Library
- B. Drost, M. Ulrich, N. Navar, and S. Ilic. 2010. Model globally, match locally: Efficient and robust 3D object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10).Google Scholar
- N. Fioraio and L. di Stefano. 2013. Joint detection, tracking and mapping by semantic bundle adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'13). Google ScholarDigital Library
- A. Geiger, P. Lenz, and R. Urtasun. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12). Google ScholarDigital Library
- A. Gupta, A. A. Efros, and M. Hebert. 2010. Blocks world revisited: Image understanding using qualitative geometry and mechanics. In Proceedings of the European Conference on Computer Vision (ECCV'10). Google ScholarDigital Library
- C. Häne, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. 2013. Joint 3D scene reconstruction and class segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'13). Google ScholarDigital Library
- E. Herbst, P. Henry, and D. Fox. 2014. Toward online 3-D object segmentation and mapping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA'14).Google Scholar
- H. Hirschmuller. 2008. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30, 2, 328--341. Google ScholarDigital Library
- Y. Ioanou, B. Taati, R. Harrap, and M. Greenspan. 2012. Difference of normals as a multi-scale operator in unorganized point clouds. In Proceedings of the 2nd International Conference on 3D Imaging, Modeling, Processing, Visualization, and Transmission (3DIMPVT'12). 501--508. Google ScholarDigital Library
- S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, J. Shotton, S. Hodges, D. Freeman, A. Davidson, and A. Fitzgibbon. 2011. KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (UIST'11). 559--568. Google ScholarDigital Library
- A. Johnson. 1997. Spin-images: A representation for 3-D surface matching. Ph.D. thesis, Robotics Institute, Carnegie Mellon University.Google Scholar
- O. Kähler and I. Reid. 2013. Efficient 3D scene labeling using fields of trees. In Procceings of the International Conference on Computer Vision (ICCV'13). Google ScholarDigital Library
- E. Kalogerakis, A. Hertzmann, and K. Singh. 2010. Learning 3D mesh segmentation and labeling. ACM Trans. Graph. 29, 4, 102. Google ScholarDigital Library
- A. Karpathy, S. Miller, and L. Fei-Fei. 2013. Object discovery in 3D scenes via shape analysis. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA'13).Google Scholar
- B.-S. Kim, P. Kohli, and S. Savarese. 2013a. 3D scene understanding by voxel-CRF. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'13). Google ScholarDigital Library
- V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. Diverdi, and T. Funkhouser. 2013b. Learning part-based templates from large collections of 3D shapes. ACM Trans. Graph. 32, 4. Google ScholarDigital Library
- Y. M. Kim, N. J. Mitra, D.-M. Yan, and L. Guibas. 2012. Acquiring 3D indoor environments with variability and repetition. ACM Trans. Graph. 31, 6. Google ScholarDigital Library
- P. Kohli, L. Ladicky, and P. H. S. Torr. 2009. Robust higher order potentials for enforcing label consistency. Int. J. Comput. Vis. 82, 3, 302--324. Google ScholarDigital Library
- D. Koller and N. Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. Google ScholarDigital Library
- H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. 2011. Semantic labeling of 3D point clouds for indoor scenes. In Proceedings of the Conference on Neural Information Processing Systems (NIPS'11).Google Scholar
- P. Krähenbühl and V. Koltun. 2011. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Proceedings of the Conference on Neural Information Processing Systems (NIPS'11).Google Scholar
- A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Neural Information Processing Systems (NIPS'12).Google Scholar
- L. Ladickỳ, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. Clocksin, and P. H. Torr. 2012. Joint optimization for object class segmentation and dense stereo reconstruction. Int. J. Comput. Vis. 100, 2, 122--133. Google ScholarDigital Library
- J. Lafferty, A. Mccallum, and F. C. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML'01). 282--289. Google ScholarDigital Library
- K. Lai, L. Bo, X. Ren, and D. Fox. 2011. A large-scale hierarchical multi-view RGB-D object dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA'11).Google Scholar
- V. Lepetit and P. Fua. 2006. Keypoint recognition using randomized trees. IEEE Trans. Pattern Anal. Mach. Intell. 28, 9. Google ScholarDigital Library
- M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, et al. 2000. The digital Michaelangelo project. 3D scanning of large statues. In Proceedings of the Annual ACM Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'00). ACM Press, New York. Google ScholarDigital Library
- D. Lin, S. Fidler, and R. Urtasun. 2013a. Holistic scene understanding for 3D object detection with RGBD camera. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'13). Google ScholarDigital Library
- H. Lin, J. Gao, Y. Zhou, G. Lu, M. Ye, C. Zhang, L. Liu, and R. Yang. 2013b. Semantic decomposition and reconstruction of residential scenes from LiDAR data. ACM Trans. Graph. 32, 4. Google ScholarDigital Library
- D. G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'99). Google ScholarDigital Library
- P. Merrell, E. Schkufza, Z. Li, M. Agrawala, and V. Koltun. 2011. Interactive furniture layout using interior design guidelines. ACM Trans. Graph. 30, 4. Google ScholarDigital Library
- L. Nan, K. Xie, and A. Sharf. 2012. A search-classify approach for cluttered indoor scene understanding. ACM Trans. Graph. 31, 6, 137. Google ScholarDigital Library
- R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A Fitzgibbon. 2011a. KinectFusion: Real-time dense surface mapping and tracking. In Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR'11). Google ScholarDigital Library
- R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. 2011b. DTAM: Dense tracking and mapping in real-time. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'11). Google ScholarDigital Library
- M. Niessner, M. Zollhöfer, S. Izadi, and M. Stamminger. 2013. Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. Graph. 32, 6. Google ScholarDigital Library
- M. Pollefeys, D. Nistéer, J. Frahm, A. Akbarzadeh, P. Mordoral, B. Cliff, C. Engels, D. Gallup, S. Kim, P. Merrell, et al. 2008. Detailed real-time urban 3D reconstruction from video. Int. J. Comput. Vis. 78, 2. Google ScholarDigital Library
- I. Posner, M. Cummins, and P. Newman. 2009. A generative framework for fast urban labeling using spatial and temporal context. Autom. Robot. 26, 2--3, 153--170. Google ScholarDigital Library
- V. Pradeep, C. Rhemann, S. Izadi, C. Zach, M. Bleyer, and S. Bathiche. 2013. MonoFusion: Real-time 3D reconstruction of small scenes with a single Web camera. In Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR'13). 83--88.Google Scholar
- F. Ramos, J. Nieto, and H. Durrant-Whyte. 2008. Combining object recognition and SLAM for extended map representations. In Experimental Robotics. Springer, 55--64.Google Scholar
- X. Ren, L. Bo, and D. Fox. 2012. ROB-(D) scene labeling: Features and algorithms. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR'12). Google ScholarDigital Library
- L. G. Roberts. 1963. Machine perception of three-dimensional solids. Ph.D. thesis, Massachusetts Institute of Technology.Google Scholar
- C. Rother, V. Kolmogorov, and A. Blake. 2004. GrabCut -- Interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23, 3, 309--314. Google ScholarDigital Library
- S. Rusinkiewicz, O. Hall-Holt, and M. Levoy. 2002. Real-time 3D model acquisition. ACM Trans. Graph. 21, 3, 438--446. Google ScholarDigital Library
- B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. 2008. LabelMe: A database and Web-based tool for image annotation. Int. J. Comput. Vis. 77, 1--3, 157--173. Google ScholarDigital Library
- A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischop. 2009. On-line random forests. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCVW'09).Google Scholar
- R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison. 2013. SLAM++: Simultaneous localization and mapping at the level of objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'13). Google ScholarDigital Library
- S. Sengupta, E. Greveson, A. Shahrokni, and P. H. Torr. 2013. Urban 3D semantic modelling using stereo vision. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA'13).Google Scholar
- Q. Shan, R. Adams, B. Curless, Y. Furukawa, and S. M. Seitz. 2013. The visual Turing test for scene reconstruction. In Proceedings of the International Conference on 3D (Vision-3DV). 25--32. Google ScholarDigital Library
- T. Shao, W. Xu, K. Zhou, J. Wang, D. Li, and B. Guo. 2012. An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans. Graph. 31, 6, 136. Google ScholarDigital Library
- L. Shapira, S. Shalom, A. Shamir, D. Cohen-Or, and H. Zhang. 2010. Contextual part analogies in 3D objects. Int. J. Comput. Vis. 89, 2--3, 309--326. Google ScholarDigital Library
- T. Sharp. 2008. Implementing decision trees and forests on a GPU. In Proceedings of the European Conference on Computer Vision (ECCV'08). Springer, 595--608.Google ScholarCross Ref
- C.-H. Shen, H. Fu, K. Chen, and S.-M. Hu. 2012. Structure recovery by part assembly. ACM Trans. Graph. 31, 6, 180. Google ScholarDigital Library
- J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'11). Google ScholarDigital Library
- J. Shotton, J. Winn, C. Rother, and A. Criminisi. 2006. TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV'06). Google ScholarDigital Library
- N. Silberman and R. Fergus. 2011. Indoor scene segmentation using a structured light sensor. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCVW'11).Google Scholar
- N. Silberman, D. Hoiem, P. Kohli, and B. Fergus. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision (ECCV'12). Google ScholarDigital Library
- N. Snavely, S. M. Seitz, and R. Szeliski. 2006. Photo tourism: Exploring photo collections in 3D. ACM Trans. Graph. 25, 3. Google ScholarDigital Library
- J. Stückler, B. Waldvogel, H. Schultz, and S. Behnke. 2013. Dense real-time mapping of object-class semantics from RGB-D video. http://www.ais.uni-bonn.de/papers/JRTIP_2014_Stueckler_RT_SemanticSLAM.pdf.Google Scholar
- J. P. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, and P. H. Torr. 2013. Mesh based semantic modelling for indoor and outdoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'13). Google ScholarDigital Library
- V. Vineet and P. Narayanan. 2008. CUDA cuts: Fast graph cuts on the GPU. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR'08). 1--8.Google Scholar
- J. S. Vitter. 1985. Random sampling with a reservoir. ACM Trans. Graph. 11, 1. Google ScholarDigital Library
- Y. Wang, J. Feng, Z. Wu, J. Wang, and S.-F. Chang. 2014. From low-cost depth sensors to CAD: Cross-domain 3D shape retrieval via regression tree fields. In Proceedings of the European Conference on Computer Vision (ECCV'14).Google ScholarCross Ref
- J. Xiao. 2014. A 2D+3D rich data approach to scene understanding. Ph.D. thesis, Massachusetts Institute of Technology. Google ScholarDigital Library
- J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. 2010. SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10).Google Scholar
- J. Xiao, A. Owens, and A. Torralba. 2013. SUN3D: A database of big spaces reconstructed using SFM and object labels. In Proceedings of the International Conference on Computer Vision (ICCV'13). Google ScholarDigital Library
- A. Yao, J. Gall, C. Leistner, and L. Van Gool. 2012. Interactive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12). 3242--3249. Google ScholarDigital Library
Index Terms
- SemanticPaint: Interactive 3D Labeling and Learning at your Fingertips
Recommendations
SemanticPaint: interactive segmentation and learning of 3D world
SIGGRAPH '15: ACM SIGGRAPH 2015 TalksWe present a real-time, interactive system for the geometric reconstruction, object-class segmentation and learning of 3D scenes [Valentin et al. ]. Using our system, a user can walk into a room wearing a consumer depth camera and a virtual reality ...
Weakly Supervised 3D Scene Segmentation with Region-Level Boundary Awareness and Instance Discrimination
Computer Vision – ECCV 2022AbstractCurrent state-of-the-art 3D scene understanding methods are merely designed in a full-supervised way. However, in the limited reconstruction cases, only limited 3D scenes can be reconstructed and annotated. We are in need of a framework that can ...
SemanticPaint: interactive segmentation and learning of 3D worlds
SIGGRAPH '15: ACM SIGGRAPH 2015 Emerging TechnologiesWe present a real-time, interactive system for the geometric reconstruction, object-class segmentation and learning of 3D scenes [Valentin et al. 2015]. Using our system, a user can walk into a room wearing a depth camera and a virtual reality headset, ...
Comments