ABSTRACT
Many virtual reality applications let multiple users communicate in a multi-talker environment, recreating the classic cocktail-party effect. While there is a vast body of research focusing on the perception and intelligibility of human speech in real-world scenarios with cocktail party effects, there is little work in accurately modeling and evaluating the effect in virtual environments. Given the goal of evaluating the impact of virtual acoustic simulation on the cocktail party effect, we conducted experiments to establish the signal-to-noise ratio (SNR) thresholds for target-word identification performance. Our evaluation was performed for sentences from the coordinate response measure corpus in presence of multi-talker babble. The thresholds were established under varying sound propagation and spatialization conditions. We used a state-of-the-art geometric acoustic system integrated into the Unity game engine to simulate varying conditions of reverberance (direct sound, direct sound & early reflections, direct sound and early reflections and late reverberation) and spatialization (mono, stereo, and binaural). Our results show that spatialization has the biggest effect on the ability of listeners to discern the target words in multi-talker virtual environments. Reverberance, on the other hand, slightly affects the target word discerning ability negatively.
Supplemental Material
- V Ralph Algazi, Richard O Duda, Dennis M Thompson, and Carlos Avendano. 2001. The cipic hrtf database. In Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on the. IEEE, 99--102.Google Scholar
- Durand R Begault and Leonard J Trejo. 2000. 3-D sound for virtual reality and multimedia. (2000). Google ScholarDigital Library
- Christopher C Berger, Mar Gonzalez-Franco, Ana Tajadura-Jiménez, Dinei Florencio, and Zhengyou Zhang. 2018. Generic HRTFs may be good enough in Virtual Reality. Improving source localization through cross-modal plasticity. Frontiers in Neuroscience 12 (2018), 21.Google ScholarCross Ref
- Robert S Bolia, W Todd Nelson, Mark A Ericson, and Brian D Simpson. 2000. A speech corpus for multitalker communications research. The Journal of the Acoustical Society of America 107, 2 (2000), 1065--1066.Google ScholarCross Ref
- Radamis Botros, Onsy Abdel-Alim, and Peter Damaske. 1986. Stereophonic speech teleconferencing. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP'86., Vol. 11. IEEE, 1321--1324.Google Scholar
- JS Bradley, Hiroshi Sato, and M Picard. 2003. On the importance of early reflections for speech in rooms. The Journal of the Acoustical Society of America 113, 6 (2003), 3233--3244.Google ScholarCross Ref
- AW Bronkhorst and R Plomp. 1988. The effect of head-induced interaural time and level differences on speech intelligibility in noise. The Journal of the Acoustical Society of America 83, 4 (1988), 1508--1516.Google ScholarCross Ref
- AW Bronkhorst and R Plomp. 1992. Effect of multiple speechlike maskers on binaural speech recognition in normal and impaired hearing. The Journal of the Acoustical Society of America 92, 6 (1992), 3132--3139.Google ScholarCross Ref
- Adelbert W Bronkhorst. 2015. The cocktail-party problem revisited: early processing and selection of multi-talker speech. Attention, Perception, & Psychophysics 77, 5 (2015), 1465--1487.Google ScholarCross Ref
- Douglas S Brungart. 2001. Informational and energetic masking effects in the perception of two simultaneous talkers. The Journal of the Acoustical Society of America 109, 3 (2001), 1101--1109.Google ScholarCross Ref
- Anish Chandak, Christian Lauterbach, Micah Taylor, Zhimin Ren, and Dinesh Manocha. 2008. Ad-frustum: Adaptive frustum tracing for interactive sound propagation. IEEE Transactions on Visualization and Computer Graphics 14, 6 (2008), 1707--1722. Google ScholarDigital Library
- E Colin Cherry. 1953. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America 25, 5 (1953), 975--979.Google ScholarCross Ref
- Alain de Cheveigné. 1997. Concurrent vowel identification. III. A neural model of harmonic interference cancellation. The Journal of the Acoustical Society of America 101, 5 (1997), 2857--2865.Google ScholarCross Ref
- Kai Crispien and Tasso Ehrenberg. 1995. Evaluation of the-cocktail-party effect-for multiple speech stimuli within a spatial auditory display. Journal of the Audio Engineering Society 43, 11 (1995), 932--941.Google Scholar
- John F Culling and Quentin Summerfield. 1995. Perceptual separation of concurrent speech sounds: Absence of across-frequency grouping by common interaural delay. The Journal of the Acoustical Society of America 98, 2 (1995), 785--797.Google ScholarCross Ref
- Rob Drullman and Adelbert W Bronkhorst. 2000. Multichannel speech intelligibility and talker recognition using monaural, binaural, and three-dimensional auditory presentation. The Journal of the Acoustical Society of America 107, 4 (2000), 2224--2235.Google ScholarCross Ref
- Joost M Festen. 1993. Contributions of comodulation masking release and temporal resolution to the speech-reception threshold masked by an interfering voice. The Journal of the Acoustical Society of America 94, 3 (1993), 1295--1300.Google ScholarCross Ref
- Thomas Funkhouser, Ingrid Carlbom, Gary Elko, Gopal Pingali, Mohan Sondhi, and Jim West. 1998. A beam tracing approach to acoustic modeling for interactive virtual environments. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques. ACM, 21--32. Google ScholarDigital Library
- William G Gardner and Keith D Martin. 1995. HRTF measurements of a KEMAR. The Journal of the Acoustical Society of America 97, 6 (1995), 3907--3908.Google ScholarCross Ref
- Michael A Gerzon. 1973. Periphony: With-height sound reproduction. Journal of the Audio Engineering Society 21, 1 (1973), 2--10.Google Scholar
- Mar Gonzalez-Franco, Antonella Maselli, Dinei Florencio, Nikolai Smolyanskiy, and Zhengyou Zhang. 2017. Concurrent talking in immersive virtual reality: on the dominance of visual speech cues. Scientific Reports 7, 1 (2017), 3817.Google ScholarCross Ref
- Monica L Hawley, Ruth Y Litovsky, and John F Culling. 2004. The benefit of binaural hearing in a cocktail party: Effect of location and type of interferer. The Journal of the Acoustical Society of America 115, 2 (2004), 833--843.Google ScholarCross Ref
- Brian Hilburn. 2004. Cognitive complexity in air traffic control: A literature review. EEC note 4, 04 (2004).Google Scholar
- Brian FG Katz. 2001. Boundary element method calculation of individual head-related transfer function. I. Rigid model calculation. The Journal of the Acoustical Society of America 110, 5 (2001), 2440--2448.Google ScholarCross Ref
- Janet Koehnke and Joan M Besing. 1996. A procedure for testing speech intelligibility in a virtual listening environment. Ear and Hearing 17, 3 (1996), 211--217.Google ScholarCross Ref
- Asbjørn Krokstad, Staffan Strom, and Svein Sørsdal. 1968. Calculating the acoustical room response by the use of a ray tracing technique. Journal of Sound and Vibration 8, 1 (1968), 118--125.Google ScholarCross Ref
- Christian Lauterbach, Anish Chandak, and Dinesh Manocha. 2007. Interactive sound rendering in complex and dynamic scenes using frustum tracing. IEEE Transactions on Visualization and Computer Graphics 13, 6 (2007), 1672--1679. Google ScholarDigital Library
- HCCH Levitt. 1971. Transformed up-down methods in psychoacoustics. The Journal of the Acoustical society of America 49, 2B (1971), 467--477.Google ScholarCross Ref
- Robert A Lutfi. 1990. How much masking is informational masking? The Journal of the Acoustical Society of America 88, 6 (1990), 2607--2610.Google ScholarCross Ref
- John MacDonald and Harry McGurk. 1978. Visual influences on speech perception processes. Perception & Psychophysics 24, 3 (1978), 253--257.Google ScholarCross Ref
- Ravish Mehra, Nikunj Raghuvanshi, Lakulish Antani, Anish Chandak, Sean Curtis, and Dinesh Manocha. 2013. Wave-based sound propagation in large open scenes using an equivalent source formulation. ACM Transactions on Graphics (TOG) 32, 2 (2013), 19. Google ScholarDigital Library
- Ravish Mehra, Atul Rungta, Abhinav Golas, Ming Lin, and Dinesh Manocha. 2015. Wave: Interactive wave-based sound propagation for virtual environments. IEEE transactions on visualization and computer graphics 21, 4 (2015), 434--442.Google Scholar
- Alok Meshram, Ravish Mehra, Hongsheng Yang, Enrique Dunn, Jan-Michael Franm, and Dinesh Manocha. 2014. P-HRTF: Efficient personalized HRTF computation for high-fidelity spatial sound. In Mixed and Augmented Reality (ISMAR), 2014 IEEE International Symposium on. IEEE, 53--61.Google ScholarCross Ref
- John P Moncur and Donald Dirks. 1967. Binaural and monaural speech intelligibility in reverberation. Journal of Speech, Language, and Hearing Research 10, 2 (1967), 186--195.Google ScholarCross Ref
- W Todd Nelson, Robert S Bolia, Mark A Ericson, and Richard L McKinley. 1999. Spatial audio displays for speech communications: A comparison of free field and virtual acoustic environments. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 43. SAGE Publications Sage CA: Los Angeles, CA, 1202--1205.Google ScholarCross Ref
- Irwin Pollack and JM Pickett. 1958. Stereophonic listening and speech intelligibility against voice babble. The Journal of the Acoustical Society of America 30, 2 (1958), 131--133.Google ScholarCross Ref
- Ville Pulkki. 1997. Virtual sound source positioning using vector base amplitude panning. Journal of the audio engineering society 45, 6 (1997), 456--466.Google Scholar
- Nikunj Raghuvanshi, Rahul Narain, and Ming C Lin. 2009. Efficient and accurate sound propagation using adaptive rectangular decomposition. IEEE Transactions on Visualization and Computer Graphics 15, 5 (2009), 789--801. Google ScholarDigital Library
- Atul Rungta, Carl Schissler, Nicholas Rewkowski, Ravish Mehra, and Dinesh Manocha. 2018. Diffraction Kernels for Interactive Sound Propagation in Dynamic Environments. IEEE transactions on visualization and computer graphics (2018). Google ScholarDigital Library
- Carl Schissler and Dinesh Manocha. 2017. Interactive sound propagation and rendering for large multi-source scenes. ACM Transactions on Graphics (TOG) 36, 1 (2017), 2.Google ScholarCross Ref
- Carl Schissler, Ravish Mehra, and Dinesh Manocha. 2014. High-order diffraction and diffuse reflections for interactive sound propagation in large environments. ACM Transactions on Graphics (TOG) 33, 4 (2014), 39. Google ScholarDigital Library
- Barbara G Shinn-Cunningham, Jason Schickler, Norbert Kopčo, and Ruth Litovsky. 2001. Spatial unmasking of nearby speech sources in a simulated anechoic environment. The Journal of the Acoustical Society of America 110, 2 (2001), 1118--1129.Google ScholarCross Ref
- Samuel Siltanen, Tapio Lokki, Sami Kiminki, and Lauri Savioja. 2007. The room acoustic rendering equation. The Journal of the Acoustical Society of America 122, 3 (September 2007), 1624--1635.Google ScholarCross Ref
- A. Southern, S. Siltanen, D. T. Murphy, and L. Savioja. 2013. Room Impulse Response Synthesis and Validation Using a Hybrid Acoustic Model. IEEE Transactions on Audio, Speech, and Language Processing 21, 9 (2013), 1940--1952. Google ScholarDigital Library
- Quentin Summerfield and John F Culling. 1992. Periodicity of maskers not targets determines ease of perceptual segregation using differences in fundamental frequency. The Journal of the Acoustical Society of America 92, 4 (1992), 2317--2317.Google ScholarCross Ref
- Michael Vorländer. 1989. Simulation of the transient and steady-state sound propagation in rooms using a new combined ray-tracing/image-source algorithm. The Journal of the Acoustical Society of America 86, 1 (1989), 172--178.Google ScholarCross Ref
- Frederic L Wightman and Doris J Kistler. 1989. Headphone simulation of free-field listening. I: stimulus synthesis. The Journal of the Acoustical Society of America 85, 2 (1989), 858--867.Google ScholarCross Ref
Index Terms
- Effects of virtual acoustics on target-word identification performance in multi-talker environments
Recommendations
Psychoacoustic Characterization of Propagation Effects in Virtual Environments
Special Issue SAP 2016As sound propagation algorithms become faster and more accurate, the question arises as to whether the additional efforts to improve fidelity actually offer perceptual benefits over existing techniques. Could environmental sound effects go the way of ...
Data-driven feedback delay network construction for real-time virtual room acoustics
AM '20: Proceedings of the 15th International Audio Mostly ConferenceFor virtual and augmented reality applications, it is desirable to render audio sources in the space the user is in, in real-time without sacrificing the perceptual quality of the sound. One aspect of the rendering that is perceptually important for a ...
Using bat-modelled sonar as a navigational tool in virtual environments
Bats are able to use active sonar as a mechanism for locating object in three dimensions and for generating spatial maps of their environments. Humans use passive sound cues to detect features of the space they occupy, as well as react to the spatial ...
Comments