skip to main content
article

Toward adaptive conversational interfaces: Modeling speech convergence with animated personas

Published: 01 September 2004 Publication History

Abstract

The design of robust interfaces that process conversational speech is a challenging research direction largely because users' spoken language is so variable. This research explored a new dimension of speaker stylistic variation by examining whether users' speech converges systematically with the text-to-speech (TTS) heard from a software partner. To pursue this question, a study was conducted in which twenty-four 7 to 10-year-old children conversed with animated partners that embodied different TTS voices. An analysis of children's amplitude, durational features, and dialogue response latencies confirmed that they spontaneously adapt several basic acoustic-prosodic features of their speech 10--50%, with the largest adaptations involving utterance pause structure and amplitude. Children's speech adaptations were relatively rapid, bidirectional, and dynamically readaptable when introduced to new partners, and generalized across different types of users and TTS voices. Adaptations also occurred consistently, with 70--95% of children converging with their partner's TTS, although individual differences in magnitude of adaptation were evident. In the design of future conversational systems, users' spontaneous convergence could be exploited to guide their speech within system processing bounds, thereby enhancing robustness. Adaptive system processing could yield further significant performance gains. The long-term goal of this research is the development of predictive models of human-computer communication to guide the design of new conversational interfaces.

References

[1]
Andersen, E. S. 1990. Speaking with Style: The Sociolinguistic Skills of Children. Routledge, Kagan Paul: London, England.
[2]
Andre, E., Muller, J., and Rist, T. 1996. The PPP persona: A multipurpose animated presentation agent. Advanced Visual Interfaces, ACM Press, 245--247.
[3]
Bickmore, T. 2003. Relational agents: Effecting change through human-computer relationships, MIT Ph.D. Thesis, February.
[4]
Bickmore, T. and Cassell, J. 2004. Social dialogue with embodied conversational agents. In Natural, Intelligent and Effective Interaction with Multimodal Dialogue Systems, Van Kuppevelt, L. Dybkjaer and N. Bernsen, Eds. Kluwer Academic: New York, NY.
[5]
Boughman, J. M. 1997. Greater spear-nosed bats give group-distinctive calls. Behavioral Ecology and Sociobiology 40, 61--70.
[6]
Burgoon, J., Stern, L., and Dillman, L. 1995. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge Univ. Press, Cambridge, UK.
[7]
Cassell, J. Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H., and Yan, H. 1999. Embodiment in conversational interfaces: Rhea, Proceedings of CHI'99, ACM Press: Pittsburgh, Pa., 520--527.
[8]
Cassell, J. and Thorisson, K. R. 1999. The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents. App. Artif. Intell. J. 13, 4--5, 519--538.
[9]
Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., Eds. 2000. Embodied Conversational Agents. MIT Press, Cambridge, MA.
[10]
Coulston, R., Oviatt, S. L., and Darves, C. 2002. Amplitude convergence in children's conversational speech with animated personas. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 4, 2689--2692.
[11]
Coulston, R. and Darves, C. 2001. Duration scoring procedures, Oregon Health and Science University, unpublished manuscript, November.
[12]
Cowlishaw, G. 1992. Song function in gibbons. Behavior 121, 1--2, 131--153.
[13]
Darves, C. and Oviatt, S. L. 2004. Talking to digital fish: Designing effective conversational interfaces for educational software. In Evaluating Conversational Agents, Z. Ruttkay and C. Pelachaud, Eds. Kluwer Academic Publisher, Dordrecht, The Netherlands.
[14]
Dehn, D. M. and Van Mulken, S. 2000. The impact of animated interface agents: A review of empirical research. Int. J. Hum. Comput. Studies 52, 1--22.
[15]
Elowson, A. M. and Snowdon, C. T. 1994. Pygmy marmosets, Cebuella pygmaea, modify vocal structure in response to changed social environment. Animal Behavior 47, 1267--1277.
[16]
Giles, H., Mulac, A., Bradac, J., and Johnson, P. 1987. Speech accommodation theory: The first decade and beyond. Communication Yearbook 10, M. L. Mcglaughlin, Ed. Sage Publ., London, UK, 13--48.
[17]
Gong, L., Nass, C., Simard, C., and Takhteyev, Y. 2001. When non-human is better than semi-human: Consistency in speech interfaces. Usability Evaluation and Interface Design: Cognitive Engineering, Intelligent Agents and Virtual Reality, Vol. 1. M. Smith, G. Salvendy, D. Harris and R. Koubek, Eds. Lawrence Erlbaum Assoc., Mahwah N.J., 390--394.
[18]
Haimoff, E. H. 1984. Acoustic and organizational features of gibbon songs. In The Lesser Apes, H. Preuschoft et al., Eds. Edinburgh University Press, Edinburgh, Scotland, 333--353.
[19]
Janik, V. M. and Slater, P. 1997. Vocal learning in mammals. Advances in the Study of Behavior 26, 59--99.
[20]
Junqua, J. C. 1993. The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am. 93, 1, 510--524.
[21]
Karat, C. M., Vergo, J., and Nahamoo, D. 2003. Conversational interface technologies. In The Human--Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 169--186.
[22]
Lai, J. and Yankelovich, N. 2003. Conversational speech interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 698--713.
[23]
Leiser, R. G. 1989. Improving natural language and speech interfaces by the use of metalinguistic phenomena. Appl. Ergonomics 20, 3, 168--73.
[24]
Ladefoged, P. 1993. A course in phonetics. Harcourt Brace Jovanovich, Ft. Worth, TX.
[25]
Maples, E. G., Haraway, M. M., and Hutto, C. W. 1989. Development of coordinated singing in a newly formed siamang pair (Hylobates syndactylus). Zoo Biology 8, 367--378.
[26]
Massaro, D., Cohen, M., Beskow, J., and Cole, R. 2000. Developing and evaluating conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, UK, 287--318.
[27]
Mirghafori, N., Fosler, E., and Morgan, N. 1996. Towards robustness to fast speech in ASR. In Proceedings of ICASSP-96, 1, 335--338.
[28]
Mitani, J. C. and Brandt, K. L. 1994. Social factors influence the acoustic variability in the long-distance calls of male chimpanzees. Ethology 96, 233--252.
[29]
Moreno, R., Mayer, R., Spires, H., and Lester, J. 2001. The case for social agency in computer-based teaching: Do students learn more deeply when they interact with animated pedagogical agents? Cognition and Instruc. 19, 2, 177--213.
[30]
Nass, C. and lee, K. 2000. Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Proceedings of the Conference on Human Factors in Computing System. ACM Press, New York, NY, 329--336.
[31]
Nass, C. and Lee, K. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. J. Exper. Psych. Appl. 7, 3, 171--181.
[32]
Nass, C. Isbister, K. and Lee, E. 2000. Truth is beauty: Researching embodied conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost and E. Churchill, Eds. MIT Press, Cambridge, MA, 374--402.
[33]
Nass, C., Steuer, J., and Tauber, E. 1994. Computers are social actors. In Proceedings of the Conference on Human Factors in Computing Systems. ACM Press, Boston, MA, 72--78.
[34]
Oviatt, S. L. 2003. Multimodal interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 286--304.
[35]
Oviatt, S. L. 1996. User-centered design of spoken language and multimodal interfaces. IEEE Multimedia, winter 3 (4), 26--35. Reprinted in Readings on Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds. Morgan-Kaufmann.
[36]
Oviatt, S. L. and Adams, B. 2000. Designing and evaluating conversational interfaces with animated characters. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, MA, 319--343.
[37]
Oviatt, S., Levow, G., Moreton, E., and Maceachern, M. 1998. Modeling global and focal hyperarticulation during human-computer error resolution. J. Acoust. Soc. Amer. 104, 5, 3080--3098.
[38]
Pisoni, D. 1997. Perception of synthetic speech. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 541--556.
[39]
Pols, L. and Jekosch, U. 1997. A structured way of looking at the performance of text-to-speech systems. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 519--527.
[40]
Potamionos, A., Narayanan, S., and lee, S. 1997. Automatic speech recognition for children. In European Conference on Speech Communication and Technology 5, 2371--2374.
[41]
Praat speech signal analysis software (URL:www.praat.org).
[42]
Rickel, J. and Johnson, W. L. 1998. Animated agents for procedural training in virtual reality: Perception, cognition and motor control. Appl. Artif. Intell. 13, 4--5, 343--382.
[43]
Rickenberg, R. and Reeves, B. 2000. The effects of animated characters on anxiety, task performance, and evaluations of user interfaces. Proceedings of CHI 2000, ACM Press: The Hague, Amsterdam, 49--56.
[44]
Scherer, K. R. 1979. Personality markers in speech. In Social Markers in Speech, K. Scherer and Giles, Eds. Cambridge Univ. Press, Cambridge, UK, 147--209.
[45]
Smith, B. L., Brown, B. L., Strong, W. J., and Rencher, A. C. 1995. Effects of speech rate on personality perception. Language and Speech 18, 145--152.
[46]
Snowdon, C. T. and Elowson, M. A. 1999. Pygmy marmosets modify call structure when paired. Ethology 105, 893--908.
[47]
Street, R., Street, N., and Vankleeck, A. 1983. Speech convergence among talkative and reticent three-year-olds. Language Sciences 5, 79--86.
[48]
Tusing, K. J. and Dillard, J. P. 2000. The sounds of dominance: Vocal precursors of perceived dominance during interpersonal influence. Hum. Comm. Resear. 26, 148--171.
[49]
Ward, N. and Nakagawa, S. 2002. Automatic user-adaptive speaking rate selection for information delivery. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 1, 549--552.
[50]
Weiss, D. J., Garibaldi, B. T. and Hauser, M. D. 2001. The production and perception of long calls by cotton-top tamarins (Saguinus Oedipus): Acoustic analyses and playback experiments. J. Comp. Psych. 115, 3, 258--271.
[51]
Welkowitz, J., Cariffe, G., and Feldstein, S. 1976. Conversational congruence as a criterion of socialization in children. Child Develop. 47, 269--272.
[52]
Welkowitz, J., Feldstein, S., Finklestein, M., and Aylesworth, L. 1972. Changes in vocal intensity as a function of interspeaker influence. Perceptual and Motor Skills 35, 715--718.
[53]
Wilpon, J. and Jacobsen, C. 1996. A study of speech recognition for children and the elderly. In Proceedings of the International Conference on Acoustics, Speech & Signal Processing, IEEE Press, Atlanta, GA, 349--352.
[54]
Yeni-Komshian, G., Kavanaugh, J., and Ferguson, C., Eds. 1980. Child Phonology, Volume 1: Production. Academic Press, New York, NY.
[55]
Zoltan-Ford, E. 1991. How to get people to say and type what computers can understand. Int. J. Man-Mach. Studies 34, 527--547.

Cited By

View all
  • (2024)Automatically adapting system pace towards user pace — Empirical studiesInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2024.103228185:COnline publication date: 1-May-2024
  • (2023)Investigating syntactic priming cumulative effects in MT-human interactionOpen Research Europe10.12688/openreseurope.13902.21(93)Online publication date: 13-Dec-2023
  • (2023)“I Won’t Go Speechless”: Design Exploration on a Real-Time Text-To-Speech Speaking Tool for VideoconferencingProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581215(1-20)Online publication date: 19-Apr-2023
  • Show More Cited By

Index Terms

  1. Toward adaptive conversational interfaces: Modeling speech convergence with animated personas

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer-Human Interaction
    ACM Transactions on Computer-Human Interaction  Volume 11, Issue 3
    September 2004
    92 pages
    ISSN:1073-0516
    EISSN:1557-7325
    DOI:10.1145/1017494
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2004
    Published in TOCHI Volume 11, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Adaptive interfaces
    2. amplitude
    3. animated characters
    4. children's educational software
    5. communication accommodation theory
    6. conversational interfaces
    7. dialogue response latency
    8. duration
    9. human-computer adaptation
    10. individual differences
    11. mobile interfaces
    12. social metaphors
    13. speech recognition
    14. text-to-speech

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Automatically adapting system pace towards user pace — Empirical studiesInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2024.103228185:COnline publication date: 1-May-2024
    • (2023)Investigating syntactic priming cumulative effects in MT-human interactionOpen Research Europe10.12688/openreseurope.13902.21(93)Online publication date: 13-Dec-2023
    • (2023)“I Won’t Go Speechless”: Design Exploration on a Real-Time Text-To-Speech Speaking Tool for VideoconferencingProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581215(1-20)Online publication date: 19-Apr-2023
    • (2023)Exploring The Potential of VR Interfaces in Animation: A Comprehensive Review2023 International Conference on Advancement in Computation & Computer Technologies (InCACCT)10.1109/InCACCT57535.2023.10141748(1-6)Online publication date: 5-May-2023
    • (2023)Speech Entrainment in Adolescent Conversations: A Developmental PerspectiveJournal of Speech, Language, and Hearing Research10.1044/2023_JSLHR-22-00263(1-19)Online publication date: 18-Apr-2023
    • (2023)How children speak with their voice assistant Sila depends on what they think about herComputers in Human Behavior10.1016/j.chb.2023.107693143:COnline publication date: 1-Jun-2023
    • (2022)Can VUI Turn-Taking Entrain User Behaviours?Proceedings of the 13th Indian Conference on Human-Computer Interaction10.1145/3570211.3570215(42-56)Online publication date: 9-Nov-2022
    • (2022)Embrace your incompetence! Designing appropriate CUI communication through an ecological approachProceedings of the 4th Conference on Conversational User Interfaces10.1145/3543829.3544531(1-5)Online publication date: 26-Jul-2022
    • (2022)Synchrony facilitates altruistic decision making for non-human avatarsComputers in Human Behavior10.1016/j.chb.2021.107079128:COnline publication date: 1-Mar-2022
    • (2021)Can Google Translate Rewire Your L2 English Processing?Digital10.3390/digital10100061:1(66-85)Online publication date: 4-Mar-2021
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media