research-article

JALI: an animator-centric viseme model for expressive lip synchronization

Authors:

Chris Landreth,

Karan SinghAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 35, Issue 4

Article No.: 127, Pages 1 - 11

https://doi.org/10.1145/2897824.2925984

Published: 11 July 2016 Publication History

Abstract

The rich signals we extract from facial expressions imposes high expectations for the science and art of facial animation. While the advent of high-resolution performance capture has greatly improved realism, the utility of procedural animation warrants a prominent place in facial animation workflow. We present a system that, given an input audio soundtrack and speech transcript, automatically generates expressive lip-synchronized facial animation that is amenable to further artistic refinement, and that is comparable with both performance capture and professional animator output. Because of the diversity of ways we produce sound, the mapping from phonemes to visual depictions as visemes is many-valued. We draw from psycholinguistics to capture this variation using two visually distinct anatomical actions: Jaw and Lip, wheresound is primarily controlled by jaw articulation and lower-face muscles, respectively. We describe the construction of a transferable template jali 3D facial rig, built upon the popular facial muscle action unit representation facs. We show that acoustic properties in a speech signal map naturally to the dynamic degree of jaw and lip in visual speech. We provide an array of compelling animation clips, compare against performance capture and existing procedural animation, and report on a brief user study.

Supplementary Material

ZIP File (a127-edwards-supp.zip)

Supplemental files.

Download
120.45 MB

MP4 File (a127.mp4)

Download
334.41 MB

References

[1]

Albrecht, I., Schröder, M., Haber, J., and Seidel, H.-P. 2005. Mixed feelings: expression of non-basic emotions in a muscle-based talking head. Virtual Reality 8, 4 (Aug.), 201--212.

Digital Library

[2]

Anderson, R., Stenger, B., Wan, V., and Cipolla, R. 2013. Expressive Visual Text-to-Speech Using Active Appearance Models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 3382--3389.

Digital Library

[3]

Bachorowski, J.-A. 1999. Vocal Expression and Perception of Emotion. Current Directions in Psychological Science 8, 2, 53--57.

[4]

Badin, P., Bailly, G., Revret, L., Baciu, M., Segebarth, C., and Savariaux, C. 2002. Three-dimensional linear articulatory modeling of tongue, lips and face, based on {MRI} and video images. Journal of Phonetics 30, 3, 533--553.

[5]

Bailly, G., Govokhina, O., Elisei, F., and Breton, G. 2009. Lip-Synching Using Speaker-Specific Articulation, Shape and Appearance Models. EURASIP Journal on Audio, Speech, and Music Processing 2009, 1, 1--11.

Digital Library

[6]

Bailly, G., Perrier, P., and Vatikiotis-Bateson, E., Eds. 2012. Audiovisual Speech Processing. Cambridge University Press. Cambridge Books Online.

[7]

Bailly, G. 1997. Learning to Speak. Sensori-Motor Control of Speech Movements. Speech Communication 22, 2-3 (Aug.), 251--267.

Digital Library

[8]

Banse, R., and Scherer, K. R. 1996. Acoustic Profiles in Vocal Emotion Expression. Journal of Personality and Social Psychology 70, 3 (Mar.), 614--636.

[9]

Bevacqua, E., and Pelachaud, C. 2004. Expressive Audio-Visual Speech. Computer Animation and Virtual Worlds 15, 3-4, 297--304.

Digital Library

[10]

Black, A. W., Taylor, P., and Caley, R. 2001. The Festival Speech Synthesis System: System Documentation Festival version 1.4, 1.4.2 ed.

[11]

Blair, P. 1947. Advanced Animation: Learn how to draw animated cartoons. Walter T. Foster.

[12]

Boersma, P., and Weenink, D., 2014. Praat: doing phonetics by computer {Computer Program}. Version 5.4.04, retrieved 28 December 2014 from http://www.praat.org/.

[13]

Brand, M. 1999. Voice Puppetry. In SIGGRAPH '99: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, ACM Press, Los Angeles, 21--28.

Digital Library

[14]

Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, SIGGRAPH '97, 353--360.

Digital Library

[15]

Brugnara, F., Falavigna, D., and Omologo, M. 1993. Automatic segmentation and labeling of speech based on hidden markov models. Speech Commun. 12, 4 (Aug.), 357--370.

Digital Library

[16]

Cao, Y., Tien, W. C., Faloutsos, P., and Pighin, F. 2005. Expressive Speech-Driven Facial Animation. ACM Transactions on Graphics (TOG) 24, 4, 1283--1302.

Digital Library

[17]

Carnegie Mellon University, 2014. CMU Sphinx: Open Source Toolkit for Speech Recognition {Computer Program}. Version 4, retrieved 28 December 2014 from http://cmusphinx.sourceforge.net/.

[18]

Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., and Ghazanfar, A. A. 2009. The Natural Statistics of Audiovisual Speech. PLoS Computational Biology 5, 7 (July), 1--18.

[19]

Cohen, M. M., and Massaro, D. W. 1993. Modeling Coarticulation in Synthetic Visual Speech. Models and Techniques in Computer Animation, 139--156.

[20]

Cosi, P., Caldognetto, E. M., Perin, G., and Zmarich, C. 2002. Labial Coarticulation Modeling for Realistic Facial Animation. In ICMI'02: IEEE International Conference on Multimodal Interfaces, IEEE Computer Society, 505--510.

Digital Library

[21]

Deng, Z., Neumann, U., Lewis, J. P., Kim, T.-Y., Bulut, M., and Narayanan, S. 2006. Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces. IEEE Transactions on Visualization and Computer Graphics 12, 6 (Nov.), 1523--1534.

Digital Library

[22]

Ekman, P., and Friesen, W. V. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement, 1 ed. Consulting Psychologists Press, Palo Alto, California, Aug.

[23]

Ezzat, T., Geiger, G., and Poggio, T. 2002. Trainable vide-orealistic speech animation. In Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, ACM, New York, NY, USA, SIGGRAPH '02, 388--398.

Digital Library

[24]

Fisher, C. G. 1968. Confusions among visually perceived consonants. Journal of Speech, Language, and Hearing Research 11, 4, 796--804.

[25]

Hill, H. C. H., Troje, N. F., and Johnston, A. 2005. Range- and Domain-Specific Exaggeration of Facial Speech. Journal of Vision 5, 10 (Nov.), 4--4.

[26]

Ito, T., Murano, E. Z., and Gomi, H. 2004. Fast Force-Generation Dynamics of Human Articulatory Muscles. Journal of Applied Physiology 96, 6 (June), 2318--2324.

[27]

Jurafsky, D., and Martin, J. H. 2008. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2 ed. Prentice Hall.

Digital Library

[28]

Kent, R. D., and Minifie, F. D. 1977. Coarticulation in Recent Speech Production Models. Journal of Phonetics 5, 2, 115--133.

[29]

King, S. A., and Parent, R. E. 2005. Creating Speech-Synchronized Animation. IEEE Transactions on Visualization and Computer Graphics 11, 3 (May), 341--352.

Digital Library

[30]

Lasseter, J. 1987. Principles of Traditional Animation Applied to 3D Computer Animation. SIGGRAPH Computer Graphics 21, 4, 35--44.

Digital Library

[31]

Li, H., Yu, J., Ye, Y., and Bregler, C. 2013. Realtime Facial Animation with on-the-Fly Correctives. ACM Transactions on Graphics (TOG) 32, 4, 42.

Digital Library

[32]

LibriVox, 2014. LibriVox---free public domain audiobooks. Retrieved 28 December 2014 from https://librivox.org/.

[33]

Liu, Y., Xu, F., Chai, J., Tong, X., Wang, L., and Huo, Q. 2015. Video-Audio Driven Real-Time Facial Animation. ACM Transactions on Graphics (TOG) 34, 6 (Nov.), 182.

Digital Library

[34]

Ma, X., and Deng, Z. 2012. A Statistical Quality Model for Data-Driven Speech Animation. IEEE Transactions on Visualization and Computer Graphics 18, 11, 1915--1927.

Digital Library

[35]

Ma, J., Cole, R., Pellom, B., Ward, W., and Wise, B. 2006. Accurate visible speech synthesis based on concatenating variable length motion capture data. Visualization and Computer Graphics, IEEE Transactions on 12, 2 (March), 266--276.

Digital Library

[36]

Maniwa, K., Jongman, A., and Wade, T. 2009. Acoustic Characteristics of Clearly Spoken English Fricatives. Journal of the Acoustical Society of America 125, 6, 3962.

[37]

Massaro, D. W., Cohen, M. M., Tabain, M., Beskow, J., and Clark, R. 2012. Animated speech: research progress and applications. In Audiovisual Speech Processing, G. Bailly, P. Perrier, and E. Vatikiotis-Bateson, Eds. Cambridge University Press, Cambridge, 309--345.

[38]

Mattheyses, W., and Verhelst, W. 2015. Audiovisual Speech Synthesis: An Overview of the State-of-the-Art. Speech Communication 66, C (Feb.), 182--217.

Digital Library

[39]

Metzner, J., Schmittfull, M., and Schnell, K. 2006. Substitute sounds for ventriloquism and speech disorders. In INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, September 17--21, 2006.

[40]

Mori, M. 1970. The Uncanny Valley (aka. 'Bukimi no tani'). Energy 7, 4, 33--35.

[41]

Orvalho, V., Bastos, P., Parke, F. I., Oliveira, B., and Alvarez, X. 2012. A Facial Rigging Survey. Eurographics 2012 - STAR -- State of The Art Report, 183--204.

[42]

Osipa, J. 2010. Stop staring: facial modeling and animation done right. John Wiley & Sons.

Digital Library

[43]

Pandzic, I. S., and Forchheimer, R., Eds. 2002. MPEG-4 Facial Animation, 1 ed. The Standard, Implementation and Applications. John Wiley & Sons, West Sussex.

Digital Library

[44]

Parke, F. I., and Waters, K. 1996. Computer Facial Animation. A. K. Peters.

Digital Library

[45]

Parke, F. I. 1972. Computer generated animation of faces. In Proceedings of the ACM Annual Conference - Volume 1, ACM, New York, NY, USA, ACM '72, 451--457.

Digital Library

[46]

Pelachaud, C., Badler, N. I., and Steedman, M. 1996. Generating Facial Expressions for Speech. Cognitive Science 20, 1, 1--46.

[47]

Rossion, B., Hanseeuw, B., and Dricot, L. 2012. Defining face perception areas in the human brain: A large-scale factorial fmri face localizer analysis. Brain and Cognition 79, 2, 138--157.

[48]

Schwartz, J.-L., and Savariaux, C. 2014. No, There Is No 150 ms Lead of Visual Speech on Auditory Speech, but a Range of Audiovisual Asynchronies Varying from Small Audio Lead to Large Audio Lag. PLoS Computational Biology (PLOSCB) 10(7) 10, 7, 1--10.

[49]

Sifakis, E., Selle, A., Robinson-Mosher, A., and Fedkiw, R. 2006. Simulating Speech With A Physics-Based Facial Muscle Model. In SCA '06: Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, Eurographics Association, Vienna, 261--270.

Digital Library

[50]

Taylor, S. L., Mahler, M., Theobald, B.-J., and Matthews, I. 2012. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, SCA '12, 275--284.

Digital Library

[51]

Taylor, S. L., Theobald, B. J., and Matthews, I. 2014. The Effect of Speaking Rate on Audio and Visual Speech. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, IEEE, Disney Research, Pittsburgh, PA, 3037--3041.

[52]

Wang, A., Emmi, M., and Faloutsos, P. 2007. Assembling an Expressive Facial Animation System. In Sandbox '07: Proceedings of the 2007 ACM SIGGRAPH symposium on Video games, ACM.

Digital Library

[53]

Wang, L., Han, W., and Soong, F. K. 2012. High Quality Lip-Sync Animation for 3D Photo-Realistic Talking Head. In ICASSP 2012: IEEE International Conference on Acoustics, Speech and Signal Processing, 4529--4532.

Digital Library

[54]

Weise, T., Li, H., Van Gool, L., and Pauly, M. 2009. Face/Off: live facial puppetry. In SCA '09: Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, ACM Request Permissions, 7--16.

Digital Library

[55]

Weise, T., Bouaziz, S., Li, H., and Pauly, M. 2011. Realtime performance-based facial animation. SIGGRAPH '11: SIGGRAPH 2011 papers (Aug.).

Digital Library

[56]

Williams, L. 1990. Performance-driven facial animation. In Proceedings of the 17th Annual Conference on Computer Graphics and Interactive Techniques, ACM, New York, NY, USA, SIGGRAPH '90, 235--242.

Digital Library

[57]

Xu, Y., Feng, A. W., Marsella, S., and Shapiro, A. 2013. A Practical and Configurable Lip Sync Method for Games. In Proceedings - Motion in Games 2013, MIG 2013, USC Institute for Creative Technologies, 109--118.

Digital Library

[58]

Young, S. J., and Young, S. 1993. The HTK Hidden Markov Model Toolkit: Design and Philosophy. University of Cambridge, Department of Engineering.

Cited By

Zhao GXu JWang XYan FQiu S(2025)PSAIP: Prior Structure-Assisted Identity-Preserving Network for Face AnimationElectronics10.3390/electronics1404078414:4(784)Online publication date: 17-Feb-2025
https://doi.org/10.3390/electronics14040784
Niu WWang ZLi YLou T(2025)NewTalker: Exploring frequency domain for speech‐driven 3D facial animation with MambaIET Image Processing10.1049/ipr2.7001119:1Online publication date: 8-Feb-2025
https://doi.org/10.1049/ipr2.70011
Zhu DXu ZYang Y(2025)MambaTalk: Speech-Driven 3D Facial Animation with MambaMultiMedia Modeling10.1007/978-981-96-2061-6_23(310-323)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1007/978-981-96-2061-6_23
Show More Cited By

Index Terms

JALI: an animator-centric viseme model for expressive lip synchronization
1. Applied computing
  1. Arts and humanities
    1. Performing arts
2. Computing methodologies

Recommendations

Creating Speech-Synchronized Animation

We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier ...
Animating expressive faces across languages

This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system ...
"May I talk to you?: -)" " Facial Animation from Text
PG '02: Proceedings of the 10th Pacific Conference on Computer Graphics and Applications

We introduce a facial animation system that produces real-time animation sequences including speech synchronization and non-verbal speech-related facial expressions from plain text input. A state-of-the-art text-to-speech synthesis component performs ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 35, Issue 4

July 2016

1396 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/2897824

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2016

Published in TOG Volume 35, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Sciences and Engineering Research Council of Canada
Canada Foundation for Innovation
Ontario Research Fund

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

134
Total Citations
View Citations
1,090
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)12

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao GXu JWang XYan FQiu S(2025)PSAIP: Prior Structure-Assisted Identity-Preserving Network for Face AnimationElectronics10.3390/electronics1404078414:4(784)Online publication date: 17-Feb-2025
https://doi.org/10.3390/electronics14040784
Niu WWang ZLi YLou T(2025)NewTalker: Exploring frequency domain for speech‐driven 3D facial animation with MambaIET Image Processing10.1049/ipr2.7001119:1Online publication date: 8-Feb-2025
https://doi.org/10.1049/ipr2.70011
Zhu DXu ZYang Y(2025)MambaTalk: Speech-Driven 3D Facial Animation with MambaMultiMedia Modeling10.1007/978-981-96-2061-6_23(310-323)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1007/978-981-96-2061-6_23
Li ZWang YDu XWang CKoch RLiu M(2024)ASMNet: Action and Style-Conditioned Motion Generative Network for 3D Human Motion GenerationCyborg and Bionic Systems10.34133/cbsystems.00905Online publication date: 6-Feb-2024
https://doi.org/10.34133/cbsystems.0090
Yu YLado AZhang YMagnotti JBeauchamp M(2024)Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real facesFrontiers in Neuroscience10.3389/fnins.2024.137998818Online publication date: 9-May-2024
https://doi.org/10.3389/fnins.2024.1379988
Fu HWang ZGong KWang KChen TLi HZeng HKang WWooldridge MDy JNatarajan S(2024)MimicProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i2.27945(1770-1777)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i2.27945
Jung SSeol YSeo KNa HKim STan VNoh J(2024)Speed-Aware Audio-Driven Speech Animation using Adaptive WindowsACM Transactions on Graphics10.1145/369134144:1(1-14)Online publication date: 31-Aug-2024
https://dl.acm.org/doi/10.1145/3691341
Wu SHaque KYumak Z(2024)ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAEProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696320(1-12)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3677388.3696320
Jiang CYu SFu JLin CZhu HMa XLi MGuo LShu YLiu JTan RHe YChen J(2024)Behaviors Speak More: Achieving User Authentication Leveraging Facial Activities via mmWave SensingProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699330(169-183)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699330
Wu SLi YYan YDuan HLiu ZZhai GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MMHead: Towards Fine-grained Multi-modal 3D Facial AnimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681366(7966-7975)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681366
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents