skip to main content
10.1145/1073368.1073388acmconferencesArticle/Chapter ViewAbstractPublication PagesscaConference Proceedingsconference-collections
Article

Transferable videorealistic speech animation

Published: 29 July 2005 Publication History

Abstract

Image-based videorealistic speech animation achieves significant visual realism at the cost of the collection of a large 5- to 10-minute video corpus from the specific person to be animated. This requirement hinders its use in broad applications, since a large video corpus for a specific person under a controlled recording setup may not be easily obtained In this paper, we propose a model transfer and adaptation algorithm which allows for a novel person to be animated using only a small video corpus. The algorithm starts with a multidimensional morphable model (MMM) previously trained from a different speaker with a large corpus, and transfers it to the novel speaker with a much smaller corpus. The algorithm consists of 1) a novel matching-by-synthesis algorithm which semi-automatically selects new MMM prototype images from the new video corpus and 2) a novel gradient descent linear regression algorithm which adapts the MMM phoneme models to the data in the novel video corpus. Encouraging experimental results are presented in which a morphable model trained from a performer with a 10-minute corpus is transferred to a novel person using a 15-second movie clip of him as the adaptation video corpus.

References

[1]
{BBPV03} Blanz V., Basso C., Poggio T., Vetter T.: Reanimating faces in images and video. In Proc. Eurographics '03 (2003), vol. 22.
[2]
{BCS97} Bregler C., Covell M., Slaney M.: Video rewrite: Driving visual speech with audio. In Proc. SIGGRAPH '97 (1997), pp. 353--360.
[3]
{Bis95} Bishop C. M.: Neural Networks for Pattern Recognition. Oxford University Press, 1995.
[4]
{BP95} Beymer D., Poggio T.: Face recognition from one example view. In Proc. IEEE 5th International Conference on Computer Vision (1995), pp. 500--507.
[5]
{CC02} Chang Y. J., Chen Y. C.: Facial model adaptation from a monocular image sequence using a textured polygonal model. Signal Processing: Image Communication 17, 5 (May 2002), 373--392.
[6]
{CFKP04} Cao Y., Faloutsos P., Kohler E., Pighin F.: Real-time speech motion synthesis from recorded motions. In Proc. 2004 ACM SIGGRAPH/Eurographics Sympsium on Computer Animation (2004), pp. 347--355.
[7]
{CG00} Cosatto E., Graf H. P.: Photo-realistic talking-heads from image samples. IEEE Trans. on Multimedia 2, 3 (Sept. 2000), 152--163.
[8]
{EGP02} Ezzat T., Geiger G., Poggio T.: Trainable videorealistic speech animation. In Proc. SIGGRAPH '02 (2002), vol. 21, pp. 388--397.
[9]
{Gal98} Gales M. J. F.: Cluster adaptive training for speech recognition. In Proc. the 5th International Conference on Spoken Language Processing (1998), pp. 1783--1786.
[10]
{GCSH02} Graf H. P., Cosatto E., Strom V., Huang F. J.: Visual prosody: facial movements accompanying speech. In Proc. 5th IEEE International Conference on Automatic Face and Gesture Recognition (2002), pp. 381--386.
[11]
{GL94} Gauvain J. L., Lee C. H.: Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. on Speech and Audio Processing 2, 2 (Apr. 1994), 291--298.
[12]
{Gle98} Gleicher M.: Retargetting motion to new characters. In Proc. SIGGRAPH '98 (1998), pp. 33--42.
[13]
{HAH01} Huang X., Acero A., Hon H. W.: Spoken language processing: a guide to theory, algorithm and system development. Pearson Education, 2001.
[14]
{JP98} Jones M., Poggio T.: Multidimensional morphable models: a framework for representing and matching object classes. International Journal of Computer Vision 29, 2 (Aug. 1998), 107--131.
[15]
{KNJ*98} Kuhn R., Nguyen P., Junqua J. C., Goldwasser L., Niedzielski N., Fincke S., Field K., Contolini M.: Eigenvoices for speaker adaptation. In Proc. the 5th International Conference on Spoken Language Processing (1998), pp. 1771--1774.
[16]
{LSZ01} Liu Z., Shan Y., Zhang Z.: Expressive expression mapping with ratio images. In Proc. SIGGRAPH '01 (2001), pp. 271--276.
[17]
{LW95} Leggetter C. J., Woodland P. C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language 9, 2 (1995), 171--185.
[18]
{NJ04} Na K., Jung M.: Hierarchical retargetting of fine facial motions. In Proc. Eurographics '04 (2004).
[19]
{NN01} Noh J. Y., Neumann U.: Expression cloning. In Proc. SIGGRAPH '01 (2001), pp. 277--288.
[20]
{PHL*98} Pighin F., Hecker J., Lischinski D., Szeliski R., Salesin D.: Synthesizing realistic facial expressions from photographs. In Proc. SIGGRAPH '98 (1998), pp. 75--84.
[21]
{SSSE00} Schödl A., Szeliski R., Salesin D. H., Essa I.: Video textures. In Proc. SIGGRAPH '00 (2000), pp. 489--498.
[22]
{WHL*04} Wang Y., Huang X., Lee C. S., Zhang S., Li Z., Samaras D., Metaxas D., Elgammal A., Huang P.: High resolution acquisition, learning and transfer of dynamic 3-d facial expressions. In Proc. Eurographics '04 (2004).
[23]
{ZLGS03} Zhang Q., Liu Z., Guo B., Shum H.: Geometry-driven photorealistic facial expression synthesis. In Proc. 2003 ACM SIGGRAPH/Eurographics Sympsium on Computer Animation (2003), pp. 177--186.

Cited By

View all
  • (2023)Deep Person Generation: A Survey from the Perspective of Face, Pose, and Cloth SynthesisACM Computing Surveys10.1145/357565655:12(1-37)Online publication date: 28-Mar-2023
  • (2022)Talking Faces: Audio-to-Video Face GenerationHandbook of Digital Face Manipulation and Detection10.1007/978-3-030-87664-7_8(163-188)Online publication date: 31-Jan-2022
  • (2021)Iterative Text-Based Editing of Talking-Heads Using Neural RetargetingACM Transactions on Graphics10.1145/344906340:3(1-14)Online publication date: 1-Aug-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SCA '05: Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation
July 2005
366 pages
ISBN:1595931988
DOI:10.1145/1073368
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 July 2005

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SCA05
Sponsor:
SCA05: Symposium on Computer Animation
July 29 - 31, 2005
California, Los Angeles

Acceptance Rates

Overall Acceptance Rate 183 of 487 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Deep Person Generation: A Survey from the Perspective of Face, Pose, and Cloth SynthesisACM Computing Surveys10.1145/357565655:12(1-37)Online publication date: 28-Mar-2023
  • (2022)Talking Faces: Audio-to-Video Face GenerationHandbook of Digital Face Manipulation and Detection10.1007/978-3-030-87664-7_8(163-188)Online publication date: 31-Jan-2022
  • (2021)Iterative Text-Based Editing of Talking-Heads Using Neural RetargetingACM Transactions on Graphics10.1145/344906340:3(1-14)Online publication date: 1-Aug-2021
  • (2021)The Creation and Detection of DeepfakesACM Computing Surveys10.1145/342578054:1(1-41)Online publication date: 2-Jan-2021
  • (2020)Intuitive facial animation editing based on a generative RNN frameworkProceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation10.1111/cgf.14117(1-11)Online publication date: 6-Oct-2020
  • (2020)Talking-Head Generation with Rhythmic Head MotionComputer Vision – ECCV 202010.1007/978-3-030-58545-7_3(35-51)Online publication date: 5-Nov-2020
  • (2019)Text-based editing of talking-head videoACM Transactions on Graphics10.1145/3306346.332302838:4(1-14)Online publication date: 12-Jul-2019
  • (2018)Visual Speech Emotion Conversion using Deep Learning for 3D Talking HeadProceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data10.1145/3267935.3267950(7-13)Online publication date: 19-Oct-2018
  • (2018)HeadonACM Transactions on Graphics10.1145/3197517.320135037:4(1-13)Online publication date: 30-Jul-2018
  • (2018)Deep video portraitsACM Transactions on Graphics10.1145/3197517.320128337:4(1-14)Online publication date: 30-Jul-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media