Article

Automatic speech recognition for webcasts: how good is good enough and what to do when it isn't

Authors:

Cosmin Munteanu,

Yuecheng ZhangAuthors Info & Claims

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Pages 39 - 42

https://doi.org/10.1145/1180995.1181005

Published: 02 November 2006 Publication History

Abstract

The increased availability of broadband connections has recently led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. One challenge to skimming and browsing through such archives is the lack of text transcripts of the webcast's audio channel. This paper describes a procedure for prototyping an Automatic Speech Recognition (ASR) system that generates realistic transcripts of any desired Word Error Rate (WER), thus overcoming the drawbacks of both prototype-based and Wizard of Oz simulations. We used such a system in a user study showing that transcripts with WERs less than 25% are acceptable for use in webcast archives. As current ASR systems can only deliver, in realistic conditions, Word Error Rates (WERs) of around 45%, we also describe a solution for reducing the WER of such transcripts by engaging users to collaborate in a "wiki" fashion on editing the imperfect transcripts obtained through ASR.

References

[1]

C. Dufour, E. G. Toms, J. Lewis, and R. M. Baecker, "User strategies for handling information tasks in webcasts," in Proc. of CHI, 2005.

Digital Library

[2]

E. Leeuwis, M. Federico, and M. Cettolo, "Language modeling and transcription of the TED Corpus lectures," in Proc. of the IEEE ICASSP, 2003.

[3]

I. Rogina and T. Schaaf, "Lecture and presentation tracking in an intelligent meeting room," in Proc. of IEEE ICMI, 2000.

Digital Library

[4]

K. Kato, H. Nanjo, and T. Kawahara, "Automatic transcription of lecture speech using topic-independent language modeling," in Proc. of ICSLP, 2000.

[5]

C. Munteanu, R. Baecker, G. Penn, E. Toms, and D. James, "The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives," in Proc. of CHI, 2006.

Digital Library

[6]

S. Whittaker and J. Hirschberg, "Look or listen: Discovering effective techniques for accessing speech data," in Proc. of the Human-Computer Interaction Conference. 2003, Springer-Verlag.

[7]

A. Park, T. J. Hazen, and J. R. Glass, "Automatic processing of audio lectures for information retrieval," in Proc. of IEEE ICASSP, 2005.

[8]

L. Stark, S. Whittaker, and J. Hirschberg, "ASR satisficing: The effects of ASR accuracy on speech retrieval," in Proc. of ICSLP, 2000.

[9]

S. Whittaker, J. Hirschberg, B. Amento, L. Stark, M. Bacchiani, P. Isenhour, L. Stead, G. Zamchick, and A. Rosenberg, "SCANMail: a voicemail interface that makes speech browsable, readable and searchable," in Proc. of CHI, 2002.

Digital Library

[10]

N.O. Bernsen, H. Dybkjær, and L. Dybkjær, Designing Interactive Speech Systems: From First Ideas to User Testing, Springer-Verlag, 1998.

Digital Library

[11]

L. von Ahn and L. Dabbish, "Labeling images with a computer game," in Proc. of CHI, 2004.

Digital Library

[12]

T. Volkmer, J. R. Smith, and A. Natsev, "A web-based system for collaborative annotation of large image and video collections," in Proc. of ACM MM, 2005.

Digital Library

[13]

K. Goldberg, B. Chen, Solomon R., and S. Bui, "Collaborative teleoperation via the internet," in Proc. of the IEEE ICRA, 2000.

[14]

K. Crowston, H. Annabi, J. Howson, and C. Masango, "Effective work practices for software engineering: Free/libre open source software development," in Proc. of WISER, 2006.

Digital Library

[15]

N. Dahlbäck, A. Jönsson, and L. Ahrenberg, "Wizard of Oz studies -- why and how," in Proc. of the International Workshop on Intelligent User Interfaces, Orlando, Florida, USA, 1993.

Digital Library

[16]

B. L. Pellom, "SONIC: The University of Colorado continuous speech recognizer," Tech. Rep. TR-CSLR-2001-01, University of Colorado, 2001.

[17]

"The Wall Street Journal Dictation Corpus (DARPA-CSR)," The Linguistic Data Consortium, LDC94S13, 1992.

[18]

R. Stern, "Specifications of the 1996 Hub-4 broadcast news evaluation.," in Proc. of the DARPA Speech Recognition Workshop, 1997.

Cited By

Maharjan RRohani DBækgaard PBardram JDoherty K(2021)Can we talk? Design Implications for the Questionnaire-Driven Self-Report of Health and Wellbeing via Conversational AgentProceedings of the 3rd Conference on Conversational User Interfaces10.1145/3469595.3469600(1-11)Online publication date: 27-Jul-2021
https://dl.acm.org/doi/10.1145/3469595.3469600
Munteanu CIrani POviatt SAylett MPenn GPan SSharma NRudzicz FGomez RCowan BNakamura KMark GFussell SLampe Cschraefel mHourcade JAppert CWigdor D(2017)Designing Speech, Acoustic and Multimodal InteractionsProceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems10.1145/3027063.3027086(601-608)Online publication date: 6-May-2017
https://dl.acm.org/doi/10.1145/3027063.3027086
Spina DTrippas JCavedon LSanderson M(2017)Extracting audio summaries to support effective spoken document searchJournal of the Association for Information Science and Technology10.1002/asi.2383168:9(2101-2115)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1002/asi.23831
Show More Cited By

Index Terms

Automatic speech recognition for webcasts: how good is good enough and what to do when it isn't

Recommendations

Psycho-acoustics inspired automatic speech recognition
Abstract
Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights
- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language
Highlights
- Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.
Abstract
Speech intelligibility is an essential though complex construct for evaluating dysarthric speech. Various procedures can be used to measure speech intelligibility, most of which are based on subjective ratings assigned by experts. ...
Prosody modification for speech recognition in emotionally mismatched conditions

A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

November 2006

404 pages

ISBN:159593541X

DOI:10.1145/1180995

General Chairs:
Francis Quek
Virginia Tech, USA
,
Jie Yang
Carnegie Mellon University, USA
,
Program Chairs:
Dominic Massaro
University of California, Santa Cruz, USA
,
Abeer Alwan
University of California, Los Angeles, USA
,
Timothy J. Hazen
Massachusetts Institute of Technology, USA

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI06

Sponsor:

ICMI06: 8th International Conference on Multimodal Interfaces 2006

November 2 - 4, 2006

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
333
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Maharjan RRohani DBækgaard PBardram JDoherty K(2021)Can we talk? Design Implications for the Questionnaire-Driven Self-Report of Health and Wellbeing via Conversational AgentProceedings of the 3rd Conference on Conversational User Interfaces10.1145/3469595.3469600(1-11)Online publication date: 27-Jul-2021
https://dl.acm.org/doi/10.1145/3469595.3469600
Munteanu CIrani POviatt SAylett MPenn GPan SSharma NRudzicz FGomez RCowan BNakamura KMark GFussell SLampe Cschraefel mHourcade JAppert CWigdor D(2017)Designing Speech, Acoustic and Multimodal InteractionsProceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems10.1145/3027063.3027086(601-608)Online publication date: 6-May-2017
https://dl.acm.org/doi/10.1145/3027063.3027086
Spina DTrippas JCavedon LSanderson M(2017)Extracting audio summaries to support effective spoken document searchJournal of the Association for Information Science and Technology10.1002/asi.2383168:9(2101-2115)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1002/asi.23831
Munteanu CIrani POviatt SAylett MPenn GPan SSharma NRudzicz FGomez RNakamura KNakadai KKaye JDruin ALampe CMorris DHourcade J(2016)Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive ApplicationsProceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems10.1145/2851581.2856506(3612-3619)Online publication date: 7-May-2016
https://dl.acm.org/doi/10.1145/2851581.2856506
Munteanu CPenn GBrdiczka OChau PCarenini GPan SKristensson P(2015)Speech-based InteractionProceedings of the 20th International Conference on Intelligent User Interfaces10.1145/2678025.2716263(437-438)Online publication date: 18-Mar-2015
https://dl.acm.org/doi/10.1145/2678025.2716263
Kate LWaghmare MPriyadarshi A(2015)An approach for automated video indexing and video search in large lecture video archives2015 International Conference on Pervasive Computing (ICPC)10.1109/PERVASIVE.2015.7087169(1-5)Online publication date: Jan-2015
https://doi.org/10.1109/PERVASIVE.2015.7087169
Vigneshwari GJuliet A(2015)Optimized searching of video based on speech and video text content2015 International Conference on Soft-Computing and Networks Security (ICSNS)10.1109/ICSNS.2015.7292369(1-4)Online publication date: Feb-2015
https://doi.org/10.1109/ICSNS.2015.7292369
Munteanu CJones MWhittaker SOviatt SAylett MPenn GBrewster Sd'Alessandro NJones MPalanque PSchmidt AGrossman T(2014)Designing speech and language interactionsCHI '14 Extended Abstracts on Human Factors in Computing Systems10.1145/2559206.2559228(75-78)Online publication date: 26-Apr-2014
https://dl.acm.org/doi/10.1145/2559206.2559228
Haojin Yang Meinel C(2014)Content Based Lecture Video Retrieval Using Speech and Video Text InformationIEEE Transactions on Learning Technologies10.1109/TLT.2014.23073057:2(142-154)Online publication date: Apr-2014
https://doi.org/10.1109/TLT.2014.2307305
Munteanu CJones MOviatt SBrewster SPenn GWhittaker SRajput NNanavati AMackay WBrewster SBødker S(2013)We need to talkCHI '13 Extended Abstracts on Human Factors in Computing Systems10.1145/2468356.2468803(2459-2464)Online publication date: 27-Apr-2013
https://dl.acm.org/doi/10.1145/2468356.2468803

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten