research-article

Vowel based Voice Activity Detection with LSTM Recurrent Neural Network

Authors:
Juntae Kim

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
View Profile

,
Jaeseok Kim

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
View Profile

,
Seunghyung Lee

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
View Profile

,
Jinuk Park

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
View Profile

,
Minsoo Hahn

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea

Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, Republic of Korea
View Profile

ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing SystemsNovember 2016Pages 134–137https://doi.org/10.1145/3015166.3015207

Published:21 November 2016Publication History

ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing Systems

Pages 134–137

ABSTRACT

Voice activity detection (VAD) determines whether the incoming signal segments are speech or noiseand is an important technique in almost all of speech-related applications. In order to improve VAD performance in various noise environments, characterizing the speech feature has been the most crucial issue up to date. Among several proposed speech features, the context information of speech through time and vowel sound characteristics are known to current state-of-the-art speech features. Therefore, in order to reflect both on these merits, we propose vowel based VAD by Long short term memory recurrent neural network (LSTM-RNN). LSTM-RNN is known to the powerful model to capture dynamical context information through time. Moreover, with teaching the LSTM-RNN to only vowel sounds rather than whole speech, LSTM-RNN can learn more effectively because of the reduced manifold of speech. According to our experiments, proposed method shows better accuracy not only in the VAD task compared to LSTM-RNN based VAD but alsoa vowel detection task.

References

Vlaj, D., Kotnik, B., Horvat, B. and Kačič, Z. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems. EURASIP Journal on Advances in Signal Processing, 2005, 4 2005), 1--11. Google ScholarDigital Library
Benyassine, A., Shlomot, E., Su, H. Y., Massaloux, D., Lamblin, C. and Petit, J. P. ITU-T Recommendation G. 729 Annex B: a silence compression scheme for use with G. 729 optimized for V.70 digital simultaneous voice and data applications. IEEE Communications Magazine, 35, 9 1997), 64--73. Google ScholarDigital Library
Zhang, X. L. and Wang, D. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 2 2016), 252--264. Google ScholarDigital Library
Zhang, X. L. and Wu, J. Deep Belief Networks Based Voice Activity Detection. IEEE Transactions on Audio, Speech, and Language Processing, 21, 4 2013), 697--710. Google ScholarDigital Library
Ghosh, P. K., Tsiartas, A. and Narayanan, S. Robust Voice Activity Detection Using Long-Term Signal Variability. IEEE Transactions on Audio, Speech, and Language Processing, 19, 3 2011), 600--613. Google ScholarDigital Library
Ramirez, J., Segura, J. C., Benitez, C., Garcia, L. and Rubio, A. Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Processing Letters, 12, 10 2005), 689--692.Google ScholarCross Ref
Jongseo, S., Nam Soo, K. and Wonyong, S. A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6, 1 1999), 1--3.Google ScholarCross Ref
Kristjansson, T., Deligne, S. and Olsen, P. Voicing features for robust speech detection. Entropy, 2, 2.5 2005), 3.Google Scholar
Rabiner, L. R. and Gold, B. Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., 11975).Google Scholar
Dong, E., Liu, G., Zhou, Y. and Zhang, X. Applying support vector machines to voice activity detection. City, 2002.Google Scholar
Gers, F. A., Schraudolph, N. N. and Schmidhuber, J. Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3, Aug 2002), 115--143. Google ScholarDigital Library
Yoo, I. C., Lim, H. and Yook, D. Formant-Based Robust Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 12 2015), 2238--2245. Google ScholarDigital Library
Sadjadi, S. O. and Hansen, J. H. L. Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux. IEEE Signal Processing Letters, 20, 3 2013), 197--200.Google ScholarCross Ref
Sak, H., Senior, A. W. and Beaufays, F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. City, 2014.Google Scholar
Ghaemmaghami, H., Baker, B. J., Vogt, R. J. and Sridharan, S. Noise robust voice activity detection using features extracted from the time-domain autocorrelation function. Proceedings of Interspeech 20102010).Google Scholar
Garofolo, J., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.. Philadelphia: Linguistic Data Consortium1993).Google Scholar
Steeneken, A. V. a. H. J. M. Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun., 12, 3 (Jul. 1993), 247--251. Google ScholarDigital Library
Hirsch, H.-G. Fa NT-Filtering and Noise Adding Tool2005).Google Scholar
Gonzalez, S. and Brookes, M. PEFAC-a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 2 2014), 518--530. Google ScholarDigital Library
Seide, F., Fu, H., Droppo, J., Li, G. and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. City, 2014.Google Scholar
Dahl, G. E., Sainath, T. N. and Hinton, G. E. Improving deep neural networks for LVCSR using rectified linear units and dropout. IEEE, City, 2013.Google ScholarCross Ref
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J. and Devin, M. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.044672016).Google Scholar

Recommendations

Vowel Imitation Using Vocal Tract Model and Recurrent Neural Network
Neural Information Processing
Abstract
A vocal imitation system was developed using a computational model that supports the motor theory of speech perception. A critical problem in vocal imitation is how to generate speech sounds produced by adults, whose vocal tracts have physical ...
Read More
A study of voice activity detection techniques for NIST speaker recognition evaluations

Since 2008, interview-style speech has become an important part of the NIST speaker recognition evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). ...
Read More
An improvement in audio-visual voice activity detection for automatic speech recognition
IEA/AIE'10: Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I

Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing Systems
November 2016
235 pages
ISBN:9781450347907
DOI:10.1145/3015166

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 November 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Voice activity detection
recurrent neural network
vowel sound
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ICSPS 2016 Paper Acceptance Rate46of83submissions,55%Overall Acceptance Rate46of83submissions,55%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 321
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Vowel based Voice Activity Detection with LSTM Recurrent Neural Network

ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing Systems

ABSTRACT

References

Cited By

Recommendations

Vowel Imitation Using Vocal Tract Model and Recurrent Neural Network

A study of voice activity detection techniques for NIST speaker recognition evaluations

An improvement in audio-visual voice activity detection for automatic speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Vowel based Voice Activity Detection with LSTM Recurrent Neural Network

ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing Systems

ABSTRACT

References

Cited By

Recommendations

Vowel Imitation Using Vocal Tract Model and Recurrent Neural Network

A study of voice activity detection techniques for NIST speaker recognition evaluations

An improvement in audio-visual voice activity detection for automatic speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media