skip to main content
10.1145/3015166.3015207acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicspsConference Proceedingsconference-collections
research-article

Vowel based Voice Activity Detection with LSTM Recurrent Neural Network

Authors Info & Claims
Published:21 November 2016Publication History

ABSTRACT

Voice activity detection (VAD) determines whether the incoming signal segments are speech or noiseand is an important technique in almost all of speech-related applications. In order to improve VAD performance in various noise environments, characterizing the speech feature has been the most crucial issue up to date. Among several proposed speech features, the context information of speech through time and vowel sound characteristics are known to current state-of-the-art speech features. Therefore, in order to reflect both on these merits, we propose vowel based VAD by Long short term memory recurrent neural network (LSTM-RNN). LSTM-RNN is known to the powerful model to capture dynamical context information through time. Moreover, with teaching the LSTM-RNN to only vowel sounds rather than whole speech, LSTM-RNN can learn more effectively because of the reduced manifold of speech. According to our experiments, proposed method shows better accuracy not only in the VAD task compared to LSTM-RNN based VAD but alsoa vowel detection task.

References

  1. Vlaj, D., Kotnik, B., Horvat, B. and Kačič, Z. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems. EURASIP Journal on Advances in Signal Processing, 2005, 4 2005), 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Benyassine, A., Shlomot, E., Su, H. Y., Massaloux, D., Lamblin, C. and Petit, J. P. ITU-T Recommendation G. 729 Annex B: a silence compression scheme for use with G. 729 optimized for V.70 digital simultaneous voice and data applications. IEEE Communications Magazine, 35, 9 1997), 64--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Zhang, X. L. and Wang, D. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 2 2016), 252--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Zhang, X. L. and Wu, J. Deep Belief Networks Based Voice Activity Detection. IEEE Transactions on Audio, Speech, and Language Processing, 21, 4 2013), 697--710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ghosh, P. K., Tsiartas, A. and Narayanan, S. Robust Voice Activity Detection Using Long-Term Signal Variability. IEEE Transactions on Audio, Speech, and Language Processing, 19, 3 2011), 600--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ramirez, J., Segura, J. C., Benitez, C., Garcia, L. and Rubio, A. Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Processing Letters, 12, 10 2005), 689--692.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jongseo, S., Nam Soo, K. and Wonyong, S. A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6, 1 1999), 1--3.Google ScholarGoogle ScholarCross RefCross Ref
  8. Kristjansson, T., Deligne, S. and Olsen, P. Voicing features for robust speech detection. Entropy, 2, 2.5 2005), 3.Google ScholarGoogle Scholar
  9. Rabiner, L. R. and Gold, B. Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., 11975).Google ScholarGoogle Scholar
  10. Dong, E., Liu, G., Zhou, Y. and Zhang, X. Applying support vector machines to voice activity detection. City, 2002.Google ScholarGoogle Scholar
  11. Gers, F. A., Schraudolph, N. N. and Schmidhuber, J. Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3, Aug 2002), 115--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yoo, I. C., Lim, H. and Yook, D. Formant-Based Robust Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 12 2015), 2238--2245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sadjadi, S. O. and Hansen, J. H. L. Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux. IEEE Signal Processing Letters, 20, 3 2013), 197--200.Google ScholarGoogle ScholarCross RefCross Ref
  14. Sak, H., Senior, A. W. and Beaufays, F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. City, 2014.Google ScholarGoogle Scholar
  15. Ghaemmaghami, H., Baker, B. J., Vogt, R. J. and Sridharan, S. Noise robust voice activity detection using features extracted from the time-domain autocorrelation function. Proceedings of Interspeech 20102010).Google ScholarGoogle Scholar
  16. Garofolo, J., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.. Philadelphia: Linguistic Data Consortium1993).Google ScholarGoogle Scholar
  17. Steeneken, A. V. a. H. J. M. Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun., 12, 3 (Jul. 1993), 247--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hirsch, H.-G. Fa NT-Filtering and Noise Adding Tool2005).Google ScholarGoogle Scholar
  19. Gonzalez, S. and Brookes, M. PEFAC-a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 2 2014), 518--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Seide, F., Fu, H., Droppo, J., Li, G. and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. City, 2014.Google ScholarGoogle Scholar
  21. Dahl, G. E., Sainath, T. N. and Hinton, G. E. Improving deep neural networks for LVCSR using rectified linear units and dropout. IEEE, City, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  22. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J. and Devin, M. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.044672016).Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing Systems
    November 2016
    235 pages
    ISBN:9781450347907
    DOI:10.1145/3015166

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 21 November 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    ICSPS 2016 Paper Acceptance Rate46of83submissions,55%Overall Acceptance Rate46of83submissions,55%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader