ABSTRACT
Voice activity detection (VAD) determines whether the incoming signal segments are speech or noiseand is an important technique in almost all of speech-related applications. In order to improve VAD performance in various noise environments, characterizing the speech feature has been the most crucial issue up to date. Among several proposed speech features, the context information of speech through time and vowel sound characteristics are known to current state-of-the-art speech features. Therefore, in order to reflect both on these merits, we propose vowel based VAD by Long short term memory recurrent neural network (LSTM-RNN). LSTM-RNN is known to the powerful model to capture dynamical context information through time. Moreover, with teaching the LSTM-RNN to only vowel sounds rather than whole speech, LSTM-RNN can learn more effectively because of the reduced manifold of speech. According to our experiments, proposed method shows better accuracy not only in the VAD task compared to LSTM-RNN based VAD but alsoa vowel detection task.
- Vlaj, D., Kotnik, B., Horvat, B. and Kačič, Z. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems. EURASIP Journal on Advances in Signal Processing, 2005, 4 2005), 1--11. Google ScholarDigital Library
- Benyassine, A., Shlomot, E., Su, H. Y., Massaloux, D., Lamblin, C. and Petit, J. P. ITU-T Recommendation G. 729 Annex B: a silence compression scheme for use with G. 729 optimized for V.70 digital simultaneous voice and data applications. IEEE Communications Magazine, 35, 9 1997), 64--73. Google ScholarDigital Library
- Zhang, X. L. and Wang, D. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 2 2016), 252--264. Google ScholarDigital Library
- Zhang, X. L. and Wu, J. Deep Belief Networks Based Voice Activity Detection. IEEE Transactions on Audio, Speech, and Language Processing, 21, 4 2013), 697--710. Google ScholarDigital Library
- Ghosh, P. K., Tsiartas, A. and Narayanan, S. Robust Voice Activity Detection Using Long-Term Signal Variability. IEEE Transactions on Audio, Speech, and Language Processing, 19, 3 2011), 600--613. Google ScholarDigital Library
- Ramirez, J., Segura, J. C., Benitez, C., Garcia, L. and Rubio, A. Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Processing Letters, 12, 10 2005), 689--692.Google ScholarCross Ref
- Jongseo, S., Nam Soo, K. and Wonyong, S. A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6, 1 1999), 1--3.Google ScholarCross Ref
- Kristjansson, T., Deligne, S. and Olsen, P. Voicing features for robust speech detection. Entropy, 2, 2.5 2005), 3.Google Scholar
- Rabiner, L. R. and Gold, B. Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., 11975).Google Scholar
- Dong, E., Liu, G., Zhou, Y. and Zhang, X. Applying support vector machines to voice activity detection. City, 2002.Google Scholar
- Gers, F. A., Schraudolph, N. N. and Schmidhuber, J. Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3, Aug 2002), 115--143. Google ScholarDigital Library
- Yoo, I. C., Lim, H. and Yook, D. Formant-Based Robust Voice Activity Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 12 2015), 2238--2245. Google ScholarDigital Library
- Sadjadi, S. O. and Hansen, J. H. L. Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux. IEEE Signal Processing Letters, 20, 3 2013), 197--200.Google ScholarCross Ref
- Sak, H., Senior, A. W. and Beaufays, F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. City, 2014.Google Scholar
- Ghaemmaghami, H., Baker, B. J., Vogt, R. J. and Sridharan, S. Noise robust voice activity detection using features extracted from the time-domain autocorrelation function. Proceedings of Interspeech 20102010).Google Scholar
- Garofolo, J., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.. Philadelphia: Linguistic Data Consortium1993).Google Scholar
- Steeneken, A. V. a. H. J. M. Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun., 12, 3 (Jul. 1993), 247--251. Google ScholarDigital Library
- Hirsch, H.-G. Fa NT-Filtering and Noise Adding Tool2005).Google Scholar
- Gonzalez, S. and Brookes, M. PEFAC-a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 2 2014), 518--530. Google ScholarDigital Library
- Seide, F., Fu, H., Droppo, J., Li, G. and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. City, 2014.Google Scholar
- Dahl, G. E., Sainath, T. N. and Hinton, G. E. Improving deep neural networks for LVCSR using rectified linear units and dropout. IEEE, City, 2013.Google ScholarCross Ref
- Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J. and Devin, M. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.044672016).Google Scholar
Recommendations
Vowel Imitation Using Vocal Tract Model and Recurrent Neural Network
Neural Information ProcessingAbstractA vocal imitation system was developed using a computational model that supports the motor theory of speech perception. A critical problem in vocal imitation is how to generate speech sounds produced by adults, whose vocal tracts have physical ...
A study of voice activity detection techniques for NIST speaker recognition evaluations
Since 2008, interview-style speech has become an important part of the NIST speaker recognition evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). ...
An improvement in audio-visual voice activity detection for automatic speech recognition
IEA/AIE'10: Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part INoise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there ...
Comments