ABSTRACT
In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time intelligence and surveillance. In many cases, the data streams can be in compressed format, and the rate of data processing can often run at the rate of Gigabits per second. All known techniques for speaker voice analysis require the use of an offline training phase in which the system is trained with known segments of speech. The state-of-the-art method for text-independent speaker recognition is known as Gaussian Mixture Modeling (GMM), and it requires an iterative Expectation Maximization Procedure for training, which cannot be implemented in real time. In this paper, we discuss the details of such an online voice recognition system. For this purpose, we use our micro-clustering algorithms to design concise signatures of the target speakers. One of the surprising and insightful observations from our experiences with such a system is that while it was originally designed only for efficiency, we later discovered that it was also more accurate than the widely used Gaussian Mixture Model (GMM). This was because of the conciseness of the micro-cluster model, which made it less prone to over training. This is evidence of the fact that it is often possible to get the best of both worlds and do better than complex models both from an efficiency and accuracy perspective.
- C. C. Aggarwal, J. Han, J. Wang, P. Yu. A Framework for Clustering Evolving Data Streams. VLDB Conference, 2003. Google ScholarDigital Library
- I. Nabney. Netlab: Algorithms for Pattern Recognition. Advances in Pattern Recognition, Springer--Verlag, Germany, 2001 URL: http://www.ncrg.aston.ac.uk/netlab/down.phpGoogle Scholar
- M. Prybocki, A. Martin. NIST's Assessment of Text Independent Speaker Recognition Performance. http://www.nist.gov/speech/publications/index.htmlGoogle Scholar
- D. Reynolds, T. Quateiri, R. Dunn. Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing, Vol. 10, pp. 42--54, 2000.Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference, 1996. Google ScholarDigital Library
Index Terms
- A framework for classification and segmentation of massive audio data streams
Recommendations
On classification and segmentation of massive audio data streams
In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time ...
MLLR Transforms Based Speaker Recognition in Broadcast Streams
Cross-Modal Analysis of Speech, Gestures, Gaze and Facial ExpressionsThis paper deals with utilization of maximum likelihood linear regression (MLLR) adaptation transforms for speaker recognition in broadcast news streams. This task is specific particularly for widely varying acoustic conditions, microphones, ...
Speaker Identification Within Whispered Speech Audio Streams
Whisper is an alternative speech production mode used by subjects in natural conversation to protect the privacy. Due to the profound differences between whisper and neutral speech in both excitation and vocal tract function, the performance of speaker ...
Comments