Article

A framework for classification and segmentation of massive audio data streams

Author:
Charu C. Aggarwal

IBM T. J. Watson Research Center, Hawthorne, NY

IBM T. J. Watson Research Center, Hawthorne, NY
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 1013–1017https://doi.org/10.1145/1281192.1281302

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1013–1017

ABSTRACT

In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time intelligence and surveillance. In many cases, the data streams can be in compressed format, and the rate of data processing can often run at the rate of Gigabits per second. All known techniques for speaker voice analysis require the use of an offline training phase in which the system is trained with known segments of speech. The state-of-the-art method for text-independent speaker recognition is known as Gaussian Mixture Modeling (GMM), and it requires an iterative Expectation Maximization Procedure for training, which cannot be implemented in real time. In this paper, we discuss the details of such an online voice recognition system. For this purpose, we use our micro-clustering algorithms to design concise signatures of the target speakers. One of the surprising and insightful observations from our experiences with such a system is that while it was originally designed only for efficiency, we later discovered that it was also more accurate than the widely used Gaussian Mixture Model (GMM). This was because of the conciseness of the micro-cluster model, which made it less prone to over training. This is evidence of the fact that it is often possible to get the best of both worlds and do better than complex models both from an efficiency and accuracy perspective.

References

C. C. Aggarwal, J. Han, J. Wang, P. Yu. A Framework for Clustering Evolving Data Streams. VLDB Conference, 2003. Google ScholarDigital Library
I. Nabney. Netlab: Algorithms for Pattern Recognition. Advances in Pattern Recognition, Springer--Verlag, Germany, 2001 URL: http://www.ncrg.aston.ac.uk/netlab/down.phpGoogle Scholar
M. Prybocki, A. Martin. NIST's Assessment of Text Independent Speaker Recognition Performance. http://www.nist.gov/speech/publications/index.htmlGoogle Scholar
D. Reynolds, T. Quateiri, R. Dunn. Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing, Vol. 10, pp. 42--54, 2000.Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference, 1996. Google ScholarDigital Library

Index Terms

A framework for classification and segmentation of massive audio data streams
1. Information systems
  1. Information systems applications

Recommendations

On classification and segmentation of massive audio data streams

In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time ...
Read More
MLLR Transforms Based Speaker Recognition in Broadcast Streams
Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions

This paper deals with utilization of maximum likelihood linear regression (MLLR) adaptation transforms for speaker recognition in broadcast news streams. This task is specific particularly for widely varying acoustic conditions, microphones, ...
Read More
Speaker Identification Within Whispered Speech Audio Streams

Whisper is an alternative speech production mode used by subjects in natural conversation to protect the privacy. Due to the profound differences between whisper and neutral speech in both excitation and vocal tract function, the performance of speaker ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
speaker recognition
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 574
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A framework for classification and segmentation of massive audio data streams

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

On classification and segmentation of massive audio data streams

MLLR Transforms Based Speaker Recognition in Broadcast Streams

Speaker Identification Within Whispered Speech Audio Streams