|
ABSTRACT
We demonstrate a phonotactic-semantic paradigm for spoken document categorization. In this framework, we define a set of acoustic words instead of lexical words to represent acoustic activities in spoken languages. The strategy for acoustic vocabulary selection is studied by comparing different feature selection methods. With an appropriate acoustic vocabulary, a voice tokenizer converts a spoken document into a text-like document of acoustic words. Thus, a spoken document can be represented by a count vector, named a bag-of-sounds vector, which characterizes a spoken document's semantic domain. We study two phonotactic-semantic classifiers, the support vector machine classifier and the latent semantic analysis classifier, and their properties. The phonotactic-semantic framework constitutes a new paradigm in spoken document classification, as demonstrated by its success in the spoken language identification task. It achieves 18.2% error reduction over state-of-the-art benchmark performance on the 1996 NIST Language Recognition Evaluation database.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Bellegarda, J.R. Exploiting latent semantic information in statistical language modeling, In Proc. of the IEEE, 88, 8 (Aug. 2000), 1279--1296.
|
| |
3
|
Cavnar, W.B., and Trenkle, J.M. N-Gram-Based Text Categorization, In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, 161--169.
|
| |
4
|
|
| |
5
|
Dai, P., Iurgel, U., and Rigoll, G. A novel feature combination approach for spoken document classification with support vector machines, Multimedia Information Retrieval Workshop 2003, Toronto, Canada, Aug 2003.
|
| |
6
|
|
| |
7
|
Garofolo, J.S., Auzanne, C.G.P., and Voorhees, E.M. The TREC spoken document retrieval track: A success story. In Proceedings of the RIAO 2000 Conference: Context-based Multimedia Information Access, Paris 2000, 1--20.
|
| |
8
|
Hieronymus, J.L. ASCII Phonetic Symbols for the World's Languages: Worldbet. Technical Report AT&T Bell Labs, 1994.
|
| |
9
|
Ma, B., Li, H., and Lee, C.H. An Acoustic Segment Modeling Approach to Automatic Language Identification, submitted to Interspeech 2005.
|
 |
10
|
|
| |
11
|
Muller, K.R., Mika, S., Ratsch, G., Tsuda, K. and Scholkopf, B. An introduction to kernel-based learning algorithm, IEEE Trans on Neural Networks, 12, 2 (Mar 2001), 181--202.
|
| |
12
|
|
| |
13
|
Ng, K., Zue, V.W. Subword unit representations for spoken document retrieval, In Proc. of Eurospeech 1997, Rhodes, Greece, 1607--1610.
|
| |
14
|
Salton, G. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 1971.
|
| |
15
|
Singer, E., Torres-Carrasquillo, P.A., Gleason, T.P., Campbell W.M., and Reynolds, D.A. Acoustic, Phonetic and Discriminative Approaches to Automatic language recognition, In Proc. of Eurospeech, 2003.
|
| |
16
|
Torres-Carrasquillo, P.A., Reynolds, D.A., and Deller. Jr., J.R. Language identification using Gaussian Mixture model tokenization. In Proc. of ICASSP, 2002.
|
| |
17
|
Zipf, G.K. Human Behavior and the Principal of Least effort, an introduction to human ecology. Addison-Wesley, Reading, Mass, 1949.
|
| |
18
|
Zissman, M.A. Comparison of four approaches to automatic language identification of telephone speech, IEEE Trans. on Speech and Audio Processing, 4, 1 (Jan. 1996), 31--44.
|
|