| Modeling word burstiness using the Dirichlet distribution |
| Full text |
Pdf
(811 KB)
|
| Source
|
ACM International Conference Proceeding Series; Vol. 119
archive
Proceedings of the 22nd international conference on Machine learning
table of contents
Bonn, Germany
Pages: 545 - 552
Year of Publication: 2005
ISBN:1-59593-180-5
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 6, Downloads (12 Months): 51, Citation Count: 3
|
|
|
ABSTRACT
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163 190.
|
| |
5
|
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391--407.
|
 |
6
|
|
| |
7
|
|
| |
8
|
Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11--21.
|
| |
9
|
|
| |
10
|
|
| |
11
|
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization (pp. 41--48). AAAI Press.
|
| |
12
|
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/mccalum/bow.
|
| |
13
|
Minka, T. (2003). Estimating a Dirichlet distribution. www.stat.cmu.edu/~minka/papers/dirichlet.
|
| |
14
|
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning (pp. 616--623). Washington, D.C., US: Morgan Kaufmann Publishers, San Francisco, US.
|
 |
15
|
|
 |
16
|
|
 |
17
|
|
| |
18
|
Zipf, G. (1949). Human behaviour and the principle of least effort: An introduction to human ecology. Addison-Wesley.
|
|