ACM Home Page
Please provide us with feedback. Feedback
Modeling word burstiness using the Dirichlet distribution
Full text PdfPdf (811 KB)
Source ACM International Conference Proceeding Series; Vol. 119 archive
Proceedings of the 22nd international conference on Machine learning table of contents
Bonn, Germany
Pages: 545 - 552  
Year of Publication: 2005
ISBN:1-59593-180-5
Authors
Rasmus E. Madsen  Technical University of Denmark
David Kauchak  University of California, San Diego, La Jolla, CA
Charles Elkan  University of California, San Diego, La Jolla, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 51,   Citation Count: 3
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1102351.1102420
What is a DOI?

ABSTRACT

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163 190.
 
5
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391--407.
6
 
7
 
8
Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11--21.
 
9
 
10
 
11
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization (pp. 41--48). AAAI Press.
 
12
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/mccalum/bow.
 
13
Minka, T. (2003). Estimating a Dirichlet distribution. www.stat.cmu.edu/~minka/papers/dirichlet.
 
14
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning (pp. 616--623). Washington, D.C., US: Morgan Kaufmann Publishers, San Francisco, US.
15
16
17
 
18
Zipf, G. (1949). Human behaviour and the principle of least effort: An introduction to human ecology. Addison-Wesley.

Collaborative Colleagues:
Rasmus E. Madsen: colleagues
David Kauchak: colleagues
Charles Elkan: colleagues