| Text document clustering based on frequent word sequences |
| Full text |
Pdf
(75 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the 14th ACM international conference on Information and knowledge management
table of contents
Bremen, Germany
POSTER SESSION: Poster Session
table of contents
Pages: 293 - 294
Year of Publication: 2005
ISBN:1-59593-140-6
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 21, Downloads (12 Months): 141, Citation Count: 1
|
|
|
ABSTRACT
In this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector space model was commonly used for information retrieval, but it treats documents as bags of words, ignoring the sequential pattern of word occurrences in the documents. However, the meaning of natural languages strongly depends on the word sequences, and the frequent word sequences can provide compact and valuable information about the text database. Bisecting k-means and FIHC algorithms are evaluated on the performance of text clustering, and are compared with the proposed CFWS algorithm. It has been shown that CFWS has much better performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. C. M. Fung, K. Wang, and M. Ester, "Hierarchical Document Clustering Using Frequent Itemsets," Proc. of SIAM Int'l Conf. on Data Mining, 2003.
|
| |
2
|
High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004.
|
| |
3
|
The Lemur Toolkit for Language Modeling and Information Retrieval, http://www-2.cs.cmu.edu/~lemur/.
|
| |
4
|
M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," KDD-2000 Workshop on Text Mining, 2000.
|
| |
5
|
P. Weiner, "Linear Pattern Matching Algorithms," Proc. of the 14th Annual Symp. on Foundation of Computer Science, 1973, pp. 1--11.
|
 |
6
|
|
CITED BY
|
|
Lei Zhang , Debbie Zhang , Simeon J. Simoff , John Debenham, Weighted kernel model for text categorization, Proceedings of the fifth Australasian conference on Data mining and analystics, p.111-114, November 29-30, 2006, Sydney, Australia
|
|