ACM Home Page
Please provide us with feedback. Feedback
Text document clustering based on frequent word sequences
Full text PdfPdf (75 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 14th ACM international conference on Information and knowledge management table of contents
Bremen, Germany
POSTER SESSION: Poster Session table of contents
Pages: 293 - 294  
Year of Publication: 2005
ISBN:1-59593-140-6
Authors
Yanjun Li  Wright State University, Dayton, OH
Soon M. Chung  Wright State University, Dayton, OH
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 141,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1099554.1099633
What is a DOI?

ABSTRACT

In this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector space model was commonly used for information retrieval, but it treats documents as bags of words, ignoring the sequential pattern of word occurrences in the documents. However, the meaning of natural languages strongly depends on the word sequences, and the frequent word sequences can provide compact and valuable information about the text database. Bisecting k-means and FIHC algorithms are evaluated on the performance of text clustering, and are compared with the proposed CFWS algorithm. It has been shown that CFWS has much better performance.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
B. C. M. Fung, K. Wang, and M. Ester, "Hierarchical Document Clustering Using Frequent Itemsets," Proc. of SIAM Int'l Conf. on Data Mining, 2003.
 
2
High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004.
 
3
The Lemur Toolkit for Language Modeling and Information Retrieval, http://www-2.cs.cmu.edu/~lemur/.
 
4
M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," KDD-2000 Workshop on Text Mining, 2000.
 
5
P. Weiner, "Linear Pattern Matching Algorithms," Proc. of the 14th Annual Symp. on Foundation of Computer Science, 1973, pp. 1--11.
6


Collaborative Colleagues:
Yanjun Li: colleagues
Soon M. Chung: colleagues