ACM Home Page
Please provide us with feedback. Feedback
CVS: a Correlation-Verification based Smoothing technique on information retrieval and term clustering
Full text PdfPdf (634 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
POSTER SESSION: Poster papers table of contents
Pages: 469 - 474  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Christina Yip Chung  Verity, Inc., Sunnyvale, CA
Bin Chen  Exelixis, Inc., S. San Francisco, CA
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 25,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775115
What is a DOI?

ABSTRACT

As information volume in enterprise systems and in the Web grows rapidly, how to accurately retrieve information is an important research area. Several corpus based smoothing techniques have been proposed to address the data sparsity and synonym problems faced by information retrieval systems. Such smoothing techniques are often unable to discover and utilize the correlations among terms.We propose CVS, a Correlation-Verification based Smoothing method, that considers co-occurrence information in smoothing. Strongly correlated terms in a document are identified by their co-occurrence frequencies in the document. To avoid missing correlated terms with low co-occurrence frequencies but specific to the theme of the document, the joint distributions of terms in the document are compared with those in the corpus for statistical significance.A common approach to apply corpus based smoothing techniques to information retrieval is by refining the vector representations of documents. This paper investigates the effects of corpus based smoothing on information retrieval by query expansion using term clusters generated from a term clustering process. The results can also be viewed in light of the effects of smoothing on clustering.Empirical studies show that our approach outperforms previous corpus based smoothing techniques. It improves retrieval effectiveness by 14.6%. The results demonstrate that corpus based smoothing can be used for query expansion by term clustering.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
C. Carpineto, R. de Mori, and G. Romano. Information term selection for automatic query expansion. In The Seventh Text REtrieval Conference (TREC-7), pages 308--314. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.
3
4
5
 
6
I. J. Good. The population frequencies of species and the estimation of population parameters. In Biometrika, number 40 in 3,4, pages 237--264, 1953.
 
7
K. Hoashi, K. Matsumoto, N. Inoue, and K. Hashimoto. Trec-7 experiments: Query expansion method based on word contribution. In The Seventh Text REtrieval Conference (TREC-7), pages 373--381. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.
 
8
K. Hofland and S. Johansson. Word frequencies in british and american english. In The Norwegian Computing Center for the Humanities, pages 43--53, Norway, 1982.
 
9
T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WebSOM - self-organizing maps of document collections. In Proceedings of Workshop on Self-Organizing Maps (WSOM97), pages 310--315, Espoo. Finland, 1997.
 
10
F. Jelinek and R. Mercer. Interpolated estimation of markov source parameters from sparse data. In Pattern Recogition in Practice, pages 381--402, North Holland, Amsterdam, 1980.
 
11
A. Kilgarriff. Comparing word frequencies across corpora: Why chi-square doesn't work, and an improved lob-brown comparison. In ALLC-ACH Conference, 1996. http://www.hit.uib.no/allc/kilgarny.pdf.
 
12
A. Kilgarriff. Using word frequency lists to measure corpus homogeneity and similarity between corpora. In Proceedings of 5th ACL workshop on very large corpora, Beijing and Hongkong, August 1997.
 
13
A. Kilgarriff and T. Rose. Measures for corpus similarity and homogeneity. In Proceedings of 3rd conference on empirical methods in natural language processing, pages 46--52, 1998.
 
14
C. P. Klas and N. Fuhr. A new effective approach for categorizing web documents. In Proceedings of the 22th BCS-IRSG Colloquium on IR Research, 2000.
15
16
 
17
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, pages 41--48, Madison, WI, 1998.
18
 
19
A. Rauber. LabelSOM: On the labeling of self-organizing maps. http://www.ifs.tuwien.ac.at/ andi, July 10--16 1999.
 
20
 
21
Reuters Research and Standards Group. Retuers corpus. http://about.reuters.com/researchandstandards/corpus/.
22
 
23
A. E. Smith. Machine mapping of document collections: the leximancer. In Proceedings of the 5th Australasian Document Computing Symposium, Sunshine Coast, Australia, December 1 2000.
 
24
25
26

Collaborative Colleagues:
Christina Yip Chung: colleagues
Bin Chen: colleagues

Peer to Peer - Readers of this Article have also read: