|
ABSTRACT
As information volume in enterprise systems and in the Web grows rapidly, how to accurately retrieve information is an important research area. Several corpus based smoothing techniques have been proposed to address the data sparsity and synonym problems faced by information retrieval systems. Such smoothing techniques are often unable to discover and utilize the correlations among terms.We propose CVS, a Correlation-Verification based Smoothing method, that considers co-occurrence information in smoothing. Strongly correlated terms in a document are identified by their co-occurrence frequencies in the document. To avoid missing correlated terms with low co-occurrence frequencies but specific to the theme of the document, the joint distributions of terms in the document are compared with those in the corpus for statistical significance.A common approach to apply corpus based smoothing techniques to information retrieval is by refining the vector representations of documents. This paper investigates the effects of corpus based smoothing on information retrieval by query expansion using term clusters generated from a term clustering process. The results can also be viewed in light of the effects of smoothing on clustering.Empirical studies show that our approach outperforms previous corpus based smoothing techniques. It improves retrieval effectiveness by 14.6%. The results demonstrate that corpus based smoothing can be used for query expansion by term clustering.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
C. Carpineto, R. de Mori, and G. Romano. Information term selection for automatic query expansion. In The Seventh Text REtrieval Conference (TREC-7), pages 308--314. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.
|
 |
3
|
Richard H. Fowler , Wendy A. L. Fowler , Bradley A. Wilson, Integrating query thesaurus, and documents through a common visual representation, Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, p.142-151, October 13-16, 1991, Chicago, Illinois, United States
[doi> 10.1145/122860.122874]
|
 |
4
|
|
 |
5
|
|
| |
6
|
I. J. Good. The population frequencies of species and the estimation of population parameters. In Biometrika, number 40 in 3,4, pages 237--264, 1953.
|
| |
7
|
K. Hoashi, K. Matsumoto, N. Inoue, and K. Hashimoto. Trec-7 experiments: Query expansion method based on word contribution. In The Seventh Text REtrieval Conference (TREC-7), pages 373--381. National Institute of Standards and Technology (NIST), 1998. http://trec.nist.gov/pubs/trec7/t7_proceedings.html.
|
| |
8
|
K. Hofland and S. Johansson. Word frequencies in british and american english. In The Norwegian Computing Center for the Humanities, pages 43--53, Norway, 1982.
|
| |
9
|
T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WebSOM - self-organizing maps of document collections. In Proceedings of Workshop on Self-Organizing Maps (WSOM97), pages 310--315, Espoo. Finland, 1997.
|
| |
10
|
F. Jelinek and R. Mercer. Interpolated estimation of markov source parameters from sparse data. In Pattern Recogition in Practice, pages 381--402, North Holland, Amsterdam, 1980.
|
| |
11
|
A. Kilgarriff. Comparing word frequencies across corpora: Why chi-square doesn't work, and an improved lob-brown comparison. In ALLC-ACH Conference, 1996. http://www.hit.uib.no/allc/kilgarny.pdf.
|
| |
12
|
A. Kilgarriff. Using word frequency lists to measure corpus homogeneity and similarity between corpora. In Proceedings of 5th ACL workshop on very large corpora, Beijing and Hongkong, August 1997.
|
| |
13
|
A. Kilgarriff and T. Rose. Measures for corpus similarity and homogeneity. In Proceedings of 3rd conference on empirical methods in natural language processing, pages 46--52, 1998.
|
| |
14
|
C. P. Klas and N. Fuhr. A new effective approach for categorizing web documents. In Proceedings of the 22th BCS-IRSG Colloquium on IR Research, 2000.
|
 |
15
|
|
 |
16
|
Rila Mandala , Takenobu Tokunaga , Hozumi Tanaka, Combining multiple evidence from different types of thesaurus for query expansion, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.191-197, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312677]
|
| |
17
|
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, pages 41--48, Madison, WI, 1998.
|
 |
18
|
|
| |
19
|
A. Rauber. LabelSOM: On the labeling of self-organizing maps. http://www.ifs.tuwien.ac.at/ andi, July 10--16 1999.
|
| |
20
|
|
| |
21
|
Reuters Research and Standards Group. Retuers corpus. http://about.reuters.com/researchandstandards/corpus/.
|
 |
22
|
|
| |
23
|
A. E. Smith. Machine mapping of document collections: the leximancer. In Proceedings of the 5th Australasian Document Computing Symposium, Sunshine Coast, Australia, December 1 2000.
|
| |
24
|
|
 |
25
|
|
 |
26
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|