| PM-based indexing for Chinese text retrieval |
| Full text |
Pdf
(410 KB)
|
| Source
|
International Workshop on Information Retrieval with Asia Languages
archive
Proceedings of the fifth international workshop on on Information retrieval with Asian languages
table of contents
Hong Kong, China
Pages: 55 - 59
Year of Publication: 2000
ISBN:1-58113-300-6
|
|
Authors
|
|
Du Lin
|
Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China
|
|
Zhang Yibo
|
Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China
|
|
Sun Le
|
Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China
|
|
Sun Yufang
|
Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China
|
|
Han Jie
|
Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 1, Downloads (12 Months): 12, Citation Count: 1
|
|
|
ABSTRACT
This paper focused on introducing a novel PM indexing schema for Chinese text retrieval. Different with the Western languages, there is no delimiter between words in Chinese texts. The indexing is based either on the characters or on the segmented words. For the word-based indexing, the out-of-vocabulary words, such as the proper nouns, or domain terminology, are usually mis-segmented due to the limited vocabulary coverage of the segmentation dictionaries and thus impair the query precision. In this paper, several indexing and ranking methods, including the novel PM-based ranking, were tested so as to compare their efficiency in dealing with the new words in Chinese text retrieval. The experiment has shown that the query precision of the PM + word method is 10% higher than the word indexing.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Leong, M. K., Zhou, H., Preliminary qualitative analysis of segmented vs bigram indexing in Chinese, In Text Retrieval Conference (TREC-6), NIST, Gaithersburg, Maryland, 1997, pp. 551-558.
|
| |
3
|
He, J., Xu, J., Berkeley Chinese information retrieval at TREC-5: technical report, In Text Retrieval Conference (TREC-5). NIST, Gaithersburg, Maryland, 1996, pp. 191-196.
|
| |
4
|
Tsang, T., Luk, R., Wong, K. F., A hybrid indexing strategy using words and bigrams, IRAL '99, Taibei, 1999, http://www.iis.sinica.edu.tw/~IRAL99/.
|
| |
5
|
Sproat, R. and Shih, C., A statistical method for finding word boundaries in Chinese text, Computer Proceeding of Chinese and Oriental languages, 4:4,1990, pp. 336-351.
|
 |
6
|
Yubin Dai , Teck Ee Loh , Christopher S. G. Khoo, A new statistical formula for Chinese text segmentation incorporating contextual information, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.82-89, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312659]
|
| |
7
|
Fagan, J., Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods, Ph.D. Thesis, Cornell University, 1987.
|
| |
8
|
Liu, Y., Modem Chinese word segmentation specification and methodology. Tshinghua University Press, 1994.
|
| |
9
|
Sun, M., Huang, C., Identifying Chinese names in unrestricted texts, Communications of COLIPS, Vol. 4, No. 2, 1994, pp. 113-122.
|
| |
10
|
Liu, K. Y., The evaluation report of Chinese word segmentation, Applied Linguistics fin Chinese), Vol. 21, No. 1, 1997, pp. 101-106.
|
| |
11
|
|
| |
12
|
|
Peer to Peer - Readers of this Article have also read:
-
Constructing reality
Proceedings of the 11th annual international conference on Systems documentation
Douglas A. Powell
, Norman R. Ball
, Mansel W. Griffiths
-
M4: a metamodel for data preprocessing
Proceedings of the 4th ACM international workshop on Data warehousing and OLAP
Anca Vaduva
, Jörg-Uwe Kietz
, Regina Zücker
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
|