|
ABSTRACT
XML retrieval is facing new challenges when applied to heterogeneous XML documents, where next to nothing about the document structure can be taken for granted. We have developed solutions where some of the heterogeneity issues are addressed. Our fragment selection algorithm selectively divides a heterogeneous document collection into equi-sized fragments with full-text content. If the content is considered too data-oriented, it is not accepted. The algorithm needs no information about element names. In addition, three techniques for fragment expansion are presented, all of which yield a 13--17% average improvement in average precision. These techniques and algorithms are among the first steps in developing document-type-independent indexing methods for the full text in heterogeneous XML collections.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Ahonen-Myka, H., Heikkinen, B., Heinonen, O., and Klemettinen, M. 2000. Printing structured text without stylesheets. In Proceedings of the XML Scandinavia Conference.
|
 |
2
|
|
| |
3
|
Barbosa, D. and Mendelzon, A. O. 2003. Finding ID Attributes in XML Documents. In Proceedings of the 1st International XML Database Symposium (XSym). Lecture Notes in Computer Science vol. 2824. Springer-Verlag. 180--194.
|
 |
4
|
David Carmel , Yoelle S. Maarek , Matan Mandelbrod , Yosi Mass , Aya Soffer, Searching XML documents via XML fragments, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
[doi> 10.1145/860435.860464]
|
| |
5
|
Doucet, A., Aunimo, L., Lehtonen, M., and Petit, R. 2003. Accurate retrieval of XML document fragments using EXTIRP. In INEX Workshop Proceedings. Schloss Dagstuhl, Germany. 73--80.
|
 |
6
|
|
| |
7
|
Fuhr, N., Gövert, N., Kazai, G., and Lalmas, M., eds. 2002. Proceedings of the 1st Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany.
|
 |
8
|
|
| |
9
|
Fuhr, N., Lalmas, M., and Malik, S., eds. 2003. Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany.
|
| |
10
|
Glushko, R. J. and McGrath, T. 2005. Document Engineering. MIT Press, Cambridge, MA.
|
| |
11
|
|
| |
12
|
Gövert, N., Kazai, G., Fuhr, N., and Lalmas, M. 2003. Evaluating the effectiveness of content-oriented XML retrieval. Tech. Rep., University of Dortmund, Computer Science 6.
|
| |
13
|
|
 |
14
|
|
| |
15
|
Lehtonen, M. 2005. EXTIRP 2004: Towards heterogeneity. In Proceedings of the Advances in XML Information Retrieval, 3rd International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, Springer-Verlag, vol. 3493. 372--381.
|
| |
16
|
Lehtonen, M. 2006. Indexing heterogeneous XML for full-text search. Ph.D. thesis, University of Helsinki.
|
 |
17
|
|
| |
18
|
Robert W.P. Luk , H. V. Leong , Tharam S. Dillon , Alvin T.S. Chan , W. Bruce Croft , James Allan, A survey in indexing and searching XML documents, Journal of the American Society for Information Science and Technology, v.53 n.6, p.415-437, May, 2002
[doi> 10.1002/asi.10056
]
|
| |
19
|
|
 |
20
|
|
 |
21
|
Sung Hyon Myaeng , Don-Hyun Jang , Mun-Seok Kim , Zong-Cheol Zhoo, A flexible model for retrieval of SGML documents, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.138-145, August 24-28, 1998, Melbourne, Australia
[doi> 10.1145/290941.290980]
|
| |
22
|
Makoto Nagao , Jun-ichi Tsujii , Koji Yada , Toshihiro Kakimoto, An English Japanese machine translation system of the titles of scientific and engineering papers, Proceedings of the 9th conference on Computational linguistics, p.245-252, July 05-10, 1982, Prague, Czechoslovakia
[doi> 10.3115/991813.991852]
|
 |
23
|
|
| |
24
|
|
| |
25
|
Vyas, A., Fernàndez, M., and Siméon, J. 2004. The simplest XML storage manager ever. In Informal Proceedings of the 1st International Workshop on XQuery Implementation, Experience and Perspectives. 37--42.
|
| |
26
|
W3C 4 February 2004. XML Information Set, W3C Recommendation, 2nd ed. W3C. http://www.w3.org/TR/xml-infoset/.
|
|