ACM Home Page
Please provide us with feedback. Feedback
Preparing heterogeneous XML for full-text search
Full text PdfPdf (228 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 24 ,  Issue 4  (October 2006) table of contents
Pages: 455 - 474  
Year of Publication: 2006
ISSN:1046-8188
Author
Miro Lehtonen  University of Helsinki, Finland
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 189,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1185877.1185881
What is a DOI?

ABSTRACT

XML retrieval is facing new challenges when applied to heterogeneous XML documents, where next to nothing about the document structure can be taken for granted. We have developed solutions where some of the heterogeneity issues are addressed. Our fragment selection algorithm selectively divides a heterogeneous document collection into equi-sized fragments with full-text content. If the content is considered too data-oriented, it is not accepted. The algorithm needs no information about element names. In addition, three techniques for fragment expansion are presented, all of which yield a 13--17% average improvement in average precision. These techniques and algorithms are among the first steps in developing document-type-independent indexing methods for the full text in heterogeneous XML collections.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Ahonen-Myka, H., Heikkinen, B., Heinonen, O., and Klemettinen, M. 2000. Printing structured text without stylesheets. In Proceedings of the XML Scandinavia Conference.
2
 
3
Barbosa, D. and Mendelzon, A. O. 2003. Finding ID Attributes in XML Documents. In Proceedings of the 1st International XML Database Symposium (XSym). Lecture Notes in Computer Science vol. 2824. Springer-Verlag. 180--194.
4
 
5
Doucet, A., Aunimo, L., Lehtonen, M., and Petit, R. 2003. Accurate retrieval of XML document fragments using EXTIRP. In INEX Workshop Proceedings. Schloss Dagstuhl, Germany. 73--80.
6
 
7
Fuhr, N., Gövert, N., Kazai, G., and Lalmas, M., eds. 2002. Proceedings of the 1st Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany.
8
 
9
Fuhr, N., Lalmas, M., and Malik, S., eds. 2003. Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX). Dagstuhl, Germany.
 
10
Glushko, R. J. and McGrath, T. 2005. Document Engineering. MIT Press, Cambridge, MA.
 
11
 
12
Gövert, N., Kazai, G., Fuhr, N., and Lalmas, M. 2003. Evaluating the effectiveness of content-oriented XML retrieval. Tech. Rep., University of Dortmund, Computer Science 6.
 
13
14
 
15
Lehtonen, M. 2005. EXTIRP 2004: Towards heterogeneity. In Proceedings of the Advances in XML Information Retrieval, 3rd International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, Springer-Verlag, vol. 3493. 372--381.
 
16
Lehtonen, M. 2006. Indexing heterogeneous XML for full-text search. Ph.D. thesis, University of Helsinki.
17
 
18
 
19
20
21
 
22
23
 
24
 
25
Vyas, A., Fernàndez, M., and Siméon, J. 2004. The simplest XML storage manager ever. In Informal Proceedings of the 1st International Workshop on XQuery Implementation, Experience and Perspectives. 37--42.
 
26
W3C 4 February 2004. XML Information Set, W3C Recommendation, 2nd ed. W3C. http://www.w3.org/TR/xml-infoset/.