ACM Home Page
Please provide us with feedback. Feedback
Length normalization in XML retrieval
Full text PdfPdf (209 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Sheffield, United Kingdom
SESSION: XML retrieval table of contents
Pages: 80 - 87  
Year of Publication: 2004
ISBN:1-58113-881-4
Authors
Jaap Kamps  University of Amsterdam, Amsterdam, The Netherlands
Maarten de Rijke  University of Amsterdam, Amsterdam, The Netherlands
Börkur Sigurbjörnsson  University of Amsterdam, Amsterdam, The Netherlands
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 93,   Citation Count: 11
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1008992.1009009
What is a DOI?

ABSTRACT

XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a potentially retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length bias introduced by the amount of smoothing, and show the importance of extreme length priors for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate document length normalization. Even after increasing the minimal size of XML elements occurring in the index, the importance of an extreme length bias remains.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
M. Abolhassani, N. Fuhr, and S. Malik. HyREX at INEX 2003. In Fuhr et al. {10}, pages 27--32.
2
3
 
4
C. Buckley, A. Singhal, and M. Mitra. New retrieval approaches using SMART: TREC 4. In The Fourth Text REtrieval Conference (TREC-4), pages 25--48.
5
 
6
B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1--26, 1979.
 
7
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1993.
 
8
N. Fuhr, N. Gövert, G. Kazai, and M. Lalmas, editors. Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, 2003.
 
9
N. Fuhr, M. Lalmas, and S. Malik, editors. PreProceedings of the Second Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2003), 2003.
 
10
N. Fuhr, M. Lalmas, and S. Malik, editors. INEX 2003 Workshop Proceedings, 2004.
 
11
N. Gövert, M. Abolhassani, N. Fuhr, and K. Grossjohan. Content-based XML retrieval with HyRex. In Fuhr et al. {8}, pages 26--32.
 
12
N. Gövert and G. Kazai. Overview of the INitiative for the Evaluation of XML retrieval (INEX) 2002. In Fuhr et al. {8}, pages 1--17.
 
13
W. R. Greiff and W. T. Morgan. Contributions o language modeling to the theory and practice of information retrieval. In W. B. Croft and J. Lafferty, editors, Language Modeling for Information Retrieval, pages 73--93. Kluwer Academic Publishers, 2003.
 
14
D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, University of Twente, 2001.
 
15
D. Hiemstra. A Database Approach to Content-based XML Retrieval. In Fuhr et al. {8}, pages 111--118.
 
16
D. Hiemstra and W. Kraaij. Twenty-One at TREC-7: Ad-hoc and cross-language track. In The Seventh Text REtrieval Conference (TREC-7), pages 227--238, 1999.
 
17
INEX. Initiative for the evaluation of XML retrieval, 2004. http://www.is.informatik.uni-duisburg.de/projects/inex03/.
18
 
19
W. Kraaij, R. Pohlmann, and D. Hiemstra. Twenty-One at TREC-8: using language technology for information retrieval. In The Eighth Text REtrieval Conference (TREC-8), pages 285--300, 2000.
 
20
W. Kraaij and T. Westerveld. Twenty-UT at TREC-9: How different are web documents? In The Ninth Text REtrieval Conference (TREC-9), pages 665--672, 2001.
21
 
22
J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Lafferty, editors, Language Modeling for Information Retrieval, pages 1--10. Kluwer Academic Publishers, 2003.
 
23
J. List and A. P. de Vries. CWI at INEX 2002. In Fuhr et al. {8}, pages 133--140.
 
24
J. A. List, V. Mihajlovic, A. P. de Vries, G. Ramírez, and D. Hiemstra. The TIJAH XML-IR system at INEX 2003. In Fuhr et al. {10}, pages 102--109.
 
25
Y. Mass and M. Mandelbrod. Retrieving the most relevant XML components. In Fuhr et al. {10}, pages 53--58.
 
26
Y. Mass, M. Mandelbrod, E. Amitay, D. Carmel, Y. Maarek, and A. Soffer. JuruXML - an XML retrieval system at INEX'02. In Fuhr et al. {8}, pages 73--80.
27
 
28
P. Ogilvie and J. Callan. Language models and structured document retrieval. In Fuhr et al. {8}, pages 33--44.
 
29
P. Ogilvie and J. Callan. Using language models for at text queries in XML retrieval. In Fuhr et al. {10}, pages 12--18.
 
30
 
31
 
32
 
33
I. Soboroff and D. Harman. Overview of the TREC 2003 Novelty Track. In The Twelfth Text REtrieval Conference (TREC-12), 2004.
 
34
E. M. Voorhees. Overview of the TREC 2003 Question Answering Track. In The Twelfth Text REtrieval Conference (TREC-12), 2004.
 
35
 
37
38

CITED BY  11
 
 


REVIEW

"Xiaoya Tang : Reviewer"

The retrievable units in Extensible Markup Language (XML) documents are individual elements, and the length distribution of such elements is different from that of standard documents. Therefore, document length normalization in XML retrieval needs  more...

Collaborative Colleagues:
Jaap Kamps: colleagues
Maarten de Rijke: colleagues
Börkur Sigurbjörnsson: colleagues

Peer to Peer - Readers of this Article have also read: