Abstract
In information retrieval, retrieving relevant passages, as opposed to whole documents, not only directly benefits the end user by filtering out the irrelevant information within a long relevant document, but also improves retrieval accuracy in general. A critical problem in passage retrieval is to extract coherent relevant passages accurately from a document, which we refer to as passage extraction. While much work has been done on passage retrieval, the passage extraction problem has not been seriously studied. Most existing work tends to rely on presegmenting documents into fixed-length passages which are unlikely optimal because the length of a relevant passage is presumably highly sensitive to both the query and document.In this article, we present a new method for accurately detecting coherent relevant passages of variable lengths using hidden Markov models (HMMs). The HMM-based method naturally captures the topical boundaries between passages relevant and nonrelevant to the query. Pseudo-feedback mechanisms can be naturally incorporated into such an HMM-based framework to improve parameter estimation. We show that with appropriate parameter estimation, the HMM method outperforms a number of strong baseline methods on two datasets. We further show how the HMM method can be applied on top of any basic passage extraction method to improve passage boundaries.
- Allan, J. 2003. Hard track overview in trec 2003: High accuracy retrieval from documents. In Proceedings of the 12th Text REtrieval Conference. 24--37.Google Scholar
- Callan, J. P. 1994. Passage-level evidence in document retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Dublin. Springer Verlag, New York, NY, 302--310. Google Scholar
- Clarke, C. L. A. and Terra, E. L. 2003. Passage retrieval vs. document retrieval for factoid question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto, Canada. ACM Press, New York, NY, 427--428. Google Scholar
- Conroy, J. and O'Leary, D. P. 2001. Text summarization via hidden Markov models and pivoted QR matrix decomposition. Tech. Rep., University of Maryland, College Park.Google Scholar
- Cormack, G. V., Clarke, C. L. A., Palmer, C. R., and To, S. S. L. 1998. Passage-based refinement (MultiText experiments for TREC-6). In Proceedings of the 6th Text REtrieval Conference. 303--320.Google Scholar
- Corrada-Emmanuel, A. and Croft, W. B. 2004. Answer models for question answering passage retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, UK. ACM Press, New York, NY, 516--517. Google Scholar
- Denoyer, L. and Zaragoza, H. 2001. HMM-based passage models for document classification and ranking. In Proceedings of the 23rd BCS European Annual Colloquium on Information Retrieval.Google Scholar
- Freitag, D. and McCallum, A. 2000. Information extraction with HMM structure learned by stochastic optimization. In Proceedings of the 18th Conference on Artifitical Intelligence (AAAI). 584--589. Google Scholar
- Fung, P., Ngai, G., and Cheung, P. 2003. Combining optimal clustering and hidden Markov models for extractive summarization. In Proceedings of the ACL Workshop on Multilingual Summarization. Google Scholar
- He, D., Demner-Fushman, D., Oard, D. W., Karakos, D., and Khudanpur, S. 2004. Improving passage retrieval using interactive elicitation and statistical modeling. In Proceedings of the 13th Text REtrieval Conference.Google Scholar
- Hearst, M. A. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 1, 33--64. Google Scholar
- Jiang, J. and Zhai, C. 2004. UIUC in HARD 2004--Passage retrieval using HMMs. In Proceedings of the 13th Text REtrieval Conference.Google Scholar
- Kaszkiel, M. and Zobel, J. 1997. Passage retrieval revisited. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Philadelphia, PA. ACM Press, New York, NY, 178--185. Google Scholar
- Kaszkiel, M. and Zobel, J. 2001. Effective ranking with arbitrary passages. J. American Society Inf. Sci. 52, 4, 344--364. Google Scholar
- Knaus, D., Mittendorf, E., Schäuble, P., and Sheridan, P. 1996. Highlighting relevant passages for users of the interactive SPIDER retrieval system. In Proceedings of the 4th Text REtrieval Conference.Google Scholar
- Lavrenko, V. and Croft, W. B. 2001. Relevance-based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, LA. ACM Press, New York, NY, 120--127. Google Scholar
- Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th International Conference on Information and Knowledge Management. McLean, VA. ACM Press, New York, NY, 375--382. Google Scholar
- Mittendorf, E. and Schäuble, P. 1994. Document and passage retrieval based on hidden Markov models. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Dublin. Springer Verlag, New York, 318--327. Google Scholar
- Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google Scholar
- Salton, G., Allan, J., and Buckley, C. 1993. Approaches to passage retrieval in full text information systems. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Pittsburgh, PA. ACM Press, New York, NY, 49--58. Google Scholar
- Tellex, S., Katz, B., Lin, J., Fernandes, A., and Marton, G. 2003. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto, Canada. ACM Press, New York, NY, 41--47. Google Scholar
- Zajic, D., Dorr, B., and Schwartz, R. 2005. Headline generation for written and broadcast news. Tech. Rep. LAMP-TR-120, CS-TR-4698, UMIACS-TR-2005-07, University of Maryland, College Park.Google Scholar
- Zhai, C. and Lafferty, J. 2001a. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th International Conference on Information and Knowledge Management. Atlanta, GA. ACM Press, New York, NY, 403--410. Google Scholar
- Zhai, C. and Lafferty, J. 2001b. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans, LA, 334--342. Google Scholar
Index Terms
- Extraction of coherent relevant passages using hidden Markov models
Recommendations
Accurately extracting coherent relevant passages using hidden Markov models
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementIn this paper, we present a principled method for accurately extracting coherent relevant passages of variable lengths using HMMs. We show that with appropriate parameter estimation, the HMM method outperforms a number of strong baseline methods on two ...
Text summarization via hidden Markov models
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrievalA sentence extract summary of a document is a subset of the document's sentences that contains the main ideas in the document. We present an approach to generating such summaries, a hidden Markov model that judges the likelihood that each sentence ...
Enhancing relevance models with adaptive passage retrieval
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrievalPassage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous ...
Comments