ABSTRACT
Text segmentation in natural language processing typically refers to the process of decomposing a document into constituent subtopics. Our work centers on the application of text segmentation techniques within information retrieval (IR) tasks. For example, for scoring a document by combining the retrieval scores of its constituent segments, exploiting the proximity of query terms in documents for ad-hoc search, and for question answering (QA), where retrieved passages from multiple documents are aggregated and presented as a single document to a searcher. Feedback in ad-hoc IR task is shown to benefit from the use of extracted sentences instead of terms from the pseudo relevant documents for query expansion. Retrieval effectiveness for patent prior art search task is enhanced by applying text segmentation to the patent queries. Another aspect of our work involves augmenting text segmentation techniques to produce segments which are more readable with less unresolved anaphora. This is particularly useful for QA and snippet generation tasks where the objective is to aggregate relevant and novel information from multiple documents satisfying user information need on one hand, and ensuring that the automatically generated content presented to the user is easily readable without reference to the original source document.
- Bates and M. J. The Design of Browsing and Berrypicking Techniques for the Online Search Interface. Online Review, 13(5):407--424, 1989.Google ScholarCross Ref
- F. Y. Y. Choi. Advances in domain independent linear text segmentation. In Proceedings of the NAACL 2000, pages 26--33, 2000. Google ScholarDigital Library
- D. Ganguly, J. Leveling, and G. J. F. Jones. Query expansion for language modeling using sentence similarities. In Proceedings of the IRFC 2011, pages 62--77, 2011. Google ScholarDigital Library
- D. Ganguly, J. Leveling, and G. J. F. Jones. Simulation of within-session query variations using a text segmentation approach. In Proceedings of the CLEF 2011. (To appear). Springer, 2011. Google ScholarDigital Library
- D. Ganguly, J. Leveling, and G. J. F. Jones. United we fall, divided we stand: A study of query segmentation and PRF for patent prior art search. In Proceedings of the 4th International Workshop on Patent Information Retrieval, PAIR'11. ACM, 2011. Google ScholarDigital Library
- D. Ganguly, J. Leveling, W. Magdy, and G. J. F. Jones. Patent query reduction using pseudo relevance feedback. In Proceedings of CIKM 2011. ACM, 2011. Google ScholarDigital Library
- M. Hearst and C. Plaunt. Subtopic structuring for full-length document access. In SIGIR '93, pages 59--68. ACM, 1993. Google ScholarDigital Library
- M. A. Hearst. Multi-paragraph segmentation of expository text. In ACL, ACL '94, pages 9--16, Stroudsburg, PA, USA, 1994. ACM. Google ScholarDigital Library
- K. Kishida. Experiment on pseudo relevance feedback method using taylor formula at NTCIR-3 patent retrieval task. In NTCIR-3, 2003.Google Scholar
- A. M. Lam-Adesina and G. J. F. Jones. Applying summarization techniques for term selection in relevance feedback. In Proceedings of SIGIR 2001, pages 1--9. ACM, 2001. Google ScholarDigital Library
- V. Lavrenko and B. W. Croft. Relevance based language models. In SIGIR 2001, pages 120--127. ACM, 2001. Google ScholarDigital Library
- W. Magdy, J. Leveling, and G. J. F. Jones. Exploring structured documents and query formulation techniques for patent retrieval. In 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, pages 410--417, 2010. Google ScholarDigital Library
- W. Magdy, P. Lopez, and G. J. F. Jones. Simple vs. sophisticated approaches for patent prior-art search. In ECIR, pages 725--728, 2011. Google ScholarDigital Library
- I. Malioutov and R. Barzilay. Minimum cut model for spoken lecture segmentation. In In Proceedings of the COLING-ACL 2006, pages 25--32, 2006. Google ScholarDigital Library
- M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In SIGIR 1998, pages 206--214. ACM, 1998. Google ScholarDigital Library
- A. Moffat, R. Sacks-Davis, R. Wilkinson, and J. Zobel. Retrieval of partial documents. In TREC, pages 181--190, 1993.Google Scholar
- V. Moriceau, E. SanJuan, X. Tannier, and P. Bellot. Overview of the 2009 QA track: Towards a common task for QA, focused IR and automatic summarization systems. In Focused Retrieval and Evaluation, INEX-2009, pages 355--365, 2009. Google ScholarDigital Library
- J. M. Ponte. A language modeling approach to information retrieval. PhD thesis, University of Massachusetts, 1998. Google ScholarDigital Library
- J. C. Reynar. Statistical models for topic segmentation. In Proceedings of the ACL-99, 1999. Google ScholarDigital Library
- E. SanJuan, V. Moriceau, and X. Tannier. Overview of the INEX 2010 question answering track (QA@INEX). In Comparative Evaluation of Focused Retrieval, INEX 2010, 2010, (To appear). Google ScholarDigital Library
- H. Takuechi, N. Uramoto, and K. Takeda. Experiments on patent retrieval at NTCIR-5 workshop. In NTCIR-5, 2005.Google Scholar
- E. M. Voorhees. Overview of the TREC 2003 question answering track. pages 54--68, 2003.Google Scholar
- R. Wilkinson, J. Zobel, and R. Sacks-Davis. Similarity measures for short queries. In In Fourth Text REtrieval Conference (TREC-4), pages 277--285, 1995.Google Scholar
- J. Xu and W. B. Croft. Query expansion using local and global document analysis. In SIGIR 1996, pages 4--11. ACM, 1996. Google ScholarDigital Library
Index Terms
- Utilizing sub-topical structure of documents for information retrieval
Recommendations
Information Retrieval System for XML Documents
DEXA '02: Proceedings of the 13th International Conference on Database and Expert Systems ApplicationsIn the research field of document information retrieval, the unit of retrieval results returned by IR systems is a whole document or a document fragment, like a paragraph in passage retrieval. IR systems based on the vector space model compute feature ...
Information retrieval and structured documents
Lectures on information retrievalStandard Information Retrieval considers documents as atomic units of information that are indexed and retrieved as a whole. Modern evolution of document design and storage have since a long time introduced more elaborate representations of documents; ...
Utilizing sub-topical structure of documents for information retrieval
FDIA'11: Proceedings of the Fourth BCS-IRSG conference on Future Directions in Information AccessRecent years have witnessed an upsurge in the quantity of news, encyclopedic articles, blogs, forum and social networking posts etc. over the web. Some of these, such as the news and Wikipedia articles are carefully authored, edited and quality ...
Comments