ABSTRACT
We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.
- A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Commun., 7(1):39--59, 1994. Google ScholarCross Ref
- I. Adeyanju, N. Wiratunga, R. Lothian, and S. Craw. Applying machine translation evaluation techniques to textual cbr. In ICCBR, pages 21--35, 2010. Google ScholarDigital Library
- D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Mach. Learn., 34:177--210, February 1999. Google ScholarDigital Library
- P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Comput. Linguist., 16:79--85, June 1990. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1--38, 1977.Google Scholar
- A. Echihabi and D. Marcu. A noisy-channel approach to question answering. In ACL, pages 16--23, 2003. Google ScholarDigital Library
- J. Eisenstein and R. Barzilay. Bayesian unsupervised topic segmentation. In EMNLP, pages 334--343, 2008. Google ScholarDigital Library
- X. Ge and P. Smyth. Segmental semi-markov models for change-point detection with applications to semiconductor manufacturing, 2000.Google Scholar
- M. A. Hearst. Multi-paragraph segmentation of expository text. In ACL, pages 9--16, 1994. Google ScholarDigital Library
- K. Jayanthi, S. Chakraborti, and S. Massie. Introspective knowledge revision in textual case-based reasoning. In ICCBR, pages 171--185, 2010. Google ScholarDigital Library
- J. Jeon, W. B. Croft, and J. H. Lee. Finding semantically similar questions based on their answers. In SIGIR, 2005. Google ScholarDigital Library
- A. Kazantseva and S. Szpakowicz. Linear text segmentation using affinity propagation. In EMNLP, pages 284--293, 2011. Google ScholarDigital Library
- J. L. Kolodner. An introduction to case-based reasoning. Artif. Intell. Rev., 6(1):3--34, 1992.Google ScholarCross Ref
- H. Kozima. Text segmentation based on similarity between words. In ACL, pages 286--288, 1993. Google ScholarDigital Library
- K. Kummamuru, D. P, S. Roy, and L. V. Subramaniam. Unsupervised segmentation of conversational transcripts. In SDM, pages 834--845, 2008. Google ScholarDigital Library
- M. Lenz, A. Hübner, and M. Kunze. Textual cbr. In Case-Based Reasoning Technology, From Foundations to Applications, pages 115--138, 1998. Google ScholarDigital Library
- D. P, S. Chakraborti, and D. Khemani. More or better: on trade-offs in compacting textual problem solution repositories. In CIKM, pages 2321--2324, 2011. Google ScholarDigital Library
- R. J. Passonneau and D. J. Litman. Discourse segmentation by human and automated means. Computational Linguistics, 23(1):103--139, 1997. Google ScholarDigital Library
- L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19--36, 2002. Google ScholarDigital Library
- J. C. Reynar. An automatic method of finding topic boundaries. In ACL, pages 331--333, 1994. Google ScholarDigital Library
- M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. CIKM, pages 623--632, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- F. Song and W. B. Croft. A general language model for information retrieval. CIKM, pages 316--321, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- P. van Mulbregt, I. Carp, L. Gillick, S. Lowe, and J. Yamron. Text segmentation and topic tracking on broadcast news via a hidden markov model approach. In ICSLP, 1998.Google Scholar
- X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, pages 475--482, 2008. Google ScholarDigital Library
Index Terms
- Two-part segmentation of text documents
Recommendations
Text line segmentation of historical documents: a survey
There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage
We annotate 60,000 words of Classical Arabic (CA) with topics in philosophy, religion, literature, and law with fine-grain segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech (POS) ...
Comments