skip to main content
10.1145/2396761.2396862acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Two-part segmentation of text documents

Published:29 October 2012Publication History

ABSTRACT

We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.

References

  1. A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Commun., 7(1):39--59, 1994. Google ScholarGoogle ScholarCross RefCross Ref
  2. I. Adeyanju, N. Wiratunga, R. Lothian, and S. Craw. Applying machine translation evaluation techniques to textual cbr. In ICCBR, pages 21--35, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Mach. Learn., 34:177--210, February 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Comput. Linguist., 16:79--85, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1--38, 1977.Google ScholarGoogle Scholar
  6. A. Echihabi and D. Marcu. A noisy-channel approach to question answering. In ACL, pages 16--23, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Eisenstein and R. Barzilay. Bayesian unsupervised topic segmentation. In EMNLP, pages 334--343, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Ge and P. Smyth. Segmental semi-markov models for change-point detection with applications to semiconductor manufacturing, 2000.Google ScholarGoogle Scholar
  9. M. A. Hearst. Multi-paragraph segmentation of expository text. In ACL, pages 9--16, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Jayanthi, S. Chakraborti, and S. Massie. Introspective knowledge revision in textual case-based reasoning. In ICCBR, pages 171--185, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Jeon, W. B. Croft, and J. H. Lee. Finding semantically similar questions based on their answers. In SIGIR, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Kazantseva and S. Szpakowicz. Linear text segmentation using affinity propagation. In EMNLP, pages 284--293, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. L. Kolodner. An introduction to case-based reasoning. Artif. Intell. Rev., 6(1):3--34, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  14. H. Kozima. Text segmentation based on similarity between words. In ACL, pages 286--288, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Kummamuru, D. P, S. Roy, and L. V. Subramaniam. Unsupervised segmentation of conversational transcripts. In SDM, pages 834--845, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Lenz, A. Hübner, and M. Kunze. Textual cbr. In Case-Based Reasoning Technology, From Foundations to Applications, pages 115--138, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. P, S. Chakraborti, and D. Khemani. More or better: on trade-offs in compacting textual problem solution repositories. In CIKM, pages 2321--2324, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. J. Passonneau and D. J. Litman. Discourse segmentation by human and automated means. Computational Linguistics, 23(1):103--139, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19--36, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. C. Reynar. An automatic method of finding topic boundaries. In ACL, pages 331--333, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. CIKM, pages 623--632, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Song and W. B. Croft. A general language model for information retrieval. CIKM, pages 316--321, New York, NY, USA, 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. van Mulbregt, I. Carp, L. Gillick, S. Lowe, and J. Yamron. Text segmentation and topic tracking on broadcast news via a hidden markov model approach. In ICSLP, 1998.Google ScholarGoogle Scholar
  24. X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, pages 475--482, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Two-part segmentation of text documents

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
      October 2012
      2840 pages
      ISBN:9781450311564
      DOI:10.1145/2396761

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader