research-article

Two-part segmentation of text documents

Authors:
Deepak P.

IBM Research - India, Bangalore, India

IBM Research - India, Bangalore, India
View Profile

,
Karthik Visweswariah

IBM Research - India, Bangalore, India

IBM Research - India, Bangalore, India
View Profile

,
Nirmalie Wiratunga

Robert Gordon University, Aberdeen, United Kingdom

Robert Gordon University, Aberdeen, United Kingdom
View Profile

,
Sadiq Sani

Robert Gordon University, Aberdeen, United Kingdom

Robert Gordon University, Aberdeen, United Kingdom
View Profile

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementOctober 2012Pages 793–802https://doi.org/10.1145/2396761.2396862

Published:29 October 2012Publication History

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 793–802

ABSTRACT

We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.

References

A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Commun., 7(1):39--59, 1994. Google ScholarCross Ref
I. Adeyanju, N. Wiratunga, R. Lothian, and S. Craw. Applying machine translation evaluation techniques to textual cbr. In ICCBR, pages 21--35, 2010. Google ScholarDigital Library
D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Mach. Learn., 34:177--210, February 1999. Google ScholarDigital Library
P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Comput. Linguist., 16:79--85, June 1990. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1--38, 1977.Google Scholar
A. Echihabi and D. Marcu. A noisy-channel approach to question answering. In ACL, pages 16--23, 2003. Google ScholarDigital Library
J. Eisenstein and R. Barzilay. Bayesian unsupervised topic segmentation. In EMNLP, pages 334--343, 2008. Google ScholarDigital Library
X. Ge and P. Smyth. Segmental semi-markov models for change-point detection with applications to semiconductor manufacturing, 2000.Google Scholar
M. A. Hearst. Multi-paragraph segmentation of expository text. In ACL, pages 9--16, 1994. Google ScholarDigital Library
K. Jayanthi, S. Chakraborti, and S. Massie. Introspective knowledge revision in textual case-based reasoning. In ICCBR, pages 171--185, 2010. Google ScholarDigital Library
J. Jeon, W. B. Croft, and J. H. Lee. Finding semantically similar questions based on their answers. In SIGIR, 2005. Google ScholarDigital Library
A. Kazantseva and S. Szpakowicz. Linear text segmentation using affinity propagation. In EMNLP, pages 284--293, 2011. Google ScholarDigital Library
J. L. Kolodner. An introduction to case-based reasoning. Artif. Intell. Rev., 6(1):3--34, 1992.Google ScholarCross Ref
H. Kozima. Text segmentation based on similarity between words. In ACL, pages 286--288, 1993. Google ScholarDigital Library
K. Kummamuru, D. P, S. Roy, and L. V. Subramaniam. Unsupervised segmentation of conversational transcripts. In SDM, pages 834--845, 2008. Google ScholarDigital Library
M. Lenz, A. Hübner, and M. Kunze. Textual cbr. In Case-Based Reasoning Technology, From Foundations to Applications, pages 115--138, 1998. Google ScholarDigital Library
D. P, S. Chakraborti, and D. Khemani. More or better: on trade-offs in compacting textual problem solution repositories. In CIKM, pages 2321--2324, 2011. Google ScholarDigital Library
R. J. Passonneau and D. J. Litman. Discourse segmentation by human and automated means. Computational Linguistics, 23(1):103--139, 1997. Google ScholarDigital Library
L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19--36, 2002. Google ScholarDigital Library
J. C. Reynar. An automatic method of finding topic boundaries. In ACL, pages 331--333, 1994. Google ScholarDigital Library
M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. CIKM, pages 623--632, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
F. Song and W. B. Croft. A general language model for information retrieval. CIKM, pages 316--321, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
P. van Mulbregt, I. Carp, L. Gillick, S. Lowe, and J. Yamron. Text segmentation and topic tracking on broadcast news via a hidden markov model approach. In ICSLP, 1998.Google Scholar
X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, pages 475--482, 2008. Google ScholarDigital Library

Index Terms

Two-part segmentation of text documents
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Text line segmentation of historical documents: a survey

There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word ...
Read More
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Read More
Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

We annotate 60,000 words of Classical Arabic (CA) with topics in philosophy, religion, literature, and law with fine-grain segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech (POS) ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
language models
segmentation
text
translation models
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 275
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Two-part segmentation of text documents

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text line segmentation of historical documents: a survey

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Two-part segmentation of text documents

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text line segmentation of historical documents: a survey

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media