research-article

Comparative document summarization via discriminative sentence selection

Authors:
Dingding Wang

Florida International University, Miami, FL

Florida International University, Miami, FL
View Profile

,
Shenghuo Zhu

NEC Laboratories America, Inc., Cupertino, CA

NEC Laboratories America, Inc., Cupertino, CA
View Profile

,
Tao Li

Florida International University, Miami, FL

Florida International University, Miami, FL
View Profile

,
Yihong Gong

NEC Laboratories America, Inc., Cupertino, CA

NEC Laboratories America, Inc., Cupertino, CA
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 6 Issue 3Article No.: 12pp 1–18https://doi.org/10.1145/2362383.2362386

Published:29 October 2012Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Given a collection of document groups, a natural question is to identify the differences among them. Although traditional document summarization techniques can summarize the content of the document groups one by one, there exists a great necessity to generate a summary of the differences among the document groups. In this article, we study a novel problem, that of summarizing the differences between document groups. A discriminative sentence selection method is proposed to extract the most discriminative sentences which represent the specific characteristics of each document group. Experiments and case studies on real-world data sets demonstrate the effectiveness of our proposed method.

References

Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. 194--218.Google Scholar
Allan, J., Gupta, R., and Khandelwal, V. 2001. Temporal summaries of new topics. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '01). ACM, New York, 10--18. Google ScholarDigital Library
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. ACM, New York. Google ScholarDigital Library
Barzilay, R., McKeown, K., and Elhadad, M. 1999. Information fusion in the context of multi-document summarization. In Proceedings of the ACL. Google ScholarDigital Library
Baxendale, P. B. 1958. Machine-made index for technical literature: An experiment. IBM J. Res. Dev. 2, 354--361. Google ScholarDigital Library
Brants, T., Chen, F., and Farahat, A. 2003. A system for new event detection. In Proceedings of the SIGIR'03 Conference. ACM, New York, 330--337. Google ScholarDigital Library
Carbonell, J. and Goldstein, J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98). ACM, New York, 335--336. Google ScholarDigital Library
Chi, Y., Zhu, S., Song, X., Tatemura, J., and Tseng, B. L. 2007. Structural and temporal analysis of the blogosphere through community factorization. In Proceedings of the SIGKDD Conference. ACM, New York. Google ScholarDigital Library
Conroy, J. M. and O'Leary, D. P. 2001. Text summarization via hidden Markov models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '01). ACM, New York, 406--407. Google ScholarDigital Library
Ding, C., He, X., and Simon, H. 2005. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM Data Mining Conference.Google Scholar
DUC. 2006. http://www-nlpir.nist.gov/projects/duc/pubs/.Google Scholar
Edmundson, H. P. 1969. New methods in automatic extracting. J. ACM 16, 264--285. Google ScholarDigital Library
Erkan and Radev, D. R. 2004. Lexpagerank: Prestige in multi-document text summarization. In Proceedings of the EMNLP.Google Scholar
Fung, G. P. C., Yu, J. X., Liu, H., and Yu, P. S. 2007. Time-dependent event hierarchy construction. In Proceedings of the KDD'07 Conference. ACM, New York, 300--309. Google ScholarDigital Library
Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, 121--128. Google ScholarDigital Library
Gong, Y. and Liu, X. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the SIGIR Conference. Google ScholarDigital Library
Hu, M. and Liu, B. 2004. Mining and summarizing customer reviews. In Proceedings of the SIGKDD Conference. Google ScholarDigital Library
Jing, H. and McKeown, K. 2000. Cut and paste based text summarization. In Proceedings of the NAACL Conference. Google ScholarDigital Library
Knight, K. and Marcu, D. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. In Artificial Intelligence, 91--107. Google ScholarDigital Library
Kumaran, G. and Allan, J. 2004. Text classification and named entities for new event detection. In Proceedings of SIGIR'04 Conference. ACM, New York, 297--304. Google ScholarDigital Library
Lerman, K. and McDonald, R. 2009. Contrastive summarization: An experiment with consumer reviews. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Companion Volume: Short Papers, 113--116. Google ScholarDigital Library
Li, T. and Ding, C. 2006. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE, Los Alamitos, CA, 362--371. Google ScholarDigital Library
Li, T. and Ding, C. 2008. Weighted consensus clustering. In In Proceedings of 2008 SIAM International Conference on Data Mining (SDM).Google Scholar
Li, X. and Croft, W. B. 2006. Improving novelty detection for general topics using sentence-level information patterns. In Proceedings of the CIKM'06. ACM, New York, 238--247. Google ScholarDigital Library
Lin, C.-Y. and E. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of NLT-NAACL Conference. Google ScholarDigital Library
Makkonen, J., Ahonen-Myka, H., and Salmenkivi, M. 2004. Simple semantics in topic detection and tracking. Inf. Retrieval 7, 347--368. Google ScholarDigital Library
Mani, I. 2001. Automatic Summarization. John Benjamins Co.Google Scholar
Mani, I. and Bloedorn, E. 1997. Multi-document summarization by graph search and matching. In AAAI/IAAI, 622--628. Google ScholarDigital Library
Mani, I. and Bloedorn, E. 1999. Summarizing similarities and differences among related documents. Inf. Retrieval 1, 35--67. Google ScholarDigital Library
McCallum, A., Nigam, K., Rennie, J., and Seymore, K. 2000. Automating the construction of Internet portals with machine learning. Inf. Retrieval J. 127--163. Google ScholarDigital Library
Mihalcea, R. and Tarau, P. 2005. A language independent algorithm for single and multiple document summarization. In Proceedings of IJCNLP.Google Scholar
Morinaga, S. and Yamanishi, K. 2004. Tracking dynamics of topic trends using a finite mixture model. In Proceedings of KDD'04. ACM, New York, 811--816. Google ScholarDigital Library
Nenkova, A., Passonneau, R. J., and McKeown, K. 2007. The pyramid method: Incorporating human content selection variation in summarization evaluation. Trans. Speech Lang. Process. 4, 2. Google ScholarDigital Library
Ning, H., Xu, W., Chi, Y., Gong, Y., and Huang, T. S. 2007. Incremental spectral clustering with application to monitoring of evolving blog communities. In Proceedings of SIAM Data Mining Conference.Google Scholar
Ou, S., Khoo, C., and Goh, D. 2007. Multi-document summarization focusing on extracting and integrating similarities and differences among documents. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2007). 442--446.Google Scholar
Paul, M.J., Zhai, C., and Girju, R. 2010. Summarizing contrastive viewpoints in opinionated text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. (EMNLP'10). ACL, 66--76. Google ScholarDigital Library
Petersen, K. B. and Pedersen, M. S. 2006. The matrix cookbook. Version 20051003.Google Scholar
Radev, D., Jing, H., Stys, M., and Tam, D. 2004. Centroid-based summarization of multiple documents. Inf. Process. Manage. 919--938. Google ScholarDigital Library
Shen, C. and Li, T. 2010. Multi-document summarization via the minimum dominating set. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10). 984--992. Google ScholarDigital Library
Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. 2007. Document summarization using conditional random fields. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI'07). 2862--2867. Google ScholarDigital Library
Wan, X. and Yang, J. 2008. Multi-document summarization using cluster-based link analysis. In Proceedings of the 31 Annual International SIGIR Conference. Google ScholarDigital Library
Wang, D. and Li, T. 2010. Many are better than one: Improving multi-document summarization via weighted consensus. In Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10). 809--810. Google ScholarDigital Library
Wang, D., Li, T., Zhu, S., and Ding, C. 2008a. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'08). ACM, New York, 307--314. Google ScholarDigital Library
Wang, D., Li, T., Zhu, S., and Ding, C. 2008b. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the SIGIR Conference. Google ScholarDigital Library
Wang, D., Zhu, S., Li, T., and Gong, Y. 2009a. Comparative document summarization via discriminative sentence selection. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, 1963--1966. Google ScholarDigital Library
Wang, D., Zhu, S., Li, T., and Gong, Y. 2009b. Multi-document summarization using sentencebased topic models. In Proceedings of the ACL-IJCNLP Conference. (Short Paper). 297--300. Google ScholarDigital Library
Yang, Y., Pierce, T., and Carbonell, J. 1998. A study of retrospective and on-line event detection. In Proceedings of SIGIR'98 Conference. ACM, New York, 28--36. Google ScholarDigital Library
Yu, K., Bi, J., and Tresp, V. 2006. Active learning via transductive experimental design. In Proceedings of the ICML Conference. Google ScholarDigital Library
Zhai, C., Velivelli, A., and Yu, B. 2004. A cross-collection mixture model for comparative text mining. In Proceedings of the SIGKDD Conference. Google ScholarDigital Library
Zhang, K., Zi, J., and Wu, L.G. 2007. New event detection based on indexing-tree and named entity. In Proceedings of the SIGIR '07 Conference. ACM, New York, 215--222. Google ScholarDigital Library
Zhang, Y., Callan, J., and Minka, T. 2002. Novelty and redundancy detection in adaptive filtering. In Proceedings of the SIGIR'02 Conference. ACM, New York, 81--88. Google ScholarDigital Library
Zhao, Q., Mitra, P., and Chen, B. 2007. Temporal and information flow-based event detection from social text streams. In Proceedings of the 22nd National Conference on Artificial Intelligence. Vol. 2, AAAI Press, 1501--1506. Google ScholarDigital Library
Zhu, S., Wang, D., Yu, K., Li, T., and Gong, Y. 2010. Feature selection for gene expression using model-based entropy. IEEE/ACM Trans. Comput. Biol. Bioinform. 7, 1, 25--36. Google ScholarDigital Library

Index Terms

Comparative document summarization via discriminative sentence selection
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

Comparative Document Summarization via Discriminative Sentence Selection

Given a collection of document groups, a natural question is to identify the differences among these groups. Although traditional document summarization techniques can summarize the content of the document groups one by one, there exists a great ...
Read More
Comparative document summarization via discriminative sentence selection
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Given a collection of document groups, a quick question is what are the differences in these groups. In this paper, we study a novel problem of summarizing the differences between document groups. A discriminative sentence selection method is proposed ...
Read More
Multi-document abstractive summarization using ILP based multi-sentence compression
IJCAI'15: Proceedings of the 24th International Conference on Artificial Intelligence

Abstractive summarization is an ideal form of summarization since it can synthesize information from multiple documents to create concise informative summaries. In this work, we aim at developing an abstractive summarizer. First, our proposed approach ...
Read More

Reviews

Reviewer: David Parry

The main novelty of this interesting and important paper lies in the identification of what is really an under-researched problem: how to summarize the differences between documents. While document similarity and document summarization are well-known problems, this paper shows that summarizing the difference between documents is a different problem than simply showing the difference between document summaries. With summarization, users can discover documents that have different views, or approaches to similar problems. For example, in different political manifestos, this approach would be able to compare and contrast between two different approaches to lowering crime. The authors demonstrate an approach that involves looking for the most discriminative sentences that distinguish two documents, as identified by human coders. Their method maximizes the information gain demonstrated by selecting a particular sentence as a differentiation marker, and is shown to be successful in automatically identifying the same sentences as those identified by the human coders. The algorithm is well described. This paper is particularly recommended for computer scientists interested in information retrieval problems, and it may be directly applicable to fields such as forensics or software code maintenance, as well as for identifying "interesting" news. The background is very well presented and is suitable for graduate students or practitioners in the field. News organizations may also be interested, as would readers interested in attempting to summarize discussions, such as on a bulletin board, where there is disagreement. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 6, Issue 3
October 2012
126 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2362383
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2012
- Revised: 1 April 2012
- Accepted: 1 April 2012
- Received: 1 June 2011
Published in tkdd Volume 6, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Comparative document summarization
discriminative sentence selection
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 666
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparative document summarization via discriminative sentence selection

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Comparative Document Summarization via Discriminative Sentence Selection

Comparative document summarization via discriminative sentence selection

Multi-document abstractive summarization using ILP based multi-sentence compression

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Comparative document summarization via discriminative sentence selection

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Comparative Document Summarization via Discriminative Sentence Selection

Comparative document summarization via discriminative sentence selection

Multi-document abstractive summarization using ILP based multi-sentence compression

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media