skip to main content
research-article

Comparative document summarization via discriminative sentence selection

Published:29 October 2012Publication History
Skip Abstract Section

Abstract

Given a collection of document groups, a natural question is to identify the differences among them. Although traditional document summarization techniques can summarize the content of the document groups one by one, there exists a great necessity to generate a summary of the differences among the document groups. In this article, we study a novel problem, that of summarizing the differences between document groups. A discriminative sentence selection method is proposed to extract the most discriminative sentences which represent the specific characteristics of each document group. Experiments and case studies on real-world data sets demonstrate the effectiveness of our proposed method.

References

  1. Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. 194--218.Google ScholarGoogle Scholar
  2. Allan, J., Gupta, R., and Khandelwal, V. 2001. Temporal summaries of new topics. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '01). ACM, New York, 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barzilay, R., McKeown, K., and Elhadad, M. 1999. Information fusion in the context of multi-document summarization. In Proceedings of the ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Baxendale, P. B. 1958. Machine-made index for technical literature: An experiment. IBM J. Res. Dev. 2, 354--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Brants, T., Chen, F., and Farahat, A. 2003. A system for new event detection. In Proceedings of the SIGIR'03 Conference. ACM, New York, 330--337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Carbonell, J. and Goldstein, J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98). ACM, New York, 335--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chi, Y., Zhu, S., Song, X., Tatemura, J., and Tseng, B. L. 2007. Structural and temporal analysis of the blogosphere through community factorization. In Proceedings of the SIGKDD Conference. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Conroy, J. M. and O'Leary, D. P. 2001. Text summarization via hidden Markov models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '01). ACM, New York, 406--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ding, C., He, X., and Simon, H. 2005. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM Data Mining Conference.Google ScholarGoogle Scholar
  11. DUC. 2006. http://www-nlpir.nist.gov/projects/duc/pubs/.Google ScholarGoogle Scholar
  12. Edmundson, H. P. 1969. New methods in automatic extracting. J. ACM 16, 264--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Erkan and Radev, D. R. 2004. Lexpagerank: Prestige in multi-document text summarization. In Proceedings of the EMNLP.Google ScholarGoogle Scholar
  14. Fung, G. P. C., Yu, J. X., Liu, H., and Yu, P. S. 2007. Time-dependent event hierarchy construction. In Proceedings of the KDD'07 Conference. ACM, New York, 300--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, 121--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gong, Y. and Liu, X. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the SIGIR Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hu, M. and Liu, B. 2004. Mining and summarizing customer reviews. In Proceedings of the SIGKDD Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jing, H. and McKeown, K. 2000. Cut and paste based text summarization. In Proceedings of the NAACL Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Knight, K. and Marcu, D. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. In Artificial Intelligence, 91--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kumaran, G. and Allan, J. 2004. Text classification and named entities for new event detection. In Proceedings of SIGIR'04 Conference. ACM, New York, 297--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lerman, K. and McDonald, R. 2009. Contrastive summarization: An experiment with consumer reviews. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Companion Volume: Short Papers, 113--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Li, T. and Ding, C. 2006. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE, Los Alamitos, CA, 362--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Li, T. and Ding, C. 2008. Weighted consensus clustering. In In Proceedings of 2008 SIAM International Conference on Data Mining (SDM).Google ScholarGoogle Scholar
  24. Li, X. and Croft, W. B. 2006. Improving novelty detection for general topics using sentence-level information patterns. In Proceedings of the CIKM'06. ACM, New York, 238--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lin, C.-Y. and E. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of NLT-NAACL Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Makkonen, J., Ahonen-Myka, H., and Salmenkivi, M. 2004. Simple semantics in topic detection and tracking. Inf. Retrieval 7, 347--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mani, I. 2001. Automatic Summarization. John Benjamins Co.Google ScholarGoogle Scholar
  28. Mani, I. and Bloedorn, E. 1997. Multi-document summarization by graph search and matching. In AAAI/IAAI, 622--628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mani, I. and Bloedorn, E. 1999. Summarizing similarities and differences among related documents. Inf. Retrieval 1, 35--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. McCallum, A., Nigam, K., Rennie, J., and Seymore, K. 2000. Automating the construction of Internet portals with machine learning. Inf. Retrieval J. 127--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mihalcea, R. and Tarau, P. 2005. A language independent algorithm for single and multiple document summarization. In Proceedings of IJCNLP.Google ScholarGoogle Scholar
  32. Morinaga, S. and Yamanishi, K. 2004. Tracking dynamics of topic trends using a finite mixture model. In Proceedings of KDD'04. ACM, New York, 811--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nenkova, A., Passonneau, R. J., and McKeown, K. 2007. The pyramid method: Incorporating human content selection variation in summarization evaluation. Trans. Speech Lang. Process. 4, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ning, H., Xu, W., Chi, Y., Gong, Y., and Huang, T. S. 2007. Incremental spectral clustering with application to monitoring of evolving blog communities. In Proceedings of SIAM Data Mining Conference.Google ScholarGoogle Scholar
  35. Ou, S., Khoo, C., and Goh, D. 2007. Multi-document summarization focusing on extracting and integrating similarities and differences among documents. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2007). 442--446.Google ScholarGoogle Scholar
  36. Paul, M.J., Zhai, C., and Girju, R. 2010. Summarizing contrastive viewpoints in opinionated text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. (EMNLP'10). ACL, 66--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Petersen, K. B. and Pedersen, M. S. 2006. The matrix cookbook. Version 20051003.Google ScholarGoogle Scholar
  38. Radev, D., Jing, H., Stys, M., and Tam, D. 2004. Centroid-based summarization of multiple documents. Inf. Process. Manage. 919--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shen, C. and Li, T. 2010. Multi-document summarization via the minimum dominating set. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10). 984--992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. 2007. Document summarization using conditional random fields. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI'07). 2862--2867. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wan, X. and Yang, J. 2008. Multi-document summarization using cluster-based link analysis. In Proceedings of the 31 Annual International SIGIR Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wang, D. and Li, T. 2010. Many are better than one: Improving multi-document summarization via weighted consensus. In Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10). 809--810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wang, D., Li, T., Zhu, S., and Ding, C. 2008a. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'08). ACM, New York, 307--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wang, D., Li, T., Zhu, S., and Ding, C. 2008b. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the SIGIR Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wang, D., Zhu, S., Li, T., and Gong, Y. 2009a. Comparative document summarization via discriminative sentence selection. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, 1963--1966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wang, D., Zhu, S., Li, T., and Gong, Y. 2009b. Multi-document summarization using sentencebased topic models. In Proceedings of the ACL-IJCNLP Conference. (Short Paper). 297--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yang, Y., Pierce, T., and Carbonell, J. 1998. A study of retrospective and on-line event detection. In Proceedings of SIGIR'98 Conference. ACM, New York, 28--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yu, K., Bi, J., and Tresp, V. 2006. Active learning via transductive experimental design. In Proceedings of the ICML Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhai, C., Velivelli, A., and Yu, B. 2004. A cross-collection mixture model for comparative text mining. In Proceedings of the SIGKDD Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Zhang, K., Zi, J., and Wu, L.G. 2007. New event detection based on indexing-tree and named entity. In Proceedings of the SIGIR '07 Conference. ACM, New York, 215--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zhang, Y., Callan, J., and Minka, T. 2002. Novelty and redundancy detection in adaptive filtering. In Proceedings of the SIGIR'02 Conference. ACM, New York, 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhao, Q., Mitra, P., and Chen, B. 2007. Temporal and information flow-based event detection from social text streams. In Proceedings of the 22nd National Conference on Artificial Intelligence. Vol. 2, AAAI Press, 1501--1506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhu, S., Wang, D., Yu, K., Li, T., and Gong, Y. 2010. Feature selection for gene expression using model-based entropy. IEEE/ACM Trans. Comput. Biol. Bioinform. 7, 1, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Comparative document summarization via discriminative sentence selection

      Recommendations

      Reviews

      David Parry

      The main novelty of this interesting and important paper lies in the identification of what is really an under-researched problem: how to summarize the differences between documents. While document similarity and document summarization are well-known problems, this paper shows that summarizing the difference between documents is a different problem than simply showing the difference between document summaries. With summarization, users can discover documents that have different views, or approaches to similar problems. For example, in different political manifestos, this approach would be able to compare and contrast between two different approaches to lowering crime. The authors demonstrate an approach that involves looking for the most discriminative sentences that distinguish two documents, as identified by human coders. Their method maximizes the information gain demonstrated by selecting a particular sentence as a differentiation marker, and is shown to be successful in automatically identifying the same sentences as those identified by the human coders. The algorithm is well described. This paper is particularly recommended for computer scientists interested in information retrieval problems, and it may be directly applicable to fields such as forensics or software code maintenance, as well as for identifying "interesting" news. The background is very well presented and is suitable for graduate students or practitioners in the field. News organizations may also be interested, as would readers interested in attempting to summarize discussions, such as on a bulletin board, where there is disagreement. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 3
        October 2012
        126 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/2362383
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 October 2012
        • Revised: 1 April 2012
        • Accepted: 1 April 2012
        • Received: 1 June 2011
        Published in tkdd Volume 6, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader