skip to main content
10.1145/3292500.3330955acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open Access

Assessing The Factual Accuracy of Generated Text

Published:25 July 2019Publication History

ABSTRACT

We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, 2015, Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations .Google ScholarGoogle Scholar
  2. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni, 2007, Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2670--2676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992, An Estimate of an Upper Bound for the Entropy of English, Computational Linguistics, Vol. 18, 1 (March 1992), 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li, 2017, Faithful to the Original: Fact Aware Neural Abstractive Summarization, CoRR, Vol. abs/1711.04434 (2017). arxiv: 1711.04434 http://arxiv.org/abs/1711.04434Google ScholarGoogle Scholar
  5. Jason Chiu and Eric Nichols. 2016, Named Entity Recognition with Bidirectional LSTM-CNNs, Transactions of the Association for Computational Linguistics, Vol. 4 (2016), 357--370.Google ScholarGoogle ScholarCross RefCross Ref
  6. Kevin Clark and Christopher D. Manning. 2016, Improving Coreference Resolution by Learning Entity-Level Distributed Representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 643--653.Google ScholarGoogle Scholar
  7. Jenny Rose Finkel, Trond Grenager, and Christopher Manning, 2005, Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 363--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Eduard Hovy, Chin-Yew Lin, Liang Zhou, and Junichi Fukumoto. 2006, Automated Summarization Evaluation with Basic Elements. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), European Language Resources Association (ELRA).Google ScholarGoogle Scholar
  9. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer, 2016, Neural Architectures for Named Entity Recognition, In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 260--270.Google ScholarGoogle ScholarCross RefCross Ref
  10. Alon Lavie and Abhaya Agarwal. 2007, Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, Stroudsburg, PA, USA, 228--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011, Stanford's Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In In Proceedings of the CoNLL-2011 Shared Task . Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky, 2017, Adversarial Learning for Neural Dialogue Generation. In Conference on Empirical Methods in Natural Language Processing. 2157--2169.Google ScholarGoogle Scholar
  13. Chin-Yew Lin, 2004, ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Stan Szpakowicz Marie-Francine Moens (Ed.). Association for Computational Linguistics, Barcelona, Spain, 74--81.Google ScholarGoogle Scholar
  14. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016, Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2124--2133.Google ScholarGoogle ScholarCross RefCross Ref
  15. Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, and Noam Shazeer. 2018, Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 2018 International Conference on Learning Representations .Google ScholarGoogle Scholar
  16. Andrew Mccallum and David Jensen. 2003, A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models. In In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data .Google ScholarGoogle Scholar
  17. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky, 2009, Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 1003--1011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Makoto Miwa and Mohit Bansal. 2016, End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1105--1116.Google ScholarGoogle ScholarCross RefCross Ref
  19. Makoto Miwa and Yutaka Sasaki. 2014, Modeling Joint Entity and Relation Extraction with Table Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1858--1869.Google ScholarGoogle ScholarCross RefCross Ref
  20. Thahir P. Mohamed, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2011, Discovering Relations Between Noun Categories. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 1447--1455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang, 2016, Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of the 2016 SIGNLL Conference on Computational Natural Language Learning .Google ScholarGoogle ScholarCross RefCross Ref
  22. Ani Nenkova and Rebecca J. Passonneau. 2004, Evaluating Content Selection in Summarization: The Pyramid Method.. In Proceedings of the 2005 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 145--152.Google ScholarGoogle Scholar
  23. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, 2002, BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning, 2010, A Multi-pass Sieve for Coreference Resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 492--501. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013, The Life and Death of Discourse Entities: Identifying Singleton Mentions. In Proceedings of the 2013 North American Chapter of the Association for Computational Linguistics .Google ScholarGoogle Scholar
  26. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013, Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 74--84.Google ScholarGoogle Scholar
  27. Alexander M. Rush, Sumit Chopra, and Jason Weston, 2015, A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing .Google ScholarGoogle ScholarCross RefCross Ref
  28. Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, and Aaron C. Courville, 2017. Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation. In Proceedings of the 2017 AAAI Conference on Artificial Intelligence. 3288--3294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Iulian Vlad Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Mudumba, Alexandre de Bré bisson, Jose Sotelo, Dendi Suhubdy, Vincent Michalski, Alexandre Nguyen, Joelle Pineau, and Yoshua Bengio, 2017. A Deep Reinforcement Learning Chatbot, CoRR, Vol. abs/1709.02349 (2017). arxiv: 1709.02349 http://arxiv.org/abs/1709.02349Google ScholarGoogle Scholar
  30. Noam Shazeer and Mitchell Stern. 2018, Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, 4603--4611.Google ScholarGoogle Scholar
  31. Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017, Deep Active Learning for Named Entity Recognition, CoRR, Vol. abs/1707.05928 (2017). arxiv: 1707.05928 http://arxiv.org/abs/1707.05928Google ScholarGoogle ScholarCross RefCross Ref
  32. Daniil Sorokin and Iryna Gurevych. 2017, Context-Aware Representations for Knowledge Base Relation Extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1784--1789.Google ScholarGoogle ScholarCross RefCross Ref
  33. Josef Steinberger and Karel Jezek. 2009, Evaluation Measures for Text Summarization, Computing and Informatics, Vol. 28 (2009), 251--275.Google ScholarGoogle Scholar
  34. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012, Multi-instance Multi-label Learning for Relation Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Stroudsburg, PA, USA, 455--465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francc ois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018, Tensor2Tensor for Neural Machine Translation, arXiv preprint, Vol. arXiv:1803.07416 (2018), http://arxiv.org/abs/1803.07416Google ScholarGoogle Scholar
  36. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017, Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 5998--6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Denny Vrandevcić and Markus Krötzsch. 2014, Wikidata: A Free Collaborative Knowledgebase, Commun. ACM, Vol. 57 (2014), 78--85. Issue 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush, 2017, Challenges in Data-to-Document Generation, CoRR, Vol. abs/1707.08052 (2017). arxiv: 1707.08052 http://arxiv.org/abs/1707.08052Google ScholarGoogle Scholar
  39. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016, Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, CoRR, Vol. abs/1609.08144 (2016). arxiv: 1609.08144 http://arxiv.org/abs/1609.08144Google ScholarGoogle Scholar
  40. Hengtong Zhang, Yaliang Li, Fenglong Ma, Jing Gao, and Lu Su. 2018, TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2729--2737. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Assessing The Factual Accuracy of Generated Text

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader