research-article

Open Access

Assessing The Factual Accuracy of Generated Text

Authors:
Ben Goodrich

Google Brain, Mountain View, CA, USA

Google Brain, Mountain View, CA, USA
View Profile

,
Vinay Rao

Google Brain, Mountain View, CA, USA

Google Brain, Mountain View, CA, USA
View Profile

,
Peter J. Liu

Google Brain, Mountain View, CA, USA

Google Brain, Mountain View, CA, USA
View Profile

,
Mohammad Saleh

Google Brain, Mountain View, CA, USA

Google Brain, Mountain View, CA, USA
View Profile

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2019Pages 166–175https://doi.org/10.1145/3292500.3330955

Published:25 July 2019Publication History

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 166–175

ABSTRACT

We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, 2015, Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations .Google Scholar
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni, 2007, Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2670--2676. Google ScholarDigital Library
Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992, An Estimate of an Upper Bound for the Entropy of English, Computational Linguistics, Vol. 18, 1 (March 1992), 31--40. Google ScholarDigital Library
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li, 2017, Faithful to the Original: Fact Aware Neural Abstractive Summarization, CoRR, Vol. abs/1711.04434 (2017). arxiv: 1711.04434 http://arxiv.org/abs/1711.04434Google Scholar
Jason Chiu and Eric Nichols. 2016, Named Entity Recognition with Bidirectional LSTM-CNNs, Transactions of the Association for Computational Linguistics, Vol. 4 (2016), 357--370.Google ScholarCross Ref
Kevin Clark and Christopher D. Manning. 2016, Improving Coreference Resolution by Learning Entity-Level Distributed Representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 643--653.Google Scholar
Jenny Rose Finkel, Trond Grenager, and Christopher Manning, 2005, Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 363--370. Google ScholarDigital Library
Eduard Hovy, Chin-Yew Lin, Liang Zhou, and Junichi Fukumoto. 2006, Automated Summarization Evaluation with Basic Elements. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), European Language Resources Association (ELRA).Google Scholar
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer, 2016, Neural Architectures for Named Entity Recognition, In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 260--270.Google ScholarCross Ref
Alon Lavie and Abhaya Agarwal. 2007, Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, Stroudsburg, PA, USA, 228--231. Google ScholarDigital Library
Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011, Stanford's Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In In Proceedings of the CoNLL-2011 Shared Task . Google ScholarDigital Library
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky, 2017, Adversarial Learning for Neural Dialogue Generation. In Conference on Empirical Methods in Natural Language Processing. 2157--2169.Google Scholar
Chin-Yew Lin, 2004, ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Stan Szpakowicz Marie-Francine Moens (Ed.). Association for Computational Linguistics, Barcelona, Spain, 74--81.Google Scholar
Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016, Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2124--2133.Google ScholarCross Ref
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, and Noam Shazeer. 2018, Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 2018 International Conference on Learning Representations .Google Scholar
Andrew Mccallum and David Jensen. 2003, A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models. In In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data .Google Scholar
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky, 2009, Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 1003--1011. Google ScholarDigital Library
Makoto Miwa and Mohit Bansal. 2016, End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1105--1116.Google ScholarCross Ref
Makoto Miwa and Yutaka Sasaki. 2014, Modeling Joint Entity and Relation Extraction with Table Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1858--1869.Google ScholarCross Ref
Thahir P. Mohamed, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2011, Discovering Relations Between Noun Categories. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 1447--1455. Google ScholarDigital Library
Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang, 2016, Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of the 2016 SIGNLL Conference on Computational Natural Language Learning .Google ScholarCross Ref
Ani Nenkova and Rebecca J. Passonneau. 2004, Evaluating Content Selection in Summarization: The Pyramid Method.. In Proceedings of the 2005 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 145--152.Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, 2002, BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, 311--318. Google ScholarDigital Library
Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning, 2010, A Multi-pass Sieve for Coreference Resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 492--501. Google ScholarDigital Library
Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013, The Life and Death of Discourse Entities: Identifying Singleton Mentions. In Proceedings of the 2013 North American Chapter of the Association for Computational Linguistics .Google Scholar
Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013, Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 74--84.Google Scholar
Alexander M. Rush, Sumit Chopra, and Jason Weston, 2015, A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing .Google ScholarCross Ref
Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, and Aaron C. Courville, 2017. Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation. In Proceedings of the 2017 AAAI Conference on Artificial Intelligence. 3288--3294. Google ScholarDigital Library
Iulian Vlad Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Mudumba, Alexandre de Bré bisson, Jose Sotelo, Dendi Suhubdy, Vincent Michalski, Alexandre Nguyen, Joelle Pineau, and Yoshua Bengio, 2017. A Deep Reinforcement Learning Chatbot, CoRR, Vol. abs/1709.02349 (2017). arxiv: 1709.02349 http://arxiv.org/abs/1709.02349Google Scholar
Noam Shazeer and Mitchell Stern. 2018, Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, 4603--4611.Google Scholar
Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017, Deep Active Learning for Named Entity Recognition, CoRR, Vol. abs/1707.05928 (2017). arxiv: 1707.05928 http://arxiv.org/abs/1707.05928Google ScholarCross Ref
Daniil Sorokin and Iryna Gurevych. 2017, Context-Aware Representations for Knowledge Base Relation Extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1784--1789.Google ScholarCross Ref
Josef Steinberger and Karel Jezek. 2009, Evaluation Measures for Text Summarization, Computing and Informatics, Vol. 28 (2009), 251--275.Google Scholar
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012, Multi-instance Multi-label Learning for Relation Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Stroudsburg, PA, USA, 455--465. Google ScholarDigital Library
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francc ois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018, Tensor2Tensor for Neural Machine Translation, arXiv preprint, Vol. arXiv:1803.07416 (2018), http://arxiv.org/abs/1803.07416Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017, Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 5998--6008. Google ScholarDigital Library
Denny Vrandevcić and Markus Krötzsch. 2014, Wikidata: A Free Collaborative Knowledgebase, Commun. ACM, Vol. 57 (2014), 78--85. Issue 10. Google ScholarDigital Library
Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush, 2017, Challenges in Data-to-Document Generation, CoRR, Vol. abs/1707.08052 (2017). arxiv: 1707.08052 http://arxiv.org/abs/1707.08052Google Scholar
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016, Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, CoRR, Vol. abs/1609.08144 (2016). arxiv: 1609.08144 http://arxiv.org/abs/1609.08144Google Scholar
Hengtong Zhang, Yaliang Li, Fenglong Ma, Jing Gao, and Lu Su. 2018, TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2729--2737. Google ScholarDigital Library

Index Terms

Assessing The Factual Accuracy of Generated Text
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Structured outputs
    2. Machine learning approaches
      1. Neural networks

Recommendations

FAR-ASS: Fact-aware reinforced abstractive sentence summarization
Highlights
- For natural language generation tasks, fact fabrication is a serious problem.
- An automatic fact extraction scheme leveraging open information extraction and dependency parser tools to extract the structured fact tuples.
- A factual ...
Abstract
Automatic summarization systems provide an effective solution to today's unprecedented growth of textual data. For real-world tasks, such as data mining and information retrieval, the factual correctness of generated summary is critical. However, ...
Read More
Evaluating factual accuracy in complex data-to-text
Abstract
It is essential that data-to-text Natural Language Generation (NLG) systems produce texts which are factually accurate. We examine accuracy issues in the task of generating summaries of basketball games, including what accuracy means ...
Highlights
- Factual accuracy problems limit the usefulness of neural solutions for complex data-to-text.
Read More
Reducing the Need for Manual Annotated Datasets in Aspect Sentiment Classification by Transfer Learning and Weak-Supervision
Agents and Artificial Intelligence
Abstract
Users’ opinions can be greatly beneficial in developing and providing products and services and improving marketing techniques for customer recommendation and retention. For this reason, sentiment analysis algorithms that automatically extract ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota
Copyright © 2019 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
factual correctness
generative models
metric
transformers
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '19 Paper Acceptance Rate110of1,200submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 2,546
  Total Downloads
- Downloads (Last 12 months)591
- Downloads (Last 6 weeks)55
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Assessing The Factual Accuracy of Generated Text

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

FAR-ASS: Fact-aware reinforced abstractive sentence summarization

Evaluating factual accuracy in complex data-to-text

Reducing the Need for Manual Annotated Datasets in Aspect Sentiment Classification by Transfer Learning and Weak-Supervision

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Assessing The Factual Accuracy of Generated Text

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

FAR-ASS: Fact-aware reinforced abstractive sentence summarization

Evaluating factual accuracy in complex data-to-text

Reducing the Need for Manual Annotated Datasets in Aspect Sentiment Classification by Transfer Learning and Weak-Supervision

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media