Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text

Authors:
Yixuan Cao

Institute of Computing Technology, Chinese Academy Of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy Of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Hongwei Li

Institute of Computing Technology, Chinese Academy Of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy Of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Ping Luo

Institute of Computing Technology, Chinese Academy Of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy Of Sciences, Beijing, China
View Profile

,
Jiaquan Yao

Jinan University, Guangzhou, China

Jinan University, Guangzhou, China
View Profile

WWW '18: Proceedings of the 2018 World Wide Web ConferenceApril 2018Pages 1795–1804https://doi.org/10.1145/3178876.3186166

Published:10 April 2018Publication History

WWW '18: Proceedings of the 2018 World Wide Web Conference

Pages 1795–1804

ABSTRACT

Verbal descriptions over the numerical relationships among some objective measures widely exist in the published documents on Web, especially in the financial fields. However, due to large volumes of documents and limited time for manual cross-check, these claims might be inconsistent with the original structured data of the related indicators even after official publishing. Such errors can seriously affect investors' assessment of the company and may cause them to undervalue the firm even if the mistakes are made unintentionally instead of deliberately. It creates an opportunity for automated Numerical Cross-Checking (NCC) systems. This paper introduces the key component of such a system, formula extractor, which extracts formulas from verbal descriptions of numerical claims. Specifically, we formulate this task as a DAG-structure prediction problem, and propose an iterative relation extraction model to address it. In our model, we apply a bi-directional LSTM followed by a DAG-structured LSTM to extract formulas layer by layer iteratively. Then, the model is built using a human-labeled dataset of tens of thousands of sentences. The evaluation shows that this model is effective in formula extraction. At the relation level, the model achieves a 97.78% precision and 98.33% recall. At the sentence level, the predictions over 92.02% of sentences are perfect. Overall, the project for NCC has received wide recognition in the Chinese financial community.

References

Charu C Aggarwal and ChengXiang Zhai. 2012. Mining text data. Springer Science & Business Media. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR (2015).Google Scholar
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs.. In EMNLP.Google Scholar
Phil Blunsom, Nando de Freitas, Edward Grefenstette, and Karl Moritz Hermann. 2014. A deep architecture for semantic parsing. In ACL Workshop on Semantic Parsing.Google Scholar
Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A Fast Unified Model for Parsing and Sentence Understanding. In ACL.Google Scholar
Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017. Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder. In ACL.Google Scholar
Preeti Choudhary, Kenneth J Merkley, and Katherine Schipper. 2016. Qualitative characteristics of financial reporting errors deemed immaterial by managers. (2016).Google Scholar
Vivian W Fang, Allen H Huang, and Wenyu Wang. 2017. Imperfect accounting and reporting bias. Journal of Accounting Research (2017).Google Scholar
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU).Google Scholar
Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by ClaimBuster. In KDD. Google ScholarDigital Library
Daniel Hershcovich, Omri Abend, and Ari Rappoport. 2017. A Transition-Based Directed Acyclic Graph Parser for UCCA. ACL (2017).Google Scholar
Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. TACL (2016).Google Scholar
Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. ACL (2015).Google Scholar
Alastair Lawrence. 2013. Individual investors and financial disclosure. Journal of Accounting and Economics (2013).Google Scholar
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. 2017. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. ACL (2017).Google Scholar
Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. 2017. Deep Learning with Dynamic Computation Graphs. ICLR (2017).Google Scholar
Aman Madaan, Ashish Mittal, G Ramakrishnan Mausam, Ganesh Ramakrishnan, and Sunita Sarawagi. 2016. Numerical Relation Extraction with Minimal Supervision.. In AAAI. Google ScholarDigital Library
Mike Mintz, Steven Bills, Rion Snow, and Jurafsky Dan. 2009. Distant supervision for relation extraction without labeled data. In ACL. Google ScholarDigital Library
Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. ACL (2016).Google Scholar
Michèle B. Nuijten, Chris H. J. Hartgerink, Marcel A. L. M. van Assen, Sacha Epskamp, and Jelte M. Wicherts. 2016. The prevalence of statistical reporting errors in psychology (1985--2013). Behavior Research Methods (2016).Google Scholar
Subhro Roy, Shyam Upadhyay, and Dan Roth. 2016. Equation Parsing: Mapping Sentences to Grounded Equations. In EMNLP.Google Scholar
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing (1997). Google ScholarDigital Library
Bing Shuai, Zhen Zuo, Bing Wang, and Gang Wang. 2016. Dag-recurrent neural networks for scene labeling. In CVPR.Google Scholar
Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In ICML. Google ScholarDigital Library
Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. ACL (2015).Google Scholar
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv (2016).Google Scholar
Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv (2012).Google Scholar
Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2016. DAG-Structured Long Short-Term Memory for Semantic Compositionality. In NAACL HLT.Google Scholar
Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In ICML. Google ScholarDigital Library

Index Terms

Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Structured outputs
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction

Recommendations

Experiments on pattern-based relation learning
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Relation extraction is the task of extracting semantic relations - such as synonymy or hypernymy - between word pairs from corpus data. Past work in relation extraction has concentrated on manually creating templates to use in directly extracting word ...
Read More
Towards Large-Scale Unsupervised Relation Extraction from the Web

The Web brings an open-ended set of semantic relations. Discovering the significant types is very challenging. Unsupervised algorithms have been developed to extract relations from a corpus without knowing the relation types in advance, but most rely on ...
Read More
Review of entity relation extraction

In today’s big data era, there are a large number of unstructured information resources on the web. Natural language processing researchers have been working hard to figure out how to extract useful information from them. Entity Relation Extraction is a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '18: Proceedings of the 2018 World Wide Web Conference
April 2018
2000 pages
ISBN:9781450356398
General Chairs:
Pierre-Antoine Champin
Universitè Claude Bernard Lyon 1, France
,
Fabien Gandon
Inria, Université Côte d'Azur, CNRS, I3S, France
,
Lionel Médini
Université Claude Bernard Lyon 1, France
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Panagiotis G. Ipeirotis
New York University, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 10 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
formula extraction
information extraction
iterative relation extraction
numerical cross-checking
relation extraction
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '18 Paper Acceptance Rate170of1,155submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 599
  Total Downloads
- Downloads (Last 12 months)107
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text

WWW '18: Proceedings of the 2018 World Wide Web Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Experiments on pattern-based relation learning

Towards Large-Scale Unsupervised Relation Extraction from the Web

Review of entity relation extraction