research-article

Machine Reading of Biomedical Data Dictionaries

Authors:
Naveen Ashish

Hutch Data Commonwealth, Fred Hutchinson Cancer Research Center, Seattle WA

Hutch Data Commonwealth, Fred Hutchinson Cancer Research Center, Seattle WA
View Profile

,
Arihant Patawari

City of Hope, National Medical Center, Duarte, CA

City of Hope, National Medical Center, Duarte, CA
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 9 Issue 4Article No.: 21pp 1–20https://doi.org/10.1145/3177874

Published:11 May 2018Publication History

Journal of Data and Information Quality

Abstract

This article describes an approach for the automated reading of biomedical data dictionaries. Automated reading is the process of extracting element details for each of the data elements from a data dictionary in a document format (such as PDF) to a completely structured representation. A structured representation is essential if the data dictionary metadata are to be used in applications such as data integration and also in evaluating the quality of the associated data. We present an approach and implemented solution for the problem, considering different formats of data dictionaries. We have a particular focus on the most challenging format with a machine-learning classification solution to the problem using conditional random field classifiers. We present an evaluation using several actual data dictionaries, demonstrating the effectiveness of our approach.

References

C. C. Aggarwal and C. Zhai. 2012. A survey of text classification algorithms. In Mining Text Data. Springer, 163--222. Google ScholarDigital Library
N. Ashish and C. Knoblock. 1997. Wrapper generation for semi-structured internet sources. ACM Sigmod Rec. 26, 4 (1997), 8--15. Google ScholarDigital Library
N. Ashish and A. Patawari. 2017. Data Dictionary Reader Code. Retrieved from https://github.com/nashish100/DDReading.Google Scholar
N. Ashish, P. Dewan, J. Ambite, and A. Toga. 2015. GEM: Tha GAAIN entity mapper. In Proceedings of the 11th International Conference on Data Integration in Life Sciences. Springer, 13--27.Google Scholar
Y. Aumann, R. Feldman, Y. Liberzon, and Rosenfeld. 2006. Visual information extraction. Knowl. Inf. Syst. 10, 1 (2006), 1--15. Google ScholarDigital Library
P. Buneman. 1997. Semistructured data. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 117--121. Google ScholarDigital Library
H. Chao and J. Fan. 2004. Layout and content extraction for pdf documents. In Document Analysis Systems VI. Springer, Berlin.Google Scholar
W. Cohen, M. Hurst, and L. S. Jensen. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the 11th International Conference on World Wide Web (WWW'02). ACM, New York, NY, USA, 232--241. Google ScholarDigital Library
Data Dictionary. IBM Dictionary of Computing (10th ed.). ACM.Google Scholar
E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. 2014. Web data extraction, applications and techniques. Knowl.-Based Syst. 70, C (2014), 301--323. Google ScholarDigital Library
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. 2007. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web. ACM, 71--80. Google ScholarDigital Library
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explor. 11, 1 (2009). Google ScholarDigital Library
M. Hurst. 2000. The Interpretation of Tables in Texts. Ph.D. thesis. University of Edinburgh, School of Cognitive Science and Informatics.Google Scholar
J. Lafferty, A. McCallum, and P. Pereira. 2001. Conditional. In Proceedings of the International Conference on Machine Learning (ICML’01).Google Scholar
K. Lerman, L. Getoor, S. Minton, and C. Knoblock. 2005. Using the structure of web sites for automatic segmentation of tables. In Proceedings of the ACM SIGMOD Conference on Management Of Data. 119--130. Google ScholarDigital Library
A. McCallum. 2002. Mallet. Récupéré sur MALLET: A Machine Learning for Language Toolkit. Retrieved from http://mallet.cs.umass.edu/.Google Scholar
PDFBox. 2015. Récupéré sur Apache PDFBox. Retrieved from https://pdfbox.apache.org/.Google Scholar
PDFTables. 2015. Récupéré sur Accurately extract tables from PDF. Retrieved from https://pdftables.com/.Google Scholar
D. Pinto, A. McCallum, X. Wei, and W. Croft. 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM. 235—242. Google ScholarDigital Library
Tabula. 2015. Récupéré sur extract tables from PDFs. Retrieved from http://tabula.technology/.Google Scholar
A. Tengli, A. Yang, and N. Ma. 2005. Learning table extraction from examples. In Proceedings of the 2th Annual COLING Conference. 987--993. Google ScholarDigital Library
T. Tombaugh and N. McIntyre. 1992. The mini-mental state examination: A comprehensive Review. J. Am. Geriat. Soc. 922--935.Google ScholarCross Ref
Z. Y and B. Liu. 2005. Web data extraction based on partial tree. In Proceedings of the 14th WWW Conference. 76--85. Google ScholarDigital Library

Index Terms

Machine Reading of Biomedical Data Dictionaries
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Semi-structured data
    2. Information integration
      1. Extraction, transformation and loading

Recommendations

Name Similarity for Composite Element Name Matching
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Background and Objective: Matching corresponding data elements is a critical problem in biomedical data harmonization for data sharing. The similarity of the element names is one of the many factors employed in determining data element matches. ...
Read More
Machine reading: from wikipedia to the web
Read More
Biomedical named entity recognition using BERT in the machine reading comprehension framework
Graphical abstract
Isothermal surfaces within the device

Display Omitted
Highlights
- We achieve named entity recognition in the machine reading comprehension framework.
Abstract
Recognition of biomedical entities from literature is a challenging research focus, which is the foundation for extracting a large amount of biomedical knowledge existing in unstructured texts into structured formats. Using the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 9, Issue 4
Challenge Paper, Experience Paper and Research Paper
December 2017
91 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3208074
Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 May 2018
- Revised: 1 December 2017
- Accepted: 1 December 2017
- Received: 1 December 2015
Published in jdiq Volume 9, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Document extraction
machine-learning classification
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 204
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Machine Reading of Biomedical Data Dictionaries

Journal of Data and Information Quality

Abstract

References

Cited By

Index Terms

Recommendations

Name Similarity for Composite Element Name Matching

Machine reading: from wikipedia to the web

Biomedical named entity recognition using BERT in the machine reading comprehension framework