skip to main content
research-article

Machine Reading of Biomedical Data Dictionaries

Authors Info & Claims
Published:11 May 2018Publication History
Skip Abstract Section

Abstract

This article describes an approach for the automated reading of biomedical data dictionaries. Automated reading is the process of extracting element details for each of the data elements from a data dictionary in a document format (such as PDF) to a completely structured representation. A structured representation is essential if the data dictionary metadata are to be used in applications such as data integration and also in evaluating the quality of the associated data. We present an approach and implemented solution for the problem, considering different formats of data dictionaries. We have a particular focus on the most challenging format with a machine-learning classification solution to the problem using conditional random field classifiers. We present an evaluation using several actual data dictionaries, demonstrating the effectiveness of our approach.

References

  1. C. C. Aggarwal and C. Zhai. 2012. A survey of text classification algorithms. In Mining Text Data. Springer, 163--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Ashish and C. Knoblock. 1997. Wrapper generation for semi-structured internet sources. ACM Sigmod Rec. 26, 4 (1997), 8--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Ashish and A. Patawari. 2017. Data Dictionary Reader Code. Retrieved from https://github.com/nashish100/DDReading.Google ScholarGoogle Scholar
  4. N. Ashish, P. Dewan, J. Ambite, and A. Toga. 2015. GEM: Tha GAAIN entity mapper. In Proceedings of the 11th International Conference on Data Integration in Life Sciences. Springer, 13--27.Google ScholarGoogle Scholar
  5. Y. Aumann, R. Feldman, Y. Liberzon, and Rosenfeld. 2006. Visual information extraction. Knowl. Inf. Syst. 10, 1 (2006), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Buneman. 1997. Semistructured data. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 117--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Chao and J. Fan. 2004. Layout and content extraction for pdf documents. In Document Analysis Systems VI. Springer, Berlin.Google ScholarGoogle Scholar
  8. W. Cohen, M. Hurst, and L. S. Jensen. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the 11th International Conference on World Wide Web (WWW'02). ACM, New York, NY, USA, 232--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Data Dictionary. IBM Dictionary of Computing (10th ed.). ACM.Google ScholarGoogle Scholar
  10. E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. 2014. Web data extraction, applications and techniques. Knowl.-Based Syst. 70, C (2014), 301--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. 2007. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web. ACM, 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explor. 11, 1 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Hurst. 2000. The Interpretation of Tables in Texts. Ph.D. thesis. University of Edinburgh, School of Cognitive Science and Informatics.Google ScholarGoogle Scholar
  14. J. Lafferty, A. McCallum, and P. Pereira. 2001. Conditional. In Proceedings of the International Conference on Machine Learning (ICML’01).Google ScholarGoogle Scholar
  15. K. Lerman, L. Getoor, S. Minton, and C. Knoblock. 2005. Using the structure of web sites for automatic segmentation of tables. In Proceedings of the ACM SIGMOD Conference on Management Of Data. 119--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. McCallum. 2002. Mallet. Récupéré sur MALLET: A Machine Learning for Language Toolkit. Retrieved from http://mallet.cs.umass.edu/.Google ScholarGoogle Scholar
  17. PDFBox. 2015. Récupéré sur Apache PDFBox. Retrieved from https://pdfbox.apache.org/.Google ScholarGoogle Scholar
  18. PDFTables. 2015. Récupéré sur Accurately extract tables from PDF. Retrieved from https://pdftables.com/.Google ScholarGoogle Scholar
  19. D. Pinto, A. McCallum, X. Wei, and W. Croft. 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM. 235—242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tabula. 2015. Récupéré sur extract tables from PDFs. Retrieved from http://tabula.technology/.Google ScholarGoogle Scholar
  21. A. Tengli, A. Yang, and N. Ma. 2005. Learning table extraction from examples. In Proceedings of the 2th Annual COLING Conference. 987--993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Tombaugh and N. McIntyre. 1992. The mini-mental state examination: A comprehensive Review. J. Am. Geriat. Soc. 922--935.Google ScholarGoogle ScholarCross RefCross Ref
  23. Z. Y and B. Liu. 2005. Web data extraction based on partial tree. In Proceedings of the 14th WWW Conference. 76--85. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Machine Reading of Biomedical Data Dictionaries

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 9, Issue 4
        Challenge Paper, Experience Paper and Research Paper
        December 2017
        91 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/3208074
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 May 2018
        • Revised: 1 December 2017
        • Accepted: 1 December 2017
        • Received: 1 December 2015
        Published in jdiq Volume 9, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)11
        • Downloads (Last 6 weeks)2

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader