Abstract
This article describes an approach for the automated reading of biomedical data dictionaries. Automated reading is the process of extracting element details for each of the data elements from a data dictionary in a document format (such as PDF) to a completely structured representation. A structured representation is essential if the data dictionary metadata are to be used in applications such as data integration and also in evaluating the quality of the associated data. We present an approach and implemented solution for the problem, considering different formats of data dictionaries. We have a particular focus on the most challenging format with a machine-learning classification solution to the problem using conditional random field classifiers. We present an evaluation using several actual data dictionaries, demonstrating the effectiveness of our approach.
- C. C. Aggarwal and C. Zhai. 2012. A survey of text classification algorithms. In Mining Text Data. Springer, 163--222. Google ScholarDigital Library
- N. Ashish and C. Knoblock. 1997. Wrapper generation for semi-structured internet sources. ACM Sigmod Rec. 26, 4 (1997), 8--15. Google ScholarDigital Library
- N. Ashish and A. Patawari. 2017. Data Dictionary Reader Code. Retrieved from https://github.com/nashish100/DDReading.Google Scholar
- N. Ashish, P. Dewan, J. Ambite, and A. Toga. 2015. GEM: Tha GAAIN entity mapper. In Proceedings of the 11th International Conference on Data Integration in Life Sciences. Springer, 13--27.Google Scholar
- Y. Aumann, R. Feldman, Y. Liberzon, and Rosenfeld. 2006. Visual information extraction. Knowl. Inf. Syst. 10, 1 (2006), 1--15. Google ScholarDigital Library
- P. Buneman. 1997. Semistructured data. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 117--121. Google ScholarDigital Library
- H. Chao and J. Fan. 2004. Layout and content extraction for pdf documents. In Document Analysis Systems VI. Springer, Berlin.Google Scholar
- W. Cohen, M. Hurst, and L. S. Jensen. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the 11th International Conference on World Wide Web (WWW'02). ACM, New York, NY, USA, 232--241. Google ScholarDigital Library
- Data Dictionary. IBM Dictionary of Computing (10th ed.). ACM.Google Scholar
- E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. 2014. Web data extraction, applications and techniques. Knowl.-Based Syst. 70, C (2014), 301--323. Google ScholarDigital Library
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. 2007. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web. ACM, 71--80. Google ScholarDigital Library
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explor. 11, 1 (2009). Google ScholarDigital Library
- M. Hurst. 2000. The Interpretation of Tables in Texts. Ph.D. thesis. University of Edinburgh, School of Cognitive Science and Informatics.Google Scholar
- J. Lafferty, A. McCallum, and P. Pereira. 2001. Conditional. In Proceedings of the International Conference on Machine Learning (ICML’01).Google Scholar
- K. Lerman, L. Getoor, S. Minton, and C. Knoblock. 2005. Using the structure of web sites for automatic segmentation of tables. In Proceedings of the ACM SIGMOD Conference on Management Of Data. 119--130. Google ScholarDigital Library
- A. McCallum. 2002. Mallet. Récupéré sur MALLET: A Machine Learning for Language Toolkit. Retrieved from http://mallet.cs.umass.edu/.Google Scholar
- PDFBox. 2015. Récupéré sur Apache PDFBox. Retrieved from https://pdfbox.apache.org/.Google Scholar
- PDFTables. 2015. Récupéré sur Accurately extract tables from PDF. Retrieved from https://pdftables.com/.Google Scholar
- D. Pinto, A. McCallum, X. Wei, and W. Croft. 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM. 235—242. Google ScholarDigital Library
- Tabula. 2015. Récupéré sur extract tables from PDFs. Retrieved from http://tabula.technology/.Google Scholar
- A. Tengli, A. Yang, and N. Ma. 2005. Learning table extraction from examples. In Proceedings of the 2th Annual COLING Conference. 987--993. Google ScholarDigital Library
- T. Tombaugh and N. McIntyre. 1992. The mini-mental state examination: A comprehensive Review. J. Am. Geriat. Soc. 922--935.Google ScholarCross Ref
- Z. Y and B. Liu. 2005. Web data extraction based on partial tree. In Proceedings of the 14th WWW Conference. 76--85. Google ScholarDigital Library
Index Terms
- Machine Reading of Biomedical Data Dictionaries
Recommendations
Name Similarity for Composite Element Name Matching
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health InformaticsBackground and Objective: Matching corresponding data elements is a critical problem in biomedical data harmonization for data sharing. The similarity of the element names is one of the many factors employed in determining data element matches. ...
Biomedical named entity recognition using BERT in the machine reading comprehension framework
Graphical abstractIsothermal surfaces within the device
Display Omitted
Highlights- We achieve named entity recognition in the machine reading comprehension framework.
AbstractRecognition of biomedical entities from literature is a challenging research focus, which is the foundation for extracting a large amount of biomedical knowledge existing in unstructured texts into structured formats. Using the ...
Comments