ABSTRACT
We investigate tagging figure and table captions in scientific articles from geology to support visualization of research findings on maps and time-lines. Our proposed approach comprises identifying geological time expressions and geographic and geologic locations without requiring large pre-annotated data. Different tagging approaches are tested and evaluated on a corpus of captions extracted from scientific geological articles. Our baseline method builds on geologic timescale ontologies and GeoNames as gazetteers to facilitate lookup of times and location names. The baseline is evaluated on a development set of captions from 20 documents and the results are analyzed manually to identify causes for tagging errors. We found that the poor performance of the baseline approach is mainly due to i) lack of coverage in the gazetteers, ii) incorrect tagging of person names as location names, and iii) a simplistic gazetteer lookup for capitalized words. We augmented the baseline approach by extending the gazetteers, by adding reference identification to block person names being tagged as locations, by filtering trivial matches, and by augmenting the lookup by correcting capitalization using true casing of words. The different configurations of our extended approach were evaluated on a test set of 80 documents, achieving an improved precision and recall of more than 90%.
- K. Cohen, S. Finney, and P. Gibbard. International chronostratigraphic chart. Technical report, International Commission on Stratigraphy, 2015.Google Scholar
- S. Cox and S. Richard. A geologic timescale ontology and service. Earth Science Informatics, 8(1):5--19, 2015.Google ScholarCross Ref
- L. Ferro, L. Gerber, I. Mani, B. Sundheim, and G. Wilson. TIDES 2005 standard for the annotation of temporal expressions. Technical report, Mitre, 2005.Google Scholar
- J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages 363--370. ACL, 2005. Google ScholarDigital Library
- F. Gey, R. Larson, J. Machado, and M. Yoshioka. NTCIR 9-GeoTime overview - Evaluating geographic and temporal search: Round 2. In Proceedings of NTCIR-9 Workshop Meeting. NTCIR, Tokyo, Japan, 2011.Google Scholar
- L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. tRuEcasIng. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL '03, pages 152--159, Stroudsburg, PA, USA, 2003. ACL. Google ScholarDigital Library
- X. Ma and P. Fox. Recent progress on geologic time ontologies and considerations for future works. Earth Science Informatics, 2013.Google ScholarCross Ref
- T. Mandl, P. Carvalho, G. M. D. Nunzio, F. C. Gey, R. R. Larson, D. Santos, and C. Womser-Hacker. GeoClef 2008: The CLEF 2008 cross-language geographic information retrieval track overview. In C. Peters, T. Deselaers, N. Ferro, J. Gonzalo, G. J. F. Jones, M. Kurimo, T. Mandl, A. Peñas, and V. Petras, editors, Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers, volume 5706 of Lecture Notes in Computer Science, pages 808--821. Springer, 2009. Google ScholarDigital Library
- W. Mitchell, S. Guptill, K. Anderson, R. Fegeas, and C. Hallam. GIRAS; a geographic information retrieval and analysis system for handling land use and land cover data. Technical Report 1059, USGS, 1977.Google Scholar
- A. Rae, V. Murdock, A. Popescu, and H. Bouchard. Mining the web for points of interest. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '12, pages 711--720. ACM, 2012. Google ScholarDigital Library
- D. Santos, N. Cardoso, P. Carvalho, I. Dornescu, S. Hartrumpf, J. Leveling, and Y. Skalban. GikiP at GeoCLEF 2008: Joining GIR and QA forces for querying Wikipedia. In C. Peters, T. Deselaers, N. Ferro, J. Gonzalo, G. J. F. Jones, M. Kurimo, T. Mandl, A. Peñas, and V. Petras, editors, Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, Revised Selected Papers, volume 5706 of Lecture Notes in Computer Science (LNCS), pages 894--905. Springer, 2009. Google ScholarDigital Library
- N. Sobhana, P. Mitra, and S. Ghosh. Conditional random field based named entity recognition in geological text. International Journal of Computer Applications, 1(3):143--147, 2010.Google ScholarCross Ref
- A. G. Woodruff and C. Plaunt. GIPSY: Automated geographic indexing of text documents. J. Am. Soc. Inf. Sci., 45(9):645--655, Oct. 1994. Google ScholarDigital Library
Index Terms
- Tagging of temporal expressions and geological features in scientific articles
Recommendations
Geo-tagging for imprecise regions of different sizes
GIR '07: Proceedings of the 4th ACM workshop on Geographical information retrievalExtracting geographical information from various web sources is likely to be important for a variety of applications. One such use for this information is to enable the study of vernacular regions: informal places referred to on a day-to-day basis, but ...
All Dates Lead to Rome: Extracting and Explaining Temporal References in Street Names
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionStreet names say a lot about a country's or region's identity. So far, they have mostly been analyzed manually and for very limited regions (e.g., a city), and hardly any large-scale studies have been performed automatically. A phenomenon not yet ...
SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementKnowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather ...
Comments