ABSTRACT
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.
- H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 160--163, 2003. Google ScholarDigital Library
- C. dos Santos, R. Milidiu, C. Crestana, and E. Fernandes. ETL ensembles for chunking, NER and SRL. In Computational Linguistics and Intelligent Text Processing, pages 100--112. 2010. Google ScholarDigital Library
- D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. K. Ng, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering, 31(3):227--251, 1999. Google ScholarDigital Library
- C. Grover, S. Givon, R. Tobin, and J. Ball. Named entity recognition for digitised historical texts. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), 2008.Google Scholar
- Kofax. Kofax homepage. http://www.kofax.com/, 2009.Google Scholar
- W. B. Lund and E. K. Ringger. Improving optical character recognition through efficient multiple system alignment. In Proceedings of the 2009 Joint International Conference on Digital Libraries, pages 231--240, 2009. Google ScholarDigital Library
- A. K. McCallum. MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu/, 2002.Google Scholar
- D. Miller, S. Boisen, R. Schwartz, R. Stone, and R. Weischedel. Named entity extraction from noisy input: speech and OCR. In Proceedings of ANLP-NAACL 2000, pages 316--324, 2000. Google ScholarDigital Library
- R. Munro, D. Ler, and J. Patrick. Meta-learning orthographic and contextual models for language independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, pages 192--195, 2003. Google ScholarDigital Library
- D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1):3--26, 2007.Google ScholarCross Ref
- PrimeRecognition. PrimeOCR web page. http://www.primerecognition.com/augprime/prime ocr.htm, 2009.Google Scholar
- L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147--155. Association for Computational Linguistics, 2009. Google ScholarDigital Library
- E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), volume 922, page 1341, 2003. Google ScholarDigital Library
- K. Shaalan and H. Raza. NERA: named entity recognition for arabic. Journal of the American Society for Information Science and Technology, 60(8):1652--1663, 2009. Google ScholarDigital Library
- M. Vilain, J. Su, and S. Lubar. Entity extraction is a boring solved problem: or is it? In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers on XX, pages 181--184. Association for Computational Linguistics, 2007. Google ScholarDigital Library
- W. Wang, C. Xiao, X. Lin, and C. Zhang. E_cient approximate entity extraction with edit distance constraints. In Proceedings of the 35th SIGMOD International Conference on Management of Data, pages 759--770. ACM, 2009. Google ScholarDigital Library
- C. W. Wu, S. Y. Jan, R. T. H. Tsai, and W. L. Hsu. On using ensemble methods for chinese named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, pages 142--145, 2006.Google Scholar
- X. Zhang, J. Zou, D. X. Le, and G. R. Thoma. Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pages 121--128. ACM, 2010. Google ScholarDigital Library
Index Terms
- Extracting person names from diverse and noisy OCR text
Recommendations
Arabic Named Entity Recognition from Diverse Text Types
GoTAL '08: Proceedings of the 6th international conference on Advances in Natural Language ProcessingName identification has been worked on quite intensively for the past few years, and has been incorporated into several products. Many researchers have attacked this problem in a variety of languages but only a few limited researches have focused on ...
Terminologies augmented recurrent neural network model for clinical named entity recognition
Graphical abstractDisplay Omitted
Highlights- We have built APcNER, a French corpus for clinical named-entity recognition.
- It ...
Abstract ObjectiveWe aimed to enhance the performance of a supervised model for clinical named-entity recognition (NER) using medical terminologies. In order to evaluate our system in French, we built a corpus for 5 types of ...
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical InformaticsDue to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
Comments