skip to main content
10.1145/1871840.1871845acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Extracting person names from diverse and noisy OCR text

Published:26 October 2010Publication History

ABSTRACT

Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.

References

  1. H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 160--163, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. dos Santos, R. Milidiu, C. Crestana, and E. Fernandes. ETL ensembles for chunking, NER and SRL. In Computational Linguistics and Intelligent Text Processing, pages 100--112. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. K. Ng, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering, 31(3):227--251, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Grover, S. Givon, R. Tobin, and J. Ball. Named entity recognition for digitised historical texts. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), 2008.Google ScholarGoogle Scholar
  5. Kofax. Kofax homepage. http://www.kofax.com/, 2009.Google ScholarGoogle Scholar
  6. W. B. Lund and E. K. Ringger. Improving optical character recognition through efficient multiple system alignment. In Proceedings of the 2009 Joint International Conference on Digital Libraries, pages 231--240, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. K. McCallum. MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu/, 2002.Google ScholarGoogle Scholar
  8. D. Miller, S. Boisen, R. Schwartz, R. Stone, and R. Weischedel. Named entity extraction from noisy input: speech and OCR. In Proceedings of ANLP-NAACL 2000, pages 316--324, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Munro, D. Ler, and J. Patrick. Meta-learning orthographic and contextual models for language independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, pages 192--195, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1):3--26, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  11. PrimeRecognition. PrimeOCR web page. http://www.primerecognition.com/augprime/prime ocr.htm, 2009.Google ScholarGoogle Scholar
  12. L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147--155. Association for Computational Linguistics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), volume 922, page 1341, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Shaalan and H. Raza. NERA: named entity recognition for arabic. Journal of the American Society for Information Science and Technology, 60(8):1652--1663, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Vilain, J. Su, and S. Lubar. Entity extraction is a boring solved problem: or is it? In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers on XX, pages 181--184. Association for Computational Linguistics, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. Wang, C. Xiao, X. Lin, and C. Zhang. E_cient approximate entity extraction with edit distance constraints. In Proceedings of the 35th SIGMOD International Conference on Management of Data, pages 759--770. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. W. Wu, S. Y. Jan, R. T. H. Tsai, and W. L. Hsu. On using ensemble methods for chinese named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, pages 142--145, 2006.Google ScholarGoogle Scholar
  18. X. Zhang, J. Zou, D. X. Le, and G. R. Thoma. Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pages 121--128. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Extracting person names from diverse and noisy OCR text

      Recommendations

      Reviews

      Joao Luis Garcia Rosa

      The authors of this paper provide a satisfying read about name entity recognition (NER) in noisy optical character recognition (OCR) texts. They deliver on their promise of providing answers to many questions that researchers in this area might have. Packer et al. draw many interesting conclusions about performing the difficult task of extracting names from noisy scanned documents: "Word order errors can play a bigger role in poor extraction performance than character recognition errors"; "The knowledge-based approaches performed better than the machine learning (ML) approaches"; and "Combining basic extraction methods can produce higher quality NER." Regarding the conclusion about machine learning approaches, ML lovers need not despair. The authors point out two ways to overcome their deficiencies: either apply a more realistic noise model of OCR errors to the computational natural language learning (CoNLL) training data or use semi-supervised ML techniques to take advantage of the large number of unlabeled documents. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
        October 2010
        96 pages
        ISBN:9781450303767
        DOI:10.1145/1871840

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate15of22submissions,68%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader