research-article

Extracting person names from diverse and noisy OCR text

Authors:
Thomas L. Packer

Brigham Young University, Provo, UT, USA

Brigham Young University, Provo, UT, USA
View Profile

,
Joshua F. Lutes

Brigham Young University, Provo, UT, USA

Brigham Young University, Provo, UT, USA
View Profile

,
Aaron P. Stewart

Brigham Young University, Provo, UT, USA

Brigham Young University, Provo, UT, USA
View Profile

,
David W. Embley

Brigham Young University, Provo, UT, USA

Brigham Young University, Provo, UT, USA
View Profile

,
Eric K. Ringger

Brigham Young University, Provo, UT, USA

Brigham Young University, Provo, UT, USA
View Profile

,
Kevin D. Seppi

Brigham Young University, Provo, UT, USA

Brigham Young University, Provo, UT, USA
View Profile

,
Lee S. Jensen

Ancestry.com, Inc., Provo, UT, USA

Ancestry.com, Inc., Provo, UT, USA
View Profile

AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text dataOctober 2010Pages 19–26https://doi.org/10.1145/1871840.1871845

Published:26 October 2010Publication History

AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Pages 19–26

ABSTRACT

Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.

References

H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 160--163, 2003. Google ScholarDigital Library
C. dos Santos, R. Milidiu, C. Crestana, and E. Fernandes. ETL ensembles for chunking, NER and SRL. In Computational Linguistics and Intelligent Text Processing, pages 100--112. 2010. Google ScholarDigital Library
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. K. Ng, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering, 31(3):227--251, 1999. Google ScholarDigital Library
C. Grover, S. Givon, R. Tobin, and J. Ball. Named entity recognition for digitised historical texts. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), 2008.Google Scholar
Kofax. Kofax homepage. http://www.kofax.com/, 2009.Google Scholar
W. B. Lund and E. K. Ringger. Improving optical character recognition through efficient multiple system alignment. In Proceedings of the 2009 Joint International Conference on Digital Libraries, pages 231--240, 2009. Google ScholarDigital Library
A. K. McCallum. MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu/, 2002.Google Scholar
D. Miller, S. Boisen, R. Schwartz, R. Stone, and R. Weischedel. Named entity extraction from noisy input: speech and OCR. In Proceedings of ANLP-NAACL 2000, pages 316--324, 2000. Google ScholarDigital Library
R. Munro, D. Ler, and J. Patrick. Meta-learning orthographic and contextual models for language independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, pages 192--195, 2003. Google ScholarDigital Library
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1):3--26, 2007.Google ScholarCross Ref
PrimeRecognition. PrimeOCR web page. http://www.primerecognition.com/augprime/prime ocr.htm, 2009.Google Scholar
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147--155. Association for Computational Linguistics, 2009. Google ScholarDigital Library
E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), volume 922, page 1341, 2003. Google ScholarDigital Library
K. Shaalan and H. Raza. NERA: named entity recognition for arabic. Journal of the American Society for Information Science and Technology, 60(8):1652--1663, 2009. Google ScholarDigital Library
M. Vilain, J. Su, and S. Lubar. Entity extraction is a boring solved problem: or is it? In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers on XX, pages 181--184. Association for Computational Linguistics, 2007. Google ScholarDigital Library
W. Wang, C. Xiao, X. Lin, and C. Zhang. E_cient approximate entity extraction with edit distance constraints. In Proceedings of the 35th SIGMOD International Conference on Management of Data, pages 759--770. ACM, 2009. Google ScholarDigital Library
C. W. Wu, S. Y. Jan, R. T. H. Tsai, and W. L. Hsu. On using ensemble methods for chinese named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, pages 142--145, 2006.Google Scholar
X. Zhang, J. Zou, D. X. Le, and G. R. Thoma. Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pages 121--128. ACM, 2010. Google ScholarDigital Library

Index Terms

Extracting person names from diverse and noisy OCR text
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Arabic Named Entity Recognition from Diverse Text Types
GoTAL '08: Proceedings of the 6th international conference on Advances in Natural Language Processing

Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products. Many researchers have attacked this problem in a variety of languages but only a few limited researches have focused on ...
Read More
Terminologies augmented recurrent neural network model for clinical named entity recognition
Graphical abstract

Display Omitted
Highlights
- We have built APcNER, a French corpus for clinical named-entity recognition.
- It ...
Abstract Objective
We aimed to enhance the performance of a supervised model for clinical named-entity recognition (NER) using medical terminologies. In order to evaluate our system in French, we built a corpus for 5 types of ...
Read More
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics

Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
Read More

Reviews

Reviewer: Joao Luis Garcia Rosa

The authors of this paper provide a satisfying read about name entity recognition (NER) in noisy optical character recognition (OCR) texts. They deliver on their promise of providing answers to many questions that researchers in this area might have. Packer et al. draw many interesting conclusions about performing the difficult task of extracting names from noisy scanned documents: "Word order errors can play a bigger role in poor extraction performance than character recognition errors"; "The knowledge-based approaches performed better than the machine learning (ML) approaches"; and "Combining basic extraction methods can produce higher quality NER." Regarding the conclusion about machine learning approaches, ML lovers need not despair. The authors point out two ways to overcome their deficiencies: either apply a more realistic noise model of OCR errors to the computational natural language learning (CoNLL) training data or use semi-supervised ML techniques to take advantage of the large number of unlabeled documents. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
October 2010
96 pages
ISBN:9781450303767
DOI:10.1145/1871840
Program Chairs:
Roberto Basili
University of Rome, Italy
,
Daniel Lopresti
Lehigh University, USA
,
Christoph Ringlstetter
University of Munich, Germany
,
Shourya Roy
Xerox India Innovation Hub, India
,
Klaus U. Schulz
University of Munich, Germany
,
L. Venkata Subramaniam
IBM Research, India
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CRF
MEMM
NER
information extraction
named entity recognition
noisy OCR
scanned document images
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate15of22submissions,68%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 511
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting person names from diverse and noisy OCR text

AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Arabic Named Entity Recognition from Diverse Text Types

Terminologies augmented recurrent neural network model for clinical named entity recognition

A Flexible Text Mining System for Entity and Relation Extraction in PubMed

Reviews

Access critical reviews of Computing literature here