research-article

OCR correction of documents generated during Argentina's national reorganization process

Authors:
Paula Estrella

FaMAF, Universidad Nacional de Córdoba, Córdoba, Argentina

FaMAF, Universidad Nacional de Córdoba, Córdoba, Argentina
View Profile

,
Pablo Paliza

FaMAF, Universidad Nacional de Córdoba, Ciudad Universitaria, Córdoba, Argentina

FaMAF, Universidad Nacional de Córdoba, Ciudad Universitaria, Córdoba, Argentina
View Profile

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageMay 2014Pages 119–123https://doi.org/10.1145/2595188.2595194

Published:19 May 2014Publication History

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

Pages 119–123

ABSTRACT

In this paper we present work done to automatically correct OCRed text from a digital archive setup to preserve documents created during Argentina's 1976-1983 dictatorship, also known as the National Reorganization Process (Proceso de Reorganización Nacional). These documents are quite unique in their structure, content and state of preservation, making it a challenging corpus. We adopted a post-processing approach, in which we create a specific dictionary and correct the OCRed text based on edit distances and typographical characteristics of the text. On a representative test set we were able to correct about 30% of the OCR errors.

References

A. Abdulkader and M. R. Casey. Low cost correction of ocr errors using learning in a multi-engine environment. In ICDAR, pages 576--580, 2009. Google ScholarDigital Library
Y. Bassil and M. Alwani. Context-sensitive spelling correction using google web 1t 5-gram information. Computer and Information Science, 5(3):37--48, 2012.Google ScholarCross Ref
Y. Chang, D. Chen, Y. Zhang, and J. Yang. An image-based automatic arabic translation system. Pattern Recognition, 42(9):2127--2134, 2009. Google ScholarDigital Library
S. Chen, D. Misra, and G. R. Thoma. Efficient automatic ocr word validation using word partial format derivation and language model. In DRR, pages 1--10, 2010.Google Scholar
A. Fischer, A. Keller, V. Frinken, and H. Bunke. Lexicon-free handwritten word spotting using character hmms. Pattern Recognition Letters, 33(7):934--942, 2012. Google ScholarDigital Library
L. Furrer. Unsupervised text segmentation for correcting ocr errors. Masters Thesis. University of Zürich, Switzerland, 2013.Google Scholar
M. Heliński, M. Kmieciak, and T. Parkola. Report on the comparison of tesseract and abbyy finereader ocr engines, 2012.Google Scholar
O. Kolak, W. Byrne, and P. Resnik. A generative probabilistic ocr model for nlp applications. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03, pages 55--62, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
O. Kolak and P. Resnik. Ocr post-processing for low density languages. In HLT/EMNLP, 2005. Google ScholarDigital Library
T. A. Lasko and S. E. Hauser. Approximate string matching algorithms for limited-vocabulary OCR output correction. In P. B. Kantor, D. P. Lopresti, and J. Zhou, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 4307 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, pages 232--240, Dec. 2000.Google Scholar
V. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707, 1966.Google Scholar
C. Neudecker and A. Tzadok. User collaboration for improving access to historical texts. LIBER Quarterly, 20(1), 2010.Google Scholar
K. Niklas. Unsupervised post-correction of ocr errors. Diploma Thesis. Leibniz Universität Hannover. Germany, 2010.Google Scholar
M. Reynaert. Text induced spelling correction. In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics. Google ScholarDigital Library
M. Reynaert. Non-interactive ocr post-correction for giga-scale digitization projects. In CICLing, pages 617--630, 2008. Google ScholarDigital Library
J. A. Rodríguez-Serrano and F. Perronnin. Handwritten word-spotting using hidden markov models and universal vocabularies. Pattern Recognition, 42(9):2106--2116, 2009. Google ScholarDigital Library
J. A. Rodríguez-Serrano and F. Perronnin. Synthesizing queries for handwritten word image retrieval. Pattern Recognition, 45(9):3270--3276, 2012. Google ScholarDigital Library
C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.Google Scholar
T. Tasdizen, E. Jurrus, and R. T. Whitaker. Non-uniform illumination correction in transmission electron microscopy images. In in MICCAI Workshop on Microscopic Image Analysis with Applications in Biology (MIAAB08, 2008.Google Scholar
M. Volk, T. Marek, and R. Sennrich. Reducing ocr errors by combining two ocr systems. In ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), pages 61--65, August 2010.Google Scholar

Index Terms

OCR correction of documents generated during Argentina's national reorganization process
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
      2. Optical character recognition

Recommendations

An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents

Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for ...
Read More
Automatic knowledge extraction from OCR documents using hierarchical document analysis
RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems

Industries can improve their business efficiency by analyzing and extracting relevant knowledge from large numbers of documents. Knowledge extraction manually from large volume of documents is labor intensive, unscalable and challenging. Consequently ...
Read More
Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents
ICIBE '17: Proceedings of the 3rd International Conference on Industrial and Business Engineering

Finding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage
May 2014
200 pages
ISBN:9781450325882
DOI:10.1145/2595188
Program Chairs:
Apostolos Antonacopoulos
University of Salford
,
Klaus U. Schulz
Ludwig-Maximilians-Universität München
Copyright © 2014 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 May 2014
Check for updates
Author Tags
automatic OCR correction
dictionary building
digital archives
performance evaluation
Qualifiers
- research-article
Conference

Acceptance Rates
DATeCH '14 Paper Acceptance Rate31of49submissions,63%Overall Acceptance Rate60of86submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 106
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OCR correction of documents generated during Argentina's national reorganization process

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents

Automatic knowledge extraction from OCR documents using hierarchical document analysis

Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

OCR correction of documents generated during Argentina's national reorganization process

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents

Automatic knowledge extraction from OCR documents using hierarchical document analysis

Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media