ABSTRACT
In this paper we present work done to automatically correct OCRed text from a digital archive setup to preserve documents created during Argentina's 1976-1983 dictatorship, also known as the National Reorganization Process (Proceso de Reorganización Nacional). These documents are quite unique in their structure, content and state of preservation, making it a challenging corpus. We adopted a post-processing approach, in which we create a specific dictionary and correct the OCRed text based on edit distances and typographical characteristics of the text. On a representative test set we were able to correct about 30% of the OCR errors.
- A. Abdulkader and M. R. Casey. Low cost correction of ocr errors using learning in a multi-engine environment. In ICDAR, pages 576--580, 2009. Google ScholarDigital Library
- Y. Bassil and M. Alwani. Context-sensitive spelling correction using google web 1t 5-gram information. Computer and Information Science, 5(3):37--48, 2012.Google ScholarCross Ref
- Y. Chang, D. Chen, Y. Zhang, and J. Yang. An image-based automatic arabic translation system. Pattern Recognition, 42(9):2127--2134, 2009. Google ScholarDigital Library
- S. Chen, D. Misra, and G. R. Thoma. Efficient automatic ocr word validation using word partial format derivation and language model. In DRR, pages 1--10, 2010.Google Scholar
- A. Fischer, A. Keller, V. Frinken, and H. Bunke. Lexicon-free handwritten word spotting using character hmms. Pattern Recognition Letters, 33(7):934--942, 2012. Google ScholarDigital Library
- L. Furrer. Unsupervised text segmentation for correcting ocr errors. Masters Thesis. University of Zürich, Switzerland, 2013.Google Scholar
- M. Heliński, M. Kmieciak, and T. Parkola. Report on the comparison of tesseract and abbyy finereader ocr engines, 2012.Google Scholar
- O. Kolak, W. Byrne, and P. Resnik. A generative probabilistic ocr model for nlp applications. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03, pages 55--62, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
- O. Kolak and P. Resnik. Ocr post-processing for low density languages. In HLT/EMNLP, 2005. Google ScholarDigital Library
- T. A. Lasko and S. E. Hauser. Approximate string matching algorithms for limited-vocabulary OCR output correction. In P. B. Kantor, D. P. Lopresti, and J. Zhou, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 4307 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, pages 232--240, Dec. 2000.Google Scholar
- V. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707, 1966.Google Scholar
- C. Neudecker and A. Tzadok. User collaboration for improving access to historical texts. LIBER Quarterly, 20(1), 2010.Google Scholar
- K. Niklas. Unsupervised post-correction of ocr errors. Diploma Thesis. Leibniz Universität Hannover. Germany, 2010.Google Scholar
- M. Reynaert. Text induced spelling correction. In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics. Google ScholarDigital Library
- M. Reynaert. Non-interactive ocr post-correction for giga-scale digitization projects. In CICLing, pages 617--630, 2008. Google ScholarDigital Library
- J. A. Rodríguez-Serrano and F. Perronnin. Handwritten word-spotting using hidden markov models and universal vocabularies. Pattern Recognition, 42(9):2106--2116, 2009. Google ScholarDigital Library
- J. A. Rodríguez-Serrano and F. Perronnin. Synthesizing queries for handwritten word image retrieval. Pattern Recognition, 45(9):3270--3276, 2012. Google ScholarDigital Library
- C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.Google Scholar
- T. Tasdizen, E. Jurrus, and R. T. Whitaker. Non-uniform illumination correction in transmission electron microscopy images. In in MICCAI Workshop on Microscopic Image Analysis with Applications in Biology (MIAAB08, 2008.Google Scholar
- M. Volk, T. Marek, and R. Sennrich. Reducing ocr errors by combining two ocr systems. In ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), pages 61--65, August 2010.Google Scholar
Index Terms
- OCR correction of documents generated during Argentina's national reorganization process
Recommendations
An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for ...
Automatic knowledge extraction from OCR documents using hierarchical document analysis
RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent SystemsIndustries can improve their business efficiency by analyzing and extracting relevant knowledge from large numbers of documents. Knowledge extraction manually from large volume of documents is labor intensive, unscalable and challenging. Consequently ...
Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents
ICIBE '17: Proceedings of the 3rd International Conference on Industrial and Business EngineeringFinding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: ...
Comments