skip to main content
10.1145/2595188.2595194acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

OCR correction of documents generated during Argentina's national reorganization process

Authors Info & Claims
Published:19 May 2014Publication History

ABSTRACT

In this paper we present work done to automatically correct OCRed text from a digital archive setup to preserve documents created during Argentina's 1976-1983 dictatorship, also known as the National Reorganization Process (Proceso de Reorganización Nacional). These documents are quite unique in their structure, content and state of preservation, making it a challenging corpus. We adopted a post-processing approach, in which we create a specific dictionary and correct the OCRed text based on edit distances and typographical characteristics of the text. On a representative test set we were able to correct about 30% of the OCR errors.

References

  1. A. Abdulkader and M. R. Casey. Low cost correction of ocr errors using learning in a multi-engine environment. In ICDAR, pages 576--580, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Bassil and M. Alwani. Context-sensitive spelling correction using google web 1t 5-gram information. Computer and Information Science, 5(3):37--48, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  3. Y. Chang, D. Chen, Y. Zhang, and J. Yang. An image-based automatic arabic translation system. Pattern Recognition, 42(9):2127--2134, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chen, D. Misra, and G. R. Thoma. Efficient automatic ocr word validation using word partial format derivation and language model. In DRR, pages 1--10, 2010.Google ScholarGoogle Scholar
  5. A. Fischer, A. Keller, V. Frinken, and H. Bunke. Lexicon-free handwritten word spotting using character hmms. Pattern Recognition Letters, 33(7):934--942, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Furrer. Unsupervised text segmentation for correcting ocr errors. Masters Thesis. University of Zürich, Switzerland, 2013.Google ScholarGoogle Scholar
  7. M. Heliński, M. Kmieciak, and T. Parkola. Report on the comparison of tesseract and abbyy finereader ocr engines, 2012.Google ScholarGoogle Scholar
  8. O. Kolak, W. Byrne, and P. Resnik. A generative probabilistic ocr model for nlp applications. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03, pages 55--62, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. O. Kolak and P. Resnik. Ocr post-processing for low density languages. In HLT/EMNLP, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. A. Lasko and S. E. Hauser. Approximate string matching algorithms for limited-vocabulary OCR output correction. In P. B. Kantor, D. P. Lopresti, and J. Zhou, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 4307 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, pages 232--240, Dec. 2000.Google ScholarGoogle Scholar
  11. V. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707, 1966.Google ScholarGoogle Scholar
  12. C. Neudecker and A. Tzadok. User collaboration for improving access to historical texts. LIBER Quarterly, 20(1), 2010.Google ScholarGoogle Scholar
  13. K. Niklas. Unsupervised post-correction of ocr errors. Diploma Thesis. Leibniz Universität Hannover. Germany, 2010.Google ScholarGoogle Scholar
  14. M. Reynaert. Text induced spelling correction. In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Reynaert. Non-interactive ocr post-correction for giga-scale digitization projects. In CICLing, pages 617--630, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. A. Rodríguez-Serrano and F. Perronnin. Handwritten word-spotting using hidden markov models and universal vocabularies. Pattern Recognition, 42(9):2106--2116, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. A. Rodríguez-Serrano and F. Perronnin. Synthesizing queries for handwritten word image retrieval. Pattern Recognition, 45(9):3270--3276, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.Google ScholarGoogle Scholar
  19. T. Tasdizen, E. Jurrus, and R. T. Whitaker. Non-uniform illumination correction in transmission electron microscopy images. In in MICCAI Workshop on Microscopic Image Analysis with Applications in Biology (MIAAB08, 2008.Google ScholarGoogle Scholar
  20. M. Volk, T. Marek, and R. Sennrich. Reducing ocr errors by combining two ocr systems. In ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), pages 61--65, August 2010.Google ScholarGoogle Scholar

Index Terms

  1. OCR correction of documents generated during Argentina's national reorganization process

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage
        May 2014
        200 pages
        ISBN:9781450325882
        DOI:10.1145/2595188

        Copyright © 2014 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 May 2014

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        DATeCH '14 Paper Acceptance Rate31of49submissions,63%Overall Acceptance Rate60of86submissions,70%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader