skip to main content
10.1145/3078081.3078098acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

Published:01 June 2017Publication History

ABSTRACT

This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text recognition is illustrated in detail using the example of 'Der Heiligen Leben', printed in Nuremberg in 1488. For each step the required time expenditure was recorded. The recognition rate was excellent both on character (97.97%) and word (91.58%) level. Furthermore, a comparison of a highly automated (LAREX) and a manual (Aletheia) method for layout analysis was performed. By substantially automating the segmentation the required human effort was reduced considerably from 39 hours to around eight hours, without any significant drop in OCR accuracy. Realistic estimates for the human effort necessary for full text extraction from incunabula can be derived from this study. The printed pages of the complete work together with the OCR result are available online1 ready to be inspected and downloaded.

References

  1. Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In Document Analysis and Recognition (ICDAR), 2013 12th Int. Conf. on. IEEE, 683--687.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Aletheia-an advanced document layout and text ground-truthing system for production environments. In Document Analysis and Recognition (ICDAR), 2011 Int. Conf. on. IEEE, 48--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David Scott Doermann and Karl Tombre (Eds.). 2014. Handbook of Document Image Processing and Recognition. Springer. Google ScholarGoogle ScholarCross RefCross Ref
  4. Felix Kirchner, Marco Dittrich, Phillip Beckenbauer, and Maximilian Nöth. 2016. OCR bei Inkunabeln--Offizinspezifischer Ansatz der Universitätsbibliothek Würzburg. ABI Technik 36, 3 (2016), 178--188. Google ScholarGoogle ScholarCross RefCross Ref
  5. Guillaume Lazzara, Roland Levillain, Thierry Géraud, Yann Jacquelet, Julien Marquegnies, and Arthur Crépin-Leblond. 2011. The SCRIBO module of the Olena platform: a free software framework for document image analysis. In Document Analysis and Recognition (ICDAR), 2011 Int. Conf. on. IEEE, 252--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Thomas A Nartker, Stephen V Rice, and Steven E Lumos. 2005. Software tools and test data for research and testing of page-reading OCR systems. In Electronic Imaging 2005. International Society for Optics and Photonics, 37--47.Google ScholarGoogle Scholar
  7. Jean-Yves Ramel, Sébastien Busson, and Marie-Luce Demonet. 2006. AGORA: the interactive document image analysis tool of the BVH project. In Document Image Analysis for Libraries, 2006. DIAL'06. 2nd Int. Conf. on. IEEE, 11--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christian Reul, Uwe Springmann, and Frank Puppe. 2017. LAREX -- A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books. In Digital Access to Textual Cultural Heritage, DATeCH 2017. 2nd Int. Conf. on.Google ScholarGoogle Scholar
  9. Jeffrey A. Rydberg-Cox. 2009. Digitizing Latin incunabula: Challenges, methods, and possibilities. Digital Humanities Quarterly 3, 1 (2009). http://www.digitalhumanities.org/dhq/vol/3/1/000027/000027.html#p7Google ScholarGoogle Scholar
  10. Uwe Springmann, Florian Fink, and Klaus Schulz. 2017. Workshop: OCR and postcorrection of early printings for digital humanities. (2017). http://www.cis.lmu.de/ocrworkshopGoogle ScholarGoogle Scholar
  11. Uwe Springmann, Florian Fink, and Klaus U Schulz. 2016. Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings. ArXiv e-prints (2016). http://arxiv.org/abs/1606.05157Google ScholarGoogle Scholar
  12. Uwe Springmann and Anke Lüdeling. 2017. OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus. Digital Humanities Quarterly 11, 2 (2017).Google ScholarGoogle Scholar

Index Terms

  1. Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
        June 2017
        179 pages
        ISBN:9781450352659
        DOI:10.1145/3078081

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 June 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        DATeCH2017 Paper Acceptance Rate29of37submissions,78%Overall Acceptance Rate60of86submissions,70%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader