research-article

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

Authors:
Christian Reul

University of Würzburg, Am Hubland, Würzburg

University of Würzburg, Am Hubland, Würzburg
View Profile

,
Marco Dittrich

University Library of Würzburg, Am Hubland, Würzburg

University Library of Würzburg, Am Hubland, Würzburg
View Profile

,
Martin Gruner

University Library of Würzburg, Am Hubland, Würzburg

University Library of Würzburg, Am Hubland, Würzburg
View Profile

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural HeritageJune 2017Pages 155–160https://doi.org/10.1145/3078081.3078098

Published:01 June 2017Publication History

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

Pages 155–160

ABSTRACT

This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text recognition is illustrated in detail using the example of 'Der Heiligen Leben', printed in Nuremberg in 1488. For each step the required time expenditure was recorded. The recognition rate was excellent both on character (97.97%) and word (91.58%) level. Furthermore, a comparison of a highly automated (LAREX) and a manual (Aletheia) method for layout analysis was performed. By substantially automating the segmentation the required human effort was reduced considerably from 39 hours to around eight hours, without any significant drop in OCR accuracy. Realistic estimates for the human effort necessary for full text extraction from incunabula can be derived from this study. The printed pages of the complete work together with the OCR result are available online1 ready to be inspected and downloaded.

References

Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In Document Analysis and Recognition (ICDAR), 2013 12th Int. Conf. on. IEEE, 683--687.Google ScholarDigital Library
Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Aletheia-an advanced document layout and text ground-truthing system for production environments. In Document Analysis and Recognition (ICDAR), 2011 Int. Conf. on. IEEE, 48--52. Google ScholarDigital Library
David Scott Doermann and Karl Tombre (Eds.). 2014. Handbook of Document Image Processing and Recognition. Springer. Google ScholarCross Ref
Felix Kirchner, Marco Dittrich, Phillip Beckenbauer, and Maximilian Nöth. 2016. OCR bei Inkunabeln--Offizinspezifischer Ansatz der Universitätsbibliothek Würzburg. ABI Technik 36, 3 (2016), 178--188. Google ScholarCross Ref
Guillaume Lazzara, Roland Levillain, Thierry Géraud, Yann Jacquelet, Julien Marquegnies, and Arthur Crépin-Leblond. 2011. The SCRIBO module of the Olena platform: a free software framework for document image analysis. In Document Analysis and Recognition (ICDAR), 2011 Int. Conf. on. IEEE, 252--258. Google ScholarDigital Library
Thomas A Nartker, Stephen V Rice, and Steven E Lumos. 2005. Software tools and test data for research and testing of page-reading OCR systems. In Electronic Imaging 2005. International Society for Optics and Photonics, 37--47.Google Scholar
Jean-Yves Ramel, Sébastien Busson, and Marie-Luce Demonet. 2006. AGORA: the interactive document image analysis tool of the BVH project. In Document Image Analysis for Libraries, 2006. DIAL'06. 2nd Int. Conf. on. IEEE, 11--21. Google ScholarDigital Library
Christian Reul, Uwe Springmann, and Frank Puppe. 2017. LAREX -- A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books. In Digital Access to Textual Cultural Heritage, DATeCH 2017. 2nd Int. Conf. on.Google Scholar
Jeffrey A. Rydberg-Cox. 2009. Digitizing Latin incunabula: Challenges, methods, and possibilities. Digital Humanities Quarterly 3, 1 (2009). http://www.digitalhumanities.org/dhq/vol/3/1/000027/000027.html#p7Google Scholar
Uwe Springmann, Florian Fink, and Klaus Schulz. 2017. Workshop: OCR and postcorrection of early printings for digital humanities. (2017). http://www.cis.lmu.de/ocrworkshopGoogle Scholar
Uwe Springmann, Florian Fink, and Klaus U Schulz. 2016. Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings. ArXiv e-prints (2016). http://arxiv.org/abs/1606.05157Google Scholar
Uwe Springmann and Anke Lüdeling. 2017. OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus. Digital Humanities Quarterly 11, 2 (2017).Google Scholar

Index Terms

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
      2. Optical character recognition

Recommendations

OCR for printed Kannada text to machine editable format using database approach

This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
Read More
OCR for printed Kannada text to machine editable format using database approach
ICAI'08: Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information

This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
Read More
Prototype Extraction and Adaptive OCR

To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
June 2017
179 pages
ISBN:9781450352659
DOI:10.1145/3078081

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Early Printed Books
Optical Character Recognition
Segmentation
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
DATeCH2017 Paper Acceptance Rate29of37submissions,78%Overall Acceptance Rate60of86submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 135
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

OCR for printed Kannada text to machine editable format using database approach

OCR for printed Kannada text to machine editable format using database approach

Prototype Extraction and Adaptive OCR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

OCR for printed Kannada text to machine editable format using database approach

OCR for printed Kannada text to machine editable format using database approach

Prototype Extraction and Adaptive OCR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media