ABSTRACT
This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text recognition is illustrated in detail using the example of 'Der Heiligen Leben', printed in Nuremberg in 1488. For each step the required time expenditure was recorded. The recognition rate was excellent both on character (97.97%) and word (91.58%) level. Furthermore, a comparison of a highly automated (LAREX) and a manual (Aletheia) method for layout analysis was performed. By substantially automating the segmentation the required human effort was reduced considerably from 39 hours to around eight hours, without any significant drop in OCR accuracy. Realistic estimates for the human effort necessary for full text extraction from incunabula can be derived from this study. The printed pages of the complete work together with the OCR result are available online1 ready to be inspected and downloaded.
- Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In Document Analysis and Recognition (ICDAR), 2013 12th Int. Conf. on. IEEE, 683--687.Google ScholarDigital Library
- Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Aletheia-an advanced document layout and text ground-truthing system for production environments. In Document Analysis and Recognition (ICDAR), 2011 Int. Conf. on. IEEE, 48--52. Google ScholarDigital Library
- David Scott Doermann and Karl Tombre (Eds.). 2014. Handbook of Document Image Processing and Recognition. Springer. Google ScholarCross Ref
- Felix Kirchner, Marco Dittrich, Phillip Beckenbauer, and Maximilian Nöth. 2016. OCR bei Inkunabeln--Offizinspezifischer Ansatz der Universitätsbibliothek Würzburg. ABI Technik 36, 3 (2016), 178--188. Google ScholarCross Ref
- Guillaume Lazzara, Roland Levillain, Thierry Géraud, Yann Jacquelet, Julien Marquegnies, and Arthur Crépin-Leblond. 2011. The SCRIBO module of the Olena platform: a free software framework for document image analysis. In Document Analysis and Recognition (ICDAR), 2011 Int. Conf. on. IEEE, 252--258. Google ScholarDigital Library
- Thomas A Nartker, Stephen V Rice, and Steven E Lumos. 2005. Software tools and test data for research and testing of page-reading OCR systems. In Electronic Imaging 2005. International Society for Optics and Photonics, 37--47.Google Scholar
- Jean-Yves Ramel, Sébastien Busson, and Marie-Luce Demonet. 2006. AGORA: the interactive document image analysis tool of the BVH project. In Document Image Analysis for Libraries, 2006. DIAL'06. 2nd Int. Conf. on. IEEE, 11--21. Google ScholarDigital Library
- Christian Reul, Uwe Springmann, and Frank Puppe. 2017. LAREX -- A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books. In Digital Access to Textual Cultural Heritage, DATeCH 2017. 2nd Int. Conf. on.Google Scholar
- Jeffrey A. Rydberg-Cox. 2009. Digitizing Latin incunabula: Challenges, methods, and possibilities. Digital Humanities Quarterly 3, 1 (2009). http://www.digitalhumanities.org/dhq/vol/3/1/000027/000027.html#p7Google Scholar
- Uwe Springmann, Florian Fink, and Klaus Schulz. 2017. Workshop: OCR and postcorrection of early printings for digital humanities. (2017). http://www.cis.lmu.de/ocrworkshopGoogle Scholar
- Uwe Springmann, Florian Fink, and Klaus U Schulz. 2016. Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings. ArXiv e-prints (2016). http://arxiv.org/abs/1606.05157Google Scholar
- Uwe Springmann and Anke Lüdeling. 2017. OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus. Digital Humanities Quarterly 11, 2 (2017).Google Scholar
Index Terms
- Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)
Recommendations
OCR for printed Kannada text to machine editable format using database approach
This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
OCR for printed Kannada text to machine editable format using database approach
ICAI'08: Proceedings of the 9th WSEAS International Conference on International Conference on Automation and InformationThis paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
Prototype Extraction and Adaptive OCR
To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method ...
Comments