note

Printed Text Image Database for Sindhi OCR

Authors:
Dil Nawaz Hakro

University of Sindh, Jamshoro, Pakistan

University of Sindh, Jamshoro, Pakistan
View Profile

,
Abdullah Zawawi Talib

Universiti Sains Malaysia, Malaysia

Universiti Sains Malaysia, Malaysia
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 15 Issue 4Article No.: 21pp 1–18https://doi.org/10.1145/2846093

Published:16 May 2016Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well as their retrieval. Research on most of the noncursive scripts (Latin) has matured, whereas research on the cursive (connected) scripts is still moving toward perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting the Arabic script; it contains 52 characters, compared to 28 characters in the original Arabic alphabet, in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development, a database is needed for training and testing of Sindhi text images. We have developed a large database containing over 4 billion words and 15 billion characters in 150 various fonts in four font weights and four styles. The database contents were collected from various sources including websites, books, and theses. A custom-built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers words, characters, characters with spaces, and lines. The database is freely available as a partial or full database by sending an email to one of the authors.

References

Ashraf AbdelRaouf, Colin Higgins, and Mahmoud Khalil. 2008. A database for Arabic printed character recognition. In Image Analysis and Recognition, Lecture Notes in Computer Science. A. Campilho and M. Kamel (Eds.). Vol. 5112, Springer, Berlin. 567--578. Google ScholarDigital Library
Ghulam Ali Alana. 1993. Sindhi Sooratkhati (4th ed.). Sindhi Language Authority Hyderabad, Sindh.Google Scholar
Ghulam Ali Alana. 2004. Sindhi Boli Jo Bunn Bunyad. Sindhi Language Authority Hyderabad. Sindh. Pakistan.Google Scholar
Yousef Al-Ohali, Mohamed Cheriet, and Ching Suen. 2003. Databases for recognition of handwritten arabic cheques. Pattern Recognition 36, 1 (2003), 111--121.Google ScholarCross Ref
Dil Nawaz Hakro, Imdad A. Ismaili, Abdullah Zawawi Talib, Zeeshani Bhatti, and Ghulam Nabi Mojai. 2014. Issues and challenges in Sindhi OCR. Sindh University Research Journal (Science Series) 46, 2 (2014), 143--152.Google Scholar
Madiha Ijaz and S. Sarmad Hussain. 2007. Corpus based urdu lexicon development. In The Proceedings of Conference on Language Technology 2007 (CLT’07), University of Peshawar, Peshawar, Pakistan. Vol. 73. 1--12.Google Scholar
Sabri Mahmoud, Irfan Ahmad, Wasfi G. Al-Khatib, Mohammad Alshayeb, Mohammad Tanvir Parvez, Volker Margner, and Gernot A. Fink. 2014. KHATT: An open arabic offline handwritten text database. Pattern Recognition 47, 3 (2014), 1096--1112. Google ScholarDigital Library
Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda. 2011. Arabic Gigaword (5th ed. LDC2011T11). Linguistic Data Consortium, Philadelphia University of Pennsylvania.Google Scholar
Mohammad Tanvir Parvez and Sabri A. Mahmoud. 2013. Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognition 46, 1 (2013), 141--154. Google ScholarDigital Library
Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine Ellouze, and Hamid Amiri. 2002. IFN/ENIT database of handwritten arabic words. In Proceedings of of CIFED, 2, Citeseer. 127--136.Google Scholar
Mutee U. Rahman. 2010. Towards Sindhi corpus construction. In Conference on Language and Technology (CLT’10). Lahore, Pakistan. 37--45.Google Scholar
Rajneesh Rani, Renu Dhir, and G. S. Lehal. 2011. Identification of printed Punjabi words and English numerals using Gabor features. World Academy of Science, Engineering and Technology 73, 1 (2011), 392--395.Google Scholar
Fouad Slimane, Rolf Ingold, Slim Kanoun, Adel M. Alimi, and Jean Hennebert. 2009. A new Arabic printed text image database and evaluation protocols. In 10th International Conference on Document Analysis and Recognition, 2009 (ICDAR’09). 946--950. Google ScholarDigital Library
Fouad Slimane, Slim Kanoun, Jean Hennebert, Adel M. Alimi, and Rolf Ingold. 2013. A study on font-family and font-size recognition applied to arabic word images at ultra-low resolution. Pattern Recognition Letters 34, 2 (2013), 209--218. Google ScholarDigital Library

Index Terms

Printed Text Image Database for Sindhi OCR
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document scanning
      2. Optical character recognition

Recommendations

OCR for printed Kannada text to machine editable format using database approach

This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
Read More
OCR for printed Kannada text to machine editable format using database approach
ICAI'08: Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information

This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
Read More
OCR of printed telugu text with high recognition accuracies
ICVGIP'06: Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image Processing

Telugu is one of the oldest and popular languages of India spoken by more than 66 million people especially in South India. Development of Optical Character Recognition systems for Telugu text is an area of current research.

OCR of Indian scripts is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 15, Issue 4
June 2016
173 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2915955
Editor:
Richard Sproat
Google, Inc., USA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 May 2016
- Revised: 1 November 2015
- Accepted: 1 November 2015
- Received: 1 April 2015
Published in tallip Volume 15, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Sindhi optical character recognition
Text image database
Qualifiers
- note
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 343
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Printed Text Image Database for Sindhi OCR

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

OCR for printed Kannada text to machine editable format using database approach

OCR for printed Kannada text to machine editable format using database approach

OCR of printed telugu text with high recognition accuracies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Printed Text Image Database for Sindhi OCR

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

OCR for printed Kannada text to machine editable format using database approach

OCR for printed Kannada text to machine editable format using database approach

OCR of printed telugu text with high recognition accuracies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media