Abstract
Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well as their retrieval. Research on most of the noncursive scripts (Latin) has matured, whereas research on the cursive (connected) scripts is still moving toward perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting the Arabic script; it contains 52 characters, compared to 28 characters in the original Arabic alphabet, in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development, a database is needed for training and testing of Sindhi text images. We have developed a large database containing over 4 billion words and 15 billion characters in 150 various fonts in four font weights and four styles. The database contents were collected from various sources including websites, books, and theses. A custom-built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers words, characters, characters with spaces, and lines. The database is freely available as a partial or full database by sending an email to one of the authors.
- Ashraf AbdelRaouf, Colin Higgins, and Mahmoud Khalil. 2008. A database for Arabic printed character recognition. In Image Analysis and Recognition, Lecture Notes in Computer Science. A. Campilho and M. Kamel (Eds.). Vol. 5112, Springer, Berlin. 567--578. Google ScholarDigital Library
- Ghulam Ali Alana. 1993. Sindhi Sooratkhati (4th ed.). Sindhi Language Authority Hyderabad, Sindh.Google Scholar
- Ghulam Ali Alana. 2004. Sindhi Boli Jo Bunn Bunyad. Sindhi Language Authority Hyderabad. Sindh. Pakistan.Google Scholar
- Yousef Al-Ohali, Mohamed Cheriet, and Ching Suen. 2003. Databases for recognition of handwritten arabic cheques. Pattern Recognition 36, 1 (2003), 111--121.Google ScholarCross Ref
- Dil Nawaz Hakro, Imdad A. Ismaili, Abdullah Zawawi Talib, Zeeshani Bhatti, and Ghulam Nabi Mojai. 2014. Issues and challenges in Sindhi OCR. Sindh University Research Journal (Science Series) 46, 2 (2014), 143--152.Google Scholar
- Madiha Ijaz and S. Sarmad Hussain. 2007. Corpus based urdu lexicon development. In The Proceedings of Conference on Language Technology 2007 (CLT’07), University of Peshawar, Peshawar, Pakistan. Vol. 73. 1--12.Google Scholar
- Sabri Mahmoud, Irfan Ahmad, Wasfi G. Al-Khatib, Mohammad Alshayeb, Mohammad Tanvir Parvez, Volker Margner, and Gernot A. Fink. 2014. KHATT: An open arabic offline handwritten text database. Pattern Recognition 47, 3 (2014), 1096--1112. Google ScholarDigital Library
- Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda. 2011. Arabic Gigaword (5th ed. LDC2011T11). Linguistic Data Consortium, Philadelphia University of Pennsylvania.Google Scholar
- Mohammad Tanvir Parvez and Sabri A. Mahmoud. 2013. Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognition 46, 1 (2013), 141--154. Google ScholarDigital Library
- Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine Ellouze, and Hamid Amiri. 2002. IFN/ENIT database of handwritten arabic words. In Proceedings of of CIFED, 2, Citeseer. 127--136.Google Scholar
- Mutee U. Rahman. 2010. Towards Sindhi corpus construction. In Conference on Language and Technology (CLT’10). Lahore, Pakistan. 37--45.Google Scholar
- Rajneesh Rani, Renu Dhir, and G. S. Lehal. 2011. Identification of printed Punjabi words and English numerals using Gabor features. World Academy of Science, Engineering and Technology 73, 1 (2011), 392--395.Google Scholar
- Fouad Slimane, Rolf Ingold, Slim Kanoun, Adel M. Alimi, and Jean Hennebert. 2009. A new Arabic printed text image database and evaluation protocols. In 10th International Conference on Document Analysis and Recognition, 2009 (ICDAR’09). 946--950. Google ScholarDigital Library
- Fouad Slimane, Slim Kanoun, Jean Hennebert, Adel M. Alimi, and Rolf Ingold. 2013. A study on font-family and font-size recognition applied to arabic word images at ultra-low resolution. Pattern Recognition Letters 34, 2 (2013), 209--218. Google ScholarDigital Library
Index Terms
- Printed Text Image Database for Sindhi OCR
Recommendations
OCR for printed Kannada text to machine editable format using database approach
This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
OCR for printed Kannada text to machine editable format using database approach
ICAI'08: Proceedings of the 9th WSEAS International Conference on International Conference on Automation and InformationThis paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
OCR of printed telugu text with high recognition accuracies
ICVGIP'06: Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image ProcessingTelugu is one of the oldest and popular languages of India spoken by more than 66 million people especially in South India. Development of Optical Character Recognition systems for Telugu text is an area of current research.
OCR of Indian scripts is ...
Comments