skip to main content
note

Printed Text Image Database for Sindhi OCR

Published:16 May 2016Publication History
Skip Abstract Section

Abstract

Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well as their retrieval. Research on most of the noncursive scripts (Latin) has matured, whereas research on the cursive (connected) scripts is still moving toward perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting the Arabic script; it contains 52 characters, compared to 28 characters in the original Arabic alphabet, in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development, a database is needed for training and testing of Sindhi text images. We have developed a large database containing over 4 billion words and 15 billion characters in 150 various fonts in four font weights and four styles. The database contents were collected from various sources including websites, books, and theses. A custom-built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers words, characters, characters with spaces, and lines. The database is freely available as a partial or full database by sending an email to one of the authors.

References

  1. Ashraf AbdelRaouf, Colin Higgins, and Mahmoud Khalil. 2008. A database for Arabic printed character recognition. In Image Analysis and Recognition, Lecture Notes in Computer Science. A. Campilho and M. Kamel (Eds.). Vol. 5112, Springer, Berlin. 567--578. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ghulam Ali Alana. 1993. Sindhi Sooratkhati (4th ed.). Sindhi Language Authority Hyderabad, Sindh.Google ScholarGoogle Scholar
  3. Ghulam Ali Alana. 2004. Sindhi Boli Jo Bunn Bunyad. Sindhi Language Authority Hyderabad. Sindh. Pakistan.Google ScholarGoogle Scholar
  4. Yousef Al-Ohali, Mohamed Cheriet, and Ching Suen. 2003. Databases for recognition of handwritten arabic cheques. Pattern Recognition 36, 1 (2003), 111--121.Google ScholarGoogle ScholarCross RefCross Ref
  5. Dil Nawaz Hakro, Imdad A. Ismaili, Abdullah Zawawi Talib, Zeeshani Bhatti, and Ghulam Nabi Mojai. 2014. Issues and challenges in Sindhi OCR. Sindh University Research Journal (Science Series) 46, 2 (2014), 143--152.Google ScholarGoogle Scholar
  6. Madiha Ijaz and S. Sarmad Hussain. 2007. Corpus based urdu lexicon development. In The Proceedings of Conference on Language Technology 2007 (CLT’07), University of Peshawar, Peshawar, Pakistan. Vol. 73. 1--12.Google ScholarGoogle Scholar
  7. Sabri Mahmoud, Irfan Ahmad, Wasfi G. Al-Khatib, Mohammad Alshayeb, Mohammad Tanvir Parvez, Volker Margner, and Gernot A. Fink. 2014. KHATT: An open arabic offline handwritten text database. Pattern Recognition 47, 3 (2014), 1096--1112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda. 2011. Arabic Gigaword (5th ed. LDC2011T11). Linguistic Data Consortium, Philadelphia University of Pennsylvania.Google ScholarGoogle Scholar
  9. Mohammad Tanvir Parvez and Sabri A. Mahmoud. 2013. Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognition 46, 1 (2013), 141--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine Ellouze, and Hamid Amiri. 2002. IFN/ENIT database of handwritten arabic words. In Proceedings of of CIFED, 2, Citeseer. 127--136.Google ScholarGoogle Scholar
  11. Mutee U. Rahman. 2010. Towards Sindhi corpus construction. In Conference on Language and Technology (CLT’10). Lahore, Pakistan. 37--45.Google ScholarGoogle Scholar
  12. Rajneesh Rani, Renu Dhir, and G. S. Lehal. 2011. Identification of printed Punjabi words and English numerals using Gabor features. World Academy of Science, Engineering and Technology 73, 1 (2011), 392--395.Google ScholarGoogle Scholar
  13. Fouad Slimane, Rolf Ingold, Slim Kanoun, Adel M. Alimi, and Jean Hennebert. 2009. A new Arabic printed text image database and evaluation protocols. In 10th International Conference on Document Analysis and Recognition, 2009 (ICDAR’09). 946--950. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Fouad Slimane, Slim Kanoun, Jean Hennebert, Adel M. Alimi, and Rolf Ingold. 2013. A study on font-family and font-size recognition applied to arabic word images at ultra-low resolution. Pattern Recognition Letters 34, 2 (2013), 209--218. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Printed Text Image Database for Sindhi OCR

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 4
        June 2016
        173 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/2915955
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 May 2016
        • Revised: 1 November 2015
        • Accepted: 1 November 2015
        • Received: 1 April 2015
        Published in tallip Volume 15, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • note
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader