CRBLP Bangla OCR: BanglaOCR / Bangla OCR / Bengali OCR : Research and Development

------------------------------------------------------------------------------------------------------------

(!) For the moment, this is random writing and not organized.The goal of this writing is to help researchers / students get sufficient amount of information prior to their work on Bangla / Bengali OCR. All the contents of this blog is written based on personal experience. No guarantee to provide the latest information.

------------------------------------------------------------------------------------------------------------

I am writing this in order to have a descent up to date status of OCR research and development for recognizing Bangla / Bengali scripts. Perhaps time to time I will update this note. Right at this moment, I am not much updated about the state of the art research/development. The reason is that, I am not really involved with OCR since 2009. However, I always have the desire to come back to OCR research and contribute more. Therefore, I have decided to start gathering state of the art information. Perhaps a good source will be to get information from those who are studying about Bangla OCR and somehow come across this post. I would like to request them to send me the missing information in the following address: bangla(dot)ocr(at)gmail(dot)com.

Prepare Bangla/ Bengali Training data for Tesseract (Updated)

Recently, I uploaded the files (images, transcript, box etc.) which I used to prepare training data (with Tesseract version 2) and develop BanglaOCR. However, it appears that Tesseract has already training file for Bangla / Bengali for version 3.

*** Tesseract (version 3) training files for Bangla / Bengali are available in the following link: https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.ben.tar.gz&can=2&q=#makechanges

If you are interested for Tesseract version 2 training files (prepared by me) then follow the link: https://code.google.com/p/tesseract-ocr/downloads/detail?name=tessdata.ban.tar.gz&can=2&q=#makechanges

However, yet people are interested to prepare their own training data from the scratch! The first requirement to prepare the data is to have the complete set of possible characters or combined character images. For this purpose during OCR research at CRBLP we prepared a file which contains the complete set of characters and their combinations. Thanks to the expert at CRBLP for verifying the file. I have uploaded the file in BanglaOCR project page. Click the link to download it. Or if you are interested to copy certain characters and their combinations then copy from the link.

Once you have the combinations of all the characters, then you need to generate image from the characters. This is not difficult if you have already a tool available. For my research, I developed a text to image generator tool which served the purpose. The tool allows you generate text to image with any font size and type. Therefore, if you consider to prepare training data for tesseract, you can have the following most important information in hand: (a) Training images (b) Transcript. Now you can follow the requirements of Tesseract (see how to train Tesseract in the link) in order to prepare your own training data.

Perhaps, in the near future I will prepare training data for Tesseract.

OCRopus: New version released!

This is a great news that OCRopus version 0.7 is released.

What I want to know:

Is there any complete Bangla / Bengali OCR after BanglaOCR?

Is there any standard dataset for bench-marking Bangla / Bengali OCR research?

Citation:

I observed that researches / students who use BanglaOCR for academic purpose, cite a different paper (which is my previous version of OCR). The correct citation should be the following papers:

1. Md. Abul Hasnat, Muttakinur Rahman Chowdhury and Mumit Khan, "An open source Tesseract based Optical Character Recognizer for Bangla script", In the Tenth International Conference on Document Analysis and Recognition (ICDAR'2009), Catalonia, Spain, July 26-29, 2009.
Available at: http://www.cvc.uab.es/icdar2009/papers/3725a671.pdf

2. Md. Abul Hasnat, Muttakinur Rahman Chowdhury and Mumit Khan, "Integrating Bangla script recognition support in Tesseract OCR", Proceedings of the Conference on Language and Technology 2009 (CLT09), Lahore, Pakistan, January 22-24, 2009.
Avaliable at: http://crulp.org/clt09/download/Papers/Paper16.pdf

4 comments:

জয়ন্ত said...: Any update?; August 19, 2013 at 11:09 PM
Md. Abul Hasnat said...: Unfortunately no! I tried in the middle to do some update quickly. However, it seems several things changed in the middle.

Hopefully, there will be a surprise soon. Although its not from me.; September 20, 2013 at 6:05 AM
Unknown said...: I am trying to use your "BanglaOCR" for a book project. Your software work well for low resoulation and small amount of font load. But an entire page it gets wired.
If you are interested, then i can send you some images of my results.
thank you.; February 2, 2014 at 7:54 AM
Aniruddha Bhatacharjee said...: I like to get your product bengali OCR.; October 29, 2015 at 12:45 AM

CRBLP Bangla OCR

Tuesday, April 16, 2013

BanglaOCR / Bangla OCR / Bengali OCR : Research and Development