Sunday, March 31, 2013

Training files released for "Tesseract Based BanglaOCR"

I am happy to write about the release of training images and transcript that I have used for training tesseract engine to develop BanglaOCR. I do apologize for the long unexpected delay for making these public. From 2009 I had always the desire to extend my work on BanglaOCR. However, it did not happen and seems very less likely to happen in near future. Therefore, I have decided to come up with the release of the training data and sharing the experiments behind preparing these data.

Finally, the training data is available on the following link on BanglaOCR on google code page. I recommend to read and follow the readme.txt file, which contains proper explanation of the data.

Following my previous post, I am writing this in order to motivate interested individuals who has the aim to develop an OCR for Bangla/Bengali language. You may find the appearance of the training image files as very simple and may immediately think that these are so simple that I can immediately include many more training files and enhance the performance of OCR. I encourage you to do so. However, I would like to share that I did long time experiments with many complex combinations of training data and finally observed that "simpler is better". Perhaps you will have similar experience.

One important issue at present is the usability of these files for Tesseract version 3 or later. I did not try these for training Tesseract 3. Therefore, I encourage you to do so. I have provided the box files also which is the most time consuming part for training data generation. However, as I wrote, these are for Tesseract 2. Therefore, you need to modify these for training Bangla language for Tesseract 3.


When you have the training data associated with Tesseract 3, you will be able to plug it in BanglaOCR to see the complete OCR output on your test images.

Acknowledgement:

If you use BanglaOCR and its associated files for academic research, please cite the following papers:

a. Md. Abul Hasnat, Muttakinur Rahman Chowdhury and Mumit Khan, "An open source Tesseract based Optical Character Recognizer for Bangla script", In Tenth International Conference on Document Analysis and Recognition (ICDAR'2009), Catalonia, Spain, July 26-29, 2009.

b. Md. Abul Hasnat, Muttakinur Rahman Chowdhury and Mumit Khan, "Integrating Bangla script recognition support in Tesseract OCR", Proceedings of the Conference on Language and Technology 2009 (CLT09), Lahore, Pakistan, January 22-24, 2009.