Wednesday, August 20, 2008

Status of Bangla character segmentation

This post is specially for those who are concern about our current status of Bangla character segmentation. I would like to show one example:

Figure 1: Input image

Figure 1: Segmented color image

Yet there is some modifications need to do.

Monday, August 4, 2008

Why tesseract need to train all possible combinations of characters/units to recognize Bangla and Devanagari scripts?


The last step of preprocessing is segmentation. Information about the segmented units is passed through the classifier for recognition. Tesseract is very efficiently trained to recognize english text image. To integrate the support for recognition of Indic scripts (Bangla and Devanagari) we have to modify several algorithms. Right now tesseract do not allow us to modify the internal algorithms. So, we need to prepare the test image in such form so that it will be suitable for tesseract to process and recognize. Here I am writing my personal observations after finishing my primary targeted experiment with tesseract to recognize Bangla script.

Experiments and Observations:

We performed our experiment with the training data to make a decision about the necessary amount of data units. For this experimental purpose we considered two issues approaches as follows:
  1. Train dependent modifiers separately than the basic units.

  2. Train dependent modifiers combined with the basic units.

By basic units we are considering all the independent vowels, consonants, compound characters (combined consonants). In a word image the modifiers (vowel & consonant) are placed around the core characters and often connected with the basic characters following few certain characteristics. In most of the cases the image of basic character and modifier has a certain amount of overlap between them. As a result it is impossible to make a horizontal and vertical boundary line between them. This scenario is completely different than English character segmentation. Our segmentation algorithm is capable of locating the region of the modifiers without placing any boundary location, and hence it is impossible to generate a bounding box separately for the basic characters and modifiers. An example of characters with modifier and their break down is shown in figure-1.

Figure-1: Characters with modifiers and their break down

Depending on the performance of our segmenter, at experimental approach 1 we trained tesseract with the basic units and the modifiers separately to observe whether it is capable of recognizing both the modifier and basic character in case of the presence of certain amount of overlap between them. We observed that during testing the segmenter is capable of making the modifier and basic character disconnected from each. Now, if the segmenter is successful in separating the modifier (example: , , ৈ ্য) and basic characters using a vertical column then tesseract will successfully recognize them. However in case of other modifiers (Example: ি, , , , , , and ্র) it is not possible for the segmenter to separate them using any vertical column. In such cases tesseract failed to recognize them because it is unable to identify the bounding box for modifier and basic unit.

During recognition of a line image, tesseract extract bounding box of each character image. When the modifier and basic unit has an overlap between them, it identified a box which includes both the modifier and basic character image. As we have no trained unit which includes both the modifier and basic unit, so tesseract will definitely fail in such case. An example of such image is shown in figure-2.

Figure-2: Example of test case image

So, considering the performance of approach it is clear that for Indic scripts which have similar properties it is necessary to train all possible combinations of the sensitive modifiers (Example: ি, , , , , , and ্র) and basic characters. So, the final training data set (approach-2) will consider the followings:

  • All vowels, consonants and numerals

  • Consonants + vowel modifiers

  • Consonants + consonant modifiers

  • Combined consonants (compound character)

  • Compound characters + vowel modifiers

  • Compound characters + consonant modifiers

Following approach-2 the total amount of training data units will be around 3200.

How to add a method in OCROPUS-0.2

i added a method in in file
method name is : crblpRecognizeLine(....)

i did it as the fellowing steps:

1. added my method body in the file in
struct NewGroupingLineOCR : IRecognizeLine {..};

2. added a fake body in file in
struct TesseractRecognizeLine : IRecognizeLine {..};

as this: void crblpRecognizeLine(...){}

3. added a fake body in file in
struct GLineRec : IRecognizeLine {..};

as this: void crblpRecognizeLine(...){}

4. added a prototype in file ocrinterfaces.h in
struct IRecognizeLine : IComponent {..};

as this: virtual void crblpRecognizeLine(...)=0;

5. added a prototype in file ocr.pkg in
struct IRecognizeLine : IComponent {..};

as this: virtual void crblpRecognizeLine(...)=0;

now compile
you can now call it from cpp or lua :-)

Saturday, August 2, 2008

how to install ocropus-0.2 (for newbies)

this guide is based on ubuntu 8.04LTS

download ocropus-0.2,tesseract(use svn) and openfst.

open terminal (Applications->accessories->terminal)
make sure you are connected to internet.
then enter the fellowing commands -

$ sudo apt-get install libpng12
$ sudo apt-get install libpng12-dev
$ sudo apt-get install libtiff4
$ sudo apt-get install libtiff4-dev
$ sudo apt-get install libjpeg62
$ sudo apt-get install libjpeg62-dev
$ sudo apt-get install ftjam
$ sudo apt-get install zlib
$ sudo apt-get install libleptonica
$ sudo apt-get install libleptonica-dev
$ sudo apt-get install libedit-dev
$ sudo apt-get install aspell-en
$ sudo apt-get install libaspell-dev

to build and install tesseract go to the tesseract root dir
commands are:

$sudo make install

to build and install go to the openfst root dir

$cd fst
$make all
$sudo mkdir -p /usr/local/include/fst/lib
$sudo cp
-v fst/lib/*.h /usr/local/include/fst/lib
$sudo cp fst/lib/*.a /usr/local/lib

make sure that all the *.h files in
has the permision "read-only" for "others" catagory
make sure
/usr/local/lib/libfst.a has the same permission

to do this run these command:
$sudo chmod -R 755
$sudo chmod 755

go to the ocropus root dir the run these commands
$./configure --without-SDL
$sudo jam install