Wednesday, August 20, 2008
Status of Bangla character segmentation
Thursday, August 14, 2008
Monday, August 4, 2008
Why tesseract need to train all possible combinations of characters/units to recognize Bangla and Devanagari scripts?
Background:
Experiments and Observations:
We performed our experiment with the training data to make a decision about the necessary amount of data units. For this experimental purpose we considered two issues approaches as follows:
Train dependent modifiers separately than the basic units.
Train dependent modifiers combined with the basic units.
By basic units we are considering all the independent vowels, consonants, compound characters (combined consonants). In a word image the modifiers (vowel & consonant) are placed around the core characters and often connected with the basic characters following few certain characteristics. In most of the cases the image of basic character and modifier has a certain amount of overlap between them. As a result it is impossible to make a horizontal and vertical boundary line between them. This scenario is completely different than English character segmentation. Our segmentation algorithm is capable of locating the region of the modifiers without placing any boundary location, and hence it is impossible to generate a bounding box separately for the basic characters and modifiers. An example of characters with modifier and their break down is shown in figure-1.
Figure-1: Characters with modifiers and their break down
Depending on the performance of our segmenter, at experimental approach 1 we trained tesseract with the basic units and the modifiers separately to observe whether it is capable of recognizing both the modifier and basic character in case of the presence of certain amount of overlap between them. We observed that during testing the segmenter is capable of making the modifier and basic character disconnected from each. Now, if the segmenter is successful in separating the modifier (example: া, ে, ৈ ্য) and basic characters using a vertical column then tesseract will successfully recognize them. However in case of other modifiers (Example: ি, ী, ু, ূ, ৃ, ৌ, ঁ and ্র) it is not possible for the segmenter to separate them using any vertical column. In such cases tesseract failed to recognize them because it is unable to identify the bounding box for modifier and basic unit.
During recognition of a line image, tesseract extract bounding box of each character image. When the modifier and basic unit has an overlap between them, it identified a box which includes both the modifier and basic character image. As we have no trained unit which includes both the modifier and basic unit, so tesseract will definitely fail in such case. An example of such image is shown in figure-2.
Figure-2: Example of test case image
So, considering the performance of approach it is clear that for Indic scripts which have similar properties it is necessary to train all possible combinations of the sensitive modifiers (Example: ি, ী, ু, ূ, ৃ, ৌ, ঁ and ্র) and basic characters. So, the final training data set (approach-2) will consider the followings:
All vowels, consonants and numerals
Consonants + vowel modifiers
Consonants + consonant modifiers
Combined consonants (compound character)
Compound characters + vowel modifiers
Compound characters + consonant modifiers
Following approach-2 the total amount of training data units will be around 3200.
How to add a method in OCROPUS-0.2
method name is : crblpRecognizeLine(....)
i did it as the fellowing steps:
1. added my method body in the file grouping.cc in
struct NewGroupingLineOCR : IRecognizeLine {..};
2. added a fake body in file tesseract.cc in
struct TesseractRecognizeLine : IRecognizeLine {..};
as this: void crblpRecognizeLine(...){}
3. added a fake body in file glinerec.cc in
struct GLineRec : IRecognizeLine {..};
as this: void crblpRecognizeLine(...){}
4. added a prototype in file ocrinterfaces.h in
struct IRecognizeLine : IComponent {..};
as this: virtual void crblpRecognizeLine(...)=0;
5. added a prototype in file ocr.pkg in
struct IRecognizeLine : IComponent {..};
as this: virtual void crblpRecognizeLine(...)=0;
now compile
you can now call it from cpp or lua :-)
Saturday, August 2, 2008
how to install ocropus-0.2 (for newbies)
step-1:
download ocropus-0.2,tesseract(use svn) and openfst.
step2:
open terminal (Applications->accessories->terminal)
make sure you are connected to internet.
then enter the fellowing commands -
$ sudo apt-get install libpng12
$ sudo apt-get install libpng12-dev
$ sudo apt-get install libtiff4
$ sudo apt-get install libtiff4-dev
$ sudo apt-get install libjpeg62
$ sudo apt-get install libjpeg62-dev
$ sudo apt-get install ftjam
$ sudo apt-get install zlib
$ sudo apt-get install libleptonica
$ sudo apt-get install libleptonica-dev
$ sudo apt-get install libedit-dev
$ sudo apt-get install aspell-en
$ sudo apt-get install libaspell-dev
step-3:
to build and install tesseract go to the tesseract root dir
commands are:
$./configure
$make
$sudo make install
step-4:
to build and install go to the openfst root dir
$cd fst
$make all
$sudo mkdir -p /usr/local/include/fst/lib
$sudo cp -v fst/lib/*.h /usr/local/include/fst/lib
$sudo cp fst/lib/*.a /usr/local/lib
make sure that all the *.h files in /usr/local/include/fst/lib
has the permision "read-only" for "others" catagory
make sure /usr/local/lib/libfst.a has the same permission
to do this run these command:
$sudo chmod -R 755 /usr/local/include/fst/lib
$sudo chmod 755 /usr/local/lib/libfst.a
step-5:
go to the ocropus root dir the run these commands
$./configure --without-SDL
$jam
$sudo jam install
!!DONE!!