Monday, August 4, 2008

Why tesseract need to train all possible combinations of characters/units to recognize Bangla and Devanagari scripts?


The last step of preprocessing is segmentation. Information about the segmented units is passed through the classifier for recognition. Tesseract is very efficiently trained to recognize english text image. To integrate the support for recognition of Indic scripts (Bangla and Devanagari) we have to modify several algorithms. Right now tesseract do not allow us to modify the internal algorithms. So, we need to prepare the test image in such form so that it will be suitable for tesseract to process and recognize. Here I am writing my personal observations after finishing my primary targeted experiment with tesseract to recognize Bangla script.

Experiments and Observations:

We performed our experiment with the training data to make a decision about the necessary amount of data units. For this experimental purpose we considered two issues approaches as follows:
  1. Train dependent modifiers separately than the basic units.

  2. Train dependent modifiers combined with the basic units.

By basic units we are considering all the independent vowels, consonants, compound characters (combined consonants). In a word image the modifiers (vowel & consonant) are placed around the core characters and often connected with the basic characters following few certain characteristics. In most of the cases the image of basic character and modifier has a certain amount of overlap between them. As a result it is impossible to make a horizontal and vertical boundary line between them. This scenario is completely different than English character segmentation. Our segmentation algorithm is capable of locating the region of the modifiers without placing any boundary location, and hence it is impossible to generate a bounding box separately for the basic characters and modifiers. An example of characters with modifier and their break down is shown in figure-1.

Figure-1: Characters with modifiers and their break down

Depending on the performance of our segmenter, at experimental approach 1 we trained tesseract with the basic units and the modifiers separately to observe whether it is capable of recognizing both the modifier and basic character in case of the presence of certain amount of overlap between them. We observed that during testing the segmenter is capable of making the modifier and basic character disconnected from each. Now, if the segmenter is successful in separating the modifier (example: , , ৈ ্য) and basic characters using a vertical column then tesseract will successfully recognize them. However in case of other modifiers (Example: ি, , , , , , and ্র) it is not possible for the segmenter to separate them using any vertical column. In such cases tesseract failed to recognize them because it is unable to identify the bounding box for modifier and basic unit.

During recognition of a line image, tesseract extract bounding box of each character image. When the modifier and basic unit has an overlap between them, it identified a box which includes both the modifier and basic character image. As we have no trained unit which includes both the modifier and basic unit, so tesseract will definitely fail in such case. An example of such image is shown in figure-2.

Figure-2: Example of test case image

So, considering the performance of approach it is clear that for Indic scripts which have similar properties it is necessary to train all possible combinations of the sensitive modifiers (Example: ি, , , , , , and ্র) and basic characters. So, the final training data set (approach-2) will consider the followings:

  • All vowels, consonants and numerals

  • Consonants + vowel modifiers

  • Consonants + consonant modifiers

  • Combined consonants (compound character)

  • Compound characters + vowel modifiers

  • Compound characters + consonant modifiers

Following approach-2 the total amount of training data units will be around 3200.

1 comment:

Unknown said...

After studying some of the Indic scripts in depth recently, I have to agree that the best approach, with Tesseract at least, is to train it on all the combinations of consonants and vowels that make up syllable or grapheme units. With some scripts, such as Kannada, this means possibly as many as 20000 combinations, but in practice, the number used in the language, is only 3-4000.