Experiments and Observations:
We performed our experiment with the training data to make a decision about the necessary amount of data units. For this experimental purpose we considered two issues approaches as follows:
Train dependent modifiers separately than the basic units.
Train dependent modifiers combined with the basic units.
By basic units we are considering all the independent vowels, consonants, compound characters (combined consonants). In a word image the modifiers (vowel & consonant) are placed around the core characters and often connected with the basic characters following few certain characteristics. In most of the cases the image of basic character and modifier has a certain amount of overlap between them. As a result it is impossible to make a horizontal and vertical boundary line between them. This scenario is completely different than English character segmentation. Our segmentation algorithm is capable of locating the region of the modifiers without placing any boundary location, and hence it is impossible to generate a bounding box separately for the basic characters and modifiers. An example of characters with modifier and their break down is shown in figure-1.
Figure-1: Characters with modifiers and their break down
Depending on the performance of our segmenter, at experimental approach 1 we trained tesseract with the basic units and the modifiers separately to observe whether it is capable of recognizing both the modifier and basic character in case of the presence of certain amount of overlap between them. We observed that during testing the segmenter is capable of making the modifier and basic character disconnected from each. Now, if the segmenter is successful in separating the modifier (example: া, ে, ৈ ্য) and basic characters using a vertical column then tesseract will successfully recognize them. However in case of other modifiers (Example: ি, ী, ু, ূ, ৃ, ৌ, ঁ and ্র) it is not possible for the segmenter to separate them using any vertical column. In such cases tesseract failed to recognize them because it is unable to identify the bounding box for modifier and basic unit.
During recognition of a line image, tesseract extract bounding box of each character image. When the modifier and basic unit has an overlap between them, it identified a box which includes both the modifier and basic character image. As we have no trained unit which includes both the modifier and basic unit, so tesseract will definitely fail in such case. An example of such image is shown in figure-2.
Figure-2: Example of test case image
So, considering the performance of approach it is clear that for Indic scripts which have similar properties it is necessary to train all possible combinations of the sensitive modifiers (Example: ি, ী, ু, ূ, ৃ, ৌ, ঁ and ্র) and basic characters. So, the final training data set (approach-2) will consider the followings:
All vowels, consonants and numerals
Consonants + vowel modifiers
Consonants + consonant modifiers
Combined consonants (compound character)
Compound characters + vowel modifiers
Compound characters + consonant modifiers
Following approach-2 the total amount of training data units will be around 3200.