Wednesday, July 30, 2008
Best parameters for bpnet line training for bangla scripts
epochs - 200
learningrate - 0.2
testportion - 0
normalize - 1
shuffle - 1
Testing performance of ocropus-tesseract for Bangla Script
Ist Step: Decide the training data set
To prepare the training data for Bangla characters I considered the followings:
- Basic characters (vowel + consonant) / units, numerals and symbols
- consonant + vowel modifiers
- consonant + consonant modifiers
- combined consonants (compound character)
- compound character + vowel modifiers
- compound character + consonant modifiers
The total number of training unit is 3200.
2nd Step: Prepare training data images
Next we prepare box files (*.box), training files (*.tr), clustered files (*.inttemp, *.normproto and *.pffmtable) and character set file (*.unicharset) appropriate instructions.
An example of the error report is as follows:
hasnat@hasnat-desktop:~$ tesseract 13.tif junk nobatch box.train
Tesseract Open Source OCR Engine
APPLY_BOXES: boxfile 2/1/সুঁ ((19,1388),(67,1468)): FAILURE! box overlaps no blobs or blobs in multiple rows
4th Step: Prepare language specific data files
We followed these four stpes to prepare complete training data.
Testing:
Figure-4 : Segmentation result
My feedback:
Next TODO:
I plan to do the following tasks to improve the efficiency:
1. Train each unit with at least 10 samples. (Right now, number of sample per unit = 1)
2. Train each unit with variations.
3. Collect training units from testing documents which is a lenthy process.
Saturday, July 19, 2008
Training data for ocropus-tesseract
We are testing the recognition performance with the trained data units for Bangla script and continue adding more data units to enhance the recognition accuracy. The Devanagari training data is useful for testing the basic character recognition only. It will be helpful to guide anyone who just start trying to recognize Devanagari script using ocropus-tesseract.
The training data is freely available to download. Anyone can download these from the following links:
Download Training data for Bangla
Download Training data for Devanagari
Sunday, July 13, 2008
How to test Bangla and Devanagari script using OCROPUS(tesseract)
Introduction
This document provides the step-by-step instructions that we followed to test printed document of Bangla and Devanagari script.
Required files
To test Bangla or Devanagari scripts (lang = ban/dev), we have two files in ocropus-0.2/ocroscript subdirectory. Those files are:
- ocropus-0.2/ocroscript/rec-ltess.lua
- ocropus-0.2/ocroscript/rec-tess.lua
Among these two lua files rec-ltess.lua is used mainly to observe the performance of ocropus layout analysis and rec-tess.lua is used to observe the performance of character recognition.
Usage of the lua files for testing
rec-ltess.lua
if #arg <>
arg = { "../../data/pages/alice_1.png" }
end
pages = Pages()
pages:parseSpec(arg[1])
segmenter = make_SegmentPageByRAST()
page_image = bytearray()
page_segmentation = intarray()
line_image = bytearray()
bboxes = rectanglearray()
costs = floatarray()
tesseract_recognizer = make_TesseractRecognizeLine()
tesseract.init("ban")
while pages:nextPage() do
pages:getBinary(page_image)
segmenter:segment(page_segmentation,page_image)
regions = RegionExtractor()
regions:setPageLines(page_segmentation)
for i = 1,regions:length()-1 do
regions:extract(line_image,page_image,i,1)
fst = make_StandardFst()
tesseract_recognizer:recognizeLine(fst,line_image)
result = nustring()
fst:bestpath(result)
print(result:utf8())
end
end
rec-tess.lua
require 'lib.util'
require 'lib.headings'
require 'lib.paragraphs'
if not tesseract then
print "Compiled without Tesseract support, can't continue."
os.exit(1)
end
opt,arg = getopt(arg)
if #arg == 0 then
print "Usage: ocroscript rec-tess [--tesslanguage=...] input.png ... >output.hocr"
os.exit(1)
end
set_version_string(hardcoded_version_string())
segmenter = make_SegmentPageByRAST()
page_image = bytearray()
page_segmentation = intarray()
function convert_RecognizedPage_to_PageNode(p)
page = PageNode()
page.width = p:width()
page.height = p:height()
page.description = p:description()
for i = 0, p:linesCount() - 1 do
local bbox = p:bbox(i)
local text = nustring()
p:text(text, i)
page:append(LineNode(bbox, text))
end
return page
end
document = DocumentNode()
for i = 1, #arg do
pages = Pages()
pages:parseSpec(arg[i])
while pages:nextPage() do
pages:getBinary(page_image)
segmenter:segment(page_segmentation,page_image)
local p = RecognizedPage()
tesseract_recognize_blockwise(p, page_image, page_segmentation)
p = convert_RecognizedPage_to_PageNode(p)
p.description = pages:getFileName()
local regions = RegionExtractor()
regions:setPageLines(page_segmentation)
p.headings = detect_headings(regions, page_image)
p.paragraphs = detect_paragraphs(regions, page_image)
document:append(p)
end
end
document:hocr_output()
Note: At present in ocropus-0.2 this file is suffering from the problem of representing the unicode output for Bangla as well as Devanagari. The reason is that the language specified in this file is overwritten in the file tesseract.cc. So, we edit the file tesseract.cc as follows:
namespace ocropus {
param_string tesslanguage("tesslanguage", "ban", "Specify the language for Tesseract");
}
Now the problem of unicode representation is solved.
Recognition issue
Isolated character recognition
To perform isolated character recognition using the lua files it is very simple by using the commands:
./ocroscript rec-tess.lua test-image.png > out-tess.html
Connected character recognition
To perform connected character recognition first we have to perform character segmentation which will be capable of representing each character as a separate component. An example of segmented image is shown if figure-1 (Bangla) and figure-2 (Devanagari). Here we include the segmentation algorithm inside the rec-tess.lua file. The command for connected character recognition is same as for isolated character recognition.
Figure 1: Example of Bangla word (a) test word (b) segmented units
Figure 2: Example of Devanagari word (a) test word (b) segmented units
=====
Acknowledgement: Souro Chowdhury
How to train Bangla and Devanagari script for tesseract engine
Introduction
This document provides the step-by-step instructions that we followed to train data for Bangla and Devanagari script. This is just a short version of the document TrainingTesseract, which we followed to prepare training data for Bangla and Devanagari. No detail explanation for the purpose of each step is given here.
Data files required
To train Bangla or Devanagari scripts (lang = ban/dev), you have to create 8 data files in the tessdata subdirectory. The 8 files are:
1. tessdata/lang.freq-dawg
2. tessdata/lang.word-dawg
3. tessdata/lang.user-words
4. tessdata/lang.inttemp
5. tessdata/lang.normproto
6. tessdata/lang.pffmtable
7. tessdata/lang.unicharset
8. tessdata/lang.DangAmbigs
Step by step procedure
Step – 1: Create training data
Preparing training data depends on the characters or units that you want to recognize. Decision about the number of training data units depends on the performance of the segmentation algorithm. If you consider minimal segmentation then you have to consider all the combinations formed by the alphabets of your script. However if your segmentation algorithm is well enough to segment the basic units properly then you can train only the basic and compound units of your script. In the fundamental level we consider training only the basic units. An example of the training data units is shown in figure-1 (Bangla training data) and figure-2 (Devanagari training data).
Figure 1: Training data units for Bangla script
Figure 2: Training data units for Devanagari script
Step – 2: Make Box file
In this step we have to prepare a box file (a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image). To create the box file, run Tesseract on each of your training images using this command line. The command is as follows:
tesseract trainfile.tif trainfile batch.nochop makebox
This will generate a file named trainfile.txt that you have to rename as trainfile.box. Then manually edit the box file where you have to replace the Latin characters (first character of each line) with appropriate unicode Bangla/Devanagari character. If any particular character is broken into two boxes then you have to manually merge the boxes. An example of edited box file is shown in figure-3. The generated box file name must be same with the training tif image file name.
Figure 3: Box file for Bangla and Devanagari script
Step – 3: Run Tesseract for Training
For each of your training image and boxfile pairs, run Tesseract in training mode using the following command:
tesseract trainfile.tif junk nobatch box.train
This will generate a file named trainfile.tr which contains the features of each character of the training page.
Step – 4: Clustering
Clustering is necessary to create prototypes. The character shape features can be clustered using the mftraining and cntraining programs. The mftraining program is invoked using the following command:
mftraining trainfile.tr
This will output two data files: inttemp and pffmtable. (A third file called Microfeat is also written by this program, but it is not used.)
The cntraining program is invoked using the following command:
cntraining trainfile.tr
This will output the normproto data file.
In case of multiple training data the following command will be used:
mftraining trainfile_1.tr trainfile_2.tr ...
cntraining trainfile_1.tr trainfile_2.tr ...
Step – 5: Compute the Character Set
Next you have to generate the unicharset data file using the following command:
unicharset_extractor trainfile.box
This will generate a file named unicharset. Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower. To set these properties we have to manually edit the unicharset file and change the default value (0) set for each training character. An example of the unicharset file is shown in figure-4.
Figure 4: unicharset file for Bangla and Devanagari script
Step – 6: Prepare Dictionary data
Tesseract uses 3 dictionary files for each language. Two of the files are coded as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8 text file. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files:
wordlist2dawg frequent_words_list freq-dawg
wordlist2dawg words_list word-dawg
The third dictionary file is called user-words and is usually empty.
The dictionary files freq-dawg and word-dawg don't have to be given many words if you don't have a wordlist to hand, but accuracy will be lower.
Step – 7: Prepare DangAmbigs file
This file represents the intrinsic ambiguity between characters or sets of characters. You have to generate this file considering the recognition failure example in your script. An example of the rules is shown in figure-5 for Bangla script.
Figure 5: Ambiguity between characters in Bangla script
Step – 8: Rename the necessary files
As mentioned in the starting of this document, now you have to rename the necessary 8 files according to your language/script. For Bangla we used lang=“ban” and for Devanagari we used lang=“dev”. So, the name of the necessary 8 files will be prefixed by lang+'.' (Example: ban.unicharset, dev.unicharset). These 8 files must be copied into the tessdata subdirectory if these are generated any other place.