Monday, December 1, 2008

BanglaOCR V 0.6 | Source Code Released

This is the announcement about the release of the source code of BanglaOCR V 0.6.

Feedback:
Please send feedback on the following email address: bangla.ocr@gmail.com

Test Images:
A set of test images are also available in the same site where the OCR setup package is available. So, please feel free to download them and test.

Download Link:
BanglaOCR V 0.6 (source code)
Test Images

Text to Image Converter V 0.1 (for Windows)

We feel the requirement of a “Text to Image Converter” at the time when go to prepare training data. There are several specifications about the training images as well as font. Targeting to fulfill those requirements while prepare training data, I just wrote this very simple program. I could do lot more enhancement of this program, however as this is able to fulfill my requirements at present so I am releasing this as V 0.1. The release information is provided below:

Perquisites:
1. Microsoft .NET Framework Version 2.0 Redistributable Package (x86)

User's Manual:
The manual will be available after installing Text to Image Generator. Go to Start>> All Programs >> TTI >> Text to Image Generator User Guide.

Feedback:
This program is basically for the developer who need to prepare artificial training data for OCR. Any feedback about the enhancement or problems of this application is highly appreciable. At present you can post feedback on the following email address: bangla.ocr@gmail.com

Download Link:
Text to Image Converter (Setup file)
Text to Image Converter (VS .NET source file)

Sunday, November 30, 2008

Text to Image Converter (for Linux)

For several purposes of our OCR research we needed a converter that can generate an image from any given Bangla/ Bengali text. One of the reasons was to prepare artificial training and testing data. For this reason we finally created a very simple text to image converter. Actually the converter is a collection of two python script files.

Prerequisites:
pango-view

Feedback:
Please reply your feedback to the following email address: bangla.ocr@gmail.com

Download:
TTI for Linux

Interesting test on recognition performance of BanglaOCR : DPI & scan vs computer generated image.doc

I was planning to test the performance of BanglaOCR for different type image (scanned image and computer generated image) in different dpi.

First of all I generated 4 images with dpi 100, 200, 300 and 400 ant test the performance. The output is as follows:

100 dpi:

Output: আরো তুমি ভো সারাদিনই খেনা এ।ন আমাদের ধেনভে দাংবা

200 dpi:

Output: আরেঢূ তুমি তো সার।দিনই খেল । এখন আমাদের খেলতে দ।ও।

300 dpi:

Output: আরে! তুমি তো সারাদিনই খেল। এখন আমাদের খেলভে দাও।

400 dpi:

Output: আরে! তুমি তো সারাদিনই খেল। এখন আমাদের খেলতে দাও।

In the second of this testing I scanned a line of text (font: SolaimanLipi, size 16) with dpi 100, 200, 300 and 400 ant test the performance. The output is as follows:

100 dpi:

Output: জারে! ড়ুমি ড়ো পীরুাদিনই খেদীা এখন থীনীদের খেঢৗতে দভো

200 dpi:

Output: আরে! তুমি ভো সারাদিনই খেল। এখন অ।ম।দের খেলভে দ।ও।

300 dpi:

Output: আরে! তুমি ভো সারাদিনই খেল। এখন আমাদের খেলভে দাও।

400 dpi:

Output: আরে! তুমি তো সারাদিনই খেল। এখন আমাদের খেলভে দাও।

The above observations make this clear that if we increase the dpi then the output will be better.

Monday, November 24, 2008

BanglaOCR V 0.5 (for Linux users) | Released

This is the announcement about the release of BanglaOCR V 0.5 (for Linux users). The release information is provided below:

Perquisites:
1. Tesseract OCR
2. Tidy
3. Java Runtime Environment (v 1.6)
4. Font : SolaimanLipi

User's Manual:
The manual is available with the package (BanglaOCR User Guide.pdf).

Feedback:
Any feedback about the application is highly appreciable. At present you can post feedback on the following email address: bangla.ocr@gmail.com

Test Images:
A set of test images are also available in the same site where the OCR setup package is available. So, please feel free to download them and test. The images were randomly selected to test the application and obviously not based on the best performance.

Download Link:
BanglaOCR V 0.5 (Linux Users)
Test Images

Wednesday, November 19, 2008

BanglaOCR V 0.6 | Released

This is the announcement about the release of BanglaOCR V 0.6. We already fixed few problems in V 0.5 and after solving those we moved to the next release. The release information is provided below:

Perquisites:
1. Microsoft .NET Framework Version 2.0 Redistributable Package (x86)
2. Microsoft Visual C++ 2005 Redistributable Package (x86)
3. Java Runtime Environment

User's Manual:
The manual will be available after installing BanglaOCR. Go to Start>> All Programs >> BanglaOCR>> BanglaOCR User Guide.

Feedback:
Any feedback about the application is highly appreciable. At present you can post feedback on the following email address: bangla.ocr@gmail.com
I am planning to build a document image database. So, if you would like, then send us your document images.

Test Images:
A set of test images are also available in the same site where the OCR setup package is available. So, please feel free to download them and test. The images were randomly selected to test the application and obviously not based on the best performance.

Download Link:
BanglaOCR V 0.6
Test Images

Saturday, November 15, 2008

BanglaOCR V 0.5 | Internal Release

I would be happier if it is possible to avoid the word "Internal Release" now. But unfortunately after waiting for 15 days I am feeling that I should do this. This release is termed as internal as I am experiencing "Memory Leak" problem in my developed application and yet struggling to fix and solve that. However as I am trying to solve the memory leak problem therefore I am releasing this version for the purpose of internal testing and feedback.

Perquisites:
1. Microsoft .NET Framework Version 2.0 Redistributable Package (x86)
2. Microsoft Visual C++ 2005 Redistributable Package (x86)

Cautions:
1. Existence of memory leak problem: You might see that there are error messages about memory allocation. So, in that case you have to close the application and restart the OCR again. This is what the problem that I mentioned above. I am trying to solve this.

Feedback:
Any feedback about the application is highly appreciable. At present you can post feedback on the following email address: bangla.ocr@gmail.com

Test Images:
A set of test images are also available in the same site where the OCR setup package is available. So, please feel free to download them and test. The images were randomly selected to test the application and obviously not based on the best performance.

Download Link:
BanglaOCR V 0.5
Test Images

Saturday, November 1, 2008

Bangla tesseract training data v-2.0 have been uploaded

I just uploaded the Bangla training data for tesseract engine. To be honest there are lot more works to do to improve the training data so that the recognition performance increases. So, I hope we will be able to improve the training data and thus newer version of the data will be available soon. If anyone want to take part of this task (preparing training data) and need any help then please feel free to contact with me. The links of the training data are given below:
http://ocropus-bengali.googlecode.com/files/Bangla%20tesseract%20training%20data%20v-2.0.zip

or

http://mhasnat.googlepages.com/Bangla_training_data_v_2_0.zip

Tuesday, October 28, 2008

uncompressed TIFF image converter | VC++ 2005

In the past two days I was searching for the source code for converting to an uncompressed tiff image from any image format. At last I got it, the link is as follows:

http://msdn.microsoft.com/en-us/library/system.drawing.imaging.encoder.compression.aspx

Current Status of Bangla OCR | 29th OCtober, 2008

In the past few days I was feeling the urgency of writing about the current status of our research and development. Its quite long time (more than 40 days) since I post anything about the status of our Bangla OCR into the blog. In the last post I wrote about segmentation success and shown a color segmented image. After that we tried to combine everything and go for the 2nd release of Bangla OCR which will actually tesseract based Bangla OCR. Feeling the actual demand we plan to develop it in both Windows and Linux platform. Thanks goes to Joyonto da, Firoj alam and Murtoza who encourage me to think and move towards both platform. I would like to write that we have significant improvement in our development task. Shouro is dealing with the finishing task of the Ubuntu GUI. I have faced several difficulties regarding to the issues like: including tessnet2.dll, handling the buffer overrun problem, post processor adding etc and spend past two weeks on these. I was depressed because of the problems of tessnet2.dll loading. At last Remi Thomas (developer of tesnet2.dll) ensure me that the problem (buffer overrun) that I was facing was on tesseract. So, I moved my focus back to the tesseract.exe and tried to include it with my windows application and run it as a hidden (background process). Now I am successful because I can see the output. Few things are yet to include in our application. Hope we can finish it soon and also release 2nd version of Bangla OCR.

Wednesday, August 20, 2008

Status of Bangla character segmentation

This post is specially for those who are concern about our current status of Bangla character segmentation. I would like to show one example:


Figure 1: Input image


Figure 1: Segmented color image

Yet there is some modifications need to do.

Monday, August 4, 2008

Why tesseract need to train all possible combinations of characters/units to recognize Bangla and Devanagari scripts?

Background:

The last step of preprocessing is segmentation. Information about the segmented units is passed through the classifier for recognition. Tesseract is very efficiently trained to recognize english text image. To integrate the support for recognition of Indic scripts (Bangla and Devanagari) we have to modify several algorithms. Right now tesseract do not allow us to modify the internal algorithms. So, we need to prepare the test image in such form so that it will be suitable for tesseract to process and recognize. Here I am writing my personal observations after finishing my primary targeted experiment with tesseract to recognize Bangla script.

Experiments and Observations:


We performed our experiment with the training data to make a decision about the necessary amount of data units. For this experimental purpose we considered two issues approaches as follows:
  1. Train dependent modifiers separately than the basic units.

  2. Train dependent modifiers combined with the basic units.

By basic units we are considering all the independent vowels, consonants, compound characters (combined consonants). In a word image the modifiers (vowel & consonant) are placed around the core characters and often connected with the basic characters following few certain characteristics. In most of the cases the image of basic character and modifier has a certain amount of overlap between them. As a result it is impossible to make a horizontal and vertical boundary line between them. This scenario is completely different than English character segmentation. Our segmentation algorithm is capable of locating the region of the modifiers without placing any boundary location, and hence it is impossible to generate a bounding box separately for the basic characters and modifiers. An example of characters with modifier and their break down is shown in figure-1.

Figure-1: Characters with modifiers and their break down

Depending on the performance of our segmenter, at experimental approach 1 we trained tesseract with the basic units and the modifiers separately to observe whether it is capable of recognizing both the modifier and basic character in case of the presence of certain amount of overlap between them. We observed that during testing the segmenter is capable of making the modifier and basic character disconnected from each. Now, if the segmenter is successful in separating the modifier (example: , , ৈ ্য) and basic characters using a vertical column then tesseract will successfully recognize them. However in case of other modifiers (Example: ি, , , , , , and ্র) it is not possible for the segmenter to separate them using any vertical column. In such cases tesseract failed to recognize them because it is unable to identify the bounding box for modifier and basic unit.

During recognition of a line image, tesseract extract bounding box of each character image. When the modifier and basic unit has an overlap between them, it identified a box which includes both the modifier and basic character image. As we have no trained unit which includes both the modifier and basic unit, so tesseract will definitely fail in such case. An example of such image is shown in figure-2.

Figure-2: Example of test case image

So, considering the performance of approach it is clear that for Indic scripts which have similar properties it is necessary to train all possible combinations of the sensitive modifiers (Example: ি, , , , , , and ্র) and basic characters. So, the final training data set (approach-2) will consider the followings:

  • All vowels, consonants and numerals

  • Consonants + vowel modifiers

  • Consonants + consonant modifiers

  • Combined consonants (compound character)

  • Compound characters + vowel modifiers

  • Compound characters + consonant modifiers

Following approach-2 the total amount of training data units will be around 3200.

How to add a method in OCROPUS-0.2

i added a method in in file grouping.cc
method name is : crblpRecognizeLine(....)

i did it as the fellowing steps:

1. added my method body in the file grouping.cc in
struct NewGroupingLineOCR : IRecognizeLine {..};

2. added a fake body in file tesseract.cc in
struct TesseractRecognizeLine : IRecognizeLine {..};

as this: void crblpRecognizeLine(...){}

3. added a fake body in file glinerec.cc in
struct GLineRec : IRecognizeLine {..};

as this: void crblpRecognizeLine(...){}

4. added a prototype in file ocrinterfaces.h in
struct IRecognizeLine : IComponent {..};

as this: virtual void crblpRecognizeLine(...)=0;

5. added a prototype in file ocr.pkg in
struct IRecognizeLine : IComponent {..};

as this: virtual void crblpRecognizeLine(...)=0;

now compile
you can now call it from cpp or lua :-)

Saturday, August 2, 2008

how to install ocropus-0.2 (for newbies)

this guide is based on ubuntu 8.04LTS

step-1:
download ocropus-0.2,tesseract(use svn) and openfst.

step2:
open terminal (Applications->accessories->terminal)
make sure you are connected to internet.
then enter the fellowing commands -

$ sudo apt-get install libpng12
$ sudo apt-get install libpng12-dev
$ sudo apt-get install libtiff4
$ sudo apt-get install libtiff4-dev
$ sudo apt-get install libjpeg62
$ sudo apt-get install libjpeg62-dev
$ sudo apt-get install ftjam
$ sudo apt-get install zlib
$ sudo apt-get install libleptonica
$ sudo apt-get install libleptonica-dev
$ sudo apt-get install libedit-dev
$ sudo apt-get install aspell-en
$ sudo apt-get install libaspell-dev

step-3:
to build and install tesseract go to the tesseract root dir
commands are:

$./configure
$make
$sudo make install

step-4:
to build and install go to the openfst root dir

$cd fst
$make all
$sudo mkdir -p /usr/local/include/fst/lib
$sudo cp
-v fst/lib/*.h /usr/local/include/fst/lib
$sudo cp fst/lib/*.a /usr/local/lib

make sure that all the *.h files in
/usr/local/include/fst/lib
has the permision "read-only" for "others" catagory
make sure
/usr/local/lib/libfst.a has the same permission

to do this run these command:
$sudo chmod -R 755
/usr/local/include/fst/lib
$sudo chmod 755
/usr/local/lib/libfst.a

step-5:
go to the ocropus root dir the run these commands
$./configure --without-SDL
$jam
$sudo jam install

!!DONE!!

Wednesday, July 30, 2008

Best parameters for bpnet line training for bangla scripts

nhidden - 500
epochs - 200
learningrate - 0.2
testportion - 0
normalize - 1
shuffle - 1

Testing performance of ocropus-tesseract for Bangla Script

Training

Ist Step: Decide the training data set


To prepare the training data for Bangla characters I considered the followings:

- Basic characters (vowel + consonant) / units, numerals and symbols
- consonant + vowel modifiers
- consonant + consonant modifiers
- combined consonants (compound character)
- compound character + vowel modifiers
- compound character + consonant modifiers

The total number of training unit is 3200.


2nd Step: Prepare training data images

We typed all the combinations and take print out of the documents (13 page). Next we scan the pages and manually preprocessed the pages which includes skew correction. We choose the most popular font (SutonnyMJ) that have been widely used for a long time to print Bangla documents as maximum documents that we are targetting to OCRized is written in that font. The font is non unicode which is the only problem because it cannot be used for transcription. So, we used unicode font (Solaimanlipi) to prepare the transcription. An example is shown in figure-1.

Figure-1 : Example of training image files

3rd Step: Prepare training data files using tesseract

Next we prepare box files (*.box), training files (*.tr), clustered files (*.inttemp, *.normproto and *.pffmtable) and character set file (*.unicharset) appropriate instructions.

To create box file we didn't rely on the box file creation of process of tesseract as we find that it failed in many times and also it generates inappropriate box information for Bangla characters. So, used our own box file creation lua script which successfully create box file with appropriate transcription. Still we have to handle several character images manually. We might avoid the manual process if we chose mix document like real case in a document image . Another important point here is that we have to eliminate few units during training because of the presence of a certain amount of gap between the core character and its modifier in those character images. An example of such units are given in figure-2.

Figure-2 : Example of image units which failed in tesseract training

An example of the error report is as follows:
hasnat@hasnat-desktop:~$ tesseract 13.tif junk nobatch box.train
Tesseract Open Source OCR Engine
APPLY_BOXES: boxfile 2/1/সুঁ ((19,1388),(67,1468)): FAILURE! box overlaps no blobs or blobs in multiple rows


4th Step: Prepare language specific data files

To prepare language specific dictionary data file (freq_dawg and word_dawg) we choose a word list of 180K words and frequent word list of 30K words. The third file user_words is empty. We add only two rules in the DangAmbigs file.

We followed these four stpes to prepare complete training data.

Testing:
As we are using ocropus-tesseract so we get the facility of getting preprocessed image just before segmentation. We applied our own segmentation algorithm to segment the words into characters. Figure-3 shows an example of the training image and figure-4 shows the output of our segmenter. This segmented image is then passed to the tesseract recognizer to obtain the output text.


Figure-3 : Test Image


Figure-4 : Segmentation result

For the test image we got following unicode output. শহরেকন্দ্রিধী জীঁবন গেড় উঠলেও বাংল্লদেশের গ্রামের র্জীবনই প্রকৃত বৈশিষ্টে্যযুঁ অধিকারী.

My feedback:

Definitely I am not satisfied with this output. I had the initial expectation that after using the dictionary files of large word list we would be able to get more accurate result than the past approach while we used a word list of only 70 basic characters.

Next TODO:

I plan to do the following tasks to improve the efficiency:

1. Train each unit with at least 10 samples. (Right now, number of sample per unit = 1)
2. Train each unit with variations.
3. Collect training units from testing documents which is a lenthy process.

Saturday, July 19, 2008

Training data for ocropus-tesseract

We have created training data for ocropus-tesseract. The training data is available for both Bangla and Devanagari. For Bangla we tried to train all the combinations of minimally segmented data units. For Devanagari we trained with the very basic units.

We are testing the recognition performance with the trained data units for Bangla script and continue adding more data units to enhance the recognition accuracy. The Devanagari training data is useful for testing the basic character recognition only. It will be helpful to guide anyone who just start trying to recognize Devanagari script using ocropus-tesseract.


The training data is freely available to download. Anyone can download these from the following links:

Download Training data for Bangla
Download Training data for Devanagari

Sunday, July 13, 2008

How to test Bangla and Devanagari script using OCROPUS(tesseract)

Introduction

This document provides the step-by-step instructions that we followed to test printed document of Bangla and Devanagari script.


Required files

To test Bangla or Devanagari scripts (lang = ban/dev), we have two files in ocropus-0.2/ocroscript subdirectory. Those files are:

  • ocropus-0.2/ocroscript/rec-ltess.lua
  • ocropus-0.2/ocroscript/rec-tess.lua

Among these two lua files rec-ltess.lua is used mainly to observe the performance of ocropus layout analysis and rec-tess.lua is used to observe the performance of character recognition.


Usage of the lua files for testing

rec-ltess.lua

if #arg <>

arg = { "../../data/pages/alice_1.png" }

end

pages = Pages()

pages:parseSpec(arg[1])

segmenter = make_SegmentPageByRAST()

page_image = bytearray()

page_segmentation = intarray()

line_image = bytearray()

bboxes = rectanglearray()

costs = floatarray()

tesseract_recognizer = make_TesseractRecognizeLine()

tesseract.init("ban")

while pages:nextPage() do

pages:getBinary(page_image)

segmenter:segment(page_segmentation,page_image)

regions = RegionExtractor()

regions:setPageLines(page_segmentation)

for i = 1,regions:length()-1 do

regions:extract(line_image,page_image,i,1)

fst = make_StandardFst()

tesseract_recognizer:recognizeLine(fst,line_image)

result = nustring()

fst:bestpath(result)

print(result:utf8())

end

end


rec-tess.lua

require 'lib.util'

require 'lib.headings'

require 'lib.paragraphs'

if not tesseract then

print "Compiled without Tesseract support, can't continue."

os.exit(1)

end

opt,arg = getopt(arg)

if #arg == 0 then

print "Usage: ocroscript rec-tess [--tesslanguage=...] input.png ... >output.hocr"

os.exit(1)

end

set_version_string(hardcoded_version_string())

segmenter = make_SegmentPageByRAST()

page_image = bytearray()

page_segmentation = intarray()


function convert_RecognizedPage_to_PageNode(p)

page = PageNode()

page.width = p:width()

page.height = p:height()

page.description = p:description()

for i = 0, p:linesCount() - 1 do

local bbox = p:bbox(i)

local text = nustring()

p:text(text, i)

page:append(LineNode(bbox, text))

end

return page

end


document = DocumentNode()

for i = 1, #arg do

pages = Pages()

pages:parseSpec(arg[i])

while pages:nextPage() do

pages:getBinary(page_image)

segmenter:segment(page_segmentation,page_image)

local p = RecognizedPage()

tesseract_recognize_blockwise(p, page_image, page_segmentation)

p = convert_RecognizedPage_to_PageNode(p)

p.description = pages:getFileName()

local regions = RegionExtractor()

regions:setPageLines(page_segmentation)

p.headings = detect_headings(regions, page_image)

p.paragraphs = detect_paragraphs(regions, page_image)

document:append(p)

end

end

document:hocr_output()


Note: At present in ocropus-0.2 this file is suffering from the problem of representing the unicode output for Bangla as well as Devanagari. The reason is that the language specified in this file is overwritten in the file tesseract.cc. So, we edit the file tesseract.cc as follows:

namespace ocropus {
param_string tesslanguage("tesslanguage", "ban", "Specify the language for Tesseract");
}

Now the problem of unicode representation is solved.


Recognition issue

Isolated character recognition

To perform isolated character recognition using the lua files it is very simple by using the commands:

./ocroscript rec-tess.lua test-image.png > out-tess.html

Connected character recognition

To perform connected character recognition first we have to perform character segmentation which will be capable of representing each character as a separate component. An example of segmented image is shown if figure-1 (Bangla) and figure-2 (Devanagari). Here we include the segmentation algorithm inside the rec-tess.lua file. The command for connected character recognition is same as for isolated character recognition.


Figure 1: Example of Bangla word (a) test word (b) segmented units

Figure 2: Example of Devanagari word (a) test word (b) segmented units

=====

Acknowledgement: Souro Chowdhury

How to train Bangla and Devanagari script for tesseract engine

Introduction

This document provides the step-by-step instructions that we followed to train data for Bangla and Devanagari script. This is just a short version of the document TrainingTesseract, which we followed to prepare training data for Bangla and Devanagari. No detail explanation for the purpose of each step is given here.

Data files required

To train Bangla or Devanagari scripts (lang = ban/dev), you have to create 8 data files in the tessdata subdirectory. The 8 files are:

1. tessdata/lang.freq-dawg

2. tessdata/lang.word-dawg

3. tessdata/lang.user-words

4. tessdata/lang.inttemp

5. tessdata/lang.normproto

6. tessdata/lang.pffmtable

7. tessdata/lang.unicharset

8. tessdata/lang.DangAmbigs

Step by step procedure

Step – 1: Create training data

Preparing training data depends on the characters or units that you want to recognize. Decision about the number of training data units depends on the performance of the segmentation algorithm. If you consider minimal segmentation then you have to consider all the combinations formed by the alphabets of your script. However if your segmentation algorithm is well enough to segment the basic units properly then you can train only the basic and compound units of your script. In the fundamental level we consider training only the basic units. An example of the training data units is shown in figure-1 (Bangla training data) and figure-2 (Devanagari training data).

Figure 1: Training data units for Bangla script

Figure 2: Training data units for Devanagari script

Step – 2: Make Box file

In this step we have to prepare a box file (a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image). To create the box file, run Tesseract on each of your training images using this command line. The command is as follows:

tesseract trainfile.tif trainfile batch.nochop makebox

This will generate a file named trainfile.txt that you have to rename as trainfile.box. Then manually edit the box file where you have to replace the Latin characters (first character of each line) with appropriate unicode Bangla/Devanagari character. If any particular character is broken into two boxes then you have to manually merge the boxes. An example of edited box file is shown in figure-3. The generated box file name must be same with the training tif image file name.

Figure 3: Box file for Bangla and Devanagari script

Step – 3: Run Tesseract for Training

For each of your training image and boxfile pairs, run Tesseract in training mode using the following command:

tesseract trainfile.tif junk nobatch box.train

This will generate a file named trainfile.tr which contains the features of each character of the training page.

Step – 4: Clustering

Clustering is necessary to create prototypes. The character shape features can be clustered using the mftraining and cntraining programs. The mftraining program is invoked using the following command:

mftraining trainfile.tr

This will output two data files: inttemp and pffmtable. (A third file called Microfeat is also written by this program, but it is not used.)

The cntraining program is invoked using the following command:

cntraining trainfile.tr

This will output the normproto data file.

In case of multiple training data the following command will be used:

mftraining trainfile_1.tr trainfile_2.tr ...

cntraining trainfile_1.tr trainfile_2.tr ...

Step – 5: Compute the Character Set

Next you have to generate the unicharset data file using the following command:

unicharset_extractor trainfile.box

This will generate a file named unicharset. Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower. To set these properties we have to manually edit the unicharset file and change the default value (0) set for each training character. An example of the unicharset file is shown in figure-4.


Figure 4: unicharset file for Bangla and Devanagari script

Step – 6: Prepare Dictionary data

Tesseract uses 3 dictionary files for each language. Two of the files are coded as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8 text file. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files:

wordlist2dawg frequent_words_list freq-dawg

wordlist2dawg words_list word-dawg

The third dictionary file is called user-words and is usually empty.

The dictionary files freq-dawg and word-dawg don't have to be given many words if you don't have a wordlist to hand, but accuracy will be lower.

Step – 7: Prepare DangAmbigs file

This file represents the intrinsic ambiguity between characters or sets of characters. You have to generate this file considering the recognition failure example in your script. An example of the rules is shown in figure-5 for Bangla script.

Figure 5: Ambiguity between characters in Bangla script

Step – 8: Rename the necessary files

As mentioned in the starting of this document, now you have to rename the necessary 8 files according to your language/script. For Bangla we used lang=“ban” and for Devanagari we used lang=“dev”. So, the name of the necessary 8 files will be prefixed by lang+'.' (Example: ban.unicharset, dev.unicharset). These 8 files must be copied into the tessdata subdirectory if these are generated any other place.

=====
Acknowledgement: Souro Chowdhury