Wednesday, July 30, 2008

Testing performance of ocropus-tesseract for Bangla Script

Training

Ist Step: Decide the training data set


To prepare the training data for Bangla characters I considered the followings:

- Basic characters (vowel + consonant) / units, numerals and symbols
- consonant + vowel modifiers
- consonant + consonant modifiers
- combined consonants (compound character)
- compound character + vowel modifiers
- compound character + consonant modifiers

The total number of training unit is 3200.


2nd Step: Prepare training data images

We typed all the combinations and take print out of the documents (13 page). Next we scan the pages and manually preprocessed the pages which includes skew correction. We choose the most popular font (SutonnyMJ) that have been widely used for a long time to print Bangla documents as maximum documents that we are targetting to OCRized is written in that font. The font is non unicode which is the only problem because it cannot be used for transcription. So, we used unicode font (Solaimanlipi) to prepare the transcription. An example is shown in figure-1.

Figure-1 : Example of training image files

3rd Step: Prepare training data files using tesseract

Next we prepare box files (*.box), training files (*.tr), clustered files (*.inttemp, *.normproto and *.pffmtable) and character set file (*.unicharset) appropriate instructions.

To create box file we didn't rely on the box file creation of process of tesseract as we find that it failed in many times and also it generates inappropriate box information for Bangla characters. So, used our own box file creation lua script which successfully create box file with appropriate transcription. Still we have to handle several character images manually. We might avoid the manual process if we chose mix document like real case in a document image . Another important point here is that we have to eliminate few units during training because of the presence of a certain amount of gap between the core character and its modifier in those character images. An example of such units are given in figure-2.

Figure-2 : Example of image units which failed in tesseract training

An example of the error report is as follows:
hasnat@hasnat-desktop:~$ tesseract 13.tif junk nobatch box.train
Tesseract Open Source OCR Engine
APPLY_BOXES: boxfile 2/1/সুঁ ((19,1388),(67,1468)): FAILURE! box overlaps no blobs or blobs in multiple rows


4th Step: Prepare language specific data files

To prepare language specific dictionary data file (freq_dawg and word_dawg) we choose a word list of 180K words and frequent word list of 30K words. The third file user_words is empty. We add only two rules in the DangAmbigs file.

We followed these four stpes to prepare complete training data.

Testing:
As we are using ocropus-tesseract so we get the facility of getting preprocessed image just before segmentation. We applied our own segmentation algorithm to segment the words into characters. Figure-3 shows an example of the training image and figure-4 shows the output of our segmenter. This segmented image is then passed to the tesseract recognizer to obtain the output text.


Figure-3 : Test Image


Figure-4 : Segmentation result

For the test image we got following unicode output. শহরেকন্দ্রিধী জীঁবন গেড় উঠলেও বাংল্লদেশের গ্রামের র্জীবনই প্রকৃত বৈশিষ্টে্যযুঁ অধিকারী.

My feedback:

Definitely I am not satisfied with this output. I had the initial expectation that after using the dictionary files of large word list we would be able to get more accurate result than the past approach while we used a word list of only 70 basic characters.

Next TODO:

I plan to do the following tasks to improve the efficiency:

1. Train each unit with at least 10 samples. (Right now, number of sample per unit = 1)
2. Train each unit with variations.
3. Collect training units from testing documents which is a lenthy process.

Saturday, July 19, 2008

Training data for ocropus-tesseract

We have created training data for ocropus-tesseract. The training data is available for both Bangla and Devanagari. For Bangla we tried to train all the combinations of minimally segmented data units. For Devanagari we trained with the very basic units.

We are testing the recognition performance with the trained data units for Bangla script and continue adding more data units to enhance the recognition accuracy. The Devanagari training data is useful for testing the basic character recognition only. It will be helpful to guide anyone who just start trying to recognize Devanagari script using ocropus-tesseract.


The training data is freely available to download. Anyone can download these from the following links:

Download Training data for Bangla
Download Training data for Devanagari

Sunday, July 13, 2008

How to test Bangla and Devanagari script using OCROPUS(tesseract)

Introduction

This document provides the step-by-step instructions that we followed to test printed document of Bangla and Devanagari script.


Required files

To test Bangla or Devanagari scripts (lang = ban/dev), we have two files in ocropus-0.2/ocroscript subdirectory. Those files are:

  • ocropus-0.2/ocroscript/rec-ltess.lua
  • ocropus-0.2/ocroscript/rec-tess.lua

Among these two lua files rec-ltess.lua is used mainly to observe the performance of ocropus layout analysis and rec-tess.lua is used to observe the performance of character recognition.


Usage of the lua files for testing

rec-ltess.lua

if #arg <>

arg = { "../../data/pages/alice_1.png" }

end

pages = Pages()

pages:parseSpec(arg[1])

segmenter = make_SegmentPageByRAST()

page_image = bytearray()

page_segmentation = intarray()

line_image = bytearray()

bboxes = rectanglearray()

costs = floatarray()

tesseract_recognizer = make_TesseractRecognizeLine()

tesseract.init("ban")

while pages:nextPage() do

pages:getBinary(page_image)

segmenter:segment(page_segmentation,page_image)

regions = RegionExtractor()

regions:setPageLines(page_segmentation)

for i = 1,regions:length()-1 do

regions:extract(line_image,page_image,i,1)

fst = make_StandardFst()

tesseract_recognizer:recognizeLine(fst,line_image)

result = nustring()

fst:bestpath(result)

print(result:utf8())

end

end


rec-tess.lua

require 'lib.util'

require 'lib.headings'

require 'lib.paragraphs'

if not tesseract then

print "Compiled without Tesseract support, can't continue."

os.exit(1)

end

opt,arg = getopt(arg)

if #arg == 0 then

print "Usage: ocroscript rec-tess [--tesslanguage=...] input.png ... >output.hocr"

os.exit(1)

end

set_version_string(hardcoded_version_string())

segmenter = make_SegmentPageByRAST()

page_image = bytearray()

page_segmentation = intarray()


function convert_RecognizedPage_to_PageNode(p)

page = PageNode()

page.width = p:width()

page.height = p:height()

page.description = p:description()

for i = 0, p:linesCount() - 1 do

local bbox = p:bbox(i)

local text = nustring()

p:text(text, i)

page:append(LineNode(bbox, text))

end

return page

end


document = DocumentNode()

for i = 1, #arg do

pages = Pages()

pages:parseSpec(arg[i])

while pages:nextPage() do

pages:getBinary(page_image)

segmenter:segment(page_segmentation,page_image)

local p = RecognizedPage()

tesseract_recognize_blockwise(p, page_image, page_segmentation)

p = convert_RecognizedPage_to_PageNode(p)

p.description = pages:getFileName()

local regions = RegionExtractor()

regions:setPageLines(page_segmentation)

p.headings = detect_headings(regions, page_image)

p.paragraphs = detect_paragraphs(regions, page_image)

document:append(p)

end

end

document:hocr_output()


Note: At present in ocropus-0.2 this file is suffering from the problem of representing the unicode output for Bangla as well as Devanagari. The reason is that the language specified in this file is overwritten in the file tesseract.cc. So, we edit the file tesseract.cc as follows:

namespace ocropus {
param_string tesslanguage("tesslanguage", "ban", "Specify the language for Tesseract");
}

Now the problem of unicode representation is solved.


Recognition issue

Isolated character recognition

To perform isolated character recognition using the lua files it is very simple by using the commands:

./ocroscript rec-tess.lua test-image.png > out-tess.html

Connected character recognition

To perform connected character recognition first we have to perform character segmentation which will be capable of representing each character as a separate component. An example of segmented image is shown if figure-1 (Bangla) and figure-2 (Devanagari). Here we include the segmentation algorithm inside the rec-tess.lua file. The command for connected character recognition is same as for isolated character recognition.


Figure 1: Example of Bangla word (a) test word (b) segmented units

Figure 2: Example of Devanagari word (a) test word (b) segmented units

=====

Acknowledgement: Souro Chowdhury

How to train Bangla and Devanagari script for tesseract engine

Introduction

This document provides the step-by-step instructions that we followed to train data for Bangla and Devanagari script. This is just a short version of the document TrainingTesseract, which we followed to prepare training data for Bangla and Devanagari. No detail explanation for the purpose of each step is given here.

Data files required

To train Bangla or Devanagari scripts (lang = ban/dev), you have to create 8 data files in the tessdata subdirectory. The 8 files are:

1. tessdata/lang.freq-dawg

2. tessdata/lang.word-dawg

3. tessdata/lang.user-words

4. tessdata/lang.inttemp

5. tessdata/lang.normproto

6. tessdata/lang.pffmtable

7. tessdata/lang.unicharset

8. tessdata/lang.DangAmbigs

Step by step procedure

Step – 1: Create training data

Preparing training data depends on the characters or units that you want to recognize. Decision about the number of training data units depends on the performance of the segmentation algorithm. If you consider minimal segmentation then you have to consider all the combinations formed by the alphabets of your script. However if your segmentation algorithm is well enough to segment the basic units properly then you can train only the basic and compound units of your script. In the fundamental level we consider training only the basic units. An example of the training data units is shown in figure-1 (Bangla training data) and figure-2 (Devanagari training data).

Figure 1: Training data units for Bangla script

Figure 2: Training data units for Devanagari script

Step – 2: Make Box file

In this step we have to prepare a box file (a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image). To create the box file, run Tesseract on each of your training images using this command line. The command is as follows:

tesseract trainfile.tif trainfile batch.nochop makebox

This will generate a file named trainfile.txt that you have to rename as trainfile.box. Then manually edit the box file where you have to replace the Latin characters (first character of each line) with appropriate unicode Bangla/Devanagari character. If any particular character is broken into two boxes then you have to manually merge the boxes. An example of edited box file is shown in figure-3. The generated box file name must be same with the training tif image file name.

Figure 3: Box file for Bangla and Devanagari script

Step – 3: Run Tesseract for Training

For each of your training image and boxfile pairs, run Tesseract in training mode using the following command:

tesseract trainfile.tif junk nobatch box.train

This will generate a file named trainfile.tr which contains the features of each character of the training page.

Step – 4: Clustering

Clustering is necessary to create prototypes. The character shape features can be clustered using the mftraining and cntraining programs. The mftraining program is invoked using the following command:

mftraining trainfile.tr

This will output two data files: inttemp and pffmtable. (A third file called Microfeat is also written by this program, but it is not used.)

The cntraining program is invoked using the following command:

cntraining trainfile.tr

This will output the normproto data file.

In case of multiple training data the following command will be used:

mftraining trainfile_1.tr trainfile_2.tr ...

cntraining trainfile_1.tr trainfile_2.tr ...

Step – 5: Compute the Character Set

Next you have to generate the unicharset data file using the following command:

unicharset_extractor trainfile.box

This will generate a file named unicharset. Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower. To set these properties we have to manually edit the unicharset file and change the default value (0) set for each training character. An example of the unicharset file is shown in figure-4.


Figure 4: unicharset file for Bangla and Devanagari script

Step – 6: Prepare Dictionary data

Tesseract uses 3 dictionary files for each language. Two of the files are coded as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8 text file. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files:

wordlist2dawg frequent_words_list freq-dawg

wordlist2dawg words_list word-dawg

The third dictionary file is called user-words and is usually empty.

The dictionary files freq-dawg and word-dawg don't have to be given many words if you don't have a wordlist to hand, but accuracy will be lower.

Step – 7: Prepare DangAmbigs file

This file represents the intrinsic ambiguity between characters or sets of characters. You have to generate this file considering the recognition failure example in your script. An example of the rules is shown in figure-5 for Bangla script.

Figure 5: Ambiguity between characters in Bangla script

Step – 8: Rename the necessary files

As mentioned in the starting of this document, now you have to rename the necessary 8 files according to your language/script. For Bangla we used lang=“ban” and for Devanagari we used lang=“dev”. So, the name of the necessary 8 files will be prefixed by lang+'.' (Example: ban.unicharset, dev.unicharset). These 8 files must be copied into the tessdata subdirectory if these are generated any other place.

=====
Acknowledgement: Souro Chowdhury

Saturday, October 6, 2007

Research and Development of OCR Systems on Bangla Script

Research Papers

SL

Paper Citation

Year

1.

A. K. Roy and B. Chatterjee, "Design of a Nearest Neighbor Classifier for Bengali Character Recognition", J. IETE, vol. 30, 1984.

1984

2.

U. Pal and B. B. Chaudhuri, "OCR In Bangla: An Indo-Bangladeshi Language", Proc. of 12th Int. Conf. on Pattern Recognition, IEEE Computer Society Press, pp. 269-274, 1994.

1994

3.

B. B. Chaudhuri and U. Pal, "Computer Recognition of printed Bangla Script", Int. Journal of Systems Science, vol. 26, pp. 2107-2123, 1995.

1995

4.

B. B. Chaudhuri and U. Pal, "OCR Error Detection and correction of an Inflectional Indian Language Script", Proceedings of ICPR, 1996.

1996

5.

B.B. Chaudhuri and U. Pal, "Automatic Separation of Words in Multi-Lingual Multi-Script Indian Documents", IEEE Trans, 1997.

1997

6.

U. Pal, Ph.D. Thesis, "On The Development of An Optical Character Recognition (OCR) System For Printed Bangla Script", Indian Statistical Institute, 1997.

1997

7.

U. Pal and B. B. Chaudhuri, "Printed Devnagari Script OCR System", Vivek, vol.10, pp.12-24, 1997.

1997

8.

B. B. Chaudhuri and U. Pal, "An OCR System To Read Two Indian Language Scripts: Bangla And Devnagari (Hindi)", Proc. Fourth Int. conf. on Document Analysis and Recognition, IEEE Computer Society Press, pp. 1011-1016, 1997.

1997

9.

B.B. Chaudhuri and U. Pal, "Skew Angle Detection Of Digitized Indian Script Documents", IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, pp.182-186, 1997.

1997

10.

B. B. Chaudhuri and U. Pal, "A Complete Printed Bangla OCR System", Pattern Recognition, vol. 31, pp. 531-549, 1998.

1998

11.

B.B.Chaudhuri and U.Pal, “A complete Bangla OCR System”, Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, 1998.

1998

12.

U. Pal and B. B. Chaudhuri, "Script Line Separation From Indian Multi-Script Documents", Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR '99, Bangalore, India.

1999

13.

U. Pal and B. B. Chaudhuri, "Automatic Separation of Machine-Printed and Hand-Written Text Lines", Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR '99, Bangalore, India.

1999

14.

U. Pal and B. B. Chaudhuri, "Automatic Identification of English, Chinese, Arabic, Devnagari And Bangla Script Line", In Proc. Sixth Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press, pp.790-794, 2001.

2001

15.

U. Pal, M. Mitra and B. B. Chaudhuri, "Multi-Skew Detection of Indian Script Documents", IEEE trans, 2001.

2001

16.

Veena Bansal and R.M.K. Sinha, A Devanagari OCR and A Brief Overview of OCR Research for Indian Scripts in Proceedings of STRANS01, held at IIT Kanpur, 2001.

2001

17.

A.Mandal and Prof. B. B. Chaudhuri, "Page Layout Analyzer For Multilingual Indian Documents", will be published in the Proceedings of the Language Engineering Conference 2002 by IEEE CS Press.

2002

18.

Ahmed Asif Chowdhury, Ejaj Ahmed, Shameem Ahmed, Shohrab Hossain and Chowdhury Mofizur Rahman"Optical Character Recognition of Bangla Characters using neural network: A better approach". 2nd International Conference on Electrical Engineering (ICEE 2002), Khulna, Bangladesh.

2002

19.

Utpal Garain And Bidyut B. Chaudhuri, "Segmentation Of Touching Characters In Printed Devnagari And Bangla Scripts Using Fuzzy Multifactorial Analysis", IEEE Transactions On Systems, Man, And Cybernetics—Part C: Applications And Reviews, Vol. 32, No. 4, November 2002

2002

20.

Jalal Uddin Mahmud, Mohammed Feroz Raihan and Chowdhury Mofizur Rahman, "A Complete OCR System for Continuous Bangla Characters", IEEE TENCON-2003: Proceedings of the Conference on Convergent Technologies for the Asia Pacific, 2003.

2003

21.

A. M. Shoeb Shatil and Mumit Khan, “Minimally Segmenting High Performance Bangla OCR using Kohonen Network”, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.

2006

22.

S. M. Murtoza Habib, Nawsher Ahmed Noor and Mumit Khan, Skew correction of Bangla script using Radon Transform, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.

2006

23.

Md. Abul Hasnat, S. M. Murtoza Habib, and Mumit Khan, Segmentation free Bangla OCR using HMM: Training and Recognition, Proc. of 1st International Conference on Digital Communications and Computer Applications (DCCA2007), Irbid, Jordan, 2007.

2007

Implementations

Name

Page Link

BOCRA [2006]

http://bocra.sourceforge.net/doc/

Apona-pathak [2006]

http://www.apona-bd.com/apona-pathak/bangla-ocr-apona-pathak.html

BanglaOCR [2007]

http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=215908

Sunday, September 9, 2007

:-- Optical Character Recognition --:


Name:: BanglaOCR

Summary::

This projects aims to develop an Optical Character Recognizer that can recognize Bangla Language Scripts. The entire OCR research and development task is mainly divided into five parts: Preprocessing, Feature Extraction, Training, Recognition and Post-processing. We performed experiment with several techniques for each individual parts and choose the appropriate methods in our implementation. We used Hidden Markov Model (HMM) technique for pattern training and classification. Hidden Markov Model Toolkit (HTK) is used to implement the Training and Classification Task.


Details::

BanglaOCR is the Optical Character Recognizer for Bangla Script. It takes scanned images of a printed page or document as input and converts them into editable Unicode text. BanglaOCR allows users to train the data set from any document and observe the recognition performance.

BanglaOCR deals will several independent parts as listed below:

  • Preprocessing
  • Feature Extraction
  • Training
  • Recognition
  • Post-processing


The Preprocessing task involves image acquisition, binarization, noise elimination, skew correction, line and word separation and character segmentation. In this step we put our effort up to minimal segmentation of characters.

The Feature Extraction task involves the extraction of meaningful characteristics of a minimally segmented character. First we divide the character image into several frames using a certain frame length. Then we performed Discrete Cosine Transform (DCT) calculation over each frames. We consider the number of frames and DCT calculated values of each frame as the features for each character.

Training is performed over the calculated features of each minimally segmented character image. We created separate HMM model for each segmented character image. The model creating involves dynamically choosing a prototype HMM model and creates a model using the prototype HMM and the extracted feature data. The model creation task is automatically performed by invoking HInit tool of the HTK toolkit.

The Recognition process invokes the recognition tool HVite of the HTK toolkit. HVite uses feature file of the segmented character image where the features are written in a specified format, the word network that describes the allowable word sequence build up from task grammar, the dictionary that define each character or word, the entire list of HMMs and the .mmf file where the description of each HMM model is written. All these files are constructed according to HTK Toolkit understandable format. After the recognition process is completed the model name is read from the Master Label File (.mmf) and the associated Unicode character for the recognized model is written to the output file.

The final task that BanglaOCR perform is post-processing. This involves a suggestion based spell checker that is capable to identify the erroneous words and produce suggestions. This take the recognizer’s output file as input and produce output file that marks the erroneous words and provides up to a certain number of suggestions.

The project goal of BanglaOCR is to develop a market place standard multilingual OCR system that will be capable to perform the digitization of a wide domain of Bangla Document images. This will help to archive the documents from all spheres and prevent the damage and lost of valuable documents and books.


Team::

Status::

  • First version of open source BanglaOCR is released under GNU Public License (GPL) version 2 or later.
  • Research work on different parts of BanglaOCR is continuing to enhance the performance and usability.


Research Scope::

  • Research on Feature Extraction of Bengali Characters.
  • Research on proper Segmentation of Bengali Characters from any type of document image.
  • Research on the preprocessing of historical Bangla Document Image.
  • Research on Bangla Handwritten Image.
  • Research on the Training and Recognition using different techniques.
  • Research on Multi Lingual OCR.

Development Scope::

  • Implement the existing developed versions using different language.

Timeline:: 2007 – 2009.

Character Recognition


Character recognition techniques associate a symbolic identity with the image of character. Character recognition is commonly referred to as optical character recognition (OCR), as it deals with the recognition of optically processed characters. The modern version of OCR appeared in the middle of the 1940’s with the development of the digital computers. OCR machines have been commercially available since the middle of the 1950’s. Today OCR-systems are available both as hardware devices and software packages, and a few thousand systems are sold every week.

In a typical OCR systems input characters are digitized by an optical scanner. Each character is then located and segmented, and the resulting character image is fed into a preprocessor for noise reduction and normalization. Certain characteristics are the extracted from the character for classification. The feature extraction is critical and many different techniques exist, each having its strengths and weaknesses. After classification the identified characters are grouped to reconstruct the original symbol strings, and context may then be applied to detect and correct errors.

Optical character recognition has many different practical applications. The main areas where OCR has been of importance are text entry (office automation), data entry (banking environment) and process automation (mail sorting).

The present state of the art in OCR has moved from primitive schemes for limited character sets, to the application of more sophisticated techniques for omnifont and handprint recognition. The main problems in OCR usually lie in the segmentation of degraded symbols which are joined or fragmented. Generally, the accuracy of an OCR system is directly dependent upon the quality of the input document. Three figures are used in ratings of OCR systems; correct classification rate, rejection rate and error rate. The performance should be rated from the systems error rate, as these errors go by undetected by the system and must be manually located for correction.

In spite of the great number of algorithms that have been developed for character recognition, the problem is not yet solved satisfactory, especially not in the cases when there are no strict limitations on the handwriting or quality of print. Up to now, no recognition algorithm may compete with man in quality. However, as the OCR machine is able to read much faster, it is still attractive.

In the future the area of recognition of constrained print is expected to decrease. Emphasis will then be on the recognition of unconstrained writing, like omnifont and handwriting. This is a challenge which requires improved recognition techniques. The potential for OCR algorithms seems to lie in the combination of different methods and the use of techniques that are able to utilize context to a much larger extent than current methodologies. May be exchanged electronically or printed in a more computer readable form, for instance barcodes.

The applications for future OCR-systems lie in the recognition of documents where control over the production process is impossible. This may be material where the recipient is cut off from an electronic version and has no control of the production process or older material which at production time could not be generated electronically. This means that future OCR-systems intended for reading printed text must be omnifont.

Another important area for OCR is the recognition of manually produced documents.

Within postal applications for instance, OCR must focus on reading of addresses on mail produced by people without access to computer technology. Already, it is not unusual for companies etc., with access to computer technology to mark mail with barcodes. The relative importance of handwritten text recognition is therefore expected to increase.

[Source]: Line Eikvil, "Optical Character Recognition", available at: “citeseer.ist.psu.edu/142042.html".