Sunday, July 13, 2008

How to test Bangla and Devanagari script using OCROPUS(tesseract)

Introduction

This document provides the step-by-step instructions that we followed to test printed document of Bangla and Devanagari script.


Required files

To test Bangla or Devanagari scripts (lang = ban/dev), we have two files in ocropus-0.2/ocroscript subdirectory. Those files are:

  • ocropus-0.2/ocroscript/rec-ltess.lua
  • ocropus-0.2/ocroscript/rec-tess.lua

Among these two lua files rec-ltess.lua is used mainly to observe the performance of ocropus layout analysis and rec-tess.lua is used to observe the performance of character recognition.


Usage of the lua files for testing

rec-ltess.lua

if #arg <>

arg = { "../../data/pages/alice_1.png" }

end

pages = Pages()

pages:parseSpec(arg[1])

segmenter = make_SegmentPageByRAST()

page_image = bytearray()

page_segmentation = intarray()

line_image = bytearray()

bboxes = rectanglearray()

costs = floatarray()

tesseract_recognizer = make_TesseractRecognizeLine()

tesseract.init("ban")

while pages:nextPage() do

pages:getBinary(page_image)

segmenter:segment(page_segmentation,page_image)

regions = RegionExtractor()

regions:setPageLines(page_segmentation)

for i = 1,regions:length()-1 do

regions:extract(line_image,page_image,i,1)

fst = make_StandardFst()

tesseract_recognizer:recognizeLine(fst,line_image)

result = nustring()

fst:bestpath(result)

print(result:utf8())

end

end


rec-tess.lua

require 'lib.util'

require 'lib.headings'

require 'lib.paragraphs'

if not tesseract then

print "Compiled without Tesseract support, can't continue."

os.exit(1)

end

opt,arg = getopt(arg)

if #arg == 0 then

print "Usage: ocroscript rec-tess [--tesslanguage=...] input.png ... >output.hocr"

os.exit(1)

end

set_version_string(hardcoded_version_string())

segmenter = make_SegmentPageByRAST()

page_image = bytearray()

page_segmentation = intarray()


function convert_RecognizedPage_to_PageNode(p)

page = PageNode()

page.width = p:width()

page.height = p:height()

page.description = p:description()

for i = 0, p:linesCount() - 1 do

local bbox = p:bbox(i)

local text = nustring()

p:text(text, i)

page:append(LineNode(bbox, text))

end

return page

end


document = DocumentNode()

for i = 1, #arg do

pages = Pages()

pages:parseSpec(arg[i])

while pages:nextPage() do

pages:getBinary(page_image)

segmenter:segment(page_segmentation,page_image)

local p = RecognizedPage()

tesseract_recognize_blockwise(p, page_image, page_segmentation)

p = convert_RecognizedPage_to_PageNode(p)

p.description = pages:getFileName()

local regions = RegionExtractor()

regions:setPageLines(page_segmentation)

p.headings = detect_headings(regions, page_image)

p.paragraphs = detect_paragraphs(regions, page_image)

document:append(p)

end

end

document:hocr_output()


Note: At present in ocropus-0.2 this file is suffering from the problem of representing the unicode output for Bangla as well as Devanagari. The reason is that the language specified in this file is overwritten in the file tesseract.cc. So, we edit the file tesseract.cc as follows:

namespace ocropus {
param_string tesslanguage("tesslanguage", "ban", "Specify the language for Tesseract");
}

Now the problem of unicode representation is solved.


Recognition issue

Isolated character recognition

To perform isolated character recognition using the lua files it is very simple by using the commands:

./ocroscript rec-tess.lua test-image.png > out-tess.html

Connected character recognition

To perform connected character recognition first we have to perform character segmentation which will be capable of representing each character as a separate component. An example of segmented image is shown if figure-1 (Bangla) and figure-2 (Devanagari). Here we include the segmentation algorithm inside the rec-tess.lua file. The command for connected character recognition is same as for isolated character recognition.


Figure 1: Example of Bangla word (a) test word (b) segmented units

Figure 2: Example of Devanagari word (a) test word (b) segmented units

=====

Acknowledgement: Souro Chowdhury

4 comments:

আনন্দবাজার পত্রিকা ইউনিকোড said...

বাংলা ওসিয়ারটি খুব ভালো হয়েছে। বাংলা ওসিয়ার টি আরও একটু ভালো করার চেষ্টা করুন। মানে ইংরেজির মতো।

Md. Abul Hasnat said...

many many thanks for your appreciation. We are trying our best to achieve our target to make Bangla OCR more efficient.

ali said...
This comment has been removed by the author.
ali said...

Hello,
I am also using ocropus.
I am getting an error when i try to run the following

ocroscript rec-tess file.png

I get the following error

no file '/usr/local/share/ocropus/scripts//rec-tess.lua'
no file './rec-tess.so'
no file '/usr/local/lib/lua/5.1/rec-tess.so'
no file '/usr/local/lib/lua/5.1/loadall.so'

Any help will be appreciated.

Thanks a lot