Introduction
This document provides the step-by-step instructions that we followed to test printed document of Bangla and Devanagari script.
Required files
To test Bangla or Devanagari scripts (lang = ban/dev), we have two files in ocropus-0.2/ocroscript subdirectory. Those files are:
- ocropus-0.2/ocroscript/rec-ltess.lua
- ocropus-0.2/ocroscript/rec-tess.lua
Among these two lua files rec-ltess.lua is used mainly to observe the performance of ocropus layout analysis and rec-tess.lua is used to observe the performance of character recognition.
Usage of the lua files for testing
rec-ltess.lua
if #arg <>
arg = { "../../data/pages/alice_1.png" }
end
pages = Pages()
pages:parseSpec(arg[1])
segmenter = make_SegmentPageByRAST()
page_image = bytearray()
page_segmentation = intarray()
line_image = bytearray()
bboxes = rectanglearray()
costs = floatarray()
tesseract_recognizer = make_TesseractRecognizeLine()
tesseract.init("ban")
while pages:nextPage() do
pages:getBinary(page_image)
segmenter:segment(page_segmentation,page_image)
regions = RegionExtractor()
regions:setPageLines(page_segmentation)
for i = 1,regions:length()-1 do
regions:extract(line_image,page_image,i,1)
fst = make_StandardFst()
tesseract_recognizer:recognizeLine(fst,line_image)
result = nustring()
fst:bestpath(result)
print(result:utf8())
end
end
rec-tess.lua
require 'lib.util'
require 'lib.headings'
require 'lib.paragraphs'
if not tesseract then
print "Compiled without Tesseract support, can't continue."
os.exit(1)
end
opt,arg = getopt(arg)
if #arg == 0 then
print "Usage: ocroscript rec-tess [--tesslanguage=...] input.png ... >output.hocr"
os.exit(1)
end
set_version_string(hardcoded_version_string())
segmenter = make_SegmentPageByRAST()
page_image = bytearray()
page_segmentation = intarray()
function convert_RecognizedPage_to_PageNode(p)
page = PageNode()
page.width = p:width()
page.height = p:height()
page.description = p:description()
for i = 0, p:linesCount() - 1 do
local bbox = p:bbox(i)
local text = nustring()
p:text(text, i)
page:append(LineNode(bbox, text))
end
return page
end
document = DocumentNode()
for i = 1, #arg do
pages = Pages()
pages:parseSpec(arg[i])
while pages:nextPage() do
pages:getBinary(page_image)
segmenter:segment(page_segmentation,page_image)
local p = RecognizedPage()
tesseract_recognize_blockwise(p, page_image, page_segmentation)
p = convert_RecognizedPage_to_PageNode(p)
p.description = pages:getFileName()
local regions = RegionExtractor()
regions:setPageLines(page_segmentation)
p.headings = detect_headings(regions, page_image)
p.paragraphs = detect_paragraphs(regions, page_image)
document:append(p)
end
end
document:hocr_output()
Note: At present in ocropus-0.2 this file is suffering from the problem of representing the unicode output for Bangla as well as Devanagari. The reason is that the language specified in this file is overwritten in the file tesseract.cc. So, we edit the file tesseract.cc as follows:
namespace ocropus {
param_string tesslanguage("tesslanguage", "ban", "Specify the language for Tesseract");
}
Now the problem of unicode representation is solved.
Recognition issue
Isolated character recognition
To perform isolated character recognition using the lua files it is very simple by using the commands:
./ocroscript rec-tess.lua test-image.png > out-tess.html
Connected character recognition
To perform connected character recognition first we have to perform character segmentation which will be capable of representing each character as a separate component. An example of segmented image is shown if figure-1 (Bangla) and figure-2 (Devanagari). Here we include the segmentation algorithm inside the rec-tess.lua file. The command for connected character recognition is same as for isolated character recognition.
Figure 1: Example of Bangla word (a) test word (b) segmented units
Figure 2: Example of Devanagari word (a) test word (b) segmented units
=====
Acknowledgement: Souro Chowdhury
4 comments:
বাংলা ওসিয়ারটি খুব ভালো হয়েছে। বাংলা ওসিয়ার টি আরও একটু ভালো করার চেষ্টা করুন। মানে ইংরেজির মতো।
many many thanks for your appreciation. We are trying our best to achieve our target to make Bangla OCR more efficient.
Hello,
I am also using ocropus.
I am getting an error when i try to run the following
ocroscript rec-tess file.png
I get the following error
no file '/usr/local/share/ocropus/scripts//rec-tess.lua'
no file './rec-tess.so'
no file '/usr/local/lib/lua/5.1/rec-tess.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
Any help will be appreciated.
Thanks a lot
Post a Comment