Sunday, September 9, 2007

:-- Optical Character Recognition --:


Name:: BanglaOCR

Summary::

This projects aims to develop an Optical Character Recognizer that can recognize Bangla Language Scripts. The entire OCR research and development task is mainly divided into five parts: Preprocessing, Feature Extraction, Training, Recognition and Post-processing. We performed experiment with several techniques for each individual parts and choose the appropriate methods in our implementation. We used Hidden Markov Model (HMM) technique for pattern training and classification. Hidden Markov Model Toolkit (HTK) is used to implement the Training and Classification Task.


Details::

BanglaOCR is the Optical Character Recognizer for Bangla Script. It takes scanned images of a printed page or document as input and converts them into editable Unicode text. BanglaOCR allows users to train the data set from any document and observe the recognition performance.

BanglaOCR deals will several independent parts as listed below:

  • Preprocessing
  • Feature Extraction
  • Training
  • Recognition
  • Post-processing


The Preprocessing task involves image acquisition, binarization, noise elimination, skew correction, line and word separation and character segmentation. In this step we put our effort up to minimal segmentation of characters.

The Feature Extraction task involves the extraction of meaningful characteristics of a minimally segmented character. First we divide the character image into several frames using a certain frame length. Then we performed Discrete Cosine Transform (DCT) calculation over each frames. We consider the number of frames and DCT calculated values of each frame as the features for each character.

Training is performed over the calculated features of each minimally segmented character image. We created separate HMM model for each segmented character image. The model creating involves dynamically choosing a prototype HMM model and creates a model using the prototype HMM and the extracted feature data. The model creation task is automatically performed by invoking HInit tool of the HTK toolkit.

The Recognition process invokes the recognition tool HVite of the HTK toolkit. HVite uses feature file of the segmented character image where the features are written in a specified format, the word network that describes the allowable word sequence build up from task grammar, the dictionary that define each character or word, the entire list of HMMs and the .mmf file where the description of each HMM model is written. All these files are constructed according to HTK Toolkit understandable format. After the recognition process is completed the model name is read from the Master Label File (.mmf) and the associated Unicode character for the recognized model is written to the output file.

The final task that BanglaOCR perform is post-processing. This involves a suggestion based spell checker that is capable to identify the erroneous words and produce suggestions. This take the recognizer’s output file as input and produce output file that marks the erroneous words and provides up to a certain number of suggestions.

The project goal of BanglaOCR is to develop a market place standard multilingual OCR system that will be capable to perform the digitization of a wide domain of Bangla Document images. This will help to archive the documents from all spheres and prevent the damage and lost of valuable documents and books.


Team::

Status::

  • First version of open source BanglaOCR is released under GNU Public License (GPL) version 2 or later.
  • Research work on different parts of BanglaOCR is continuing to enhance the performance and usability.


Research Scope::

  • Research on Feature Extraction of Bengali Characters.
  • Research on proper Segmentation of Bengali Characters from any type of document image.
  • Research on the preprocessing of historical Bangla Document Image.
  • Research on Bangla Handwritten Image.
  • Research on the Training and Recognition using different techniques.
  • Research on Multi Lingual OCR.

Development Scope::

  • Implement the existing developed versions using different language.

Timeline:: 2007 – 2009.

Character Recognition


Character recognition techniques associate a symbolic identity with the image of character. Character recognition is commonly referred to as optical character recognition (OCR), as it deals with the recognition of optically processed characters. The modern version of OCR appeared in the middle of the 1940’s with the development of the digital computers. OCR machines have been commercially available since the middle of the 1950’s. Today OCR-systems are available both as hardware devices and software packages, and a few thousand systems are sold every week.

In a typical OCR systems input characters are digitized by an optical scanner. Each character is then located and segmented, and the resulting character image is fed into a preprocessor for noise reduction and normalization. Certain characteristics are the extracted from the character for classification. The feature extraction is critical and many different techniques exist, each having its strengths and weaknesses. After classification the identified characters are grouped to reconstruct the original symbol strings, and context may then be applied to detect and correct errors.

Optical character recognition has many different practical applications. The main areas where OCR has been of importance are text entry (office automation), data entry (banking environment) and process automation (mail sorting).

The present state of the art in OCR has moved from primitive schemes for limited character sets, to the application of more sophisticated techniques for omnifont and handprint recognition. The main problems in OCR usually lie in the segmentation of degraded symbols which are joined or fragmented. Generally, the accuracy of an OCR system is directly dependent upon the quality of the input document. Three figures are used in ratings of OCR systems; correct classification rate, rejection rate and error rate. The performance should be rated from the systems error rate, as these errors go by undetected by the system and must be manually located for correction.

In spite of the great number of algorithms that have been developed for character recognition, the problem is not yet solved satisfactory, especially not in the cases when there are no strict limitations on the handwriting or quality of print. Up to now, no recognition algorithm may compete with man in quality. However, as the OCR machine is able to read much faster, it is still attractive.

In the future the area of recognition of constrained print is expected to decrease. Emphasis will then be on the recognition of unconstrained writing, like omnifont and handwriting. This is a challenge which requires improved recognition techniques. The potential for OCR algorithms seems to lie in the combination of different methods and the use of techniques that are able to utilize context to a much larger extent than current methodologies. May be exchanged electronically or printed in a more computer readable form, for instance barcodes.

The applications for future OCR-systems lie in the recognition of documents where control over the production process is impossible. This may be material where the recipient is cut off from an electronic version and has no control of the production process or older material which at production time could not be generated electronically. This means that future OCR-systems intended for reading printed text must be omnifont.

Another important area for OCR is the recognition of manually produced documents.

Within postal applications for instance, OCR must focus on reading of addresses on mail produced by people without access to computer technology. Already, it is not unusual for companies etc., with access to computer technology to mark mail with barcodes. The relative importance of handwritten text recognition is therefore expected to increase.

[Source]: Line Eikvil, "Optical Character Recognition", available at: “citeseer.ist.psu.edu/142042.html".