CRBLP Bangla OCR: July 2007

Tuesday, July 31, 2007

Related Works: Page Layout analysis for Bangla Document Image

A. Ray Chaudhuri, A. K. Mandal and B. B. Chaudhuri at [1] present a page layout analyzer for multilingual Indian documents more specifically for Brahmi script based documents. The headline (matra for Bangla, sirorekha for Hindi) is identified as the main feature of these documents. They applied is a bottom up approach for the page layout analysis. Here the text regions are extracted as rectangular blocks containing moderate size connected components satisfying homogeneity or compatibility in terms of headline and other (textual) features particularly available in the considered scripts. Degree of homogeneity is calculated from the compatibility criterion, in terms of selected features. Textual regions or blocks are generated from the optimal bounding boxes (BBs) of all connected components.

In their work first the likelihood of the headline in a word and vertical line in a word is calculated from the statistical analysis on a large scale of documents. The primary features are extracted from the BBs viz. corners, headline (position and its thickness), the vertical line in the middle zone of UMD, lengthwise distribution of UMD and average number of object pixels per BB, i.e. density of the components. From these primary features some secondary features are calculated as follows: average pixel density of each component, average vertical span of M within a block, average horizontal gap among closest pairs of BBs, inter-headline distances between adjacent vertical BBs, inter-block distance (both horizontal and vertical) and the ratio of number of components between adjacent blocks. The Compatibility criterion for any two scalar quantities L1 and L2 is defined as:

Comp (L1, L2) = ½(|1 – L1/L2| + |1 – L2/L1|)

The entire process is completed in three modules. The task in the first module is the block construction using the primary features. The constructed blocks and the remaining BBs are labeled into seven categories (C1 – C7). The second module is applicable for the blocks where one block (F_in) is completely or partially resides into the other (F_out). The third module is for block merging which is actually applicable only for those blocks that are not yet considered as text (but they are real text in the title).

S. Khedekar, V. Ramanaprasad, S. Setlur, V. Govindaraju at [2] present a top-down, projection-profile based algorithm to separate text blocks from image blocks in a Devanagari document. They used a distinctive feature called Shirorekha (Header Line) to analyze the pattern. The horizontal profile corresponding to a text block possesses certain regularity in frequency, orientation and shows spatial cohesion. Their algorithm uses these features to identify text blocks.

The algorithm first generates the horizontal histogram of the entire image. When the horizontal histogram is plotted of the document image, the rows where Shirorekha is present will have maximum number of black pixels. The patterns (profiles) formed by text blocks are characterized by ranges of thick black peaks, separated by white gaps. Any graphics or images by contrast have relatively uniform pattern, with no prominent peaks. Another main feature of a text block is that the histogram corresponding to that block possesses certain frequency and orientation and shows spatial cohesion i.e. adjacent lines are of similar height. Since text lines appear adjacent to each other, we look for adjacent patterns which have identical shape and size distribution. If this profile is found, then this portion of document must contain Devanagari text. Any variation from this characteristic must be an image with or without surrounding text.

Reference:

[1]. A. Ray Chaudhuri, A. K. Mandal, B. B. Chaudhuri, “Page Layout Analyzer for Multilingual Indian Documents”, Proc. of the Language Engineering Conference (LEC’02), 2002.

[2]. S. Khedekar, V. Ramanaprasad, S. Setlur, V. Govindaraju, “Text -image separation in Devanagari documents", Proc. of Seventh International Conference on Document Analysis and Recognition, 2003.

Wednesday, July 4, 2007

Planning for the Development of a Database of Document Images for Research on Bangla Optical Character Recognition

First we need to complete the basic study on the existing established Image databases of Document Images for Research purpose. I have observed that the most widely used database for research on document image analysis and recognition is the University of Washington Database where the creation of this database started at around 1990. Papers related to this database were published in 1993 at the Second International Conference on Document Analysis and Recognition. So we need to know details about the technique associated with this database creation. Some papers related to this are listed into the Study material section.

Next we have to look at the existing database for Bangla document Image and their availability. I have found that similar task has been done on “Resource Centre for Indian Language Technology Solutions Bangla” with limited varieties of document from bangla novel. I tried to get access to that resource, but unfortunately I failed. So, we need to build a database with large varieties and distribute this as open.

The followings are the basic about the UW databases.

University of Washington English Document Image Database

Series (UW-I, UW-II, UW-III) of document image databases produced by the Intelligent Systems Laboratory, at the University of Washington, Seattle, Washington, USA.

Now the most widely used database for existing research and performance evaluation is the University of Washington III (UWIII) database. The database consists of 1600 English document images with bounding boxes of 24177 homogeneous page segments or blocks, which are manually labeled into different classes depending on their contents, making the data very suitable for evaluating a block classification system. The documents in the UW-III dataset are categorized based on their degradation type as follows:

1. Direct scans of original English journals

2. Scans of first generation English journal photocopies

3. Scans of second or later generation English journal photocopies

Many of the documents in the dataset are duplicated and differ sometimes only by the degradation applied to them. This type of collection is useful when one is evaluating a page segmentation algorithm to see how well the algorithm performs when photocopy effect degradation is applied to a document.

UW – I

Contains 1147 document page images from English Scientific and Technical Journals having

Binary images scanned from 1st and other generation photocopies
Binary and grayscale images scanned directly from technical journals
Synthetic noise-free images generated from LaTeX files
Document images from UNLV ISRI database
All document images zoned and tagged
Software for OCR performance evaluation
Software for simulation of photocopy degradation
Text ground truth generated from two independent data-entry operators followed by three independent verifications

Each document page has associated with it

Text ground truth data for each text zone
Bounding box information for each zone on the page
Coarse level attributes for each document page
Finer level attributes (such as font size, alignment etc.) for each zone
Qualitative information on the condition of each page
AT & T Bell Labs degraded character images database

UW – II

Contains the followings:

624 English journal document pages

43 Complete articles
63 MEMO pages

477 JAPANESE journal document pages.
Illuminator software produced by RAF
Corrected data files for the known errors in UW-I

Each document page has associated with it

Text ground truth data for each text zone
Bounding box information for each zone on the page
Coarse level attributes for each document page
Finer level attributes (such as font size, alignment etc.) for each zone
Qualitative information on the condition of each page

Illuminator is an editor for building document image under- standing test and training sets, for easy correction of OCR errors, and for reverse- encoding the essential information and attributes of a document image page. It has been developed by RAF Technology, Inc., under the auspices of ARPA.

UW – III

Contains the followings:

25 pages of Chemical formulae and ground-truth in XFIG 25 pages of Mathematical formulae and ground-truth in XFIG and LaTeX 40 pages of Line drawings and ground-truth in AUTOCAD AND IGES 44 TIFF images of engineering drawings ground-truthed in AUTOCAD 33 TIFF images of English text generated from LaTeX and their

Character level ground-truth; 33 printed and scanned pages containing English

Text and their character level ground-truth 979 pages from UW-I in DAFS format, corrected for skew, and the word.

Bounding boxes for each word on these pages 623 pages from UW-II in DAFS format corrected for skew, and the word.

Bounding boxes for each word on 607 of these pages Software for:

- Generating ground-truth for synthetic documents;

- Synthetically degrading document images;

- Registering ideal document images to real document images;

- Transferring character ground-truth between a pair of registered images;

- Estimating document degradation model parameters;

- Validating degradation models;

- Formatting Arabic, Hindi, and music documents in LaTeX;

- Testing hypotheses about parameters from a multivariate Gaussian population

- TIFF libraries (public domain software from ftp.sgi.com)

Required Study Materials:

Research Papers:

[1]

I.T. Phillips, S. Chen, J. Ha, and R.M. Haralick, English document database design and implementation methodology, Proc. of the Second Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 1993, 65-104.

[2]

Guyon, I., Haralick, R. M., Hull, J. J., and Phillips, I. T. (1997). Data sets for OCR and document image understanding research. In Bunke, H. and Wang, P., editors, Handbook of character recognition and document image analysis, pages 779–799. World Scientific, Singapore.

[3]

Phillips, Ihsin T.; Ha, Jaekyu; Chen, Su; Haralick, Robert M., "Implementation methodology and error analysis for the University of Washington English Document Image Database-I", Proc. SPIE Vol. 2103, p. 155-173, 22nd AIPR Workshop: Interdisciplinary Computer Vision: Applications and Changing Needs.

Abstract: Document database preparation is a very time-consuming job and usually requires the involvement of many people. Any database is prone to having errors however carefully it was constructed. To assure the high quality of the document image database, a carefully planned implementation methodology is absolutely needed. In this paper, an implementation methodology that we employed to produce the UW English Document Image Database I is described. The paper also discusses how to estimate the distribution of errors contained in a database based on a double-data entry/double verification procedure.

[4]

Bhattacharya, U.; Chaudhuri, B.B., "Databases for research on recognition of handwritten characters of Indian scripts", Eighth International Conference on Document Analysis and Recognition, 2005. Proceedings, Vol. 2 On page(s): 789- 793

Abstract: Three image databases of handwritten isolated numerals of three different Indian scripts namely Devnagari, Bangla and Oriya are described in this paper. Grayscale images of 22556 Devnagari numerals written by 1049 persons, 12938 Bangla numerals written by 556 persons and 5970 Oriya numerals written by 356 persons form the respective databases. These images were scanned from three different kinds of handwritten documents - postal mails, job application form and another set of forms specially designed by the collectors for the purpose. The only restriction imposed on the writers is to write each numeral within a rectangular box. These databases are free from the limitations that they are neither developed in laboratory environments nor they are non-uniformly distributed over different classes. Also, for comparison purposes, each database has been properly divided into respective training and test sets.

Web Links:

[1] http://www.isical.ac.in/~ujjwal/download/database.html [Information about the Bangla Character Recognition database.]

[2] http://www.science.uva.nl/research/dlia/datasets/uwash1.html [UW English Document Image Database-I]

[3] http://documents.cfar.umd.edu/resources/database/UWII.html [UW-II English/Japanese Document Image Database]

[4] http://documents.cfar.umd.edu/resources/database/3UWCdRom.html [UW-III English/Technical Document Image Database]

[5] http://www.nist.gov/srd/ [NIST Scientific and Technical Database]

[6] http://www.cfar.umd.edu/~kia/ocr-faq.html#SECTION00022000000000000000 [OCR Frequently Asked Questions.]

[7] http://www.isical.ac.in/~rc_bangla/products.html#ocr [Resource Centre for Indian Language Technology Solutions Bangla]

Feature based approach for Page Layout Extraction

This post is actually the underscore of the paper "Document Image Zone Classification: A Simple High-Performance Approach" by Thomas M. Bruel

Introduction

In this paper we address this shortcoming by comparing a large set of commonly used features for block classification and include in the comparison three features that are known to yield good performance in content-based image retrieval (CBIR) and are applicable to binary images (Deselaers et al., 2004). Interestingly, we found that the single feature with the best performance is the Tamura texture histogram, which belongs to this latter class.

Related Work and Contribution

Inglis and Witten (Inglis and Witten, 1995) used seven features based on connected components and run lengths, the authors apply various machine learning techniques to the problem. In the paper of Okun et al. (Okun et al., 1999) the predominant feature type is based on connected components and run-length statistics. Other features used include the cross-correlation between scan-lines, vertical projection profiles, wavelet coefficients, learned masks, and the black pixel distribution. The most common classifier used is a neural network.

The widespread use of features based on connected components run-length statistics, combined with the simplicity of implementation of such features, led us to use these feature types in our experiments as well, comparing them to the use of features used in content-based image retrieval. Our CBIR features are based on the open source image retrieval system FIRE (Deselaers et al., 2004). We restrict our analysis for zone classification to those features that are promising for the analysis of binary images.

The most recent and detailed overview of the progress in document zone classification and a very accurate system is presented in (Wang et al., 2006). The authors use a decision tree classifier and model contextual dependencies for some zones. In our work we do not model zone context, although it is likely that a context model (which can be integrated in a similar way as presented by Wang et al.) would help the overall classification performance.

We expand on the work presented in (Wang et al., 2006) in the following ways:

• We include a detailed feature comparison including a comparison with commonly used CBIR features. It turns out that the single best feature is the Tamura texture histogram which was not previously used for zone classification.

• We present results both for a simple nearest neighbor classifier and for a very fast linear classifier based on logistic regression and the maximum entropy criterion.

• We introduce a new class of blocks containing speckles that has not been labeled in the UW-III database. This typical class of noise is important to detect during the layout analysis especially for images of photocopied documents.

• We present results for the part of the UW-III database without using duplicates and achieve a similar error rate of 1.5%.

• We introduce the use of histograms for the measurements of connected components and run lengths and show that this leads to a performance increase.

Feature Extractions

We extract the following features from each block, where features 1-3 are chosen based on their performance in CBIR (Deselaers et al., 2004) feature 4 was expected to help distinguish between the types ‘drawing’ and ‘text’ and features 5-9 were chosen based on their common use in block classification (Okun et al., 1999;Wang et al., 2006). Due to space limitations we refer the interested reader to the references for implementation details.

1. Tamura texture features histogram (TTFH)

2. Relational invariant feature histograms (RIFH)

3. Down-scaled images of size 32×32 (DSI)

4. The fill ratio, i.e. the ratio of the number of black pixels in a horizontally smeared (Wong et al., 1982) image to the area of the image (FR)

5. Run-length histograms of black and white pixels along horizontal, vertical, main diagonal, and side diagonal directions; each histogram uses eight bins, spaced apart as powers of 2, i.e. counting runs of length _ 1,3,7,15,31,63,127 and _ 128 (RL{B,W}{X,Y,M,S}H)

6. The vector formed by the total number, mean, and variance of the runs of black and white pixels along the horizontal, vertical, main diagonal, and side diagonal directions as used in (Wang et al., 2006) (RL{B,W}{X,Y,M,S}V)

7. Histograms (as in 5) of the widths and heights of connected components (CCXH, CCYH)

8. The joint distribution of the widths and heights of connected components as a 2-dimensional 64-bin histogram (CCXYH)

9. The histogram of the distances between a connected component and its nearest neighbor component (CCNNH)

Fig: Examples of document image block types distinguished in our approach

Classifications

To evaluate the various features, we use a simple nearest neighbor classifier, that is, a test sample is classified into the class the closest training sample belongs to. The distance measures used are the Jensen- Shannon divergence for histograms and the Euclidean distance for all other features (Deselaers et al., 2004). If different feature sets are combined, the overall distance is calculated as the weighted sum of the individual normalized distances. The weights are proportional to the inverse of the error rate of a particular feature.

Source: Daniel Keysers, Faisal Shafait, Thomas M. Breuel, "DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High Performance Approach", VISAPP 2007, pages 44-51

Sunday, July 1, 2007

Enhanced Constrained Run-Length Algorithm

Introduction

This Enhanced CRLA is a solution to the limitations of CRLA and it can be applied to non-Manhattan (irregularly shaped graphics embedded in text) page layout successfully. It can also extract text surrounded by a box. Both cases cannot be processed by the original CRLA.

Preprocess

The proposed new version of CRLA is performed on a label image, which is derived from the input document image and has the same size as the document image. To generate the label image, a connected component labeling algorithm is used to locate the foreground components in the document image and calculate their size. Then the label image can be generated by assigning its pixels with certain labels according to the size of the foreground component related to the pixel. Three labels are defined in the present system based on two threshold values (thVal1 & thVal2) as follows:

(1) If the height of the foreground component is less than thVal1, its pixels in the label image are assigned the label of “1”.

(2) If the height of the foreground component is between thVal1 and thVal2, its pixels in the label image are assigned the label of “2”.

(3) If the height of the foreground component is larger than thVal28, its pixels in the label image are assigned the label of “3”

The height is used instead of the width when checking the component size because it is assumed that the text on the document is horizontal alignment. As the characters in a text line may be connected in the scanned document image, checking width may lead to misjudgment of the component size.

After the label image is yielded, the new CRLA is executed on it in the following manner. Consider, for example, a label string as below

110001110002003330000110000000111.

By setting a constraint C = 5 to the run-length of zeros that are surrounded by label 1 on both sides, if the length of the consecutive zeros is not longer than C, these zeros are replaced by ones. Based on this operation, the preceding string is converted into

111111110002003330000110000000111.

This operation is referred to as selective-CRLA{1} where “selective” means it is applied only to those zeros surrounded by the specific label(s) given in the braces. The selective CRLA is performed with a two-pass processing scheme on the label image to separate text from graphics and halftone-image components.

Process

First Pass

The process of the first pass uses the following steps to accomplish page segmentation.

1. Apply selective-CRLA{1} row by row to the label image using constraint Chor-1.

2. Apply selective-CRLA{1} column by column to the label image using a constraint Cver-1.

3. Combine the images yielded by steps 1 and 2 using a logical AND operation.

4. Apply an additional selective- CRLA{1} row by row to the image yielded by step 3 using a constraint Csm-1.

The purposes of steps 1 and 2 are to link the foreground components horizontally and vertically. Step 3 is to cut the joined foreground areas according to the space between columns and text lines. Finally, step 4 is to remove the small gaps that may exist between characters in a text line. The parameters Chor-1, Cver-1, and Csm-1 are chosen according to the analysis on documents (Please read Post: 2 to know detail about this).

After dividing the document image into homogeneous regions, these regions must be further classified into text or non-text according to their features. A number of statistical measurements have been proposed for distinguishing between text and non-text regions for binary document images. Two of them are adopted in the present system: (i) the mean length of horizontal black runs and (ii) the white-black transition count per unit width. These two features can be calculated by scanning a region from left to right in a row-by-row manner. As the scanning proceeds, the horizontal black run-length is accumulated by a counter, BRL, and the white-black transition count by another counter, TC. After the whole region is scanned, the two features are computed by mean length of horizontal black runs

MBRL = BRL/TC -- (1)

White-black transition count per unit width

MTC = TC/W -- (2)

where W is the width (in pixel) of the region under consideration. If MBRL_min <= MBRL <= MBRL_max and MTC_min <= MTC <= MTC_max , the region is determined to be text. The parameter values of MBRL_min, MBRL_max, MTC_min and MTC_max in the present system are determined based on tests performed on of several document images.

The text regions extracted by the first pass are removed from the label image and the corresponding areas are filled with zeros, i.e. white.

Second Pass

The algorithm used in the second pass is similar to that used in the first pass, except for some changes in parameter settings. Specifically, the following procedure is applied to the label image output from the first pass.

1. Apply selective-CRLA{1, 2} row by row to the label image using a constraint Chor-2.

2. Apply selective-CRLA{1, 2} column by column to the label image using a constraint Cver-2.

3. Combine the images yielded by steps 1 and 2 using a logical AND operation.

4. Apply an additional selective- CRLA{1, 2} row by row to the image yielded by step 3 using a constraint Csm-2.

Again the value of the parameters Chor-2, Cver-2, and Csm-2 are chosen according to the analysis on documents (Please read Post: 2 to know detail about this).

To identify the text regions, the features and rules used in the first pass is employed again here, but with new values of MBRL_min, MBRL_max, MTC_min and MTC_max determined based on tests performed on of several document images.]

Discussion

The strength of the selective CRLA relies on the utilization of the global component information (i.e. the component size), which conducts the run-length smearing process more precisely than the original CRLA. In the first pass, the selective CRLA aims to extract body text, which usually has a smaller character size, and it can prevent erroneous linkage of text and non-text regions. In the second pass, the pixel labels and the parameter settings are relaxed to join the large headline characters and the text lines that have wide inter-character space.

Source:

Hung-Ming Sun, 2006, "Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing", International Journal of Applied Science and Engineering 2006.4, 3: 297-309

Constrained Run-Length Algorithm (CRLA)

The Constrained Run-Length Algorithm (CRLA) is a well-known technique for page segmentation which is also known as the Run-Length Smoothing/Smearing Algorithm (RLSA). This is a very well known technique for segmenting a document image into homogeneous regions. The algorithm is very efficient for partitioning documents with Manhattan layouts (i.e. the text/graphics/halftone-image regions were separable by horizontal and vertical line segments, for example two-column text together with rectangle-alignment graphics and halftone images) but not suited to deal with complex layout pages, e.g. irregular graphics embedded in a text paragraph. Its main drawback is to use only local information during the smearing stage, which may lead to erroneous linkage of text and graphics.

The previous two posts briefly describe the Constrained Run-Length Algorithm (CRLA) or Run Length Smoothing Algorithm (RLSA) and its related issue.

Source :

[1] Hung-Ming Sun, 2006, "Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing", International Journal of Applied Science and Engineering 2006.4, 3: 297-309

CRBLP Bangla OCR