Sunday, July 1, 2007

Enhanced Constrained Run-Length Algorithm


Introduction

This Enhanced CRLA is a solution to the limitations of CRLA and it can be applied to non-Manhattan (irregularly shaped graphics embedded in text) page layout successfully. It can also extract text surrounded by a box. Both cases cannot be processed by the original CRLA.

Preprocess

The proposed new version of CRLA is performed on a label image, which is derived from the input document image and has the same size as the document image. To generate the label image, a connected component labeling algorithm is used to locate the foreground components in the document image and calculate their size. Then the label image can be generated by assigning its pixels with certain labels according to the size of the foreground component related to the pixel. Three labels are defined in the present system based on two threshold values (thVal1 & thVal2) as follows:

(1) If the height of the foreground component is less than thVal1, its pixels in the label image are assigned the label of “1”.

(2) If the height of the foreground component is between thVal1 and thVal2, its pixels in the label image are assigned the label of “2”.

(3) If the height of the foreground component is larger than thVal28, its pixels in the label image are assigned the label of “3”

The height is used instead of the width when checking the component size because it is assumed that the text on the document is horizontal alignment. As the characters in a text line may be connected in the scanned document image, checking width may lead to misjudgment of the component size.

After the label image is yielded, the new CRLA is executed on it in the following manner. Consider, for example, a label string as below

110001110002003330000110000000111.

By setting a constraint C = 5 to the run-length of zeros that are surrounded by label 1 on both sides, if the length of the consecutive zeros is not longer than C, these zeros are replaced by ones. Based on this operation, the preceding string is converted into

111111110002003330000110000000111.

This operation is referred to as selective-CRLA{1} where “selective” means it is applied only to those zeros surrounded by the specific label(s) given in the braces. The selective CRLA is performed with a two-pass processing scheme on the label image to separate text from graphics and halftone-image components.

Process

First Pass

The process of the first pass uses the following steps to accomplish page segmentation.

1. Apply selective-CRLA{1} row by row to the label image using constraint Chor-1.

2. Apply selective-CRLA{1} column by column to the label image using a constraint Cver-1.

3. Combine the images yielded by steps 1 and 2 using a logical AND operation.

4. Apply an additional selective- CRLA{1} row by row to the image yielded by step 3 using a constraint Csm-1.

The purposes of steps 1 and 2 are to link the foreground components horizontally and vertically. Step 3 is to cut the joined foreground areas according to the space between columns and text lines. Finally, step 4 is to remove the small gaps that may exist between characters in a text line. The parameters Chor-1, Cver-1, and Csm-1 are chosen according to the analysis on documents (Please read Post: 2 to know detail about this).

After dividing the document image into homogeneous regions, these regions must be further classified into text or non-text according to their features. A number of statistical measurements have been proposed for distinguishing between text and non-text regions for binary document images. Two of them are adopted in the present system: (i) the mean length of horizontal black runs and (ii) the white-black transition count per unit width. These two features can be calculated by scanning a region from left to right in a row-by-row manner. As the scanning proceeds, the horizontal black run-length is accumulated by a counter, BRL, and the white-black transition count by another counter, TC. After the whole region is scanned, the two features are computed by mean length of horizontal black runs

MBRL = BRL/TC -- (1)

White-black transition count per unit width

MTC = TC/W -- (2)

where W is the width (in pixel) of the region under consideration. If MBRL_min <= MBRL <= MBRL_max and MTC_min <= MTC <= MTC_max , the region is determined to be text. The parameter values of MBRL_min, MBRL_max, MTC_min and MTC_max in the present system are determined based on tests performed on of several document images.


The text regions extracted by the first pass are removed from the label image and the corresponding areas are filled with zeros, i.e. white.

Second Pass

The algorithm used in the second pass is similar to that used in the first pass, except for some changes in parameter settings. Specifically, the following procedure is applied to the label image output from the first pass.

1. Apply selective-CRLA{1, 2} row by row to the label image using a constraint Chor-2.

2. Apply selective-CRLA{1, 2} column by column to the label image using a constraint Cver-2.

3. Combine the images yielded by steps 1 and 2 using a logical AND operation.

4. Apply an additional selective- CRLA{1, 2} row by row to the image yielded by step 3 using a constraint Csm-2.


Again the value of the parameters Chor-2, Cver-2, and Csm-2 are chosen according to the analysis on documents (Please read Post: 2 to know detail about this).


To identify the text regions, the features and rules used in the first pass is employed again here, but with new values of MBRL_min, MBRL_max, MTC_min and MTC_max determined based on tests performed on of several document images.]

Discussion

The strength of the selective CRLA relies on the utilization of the global component information (i.e. the component size), which conducts the run-length smearing process more precisely than the original CRLA. In the first pass, the selective CRLA aims to extract body text, which usually has a smaller character size, and it can prevent erroneous linkage of text and non-text regions. In the second pass, the pixel labels and the parameter settings are relaxed to join the large headline characters and the text lines that have wide inter-character space.

Source:

Hung-Ming Sun, 2006, "Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing", International Journal of Applied Science and Engineering 2006.4, 3: 297-309

No comments: