Wednesday, June 27, 2007

Determination of Run-Length Smoothing Values For Document Segmentation


Lets start with the review of the Run Length Smoothing Algorithm (RLSA) technique. Usually, the RLSA has three main stages. First, in the original image, RLSA is applied horizontally, row-by-row by using hsv and then vertically, column-by-column with vsv. After horizontal and vertical smoothing, we have two bit-maps, which are next combined by a logical AND operation to produce a new smoothing image. This image has small gaps that interrupt blocks of text lines. Therefore, an additional horizontal smoothing operation is performed by using a new suitable smoothing value ahsv.

The smoothing values used for the Run-length based segmentation methods are widely varies depending on the document structure (both physical and logical layout) and text types in the document. More specifically they depend on character size and font that may also vary depending on whether a document is in a single font or multiple fonts etc. The smoothing methods use heuristic techniques to determine the values of the smoothing variables.

According to the work of Wong et al. (1982), the smoothing values of hsv and vsv depend on "the number of pixels covering the length of long words, whereas ahsv should be set to cover a few character widths". We have found that almost all the papers where the RLSA technique is used and discussed, applied their own diagnostic value as the smoothing value. However, they do not propose an analytical method for the calculation of the smoothing values.

Here in this post, an algorithm is described for automated calculation of the proper horizontal and vertical smoothing values. Specifically, the horizontal smoothing value (hsv) and the vertical smoothing value (vsv) are calculated according to the mean character length (mcl) and the mean text line distance (mtld) of the document. The values of mcl and mtld are determined using the contributions of the horizontal and vertical runlengths.

The proposed method is based on pre-estimation of the mean character length and the mean text line distance of a document. To do this, a technique is developed that uses the horizontal and vertical distributions of white and black runs. They consider that a proper value for hsv must be equal to twice the mcl, while in the vertical direction, the proper value of vsv must be taken equal to the mtld of the document. That is

hsv = 2 mcl ( 1 )

vsv = mtld (2)

To obtain the proper values of mcl and mtld, we examine not only the white runs but also the black runs in the vertical and horizontal directions. Specifically, initially we calculate three histograms. These histograms give the distributions of black runs in the horizontal and vertical directions and of white runs in the horizontal direction.

The global maximum of the histogram of horizontal black runs (gmhbr) constructed mainly from part of the characters. Therefore, it gives information about the mean character length. We observe that the mcl belongs to a region defined by multiply the gmhbr by two suitable coefficients m1 and m2. That is

mcl Є [int (m1 * gmhbr), intU (m2 * gmhbr)] --- (3)

where int(x) is the integer part of x, and

intu(x) = x, if x is integer / int(x) + 1, otherwise ----- (4)

On the other hand, we know that the mean character height (mch) is close enough to the mcl. However, it is obvious that the mch corresponds to a local or global maximum in the histogram of vertical black runs, and moreover, we accept that this maximum belongs to the region defined by (3).

Therefore, the proper values of coefficients m1 and m2 must be determined . To do this they examine a large number of documents. Some estimation are made after examining number of documents.
  • We can accept that in a readable document the mtld is always greater than int(0.8mcl). This is the lower limit for mtld.
  • The upper limit of mtld can be large enough and it is taken equal to be 80. So, the mild is detennined as the position of the global histogram maximum in the range [int(0.8mc1),80]
Source: N. Papamarkos, J. Tzortzakis and B. Gatos, "Determination of Run-Length smoothing values for document segmentation", Third IEEE Int. Conf. on Page(s): 684-687 vol.2

Tuesday, June 26, 2007

Run Length Smoothing Algorithm (RLSA)


The Run Length Smoothing Algorithm (RLSA) is a method that can be used for Block segmentation and text discrimination. The method developed for the Document Analysis System consists of two steps. First, a segmentation procedure subdivides the area of a document into regions (blocks), each of which should contain only one type of data (text, graphic, halftone image, etc.). Next, some basic features of these blocks are calculated.

The basic RLSA is applied to a binary sequence in which white pixels are represented by 0’s and black pixels by 1’s. The algorithm transforms a binary sequence x into an output sequence y according to the following rules:

1. 0’s in x are changed to 1’s in y if the number of adjacent 0’s is less than or equal to a predefined limit C.
2. 1’s in x are unchanged in y .

For example, with C = 4 the sequence x is mapped into y as follows:

x : 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
y : 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1

When applied to pattern arrays, the RLSA has the effect of linking together neighboring black areas that are separated by less than C pixels. With an appropriate choice of C, the linked areas will be regions of a common data type.

The RLSA is applied row-by-row as well as column-by-column to a document, yielding two distinct bit-maps. Because spacings of document components tend to differ horizontally and vertically, different values of C are used for row and column processing. Thet wo bit-maps are then combined in a logical AND operation. Additional horizontal smoothing using the RLSA produces the final segmentation result.

Figure 1 : (a) Block segmentation example of a mixed text/image document, which here is the original digitized document. (b) and (c) Results of applying the RLSA in the horizontal and vertical directions. (d) Final result of block segmentation. (e) Results for blocks considered to be text data (class 1).

The blocks shown in Fig. 1(d) must next be located and classified according to content. To provide identification for subsequent processing, a unique label is assigned to each block. Simultaneously with block labeling, the following measurements are taken:
  • Total number of black pixels in a segmented block (BC).
  • Minimum x-y coordinates of a block and its x, y lengths (Xmin, Ymin, delX, delY).
  • Total number of black pixels in original data from the block (DC).
  • Horizontal white-black transitions of original data (TC).
These measurements are stored in a table and are used to calculate the following features:
  1. The height of each block segment: H = delY.
  2. The eccentricityo f the rectangle surrounding theb lock: E = delX/delY.
  3. The ratio of the number of block pixels to the area of the surrounding rectangle: S=BC/(delX * delY). If S is close to one, the block segment is approximately rectangular.
  4. The mean horizontal length of the black runs of the original data from each block: Rm=DC/TC.
The mean value of block height Hm, and the block mean black pixel run length Rm, for the text cluster may vary for different types of documents, depending on character size and font. Furthermore, the text cluster’s standard deviations sigma(Hm) and sigma(Rm) may also vary depending on whether a document is in a single font or multiple fonts and character sizes. To permit self-adjustment of the decision boundaries for text discrimination, estimates are calculate for the mean values Hm and Rm of blocks from a tightly defined text region of the R-H plane. Additional heuristic rules are applied to confirm that each such block is likely to be text before it is included in the cluster. The members of the cluster are then used to estimate new bounds on the features to detect additional text blocks. Finally, a variable, linear, separable classification scheme assigns the following four classes to the blocks:

Class 1 Text:
R <> C_21 Rm and
H <> I/C_23 and
H > C_22 Hm.

Class 4 Vertical solid black lines:
E <> C_22 Hm.

Values have been assigned to the parameters based on several training documents. With C_21=3, C_22=3, and C_23=5, the outlined method has been tested on a number of test documents with satisfactory performance.

There are certain limitations to the block segmentation and text discrimination method described so far. On some documents, text lines are linked together by the block segmentation algorithm due to small line-to-line spacing, and thus are assigned to class 3. A line of characters exceeding 3 times H, (with C_22=3), such as a title or heading in a document, is assigned to class 3. An isolated line of text printed vertically, e.g., a text description of a diagram along the vertical axis, may be classified as a number of small text blocks, or else as a class 3 block.

Consequently, in order to further classify different data types within class 3, a second discrimination based on shape factors is used. The method uses measurements of border-to-border distance within an arbitrary pattern. These measurements can be used to calculate meaningful features like the “line-shapeness” or “compactness” of objects within binary images.


Source:
K.Y. Wong, R.G. Casey and F.M. Wahl, "Docuinent analysis system," IBM J. Res. Devel., Vol. 26, NO. 6,111). 647-656, 1982.

Additional Reading:
[1] F. M. Wahl, K. Y. Wong, and R. G. Casey, “Block Segmentation and Text Extraction in Mixed Text/Image Documents,” Research Report RJ 3356, IBM Research Laboratory, SanJose, CA, 1981.

[2] F. M. Wahl, “A New Distance Mapping and its Use for Shape Measurement on Binary Patterns,” Research Report RJ 3361, IBM Research Laboratory, San Jose, CA, 1982.