CRBLP Bangla OCR: Page Layout Analysis

Showing posts with label Page Layout Analysis. Show all posts

Wednesday, July 4, 2007

Feature based approach for Page Layout Extraction

This post is actually the underscore of the paper "Document Image Zone Classification: A Simple High-Performance Approach" by Thomas M. Bruel

Introduction

In this paper we address this shortcoming by comparing a large set of commonly used features for block classification and include in the comparison three features that are known to yield good performance in content-based image retrieval (CBIR) and are applicable to binary images (Deselaers et al., 2004). Interestingly, we found that the single feature with the best performance is the Tamura texture histogram, which belongs to this latter class.

Related Work and Contribution

Inglis and Witten (Inglis and Witten, 1995) used seven features based on connected components and run lengths, the authors apply various machine learning techniques to the problem. In the paper of Okun et al. (Okun et al., 1999) the predominant feature type is based on connected components and run-length statistics. Other features used include the cross-correlation between scan-lines, vertical projection profiles, wavelet coefficients, learned masks, and the black pixel distribution. The most common classifier used is a neural network.

The widespread use of features based on connected components run-length statistics, combined with the simplicity of implementation of such features, led us to use these feature types in our experiments as well, comparing them to the use of features used in content-based image retrieval. Our CBIR features are based on the open source image retrieval system FIRE (Deselaers et al., 2004). We restrict our analysis for zone classification to those features that are promising for the analysis of binary images.

The most recent and detailed overview of the progress in document zone classification and a very accurate system is presented in (Wang et al., 2006). The authors use a decision tree classifier and model contextual dependencies for some zones. In our work we do not model zone context, although it is likely that a context model (which can be integrated in a similar way as presented by Wang et al.) would help the overall classification performance.

We expand on the work presented in (Wang et al., 2006) in the following ways:

• We include a detailed feature comparison including a comparison with commonly used CBIR features. It turns out that the single best feature is the Tamura texture histogram which was not previously used for zone classification.

• We present results both for a simple nearest neighbor classifier and for a very fast linear classifier based on logistic regression and the maximum entropy criterion.

• We introduce a new class of blocks containing speckles that has not been labeled in the UW-III database. This typical class of noise is important to detect during the layout analysis especially for images of photocopied documents.

• We present results for the part of the UW-III database without using duplicates and achieve a similar error rate of 1.5%.

• We introduce the use of histograms for the measurements of connected components and run lengths and show that this leads to a performance increase.

Feature Extractions

We extract the following features from each block, where features 1-3 are chosen based on their performance in CBIR (Deselaers et al., 2004) feature 4 was expected to help distinguish between the types ‘drawing’ and ‘text’ and features 5-9 were chosen based on their common use in block classification (Okun et al., 1999;Wang et al., 2006). Due to space limitations we refer the interested reader to the references for implementation details.

1. Tamura texture features histogram (TTFH)

2. Relational invariant feature histograms (RIFH)

3. Down-scaled images of size 32×32 (DSI)

4. The fill ratio, i.e. the ratio of the number of black pixels in a horizontally smeared (Wong et al., 1982) image to the area of the image (FR)

5. Run-length histograms of black and white pixels along horizontal, vertical, main diagonal, and side diagonal directions; each histogram uses eight bins, spaced apart as powers of 2, i.e. counting runs of length _ 1,3,7,15,31,63,127 and _ 128 (RL{B,W}{X,Y,M,S}H)

6. The vector formed by the total number, mean, and variance of the runs of black and white pixels along the horizontal, vertical, main diagonal, and side diagonal directions as used in (Wang et al., 2006) (RL{B,W}{X,Y,M,S}V)

7. Histograms (as in 5) of the widths and heights of connected components (CCXH, CCYH)

8. The joint distribution of the widths and heights of connected components as a 2-dimensional 64-bin histogram (CCXYH)

9. The histogram of the distances between a connected component and its nearest neighbor component (CCNNH)

Fig: Examples of document image block types distinguished in our approach

Classifications

To evaluate the various features, we use a simple nearest neighbor classifier, that is, a test sample is classified into the class the closest training sample belongs to. The distance measures used are the Jensen- Shannon divergence for histograms and the Euclidean distance for all other features (Deselaers et al., 2004). If different feature sets are combined, the overall distance is calculated as the weighted sum of the individual normalized distances. The weights are proportional to the inverse of the error rate of a particular feature.

Source: Daniel Keysers, Faisal Shafait, Thomas M. Breuel, "DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High Performance Approach", VISAPP 2007, pages 44-51

Wednesday, June 27, 2007

Determination of Run-Length Smoothing Values For Document Segmentation

Lets start with the review of the Run Length Smoothing Algorithm (RLSA) technique. Usually, the RLSA has three main stages. First, in the original image, RLSA is applied horizontally, row-by-row by using hsv and then vertically, column-by-column with vsv. After horizontal and vertical smoothing, we have two bit-maps, which are next combined by a logical AND operation to produce a new smoothing image. This image has small gaps that interrupt blocks of text lines. Therefore, an additional horizontal smoothing operation is performed by using a new suitable smoothing value ahsv.

The smoothing values used for the Run-length based segmentation methods are widely varies depending on the document structure (both physical and logical layout) and text types in the document. More specifically they depend on character size and font that may also vary depending on whether a document is in a single font or multiple fonts etc. The smoothing methods use heuristic techniques to determine the values of the smoothing variables.

According to the work of Wong et al. (1982), the smoothing values of hsv and vsv depend on "the number of pixels covering the length of long words, whereas ahsv should be set to cover a few character widths". We have found that almost all the papers where the RLSA technique is used and discussed, applied their own diagnostic value as the smoothing value. However, they do not propose an analytical method for the calculation of the smoothing values.

Here in this post, an algorithm is described for automated calculation of the proper horizontal and vertical smoothing values. Specifically, the horizontal smoothing value (hsv) and the vertical smoothing value (vsv) are calculated according to the mean character length (mcl) and the mean text line distance (mtld) of the document. The values of mcl and mtld are determined using the contributions of the horizontal and vertical runlengths.

The proposed method is based on pre-estimation of the mean character length and the mean text line distance of a document. To do this, a technique is developed that uses the horizontal and vertical distributions of white and black runs. They consider that a proper value for hsv must be equal to twice the mcl, while in the vertical direction, the proper value of vsv must be taken equal to the mtld of the document. That is

hsv = 2 mcl ( 1 )

vsv = mtld (2)

To obtain the proper values of mcl and mtld, we examine not only the white runs but also the black runs in the vertical and horizontal directions. Specifically, initially we calculate three histograms. These histograms give the distributions of black runs in the horizontal and vertical directions and of white runs in the horizontal direction.

The global maximum of the histogram of horizontal black runs (gmhbr) constructed mainly from part of the characters. Therefore, it gives information about the mean character length. We observe that the mcl belongs to a region defined by multiply the gmhbr by two suitable coefficients m1 and m2. That is

mcl Є [int (m1 * gmhbr), intU (m2 * gmhbr)] --- (3)

where int(x) is the integer part of x, and

intu(x) = x, if x is integer / int(x) + 1, otherwise ----- (4)

On the other hand, we know that the mean character height (mch) is close enough to the mcl. However, it is obvious that the mch corresponds to a local or global maximum in the histogram of vertical black runs, and moreover, we accept that this maximum belongs to the region defined by (3).

Therefore, the proper values of coefficients m1 and m2 must be determined . To do this they examine a large number of documents. Some estimation are made after examining number of documents.

We can accept that in a readable document the mtld is always greater than int(0.8mcl). This is the lower limit for mtld.

The upper limit of mtld can be large enough and it is taken equal to be 80. So, the mild is detennined as the position of the global histogram maximum in the range [int(0.8mc1),80]

Source: N. Papamarkos, J. Tzortzakis and B. Gatos, "Determination of Run-Length smoothing values for document segmentation", Third IEEE Int. Conf. on Page(s): 684-687 vol.2

Tuesday, June 26, 2007

Run Length Smoothing Algorithm (RLSA)

The Run Length Smoothing Algorithm (RLSA) is a method that can be used for Block segmentation and text discrimination. The method developed for the Document Analysis System consists of two steps. First, a segmentation procedure subdivides the area of a document into regions (blocks), each of which should contain only one type of data (text, graphic, halftone image, etc.). Next, some basic features of these blocks are calculated.

The basic RLSA is applied to a binary sequence in which white pixels are represented by 0’s and black pixels by 1’s. The algorithm transforms a binary sequence x into an output sequence y according to the following rules:

1. 0’s in x are changed to 1’s in y if the number of adjacent 0’s is less than or equal to a predefined limit C.
2. 1’s in x are unchanged in y .

For example, with C = 4 the sequence x is mapped into y as follows:

x : 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
y : 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1

When applied to pattern arrays, the RLSA has the effect of linking together neighboring black areas that are separated by less than C pixels. With an appropriate choice of C, the linked areas will be regions of a common data type.

The RLSA is applied row-by-row as well as column-by-column to a document, yielding two distinct bit-maps. Because spacings of document components tend to differ horizontally and vertically, different values of C are used for row and column processing. Thet wo bit-maps are then combined in a logical AND operation. Additional horizontal smoothing using the RLSA produces the final segmentation result.

Figure 1 : (a) Block segmentation example of a mixed text/image document, which here is the original digitized document. (b) and (c) Results of applying the RLSA in the horizontal and vertical directions. (d) Final result of block segmentation. (e) Results for blocks considered to be text data (class 1).

The blocks shown in Fig. 1(d) must next be located and classified according to content. To provide identification for subsequent processing, a unique label is assigned to each block. Simultaneously with block labeling, the following measurements are taken:

Total number of black pixels in a segmented block (BC).
Minimum x-y coordinates of a block and its x, y lengths (Xmin, Ymin, delX, delY).
Total number of black pixels in original data from the block (DC).
Horizontal white-black transitions of original data (TC).

These measurements are stored in a table and are used to calculate the following features:

The height of each block segment: H = delY.
The eccentricityo f the rectangle surrounding theb lock: E = delX/delY.
The ratio of the number of block pixels to the area of the surrounding rectangle: S=BC/(delX * delY). If S is close to one, the block segment is approximately rectangular.
The mean horizontal length of the black runs of the original data from each block: Rm=DC/TC.

The mean value of block height Hm, and the block mean black pixel run length Rm, for the text cluster may vary for different types of documents, depending on character size and font. Furthermore, the text cluster’s standard deviations sigma(Hm) and sigma(Rm) may also vary depending on whether a document is in a single font or multiple fonts and character sizes. To permit self-adjustment of the decision boundaries for text discrimination, estimates are calculate for the mean values Hm and Rm of blocks from a tightly defined text region of the R-H plane. Additional heuristic rules are applied to confirm that each such block is likely to be text before it is included in the cluster. The members of the cluster are then used to estimate new bounds on the features to detect additional text blocks. Finally, a variable, linear, separable classification scheme assigns the following four classes to the blocks:

Class 1 Text:
R <> C_21 Rm and
H <> I/C_23 and
H > C_22 Hm.

Class 4 Vertical solid black lines:
E <> C_22 Hm.

Values have been assigned to the parameters based on several training documents. With C_21=3, C_22=3, and C_23=5, the outlined method has been tested on a number of test documents with satisfactory performance.

There are certain limitations to the block segmentation and text discrimination method described so far. On some documents, text lines are linked together by the block segmentation algorithm due to small line-to-line spacing, and thus are assigned to class 3. A line of characters exceeding 3 times H, (with C_22=3), such as a title or heading in a document, is assigned to class 3. An isolated line of text printed vertically, e.g., a text description of a diagram along the vertical axis, may be classified as a number of small text blocks, or else as a class 3 block.

Consequently, in order to further classify different data types within class 3, a second discrimination based on shape factors is used. The method uses measurements of border-to-border distance within an arbitrary pattern. These measurements can be used to calculate meaningful features like the “line-shapeness” or “compactness” of objects within binary images.

Source:
K.Y. Wong, R.G. Casey and F.M. Wahl, "Docuinent analysis system," IBM J. Res. Devel., Vol. 26, NO. 6,111). 647-656, 1982.

Additional Reading:
[1] F. M. Wahl, K. Y. Wong, and R. G. Casey, “Block Segmentation and Text Extraction in Mixed Text/Image Documents,” Research Report RJ 3356, IBM Research Laboratory, SanJose, CA, 1981.

[2] F. M. Wahl, “A New Distance Mapping and its Use for Shape Measurement on Binary Patterns,” Research Report RJ 3361, IBM Research Laboratory, San Jose, CA, 1982.

CRBLP Bangla OCR