Wednesday, July 4, 2007

Feature based approach for Page Layout Extraction


This post is actually the underscore of the paper "Document Image Zone Classification: A Simple High-Performance Approach" by Thomas M. Bruel

Introduction

In this paper we address this shortcoming by comparing a large set of commonly used features for block classification and include in the comparison three features that are known to yield good performance in content-based image retrieval (CBIR) and are applicable to binary images (Deselaers et al., 2004). Interestingly, we found that the single feature with the best performance is the Tamura texture histogram, which belongs to this latter class.


Related Work and Contribution

Inglis and Witten (Inglis and Witten, 1995) used seven features based on connected components and run lengths, the authors apply various machine learning techniques to the problem. In the paper of Okun et al. (Okun et al., 1999) the predominant feature type is based on connected components and run-length statistics. Other features used include the cross-correlation between scan-lines, vertical projection profiles, wavelet coefficients, learned masks, and the black pixel distribution. The most common classifier used is a neural network.

The widespread use of features based on connected components run-length statistics, combined with the simplicity of implementation of such features, led us to use these feature types in our experiments as well, comparing them to the use of features used in content-based image retrieval. Our CBIR features are based on the open source image retrieval system FIRE (Deselaers et al., 2004). We restrict our analysis for zone classification to those features that are promising for the analysis of binary images.

The most recent and detailed overview of the progress in document zone classification and a very accurate system is presented in (Wang et al., 2006). The authors use a decision tree classifier and model contextual dependencies for some zones. In our work we do not model zone context, although it is likely that a context model (which can be integrated in a similar way as presented by Wang et al.) would help the overall classification performance.

We expand on the work presented in (Wang et al., 2006) in the following ways:

• We include a detailed feature comparison including a comparison with commonly used CBIR features. It turns out that the single best feature is the Tamura texture histogram which was not previously used for zone classification.

• We present results both for a simple nearest neighbor classifier and for a very fast linear classifier based on logistic regression and the maximum entropy criterion.

• We introduce a new class of blocks containing speckles that has not been labeled in the UW-III database. This typical class of noise is important to detect during the layout analysis especially for images of photocopied documents.

• We present results for the part of the UW-III database without using duplicates and achieve a similar error rate of 1.5%.

• We introduce the use of histograms for the measurements of connected components and run lengths and show that this leads to a performance increase.


Feature Extractions

We extract the following features from each block, where features 1-3 are chosen based on their performance in CBIR (Deselaers et al., 2004) feature 4 was expected to help distinguish between the types ‘drawing’ and ‘text’ and features 5-9 were chosen based on their common use in block classification (Okun et al., 1999;Wang et al., 2006). Due to space limitations we refer the interested reader to the references for implementation details.

1. Tamura texture features histogram (TTFH)

2. Relational invariant feature histograms (RIFH)

3. Down-scaled images of size 32×32 (DSI)

4. The fill ratio, i.e. the ratio of the number of black pixels in a horizontally smeared (Wong et al., 1982) image to the area of the image (FR)

5. Run-length histograms of black and white pixels along horizontal, vertical, main diagonal, and side diagonal directions; each histogram uses eight bins, spaced apart as powers of 2, i.e. counting runs of length _ 1,3,7,15,31,63,127 and _ 128 (RL{B,W}{X,Y,M,S}H)

6. The vector formed by the total number, mean, and variance of the runs of black and white pixels along the horizontal, vertical, main diagonal, and side diagonal directions as used in (Wang et al., 2006) (RL{B,W}{X,Y,M,S}V)

7. Histograms (as in 5) of the widths and heights of connected components (CCXH, CCYH)

8. The joint distribution of the widths and heights of connected components as a 2-dimensional 64-bin histogram (CCXYH)

9. The histogram of the distances between a connected component and its nearest neighbor component (CCNNH)

Fig: Examples of document image block types distinguished in our approach

Classifications

To evaluate the various features, we use a simple nearest neighbor classifier, that is, a test sample is classified into the class the closest training sample belongs to. The distance measures used are the Jensen- Shannon divergence for histograms and the Euclidean distance for all other features (Deselaers et al., 2004). If different feature sets are combined, the overall distance is calculated as the weighted sum of the individual normalized distances. The weights are proportional to the inverse of the error rate of a particular feature.


Source: Daniel Keysers, Faisal Shafait, Thomas M. Breuel, "DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High Performance Approach", VISAPP 2007, pages 44-51

No comments: