**hsv**and then vertically, column-by-column with

**vsv**. After horizontal and vertical smoothing, we have two bit-maps, which are next combined by a logical AND operation to produce a new smoothing image. This image has small gaps that interrupt blocks of text lines. Therefore, an additional horizontal smoothing operation is performed by using a new suitable smoothing value

**ahsv**.

The smoothing values used for the Run-length based segmentation methods are widely varies depending on the document structure (both physical and logical layout) and text types in the document. More specifically they depend on character size and font that may also vary depending on whether a document is in a single font or multiple fonts etc. The smoothing methods use heuristic techniques to determine the values of the smoothing variables.

According to the work of Wong et al. (1982), the smoothing values of **hsv** and **vsv **depend on "the number of pixels covering the length of long words, whereas **ahsv** should be set to cover a few character widths". We have found that almost all the papers where the RLSA technique is used and discussed, applied their own diagnostic value as the smoothing value. However, they do not propose an analytical method for the calculation of the smoothing values.

The proposed method is based on pre-estimation of the mean character length and the mean text line distance of a document. To do this, a technique is developed that uses the horizontal and vertical distributions of white and black runs. They consider that a proper value for **hsv** must be equal to twice the **mcl**, while in the vertical direction, the proper value of **vsv** must be taken equal to the **mtld** of the document. That is

hsv = 2 mcl ( 1 )

vsv = mtld (2)

To obtain the proper values of **mcl** and **mtld**, we examine not only the white runs but also the black runs in the vertical and horizontal directions. Specifically, initially we calculate three histograms. These histograms give the distributions of black runs in the horizontal and vertical directions and of white runs in the horizontal direction.

The global maximum of the histogram of horizontal black runs (**gmhbr**) constructed mainly from part of the characters. Therefore, it gives information about the mean character length. We observe that the **mcl** belongs to a region defined by multiply the **gmhbr** by two suitable coefficients **m1** and **m2**. That is

mcl Є [int (m1 * gmhbr), intU (m2 * gmhbr)] --- (3)

where int(x) is the integer part of x, and

intu(x) = x, if x is integer / int(x) + 1, otherwise ----- (4)

On the other hand, we know that the mean character height (**mch**) is close enough to the **mcl**. However, it is obvious that the **mch** corresponds to a local or global maximum in the histogram of vertical black runs, and moreover, we accept that this maximum belongs to the region defined by (3).

- We can accept that in a readable document the mtld is always greater than int(0.8mcl). This is the lower limit for mtld.

- The upper limit of mtld can be large enough and it is taken equal to be 80. So, the mild is detennined as the position of the global histogram maximum in the range [int(0.8mc1),80]

## 1 comment:

Very nice article.

Post a Comment