The smoothing values used for the Run-length based segmentation methods are widely varies depending on the document structure (both physical and logical layout) and text types in the document. More specifically they depend on character size and font that may also vary depending on whether a document is in a single font or multiple fonts etc. The smoothing methods use heuristic techniques to determine the values of the smoothing variables.
According to the work of Wong et al. (1982), the smoothing values of hsv and vsv depend on "the number of pixels covering the length of long words, whereas ahsv should be set to cover a few character widths". We have found that almost all the papers where the RLSA technique is used and discussed, applied their own diagnostic value as the smoothing value. However, they do not propose an analytical method for the calculation of the smoothing values.Here in this post, an algorithm is described for automated calculation of the proper horizontal and vertical smoothing values. Specifically, the horizontal smoothing value (hsv) and the vertical smoothing value (vsv) are calculated according to the mean character length (mcl) and the mean text line distance (mtld) of the document. The values of mcl and mtld are determined using the contributions of the horizontal and vertical runlengths.
The proposed method is based on pre-estimation of the mean character length and the mean text line distance of a document. To do this, a technique is developed that uses the horizontal and vertical distributions of white and black runs. They consider that a proper value for hsv must be equal to twice the mcl, while in the vertical direction, the proper value of vsv must be taken equal to the mtld of the document. That is
hsv = 2 mcl ( 1 )
vsv = mtld (2)
To obtain the proper values of mcl and mtld, we examine not only the white runs but also the black runs in the vertical and horizontal directions. Specifically, initially we calculate three histograms. These histograms give the distributions of black runs in the horizontal and vertical directions and of white runs in the horizontal direction.
The global maximum of the histogram of horizontal black runs (gmhbr) constructed mainly from part of the characters. Therefore, it gives information about the mean character length. We observe that the mcl belongs to a region defined by multiply the gmhbr by two suitable coefficients m1 and m2. That is
mcl Є [int (m1 * gmhbr), intU (m2 * gmhbr)] --- (3)
where int(x) is the integer part of x, and
intu(x) = x, if x is integer / int(x) + 1, otherwise ----- (4)
On the other hand, we know that the mean character height (mch) is close enough to the mcl. However, it is obvious that the mch corresponds to a local or global maximum in the histogram of vertical black runs, and moreover, we accept that this maximum belongs to the region defined by (3).Therefore, the proper values of coefficients m1 and m2 must be determined . To do this they examine a large number of documents. Some estimation are made after examining number of documents.
- We can accept that in a readable document the mtld is always greater than int(0.8mcl). This is the lower limit for mtld.
- The upper limit of mtld can be large enough and it is taken equal to be 80. So, the mild is detennined as the position of the global histogram maximum in the range [int(0.8mc1),80]