Wednesday, July 4, 2007

Planning for the Development of a Database of Document Images for Research on Bangla Optical Character Recognition


First we need to complete the basic study on the existing established Image databases of Document Images for Research purpose. I have observed that the most widely used database for research on document image analysis and recognition is the University of Washington Database where the creation of this database started at around 1990. Papers related to this database were published in 1993 at the Second International Conference on Document Analysis and Recognition. So we need to know details about the technique associated with this database creation. Some papers related to this are listed into the Study material section.

Next we have to look at the existing database for Bangla document Image and their availability. I have found that similar task has been done on “Resource Centre for Indian Language Technology Solutions Bangla” with limited varieties of document from bangla novel. I tried to get access to that resource, but unfortunately I failed. So, we need to build a database with large varieties and distribute this as open.

The followings are the basic about the UW databases.

University of Washington English Document Image Database

Series (UW-I, UW-II, UW-III) of document image databases produced by the Intelligent Systems Laboratory, at the University of Washington, Seattle, Washington, USA.

Now the most widely used database for existing research and performance evaluation is the University of Washington III (UWIII) database. The database consists of 1600 English document images with bounding boxes of 24177 homogeneous page segments or blocks, which are manually labeled into different classes depending on their contents, making the data very suitable for evaluating a block classification system. The documents in the UW-III dataset are categorized based on their degradation type as follows:

1. Direct scans of original English journals

2. Scans of first generation English journal photocopies

3. Scans of second or later generation English journal photocopies

Many of the documents in the dataset are duplicated and differ sometimes only by the degradation applied to them. This type of collection is useful when one is evaluating a page segmentation algorithm to see how well the algorithm performs when photocopy effect degradation is applied to a document.

UW – I

Contains 1147 document page images from English Scientific and Technical Journals having

  • Binary images scanned from 1st and other generation photocopies
  • Binary and grayscale images scanned directly from technical journals
  • Synthetic noise-free images generated from LaTeX files
  • Document images from UNLV ISRI database
  • All document images zoned and tagged
  • Software for OCR performance evaluation
  • Software for simulation of photocopy degradation
  • Text ground truth generated from two independent data-entry operators followed by three independent verifications

Each document page has associated with it

  • Text ground truth data for each text zone
  • Bounding box information for each zone on the page
  • Coarse level attributes for each document page
  • Finer level attributes (such as font size, alignment etc.) for each zone
  • Qualitative information on the condition of each page
  • AT & T Bell Labs degraded character images database

UW – II

Contains the followings:

  • 624 English journal document pages
    • 43 Complete articles
    • 63 MEMO pages
  • 477 JAPANESE journal document pages.
  • Illuminator software produced by RAF
  • Corrected data files for the known errors in UW-I

Each document page has associated with it

  • Text ground truth data for each text zone
  • Bounding box information for each zone on the page
  • Coarse level attributes for each document page
  • Finer level attributes (such as font size, alignment etc.) for each zone
  • Qualitative information on the condition of each page

Illuminator is an editor for building document image under- standing test and training sets, for easy correction of OCR errors, and for reverse- encoding the essential information and attributes of a document image page. It has been developed by RAF Technology, Inc., under the auspices of ARPA.

UW – III

Contains the followings:

  • 25 pages of Chemical formulae and ground-truth in XFIG 25 pages of Mathematical formulae and ground-truth in XFIG and LaTeX 40 pages of Line drawings and ground-truth in AUTOCAD AND IGES 44 TIFF images of engineering drawings ground-truthed in AUTOCAD 33 TIFF images of English text generated from LaTeX and their

  • Character level ground-truth; 33 printed and scanned pages containing English

  • Text and their character level ground-truth 979 pages from UW-I in DAFS format, corrected for skew, and the word.

  • Bounding boxes for each word on these pages 623 pages from UW-II in DAFS format corrected for skew, and the word.

  • Bounding boxes for each word on 607 of these pages Software for:

- Generating ground-truth for synthetic documents;

- Synthetically degrading document images;

- Registering ideal document images to real document images;

- Transferring character ground-truth between a pair of registered images;

- Estimating document degradation model parameters;

- Validating degradation models;

- Formatting Arabic, Hindi, and music documents in LaTeX;

- Testing hypotheses about parameters from a multivariate Gaussian population

- TIFF libraries (public domain software from ftp.sgi.com)

Required Study Materials:

Research Papers:

[1]

I.T. Phillips, S. Chen, J. Ha, and R.M. Haralick, English document database design and implementation methodology, Proc. of the Second Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 1993, 65-104.

[2]

Guyon, I., Haralick, R. M., Hull, J. J., and Phillips, I. T. (1997). Data sets for OCR and document image understanding research. In Bunke, H. and Wang, P., editors, Handbook of character recognition and document image analysis, pages 779–799. World Scientific, Singapore.

[3]

Phillips, Ihsin T.; Ha, Jaekyu; Chen, Su; Haralick, Robert M., "Implementation methodology and error analysis for the University of Washington English Document Image Database-I", Proc. SPIE Vol. 2103, p. 155-173, 22nd AIPR Workshop: Interdisciplinary Computer Vision: Applications and Changing Needs.

Abstract: Document database preparation is a very time-consuming job and usually requires the involvement of many people. Any database is prone to having errors however carefully it was constructed. To assure the high quality of the document image database, a carefully planned implementation methodology is absolutely needed. In this paper, an implementation methodology that we employed to produce the UW English Document Image Database I is described. The paper also discusses how to estimate the distribution of errors contained in a database based on a double-data entry/double verification procedure.

[4]

Bhattacharya, U.; Chaudhuri, B.B., "Databases for research on recognition of handwritten characters of Indian scripts", Eighth International Conference on Document Analysis and Recognition, 2005. Proceedings, Vol. 2 On page(s): 789- 793

Abstract: Three image databases of handwritten isolated numerals of three different Indian scripts namely Devnagari, Bangla and Oriya are described in this paper. Grayscale images of 22556 Devnagari numerals written by 1049 persons, 12938 Bangla numerals written by 556 persons and 5970 Oriya numerals written by 356 persons form the respective databases. These images were scanned from three different kinds of handwritten documents - postal mails, job application form and another set of forms specially designed by the collectors for the purpose. The only restriction imposed on the writers is to write each numeral within a rectangular box. These databases are free from the limitations that they are neither developed in laboratory environments nor they are non-uniformly distributed over different classes. Also, for comparison purposes, each database has been properly divided into respective training and test sets.

Web Links:

[1] http://www.isical.ac.in/~ujjwal/download/database.html [Information about the Bangla Character Recognition database.]

[2] http://www.science.uva.nl/research/dlia/datasets/uwash1.html [UW English Document Image Database-I]

[3] http://documents.cfar.umd.edu/resources/database/UWII.html [UW-II English/Japanese Document Image Database]

[4] http://documents.cfar.umd.edu/resources/database/3UWCdRom.html [UW-III English/Technical Document Image Database]

[5] http://www.nist.gov/srd/ [NIST Scientific and Technical Database]

[6] http://www.cfar.umd.edu/~kia/ocr-faq.html#SECTION00022000000000000000 [OCR Frequently Asked Questions.]

[7] http://www.isical.ac.in/~rc_bangla/products.html#ocr [Resource Centre for Indian Language Technology Solutions Bangla]

1 comment:

jimlosche06 said...

Hi

I read this post 2 times. It is very useful.

Pls try to keep posting.

Let me show other source that may be good for community.

Source: Performance evaluation forms

Best regards
Jonathan.