about usprojectpeoplepaperscontact
     
 

Document Analysis

Text Extraction from Interfering Background in Double-sided Handwritten Archival Documents

Team Members: Chew Lim Tan, Ruini Cao, Qian Wang, Peiyi Shen
The National Archives of Singapore keeps a large number of double-sided handwritten archival documents. Over long periods of storage, ink sipped through the pages of these documents, resulting in interfering images of handwriting coming from the back of the page. This research addresses this problem of extracting the front handwriting from the interfering strokes. An improved Canny edge detection method has been proposed based on the observation that the foreground strokes are sharper while the interfering storkes are blurred. A wavelet approach is further proposed by mapping the front and reverse sides of the same document page so as to enhance the foreground strokes and smear the interfering strokes. This is to provide a stronger distinguishing capability of the improved Canny edge detector. A automatic mapping of the double-sided pages using an adapted Murtagh's point matching method has also given interesting results.

Text Retrieval from Document Images without OCR

Team Members: Chew Lim Tan, Yue Lu, Weihua Huang, Sam Yuan Sung, Zhaohui Yu and Yi Xu
A usual way of doing text retrieval from scanned document images is to first convert the text into a machine readable form such as ASCII. If the goal is to retrieve a handful of relevant text documents from a huge corpus for human reading, then doing full character recognition of the entire corpus will be a wasteful effort if there is a cheaper way to retrieve these documents by merely treating them as images. This research explore this possibility. Basically, we try to capture character image features and use the sequential ordering and occurrence frequencies to build some kind of a document vector. The document vector is then used to match with another document vector to find its content similarity even though there is no understanding of the textual content at all but just image features. Several approaches to extracting image features have been attempted. They include (1) vertical traverse density and horizontal traverse density, (2) vertical bar method, and (3) pixel matching based on hausdorff distance. There are pros and cons with the various approaches. Some allow language independencies, while other withstand noise and font variation. Yet another can even work in a compressed image format. In all these experiments, promising results have been obtained to demonstrate the feasibility of text retrieval without full text understanding. The method will have potential in allowing a web crawler to go to the web to retrieve relevant scanned document images without having to painfully download each and every image just to spot a small number of relevant articles.

A search tool for document images in PDF files is developed and available for downloading here.

Restoration of Images Scanned From Thick Bound Documents

Team Members: Chew Lim Tan, Zheng Zhang
Perspective distortion always occurs while scanning thick, bound documents. This distortion mainly causes two sources of degradation for the scanned grayscale image (1) shadow along the spine of the book, and (2) warping of the words in the shadow. This research proposes a restoration system to solve these two problems. It first produces a vertical projection profile to detect which side of the image the shadow lies on, and a run-length method is used to find the boundary between the shadow and the clean area. Next a modified Niblack's method is used to remove the shadow. A connected component analysis is then used to adjust the location and orientation of the warped words in the shadow area.

Binarizing Document Images using Coplanar Prefilter

Team Members: Lixin Fan, Liying Fan, Chew Lim Tan
A novel coplanar filter is proposed in this research that exploits the coplanarity of gray-level distribution of neighboring pixels, to pre-filter the document images before binarization. Experiments show that the proposed filter exhibits the following desired properties for document image binarization: (1) impulsive noise removal, (2) piecewise smoothing, and (3) sharp edge preservation.

Machine Learning Methods for Document Information Mining

Team Members: Ji He, Ah Hwee Tan, Chew Lim Tan
Document categorization refers to the task of automatically assigning one or multiple predefined category labels to a free text document. Various supervised machine learning methods have been extensively studied in the English text categorization literature. However, relatively few of them have been benchmarked for Chinese text categorization. In the present study, two Chinese document corpora were constructed, namely the TREC People's Daily news corpus and the Chinese web corpus for use in a series of comparative experiments on several machine learning methods, namely, kNN, SVM and ARAM (Associate Resonance Adaptive Map which is based on the well known ART - Adaptive Resonance Theory). With the results of the present study, further research issues of ART are currently examined such as rule insertion and variance determination for constraints on clustering number, etc.

Web Document Structure Analysis

Team Members: Lakshmi Vijjappu, Ah Hwee Tan, Chew Lim Tan
This research aims at analyzing the structural content of web pages in order to extract useful information. This is done by exploiting the latent information given by HTML tags. For each specific extraction task, an object model is created consisting of the salient fields fo tbe extraction and the corresponding extraction rules based on a library of HTML parsing functions. Our system has been tested in two sample domains, namely, news article extraction and search engine links extraction.

A Clustering-based Approach to Text/Graphics Separation

Team Members: S. He, C.L. Tan, N. Abe
A clustering-based approach to the separation of text from mixed text/graphics documents is proposed. The approach starts from the grouping of connected components. Clustering is employed at three critical stages to improve the efficiency and effectiveness of the grouping, i.e. prior to the grouping, prior to orientation estimation, and posterior to the orientation estimation. Because of the high accuracy of the estimated orientation, not only the overgrouping but also most of undergrouping cases could be successfully handled.

Document Text Segmentation using Multi-band Disc Model

Team Members: Chew Lim Tan, Bo Yuan
A multi-band disc model to do document page segmentation is proposed to segregate text blocks from graphics images. The disc model takes a bottom-up approach that tries to establish local neighborhood of objects on a page and then trace the propagation of such neighborhood until all objects in text blocks are reached. The model can be applied to text with mixed typefaces with arbitrary outline shapes. It is tolerant to skews or misalignment of the objects in the input images.

A Generic Information Extraction Architecture

Team Members: Chew Lim Tan, Lee Kwang Angela Wee, Loong Cheong Tong
The advent of computing has exacerbated the problem of overwhelming information. To mange the deluge of information, Information Ex traction systems can be used to automatically extract relevant information from free-form text for update to databases or for report generation. One of the major challenges to Information Extraction is the representation of domain knowledge in the task, t hat is how to represent the meaning of the input text, the knowledge of the field of application, and the knowledge about the target information to be extracted. We have chosen a directed graph structure, a domain ontology and a frame representation resp e ctively. We have further developed a Generic Information Extraction (GIE) architecture that combines these knowledge structures for the task of processing. The GIE system is able to extract information from free-from text, further infer and derive new in f ormation. It analyzes the input text into a graph structure and subsequently unifies the graph and the ontology for extraction of relevant information, driven by the frame structure during a template filling process. The GIE system has been adopted for u se in the Message Formatting Expext (MFE) system, a large-scale information extraction system for a specific financial application within a major bank in Singapore.

Chart Recognition in Document Images

Team Members: Chew Lim Tan, Yanping Zhou
A project is proposed to read and interpret charts and graphs in documents. The results will be translated into HTML tabulated data. Scanned documents will be preprocessed to identify possible areas of charts and graphs. Several approaches have been experimented. These include low level processing to extract chart line features, statistical approach using HMM models, and neural network training.

Handwritten Character Recognition

Team Members: Chew Lim Tan, Daming Shi
Approaches to handwritten Chinese character and numeral recognition are proposed. Both the image of a Chinese character and its structural information are used, in which the Rapid transformed stroke density features of a Chinese character will be used in the preliminary classification, and an improved Extension Matrix algorithm will be carried out in the final classification get the approximate solutions of recognition rules as well as to estimate the probability density function of the overlay area amongst positive and negative examples. A GA-based supervised learning of the Neocognitron will be implemented and applied to handwritten numeral recognition. To guarantee GAs' successful run, the correlation amongst potential training patterns is used to maintain the diversity of the population.

Detection of Word Groups Based on Irregular Pyramid

Team Members: Chew Lim Tan, Poh Kok Loo
We propose an algorithm based on an irregular pyramid to detect word groups in imaged documents. The uniqueness of this approach is its inclusion of strategic background information in the analysis where most techniques have discarded. Both the foreground (i.e. text area) and portions of the background (i.e. white area) regions are examined. The algorithm is based on the concept of "closeness" where text information within a group is close to each other, in terms of spatial distance, as compared to other text areas. The result is encouraging with the ability to correctly group words of different sizes, fonts, alignments and orientations.

Language Identification in Multilingual Documents

Team Members: Chew Lim Tan, Peck Yoke Leong, Shoujie He
Most optical character recongition (OCR) systems can recongnize at most a few languages. For large archives of document images that contain different languages, there must be some way to automatically categorize these documents before applying the proper OCR on them. This report presents a research in the identification of English, Chinese, Malay and Tamil (the 4 official languages in Singapore) in image documents, as well as an implementation of a document recognition system. The system developed combines and extends some of the techniques inspired by current research in this area. However while most other works focuses on English, European, chinese and Japanese languagef interest to developers in Singapore and its neighboring countries., this research concentrates on the four official languages of Singapore. As such, it will be of interest to developers in Singapore and its neighboring countries.

News Tracking from Microfilm Images

Team Members: Chew Lim Tan, Daming Shi, Yi Xu, Zhaohui Yu, Sam Yuan Sung
A project is proposed to allow efficient retrieval of news articles from huge amount of microfilm images. Presently, a person who needs to locate news articles related to a certain event will need to know roughly the date of the event in order to quickly find the microfilm for manual search of the articles concerned. The project aims to use some kind of vector descriptor of the news article contents to allow comparison for similarity. Each newspaper page from microfilm will be digitized and preprocessed to remove graphics. Subsequently, OCR will be used to obtain ASCII text data from which a n-gram vector descriptor will be built. Similarity of vectors will be measured using dot product.

Text Extraction from Intersecting or Overlapping Graphics Lines

Team Members: Chew Lim Tan, Ruini Cao
One of the main problems in text extraction from document images containing graphics is the touching, interesecting or overlapping of text with graphics lines. Many existing text extraction algorithms simply ignore this problem. The present research tries address this problem, using degree information at intersection points with line continuation as in the case of human vision in separating intersecting text segments. Several approaches are experimented, including the use of a rho space and the detachment of text segments. Our methods are used in such applications as Chinese strokes extraction and text/graphics separation in maps.

Text/Graphics Separation Using Agent-Based Pyramid Operations

Team Members: Chew Lim Tan, Bo Yuan, Poh Oon Ng, Weihua Huang, Qiang Wang, Zheng Zhang
A document image analysis system has been developed using multiple agents working on a pyramid structure to separate text from graphics in the image. Text strings appear as different groupings of connected components at different resolutions of the image. As such, the pyramid structure, which is a multi-resolution image representation, will provide a natural means of identifying and grouping of character strings in the document. The pyramid structure is also amenable to parallel processing, whereby multiple agents in the system can individually and concurrently look for groups of connected components at appropriate levels. An agent may in turn spawn new agents when connected components become disjointed at finer resolut ion levels. The agent-based pyramid operations do not require expensive feature analysis among different connected components to detect text strings as found in other existing works.

Document Page Layout Analysis using Edge Information

Team Members: Chew Lim Tan, Qing Yuan
A method is proposed to use edge information to extract textual blocks from gray scale document images. It aims at detecting textual regions in noise infected newspaper images and separate them from graphical regions. the algorithm traces the feature points in different entities and then groups those edge points of textual regions. By line approximation and layout categorization, it can successfully retrieve directional placed text blocks. Finally feature based connected component merging is introduced to gather homogenious textual regions together within the scope of its bounding rectangles. We can obtain correct page decomposition with efficient computation and reduced memory size by handling line segments instead of small pixels. The proposed method has been tested on a large group of newspaper images with multiple page layouts.
 
     
NUS Home SoC Home CHIME Home Welcome Your Feedback