|
|
|
|
Text Extraction from Interfering
Background in Double-sided Handwritten Archival Documents
|
| Team Members: |
Chew Lim Tan, Ruini
Cao, Qian Wang, Peiyi Shen |
The National Archives
of Singapore keeps a large number of double-sided handwritten
archival documents. Over long periods of storage, ink sipped
through the pages of these documents, resulting in interfering
images of handwriting coming from the back of the page. This
research addresses this problem of extracting the front handwriting
from the interfering strokes. An improved Canny edge detection
method has been proposed based on the observation that the foreground
strokes are sharper while the interfering storkes are blurred.
A wavelet approach is further proposed by mapping the front
and reverse sides of the same document page so as to enhance
the foreground strokes and smear the interfering strokes. This
is to provide a stronger distinguishing capability of the improved
Canny edge detector. A automatic mapping of the double-sided
pages using an adapted Murtagh's point matching method has also
given interesting results.
|
 |
Text Retrieval from Document Images without
OCR
|
| Team Members: |
Chew Lim Tan, Yue Lu,
Weihua Huang, Sam Yuan Sung, Zhaohui Yu and Yi Xu |
| A usual way of doing
text retrieval from scanned document images is to first convert
the text into a machine readable form such as ASCII. If the
goal is to retrieve a handful of relevant text documents from
a huge corpus for human reading, then doing full character recognition
of the entire corpus will be a wasteful effort if there is a
cheaper way to retrieve these documents by merely treating them
as images. This research explore this possibility. Basically,
we try to capture character image features and use the sequential
ordering and occurrence frequencies to build some kind of a
document vector. The document vector is then used to match with
another document vector to find its content similarity even
though there is no understanding of the textual content at all
but just image features. Several approaches to extracting image
features have been attempted. They include (1) vertical traverse
density and horizontal traverse density, (2) vertical bar method,
and (3) pixel matching based on hausdorff distance. There are
pros and cons with the various approaches. Some allow language
independencies, while other withstand noise and font variation.
Yet another can even work in a compressed image format. In all
these experiments, promising results have been obtained to demonstrate
the feasibility of text retrieval without full text understanding.
The method will have potential in allowing a web crawler to
go to the web to retrieve relevant scanned document images without
having to painfully download each and every image just to spot
a small number of relevant articles.
A search tool for document images in PDF files is developed and
available for downloading
here.
|
 |
Restoration of Images Scanned From Thick
Bound Documents
|
| Team Members: |
Chew Lim Tan, Zheng
Zhang |
Perspective distortion
always occurs while scanning thick, bound documents. This distortion
mainly causes two sources of degradation for the scanned grayscale
image (1) shadow along the spine of the book, and (2) warping
of the words in the shadow. This research proposes a restoration
system to solve these two problems. It first produces a vertical
projection profile to detect which side of the image the shadow
lies on, and a run-length method is used to find the boundary
between the shadow and the clean area. Next a modified Niblack's
method is used to remove the shadow. A connected component analysis
is then used to adjust the location and orientation of the warped
words in the shadow area.
|
 |
Binarizing Document Images using Coplanar
Prefilter
|
| Team Members: |
Lixin Fan, Liying Fan,
Chew Lim Tan |
A novel coplanar filter
is proposed in this research that exploits the coplanarity of
gray-level distribution of neighboring pixels, to pre-filter
the document images before binarization. Experiments show that
the proposed filter exhibits the following desired properties
for document image binarization: (1) impulsive noise removal,
(2) piecewise smoothing, and (3) sharp edge preservation.
|
 |
Machine Learning Methods for Document Information
Mining
|
| Team Members: |
Ji He, Ah Hwee Tan,
Chew Lim Tan |
Document categorization
refers to the task of automatically assigning one or multiple
predefined category labels to a free text document. Various
supervised machine learning methods have been extensively studied
in the English text categorization literature. However, relatively
few of them have been benchmarked for Chinese text categorization.
In the present study, two Chinese document corpora were constructed,
namely the TREC People's Daily news corpus and the Chinese web
corpus for use in a series of comparative experiments on several
machine learning methods, namely, kNN, SVM and ARAM (Associate
Resonance Adaptive Map which is based on the well known ART
- Adaptive Resonance Theory). With the results of the present
study, further research issues of ART are currently examined
such as rule insertion and variance determination for constraints
on clustering number, etc.
|
 |
Web Document Structure Analysis
|
| Team Members: |
Lakshmi Vijjappu, Ah
Hwee Tan, Chew Lim Tan |
This research aims
at analyzing the structural content of web pages in order to
extract useful information. This is done by exploiting the latent
information given by HTML tags. For each specific extraction
task, an object model is created consisting of the salient fields
fo tbe extraction and the corresponding extraction rules based
on a library of HTML parsing functions. Our system has been
tested in two sample domains, namely, news article extraction
and search engine links extraction.
|
 |
A Clustering-based Approach to Text/Graphics
Separation
|
| Team Members: |
S. He, C.L. Tan, N.
Abe |
A clustering-based
approach to the separation of text from mixed text/graphics
documents is proposed. The approach starts from the grouping
of connected components. Clustering is employed at three critical
stages to improve the efficiency and effectiveness of the grouping,
i.e. prior to the grouping, prior to orientation estimation,
and posterior to the orientation estimation. Because of the
high accuracy of the estimated orientation, not only the overgrouping
but also most of undergrouping cases could be successfully handled.
|
 |
Document Text Segmentation using Multi-band
Disc Model
|
| Team Members: |
Chew Lim Tan, Bo Yuan |
A multi-band disc model
to do document page segmentation is proposed to segregate text
blocks from graphics images. The disc model takes a bottom-up
approach that tries to establish local neighborhood of objects
on a page and then trace the propagation of such neighborhood
until all objects in text blocks are reached. The model can
be applied to text with mixed typefaces with arbitrary outline
shapes. It is tolerant to skews or misalignment of the objects
in the input images.
|
 |
A Generic Information Extraction Architecture
|
| Team Members: |
Chew Lim Tan, Lee Kwang
Angela Wee, Loong Cheong Tong |
The advent of computing
has exacerbated the problem of overwhelming information. To
mange the deluge of information, Information Ex traction systems
can be used to automatically extract relevant information from
free-form text for update to databases or for report generation.
One of the major challenges to Information Extraction is the
representation of domain knowledge in the task, t hat is how
to represent the meaning of the input text, the knowledge of
the field of application, and the knowledge about the target
information to be extracted. We have chosen a directed graph
structure, a domain ontology and a frame representation resp
e ctively. We have further developed a Generic Information Extraction
(GIE) architecture that combines these knowledge structures
for the task of processing. The GIE system is able to extract
information from free-from text, further infer and derive new
in f ormation. It analyzes the input text into a graph structure
and subsequently unifies the graph and the ontology for extraction
of relevant information, driven by the frame structure during
a template filling process. The GIE system has been adopted
for u se in the Message Formatting Expext (MFE) system, a large-scale
information extraction system for a specific financial application
within a major bank in Singapore.
|
 |
Chart Recognition in Document Images
|
| Team Members: |
Chew Lim Tan, Yanping
Zhou |
A project is proposed
to read and interpret charts and graphs in documents. The results
will be translated into HTML tabulated data. Scanned documents
will be preprocessed to identify possible areas of charts and
graphs. Several approaches have been experimented. These include
low level processing to extract chart line features, statistical
approach using HMM models, and neural network training.
|
 |
Handwritten Character Recognition
|
| Team Members: |
Chew Lim Tan, Daming
Shi |
Approaches to handwritten
Chinese character and numeral recognition are proposed. Both
the image of a Chinese character and its structural information
are used, in which the Rapid transformed stroke density features
of a Chinese character will be used in the preliminary classification,
and an improved Extension Matrix algorithm will be carried out
in the final classification get the approximate solutions of
recognition rules as well as to estimate the probability density
function of the overlay area amongst positive and negative examples.
A GA-based supervised learning of the Neocognitron will be implemented
and applied to handwritten numeral recognition. To guarantee
GAs' successful run, the correlation amongst potential training
patterns is used to maintain the diversity of the population.
|
 |
Detection of Word Groups Based on Irregular
Pyramid
|
| Team Members: |
Chew Lim Tan, Poh Kok
Loo |
We propose an algorithm
based on an irregular pyramid to detect word groups in imaged
documents. The uniqueness of this approach is its inclusion
of strategic background information in the analysis where most
techniques have discarded. Both the foreground (i.e. text area)
and portions of the background (i.e. white area) regions are
examined. The algorithm is based on the concept of "closeness"
where text information within a group is close to each other,
in terms of spatial distance, as compared to other text areas.
The result is encouraging with the ability to correctly group
words of different sizes, fonts, alignments and orientations.
|
 |
Language Identification in Multilingual
Documents
|
| Team Members: |
Chew Lim Tan, Peck
Yoke Leong, Shoujie He |
Most optical character
recongition (OCR) systems can recongnize at most a few languages.
For large archives of document images that contain different
languages, there must be some way to automatically categorize
these documents before applying the proper OCR on them. This
report presents a research in the identification of English,
Chinese, Malay and Tamil (the 4 official languages in Singapore)
in image documents, as well as an implementation of a document
recognition system. The system developed combines and extends
some of the techniques inspired by current research in this
area. However while most other works focuses on English, European,
chinese and Japanese languagef interest to developers in Singapore
and its neighboring countries., this research concentrates on
the four official languages of Singapore. As such, it will be
of interest to developers in Singapore and its neighboring countries.
|
 |
News Tracking from Microfilm Images
|
| Team Members: |
Chew Lim Tan, Daming
Shi, Yi Xu, Zhaohui Yu, Sam Yuan Sung |
A project is proposed
to allow efficient retrieval of news articles from huge amount
of microfilm images. Presently, a person who needs to locate
news articles related to a certain event will need to know roughly
the date of the event in order to quickly find the microfilm
for manual search of the articles concerned. The project aims
to use some kind of vector descriptor of the news article contents
to allow comparison for similarity. Each newspaper page from
microfilm will be digitized and preprocessed to remove graphics.
Subsequently, OCR will be used to obtain ASCII text data from
which a n-gram vector descriptor will be built. Similarity of
vectors will be measured using dot product.
|
 |
Text Extraction from Intersecting or Overlapping
Graphics Lines
|
| Team Members: |
Chew Lim Tan, Ruini
Cao |
One of the main problems
in text extraction from document images containing graphics
is the touching, interesecting or overlapping of text with graphics
lines. Many existing text extraction algorithms simply ignore
this problem. The present research tries address this problem,
using degree information at intersection points with line continuation
as in the case of human vision in separating intersecting text
segments. Several approaches are experimented, including the
use of a rho space and the detachment of text segments. Our
methods are used in such applications as Chinese strokes extraction
and text/graphics separation in maps.
|
 |
Text/Graphics Separation Using Agent-Based
Pyramid Operations
|
| Team Members: |
Chew Lim Tan, Bo Yuan,
Poh Oon Ng, Weihua Huang, Qiang Wang, Zheng Zhang |
A document image analysis
system has been developed using multiple agents working on a
pyramid structure to separate text from graphics in the image.
Text strings appear as different groupings of connected components
at different resolutions of the image. As such, the pyramid
structure, which is a multi-resolution image representation,
will provide a natural means of identifying and grouping of
character strings in the document. The pyramid structure is
also amenable to parallel processing, whereby multiple agents
in the system can individually and concurrently look for groups
of connected components at appropriate levels. An agent may
in turn spawn new agents when connected components become disjointed
at finer resolut ion levels. The agent-based pyramid operations
do not require expensive feature analysis among different connected
components to detect text strings as found in other existing
works.
|
 |
Document Page Layout Analysis using Edge
Information
|
| Team Members: |
Chew Lim Tan, Qing
Yuan |
A method is proposed
to use edge information to extract textual blocks from gray
scale document images. It aims at detecting textual regions
in noise infected newspaper images and separate them from graphical
regions. the algorithm traces the feature points in different
entities and then groups those edge points of textual regions.
By line approximation and layout categorization, it can successfully
retrieve directional placed text blocks. Finally feature based
connected component merging is introduced to gather homogenious
textual regions together within the scope of its bounding rectangles.
We can obtain correct page decomposition with efficient computation
and reduced memory size by handling line segments instead of
small pixels. The proposed method has been tested on a large
group of newspaper images with multiple page layouts.
|
 |
|