|
The information age is upon us.
We face myriads of information in many different forms every day. The challenges
to us are how to mine and extract useful information from these multitude
sources of information. The information mining and extraction group led by
Prof. Tan Chew Lim in the Department’s AI group works on various techniques
pertaining to the following information sources.
Document analysis is the task of examining document images in order to
acquire an understanding of their contents. Thanks to the repaid advances in
printings, scanning, photocopying and digital photography in recent years, we
witness today a proliferation of document images in our daily lives. Such
document images are easily understood by humans but pose great challenges to the
machines. In the early years, document analysis research aimed at extracting and
recognizing text in document images leading to the current matured technology of
Optical Character Recognition (OCR). Another early research was in layout
analysis of document images in order to enhance the correct detection of text
regions in the documents. In recent years, document analysis research has moved
into new areas including enhancement or correction of distorted or degraded
document images, such as historical documents and text images captured by
digital cameras and mobile devices such as phones and PDAs. Another recent
research interest is in the understanding of graphics and symbols embedded in
document images. The document analysis group in the School of Computing is
actively involved in some of these recent advances as described below:
1.1 Document Image Binarization
By binarization, we mean to segregate the foreground text from its
background. Though traditional well known binarization techniques are able to do
reasonable jobs on relatively clean documents, they are unable to deal with
challenges found in documents that contain complex background such as smudges,
uneven illumination and color variation. One method that we have developed is to
find local maximum and minimum contrast in the document images, so that the
binarization process is more responsive to the local variation in the
background. Another method we have developed is an iterative smoothing procedure
to estimate the document background variation in the form of a polynomial
surface. The background variation is then used to subtract from the document
image intensity values to achieve a cleaning effect. Figure 1 shows the results
of binarization using the above two methods
(a1)
(b1)
(c1)
(a2)
(b2)
(c2)
Figure 1. (a) Input images; (b) the binary images constructed by
Local Maximum and Minimum; (c) the binary image constructed by Background
Estimate.
Detailed about the above methods can be found here.
1.2 Historical Document Image Restoration
We have also developed methods to clean up double-sided handwritten historical
documents. Due to long period of storage, the seepage of ink through the
document page has caused a double image where the written strokes on the back of
the page are seen on the front. This is known as the bleed-through problem. One
method we developed relies on human intervention to help the machine to learn to
differentiate between the foreground and background strokes. Another method is
to enable automatic matching of written strokes on both sides so that the
unwanted background strokes can be identified and removed. Figure 2 shows the
results of the image restoration.
Figure 2. A double sided handwritten historical document and its restoration
using one of our methods.
1.3 Patent Document Search
Another practical application of our document analysis research is the development of a graphics viewing tool for US patent database users. The layout of documents in the US patent database is not optimal for reading purpose. In a patent document, figures and text description are separated into the drawing and description sections, either of which occupies several consecutive pages. Our graphics viewing tool as shown in figure 3 connects the captions and labels of figures in the drawing to their relevant text description. With the help of this system, a patent user can conveniently jump from a figure to relevant text or vice versa by clicking corresponding captions and labels.

Figure 8: A snapshot of the system interface, where label 23 in the drawing is clicked. The left part of the interface is a window displaying the text version of a patent, and the right part is a window displaying the drawings of the patent.
Topá
Advances in image acquisition and storage have led to huge image databases which contain a wealth of useful information. Mining and extraction salient information from these images is a daunting task. Image mining aims to address this problem. In the context of images, image mining deals with the extraction of knowledge, image data relationship, or other patterns not explicitly stored in the images. It is an interdisciplinary effort that draws upon expertise in image processing, information retrieval, data mining, machine learning, database and artificial intelligence.
2.1 Camera Images
The advancement of camera technology has given people an alternative to traditional scanning for text image acquisition. With a portable camera, textual information can be captured from not only paper materials but also real scene objects like signboards. However, this poses a challenge for traditional document image analysis techniques. Because the image plane in a camera is not parallel to the document plane, images obtained by a camera often suffer from perspective distortion, resulting in a failure when OCR or other document image analysis techniques are applied to them directly. We have developed a method, which is able to recognize badly distorted characters or symbols in real scene signboards and rectify the perspective distortion simultaneously. Figure 4 shows the rectified results of our method in comparison with two existing methods. This work should facilitate applications like Sign Recognition, Mobile Phone Translator, and Speech Generator for the visually impaired.

Figure 4: Rectified signboards our method in comparison with two existing methods. Rectified images are scaled for better viewing purpose. (a) real-scene symbols (b) by our method (c) by SIFT (d) by Shape Context. (Reference template is shown on the far right)
2.2 Medical Images
A new image mining application involving brain CT scan images has recently been studied. The current research aims to investigate techniques for fast retrieval of brain CT scan images based on the image content of the medical anomalies as well as other textual information associated with the medical conditions. Intelligent retrieval systems are developed to improve the current hospital database management system. Machine learning paradigms are explored to enable automatic classification of medical images based on image contents and textual data. In addition, text and image mining techniques are also investigated to train the machine to do automatic interpretation of image contents.
A content-based CT image retrieval system – TBIdoc has been developed. The system helps to manage and retrieve traumatic brain injury data. With this web-based system, doctors can query previous patient data by uploading brain CT scan images. The system returns a list of previous study cases ranked according to the visual similarity to the query images. Using TBIdoc, doctors can find relevant previous cases in a fast and convenient way. It is expected that the system can improve the current hospital database management involving traumatic brain injury and also contribute to the computer-aided education in traumatic brain injury. Figure 5 shows some sample retrieval results in TBIdoc.
Figure 4: Sample retrieval results in TBIdoc.
Topá
3. Video Information
Digital video now plays an important role in entertainment, education, and other multimedia applications. The advancement in technology and decreasing price of devices has led to the increasing use of video in our day to day activities. For instance, students prefer to take video or photos of lecture presentation in classroom rather than writing on pad. Therefore, with hundreds of thousands of hours of archival videos, there is an urgent demand for tools that allow efficient browsing and retrieving of video data. In response to such needs, various video content analysis techniques using one or a combination of image, audio, and textual information present in video have been proposed to parse, index, and abstract massive amount of data. Among these information sources, text present in the video is a reliable clue for three reasons: (1) it is closely related to the current content of video, (2) it has distinctive visual characteristics and (3) the state-of-the-art optical character recognition (OCR) techniques are far more robust than existing speech analysis techniques and visual object analysis techniques. Therefore, almost all video indexing research work begins with video text detection and recognition. A large number of approaches, such as connected based, edge based, texture based and compressed domain based methods, have been proposed and already obtained impressive performance. A majority of text detection approaches proposed in the literature are for image documents. Although most of them can be adopted and extended for video documents, detecting text in videos presents unique challenges over that in images, due to many undesirable properties of video for text detection and extraction problem, such as low resolution, low contrast, unknown text color, size, position, orientation and layout, color bleeding and unconstrained backgrounds. At a high level, text in digital video can be divided into two classes, graphics text and scene text. Graphics/caption text is text that is mechanically added to video frames to supplement the visual and audio content. Scene text, on the other hand, is text that appears within scene and is captured by the camera. Clearly, the detection and extraction of scene text is a much tougher task due to varying lighting, complex moment and transformation. Hence, developing method that detects both graphics and scene text without constraints is still challenging and interesting for researchers in this community.
3.1 Graphics or Caption text Detection
Since graphics or caption text is purposefully added it is often more structured and closely related to the subject than scene text. Most related previous work has focused on the detection of graphics text because of its usefulness in several applications. For example, captions in news broadcasts and documentaries usually annotate information on where, when and who of the reported events. More importantly, a sequence of frames with caption text is often used to represent highlights in documentaries. Also captions are widely used to depict titles, producers, actors, credits, and sometimes, the context of story. Furthermore, text and symbols that are present at specific locations in a video image can be used to identify the TV station and program associated with the video. In summary, graphics/captions in video frames provide highly condensed information about the contents of the video and can be used for video skimming, browsing, and retrieval in large video databases. Although, embedded graphics/caption provides important information about the image, it is not an easy problem to reliably detect and extract text in video. In a single frame, the size of the character can change from very small to very big. The font of text can be different. Text can occur in a very cluttered background. For video sequences, the text can be either still or moving in an arbitrary direction. The same text may vary its size from frame to frame, due to some special effects. The background can also moving/changing, independent of the text. By keeping these in our mind, we have developed a method called Laplacian method for video text detection, which works well for different kinds of text when compared to literature. However, we assume that text is in the horizontal direction because generally graphics or caption text in news video appears horizontally. Sample result of the method is shown in Figure 6.
(a)Input (Graphics)
(b)Edge-Texture
(c)Edge-Color
(d)Gradient
(e)Color-Cluster
(f)Laplacian
Figure 6: The detected blocks of the four existing methods and the proposed method for a horizontal text image.
3.2 Scene text Detection (Multi-Oriented text)
In some domains such as sports, however, scene text can be used to uniquely identify objects. Most related previous work has focused on the detection of graphics text though scene text is often difficult to detect and extract due to its virtually unlimited range of above said poses. It is important in applications such as navigation, surveillance, video classification, or analysis of sporting events and text tracking. Further, when there is no graphics text then scene text could be useful for understanding video content to generate key words for indexing and event identification and event boundary identification. It can also be used to record the broadcast time and date of commercial and content that broadcasted, helping the people to check whether their client’s commercials have been broadcast correctly at the arranged time on the arranged television channel. We extended the same Laplacian method in Fourier domain with the help of skeletonization concept for detecting multi-oriented text including horizontal text. We have explored the skeletonization concept for separating text and non-text segments of the component detected by the Laplacian method. This method works well for text on straight line but not text on curve line. Sample result can be seen in Figure 7. To overcome the drawback of the Laplacian method, we also developed a method called Bayesian classifier approach with new boundary growing concept for detecting curve text line which includes detection of horizontal, non-horizontal text lines. Sample results can be seen in Figure 8.
(a)Input (Graphics)
(b)Edge-Texture
(c)Edge-Color
(d)Gradient
(e)Color-Cluster
(f)Laplacian+ Skeletonization
Figure 7: The detected blocks of the four existing methods and the proposed method for a non-horizontal text image.
Figure 8: Curve text line detection.
3.3 Text Recognition
Text recognition is generally divided into four steps: detection, localization,
extraction and recognition. The detection step roughly classifies text and
non-text regions. The localization step determines the accurate boundaries of
text strings as we described in the above two sections. The extraction step
filters out background pixels in the text strings, so that only text pixels are
left for recognition. Since the above three steps generate a binary text image,
the recognition step can be done by commercial document OCR software. Although
many methods have been proposed for text recognition in the last decade, most of
them work well for document captured by high resolution camera or scanner but
not for text in video because of the current OCR limitations such as big font,
plain background and complete shape of the component. Therefore, recognition of
video text is challenging and interesting for the document research community.
There are methods which use text enhancement using temporal frames to increase
the contrast so that the current document OCR can recognize them but the problem
is how to generalize the enhancement procedure for different kinds of text in
different situations. We have developed a binarization method which preprocesses
the video image before sending it to the OCR for recognition. Sample recognition
results are shown in Figure 9.
Figure 9(a): Text blocks are extracted from video frame
Figure 9(b): Binarization and Recognition results
Topá
Text through ASCII coding contains the direct information readily available for the machine to process. However, natural language expressions easily understood by humans pose great challenges to the machine in extracting the intended meaning from text. Complex grammatical structures and idiomatic expression are barriers for the machine to overcome in extracting the correct information from text. The following are some of the research that we do in enabling machines to understand and extract text information.
4.1 Syntactic Structure Alignment for Statistical Machine Translation (SMT)
Machine translation is one of the earliest applications of computer in AI in understanding and translation human languages. While the early machine translation methods were mainly based on rules relying on grammatical rules of both the source and target languages. Today, the trend is based on statistical machine translation (SMT) approach which uses on a large body of existing bilingual texts as training data for machine to capture the statistical mapping of bilingual words and phrases between both languages. One of the steps required is to do word alignment between bilingual sentences in the parallel corpus. This is the research problem that interests us.
Most of the current work in SMT obtains Translational Equivalences by initially conducting word alignment on the plain parallel corpus and extracting the Translational Equivalences which are consistent with the word alignment. Therefore, a decent word alignment is required as a prerequisite. Such pipeline approach to get Translational Equivalences is argued to be vulnerable to the errors from the initial stage of word alignment. Currently, researchers address this problem by mainly focusing on how to improve word alignment. Alternatively, we attempt to directly conduct syntactic structure alignment to obtain the syntactic Translational Equivalences.
4.2 Entity Linking Leveraging Automatically Generated Annotation
Given a knowledge base (KB), a document collection, and a list of queries in the format of [name-string, document-id], Entity Linking task is to determine for each query, which knowledge base entity is being referred to, or if the entity is a new entity which is not present in the reference KB. The proposed approach for Entity Linking is to automatically generate a large scale corpus annotation for ambiguous mentions leveraging on their unambiguous synonyms in the document collection.
Traditionally, without any training data available, the solution is to rank the candidates based on similarity using a Vector Space Model. However it is difficult for the ranking approach to detect a new entity that is not yet covered in KB, and it is also inconvenient to combine different features. We create a large corpus for entity linking by an automatic method. A binary classifier is trained to filter out KB entities that are not similar to current mentions. We further leverage on the Wikipedia documents to provide additional information which is not available in our generated corpus through a domain adaption approach which provides further performance improvement. 24.1% accuracy improvements on KBP-09 over the baseline also benefits from more variations spotted using additional knowledge from Wikipedia.
4.3 Wavelet Kernel for Textual Data Classification
We propose a wavelet kernel method which keeps all features and uses the relationship between them to improve effectiveness. The new kernel requires the feature set to be ordered such that consecutive features are related either statistically or based on some external knowledge source. The ordered feature set enables the interpretation of the vector representation of an object as a series of equally space observations of a hypothetical continuous signal. The new kernel maps the vector representation of objects to the L2 function space, where appropriately chosen compactly supported basis functions utilize the relation between features when calculating the similarity between two objects.
The suggested approach is not entirely new to text representation. In order to be efficient, the mathematical objects of a formal model, like vectors, have to reasonably reproduce language-related phenomena such as word meaning inherent in index terms. On the other hand, the classical model of text representation, when it comes to the representation of word meaning, is approximate only. Adding expansion terms to the vector representation can also improve effectiveness. The choice of expansion terms is either based on distributional similarity or on some lexical resource that establishes relationships between terms. Existing methods regard all expansion terms equally important. The proposed kernel, however, discounts less important expansion terms according to a semantic similarity distance. This approach improves effectiveness in both text classification and information retrieval.
4.4 Kernel Based Discourse Relation Recognition with Temporal Ordering Information
Syntactic knowledge is important for discourse relation recognition. Yet only heuristically selected flat paths and 2-level production rules have been used to incorporate such information so far. We propose using tree kernel based approach to automatically mine the syntactic information from the parse trees for discourse analysis, applying kernel function to the tree structures directly. These structural syntactic features, together with other normal flat features are incorporated into our composite kernel to capture diverse knowledge for simultaneous discourse identification and classification for both explicit and implicit relations.
Experiment shows that tree kernel approach is able to give statistical significant improvements over flat syntactic path feature. We also illustrate that tree kernel approach covers more structure information than the production rules which allows tree kernel to further incorporate information from a higher dimension space for possible better discriminate. Besides, we further propose to leverage on temporal ordering information to constrain the interpretation of discourse relation which also demonstrate statistically significant improvements for discourse relation recognition on PDTB 2.0 for both explicit and implicit relations as well.
4.5 Event Anaphora Resolution
Anaphora resolution, the task of resolving a given text expression to its referred expression in prior texts, is important for intelligent text processing and information extraction. Most previous work on anaphora resolution aims at object anaphora in which both an anaphor and its antecedent are mentions of the same concrete real world object. In contrast, event anaphors are mentions that refer back to an abstract entity, i.e. an event expressed by a verb rather than a noun.
Event anaphora resolution is useful for cascaded event template extraction and other natural language processing study. In our preliminary study, we investigate event pronoun resolution for general purpose. We first explore various lexical, positional and syntactic features useful for the event pronoun resolution. We further explore tree kernel to model structure information embedded in the syntactic parse. A composite kernel is then used to combine the above diverse information. Besides, we also look into the incorporation of the negative training instances with non-anaphora, although these training instances are not used in previous study on co-reference or anaphora resolution as they make training instances too unbalanced. Our study shows that they are very useful for the final resolution through random sampling strategy.
4.6 Disambiguation of Ambiguous Sentiment Adjectives
We propose supervised and unsupervised hierarchical dependency-based methods to disambiguate ambiguous sentiment adjectives (ASAs) within context and identify their sentiment polarity as positive or negative. Firstly, we demonstrate that there are determinant words in context which can be effectively used to disambiguate ASAs, rather than only using the phrase containing ASAs. For this purpose, we propose a hierarchical dependency-based method (HDEPM) to extract the determinant words. Then we employ the determinant words extracted by our hierarchical method in supervised and unsupervised manners to disambiguate ASAs.
Our supervised method simply makes use of the determinant words as the features of a Na?ve Bayse classifier while our unsupervised method works based on the automatically-detected association strength between the determinant words and the ambiguous adjective. We evaluate our methods on 410 sentences containing a sentiment ambiguous adjective, sampled from movie and hotel domains. The experimental results show that, our hierarchical methods outperform the methods that works based on the phrase-level disambiguation. The supervised learning method achieves an accuracy of 82.20%, while the accuracy of the supervised phrase-level method is 75.37%. The accuracy of the unsupervised learning algorithm ranges from 69.02% to 74.39% depending on the set of the determinant words extracted by HDEPM.
Topá
|