Information Mining and Extraction

The information age is upon us. We face myriads of information in many different forms every day. The challenges to us are how to mine and extract useful information from these multitude sources of information.  The information mining and extraction group led by Prof. Tan Chew Lim in the Department’s AI group works on various techniques pertaining to the following areas.

Document Information

Image Information

Video Information

Text Processing

1.Document Information

Document analysis is the task of examining document images in order to acquire an understanding of their contents. Thanks to the repaid advances in printings, scanning, photocopying and digital photography in recent years, we witness today a proliferation of document images in our daily lives. Such document images are easily understood by humans but pose great challenges to the machines. In the early years, document analysis research aimed at extracting and recognizing text in document images leading to the current matured technology of Optical Character Recognition (OCR). Another early research was in layout analysis of document images in order to enhance the correct detection of text regions in the documents. In recent years, document analysis research has moved into new areas including enhancement or correction of distorted or degraded document images, such as historical documents and text images captured by digital cameras and mobile devices such as phones and PDAs. Another recent research interest is in the understanding of graphics and symbols embedded in document images. The document analysis group in the School of Computing is actively involved in some of these recent advances as described below:

1.1 Document Image Binarization

Document image binarization is an important preprocessing technique, which aims to segregate the foreground text from the document background. A lot of binarization techniques have been proposed and work well on relatively clean documents. However, they are unable to perform well on degraded document images, includes smudges and smear, uneven illumination and intensity variation. 

Figure 1.1 Binarization examples by Local Maximum and Minimum (LMM) method and Background Estimate (BE) method.
(a) Input images
(b)Binary images by LMM
(c) Binary images by BE

 

We have proposed a few techniques to binarize document images. One method that we have developed is to find local maximum and minimum contrast in the document images, so that the binarization process is more responsive to the local variation in the background. Another method we have developed is an iterative smoothing procedure to estimate the document background variation in the form of a polynomial surface. The background variation is then used to subtract from the document image intensity values to achieve a cleaning effect. Figure 1.1 shows the results of binarization using the above two methods.

Figure 1.2 Binarization Examples.
(c),(d) denotes the binarization results of the degraded images in (a),(b)
(a)
(b)
(c)
(d)

 

The thresholding of degraded document images is still an unsolved problem due to the high inter/intra-variation between the text stroke and the document background across different document images. We further developed a novel document image binarization technique that addresses these issues by using adaptive image contrast. The proposed method is simple, robust and involves minimum parameter tuning. Figure 1.2 shows some binarization results of our proposed method on some badly degraded document images.

Besides developing new binarization technique, we viewed the document image binarization as a learning problem and proposed learning frameworks to improve the performance of existing binarization techniques.

1.2 Document Image Deblurring

Motion blur is a common distortion of document images, especially for those images taken with digital camera and mobile device. Blur often decreases the quality of document image and makes the text information within the document images unreachable by optical character recognition (OCR) or by a person.

Figure 1.3 The document image in (a) contains defocus blur of different extents.
Its corresponding singular value map is shown in (b), those regions with different blur degrees are highlighted in different color.
(a)
(b)
(c)
(d)

 

To handle this problem, we first developed a simple and effective automatic image blurred region detection and classification technique. The proposed technique can be used in many different multimedia analysis applications such as image segmentation, depth estimation and information retrieval. Document image regions with different blur degree can be detected and extracted using different thresholds, which is illustrated in Figure 1.3. We then developed a blur correction technique that aims to correct motion blur within document images. Figure 1.4 shows two examples of deblurred document image results by our proposed method.

Figure 1.4 Examples of Document Image Deblurring
(a) The blurry images
(b)The deblurred results

1.3 Historical Document Image Restoration

We have also developed methods to clean up double-sided handwritten historical documents. Due to long period of storage, the seepage of ink through the document page has caused a double image where the written strokes on the back of the page are seen on the front. This is known as the bleed-through problem. One method we developed relies on human intervention to help the machine to learn to differentiate between the foreground and background strokes. Another method is to enable automatic matching of written strokes on both sides so that the unwanted background strokes can be identified and removed. Figure 1.5 shows the results of the image restoration.

Figure 1.5 A double sided handwritten document and its restoration using our method.

 

1.4 Patent Document Search

Another practical application of our document analysis research is the development of a graphics viewing tool for US patent database users. The layout of documents in the US patent database is not optimal for reading purpose. In a patent document, figures and text description are separated into the drawing and description sections, either of which occupies several consecutive pages. Our graphics viewing tool as shown in Figure 1.6 connects the captions and labels of figures in the drawing to their relevant text description. With the help of this system, a patent user can conveniently jump from a figure to relevant text or vice versa by clicking corresponding captions and labels.

Figure 1.6 A snapshot of the system interface, where label 23 in the drawing is clicked.
The left part of the interface is a window displaying the text version of a patent,
and the right part is a window displaying the drawings of the patent.

 

Topá 

 

2. Image Information

Advances in image acquisition and storage have led to huge image databases which contain a wealth of useful information. Mining and extraction salient information from these images is a daunting task. Image mining aims to address this problem. In the context of images, image mining deals with the extraction of knowledge, image data relationship, or other patterns not explicitly stored in the images. It is an interdisciplinary effort that draws upon expertise in image processing, information retrieval, data mining, machine learning, database and artificial intelligence.

2.1 Camera Images

The advancement of camera technology has given people an alternative to traditional scanning for text image acquisition. With a portable camera, textual information can be captured from not only paper materials but also real scene objects like signboards. However, this poses a challenge for traditional document image analysis techniques. Because the image plane in a camera is not parallel to the document plane, images obtained by a camera often suffer from perspective distortion, resulting in a failure when OCR or other document image analysis techniques are applied to them directly. We have developed a method, which is able to recognize badly distorted characters or symbols in real scene signboards and rectify the perspective distortion simultaneously. Figure 2.1 shows the rectified results of our method in comparison with two existing methods. This work should facilitate applications like Sign Recognition, Mobile Phone Translator, and Speech Generator for the visually impaired.

Figure 2.1 Rectified signboards our method in comparison with two existing methods. Rectified images are scaled for better viewing purpose. (a) real-scene symbols (b) by our method (c) by SIFT (d) by Shape Context. (Reference template is shown on the far right)

 

2.2 Medical Image

Thanks to the development of imaging techniques, there are large amounts of medical imaging data and analysis available in hospitals. Medical image mining research at the School of Computing stems for the group’s original interests in data mining. In the context of medical images, image mining deals with the extraction of knowledge, image data relationships and other patterns not explicitly stored in the images. This information can be used in clinical diagnosis and prognosis. Our current research targets on the Traumatic Brain Injury  (TBI) CT scan image database.


In the brain CT image project, a content-based CT image retrieval system – TBIDoc has been developed. The system helps to manage and retrieve traumatic brain injury data. With this web-based system, doctors can query previous patient data by uploading brain CT scan images.  Figure 2.2 shows the new interface of the TBIDoc system. Using TBIDoc, doctors can find relevant previous cases in a fast and convenient way.  Figure 2.3 shows some sample retrieval results in TBIDoc.

Figure 2.2 TBIDoc Interface

 

It is also important to automatically quantifying clinical features in brain CT images for TBI prognosis research. We have proposed a few techniques on extracting features from the brain CT scan images. This extracted information will be useful for the building of an automated prognosis system, which aims to predict the possible outcome of TBI. Associations between the TBI images/extracted features and possible outcome can be linked by machine learning techniques.

Figure 2.3 Sample retrieval results in TBIDoc.


Indexing of brain CT slices is to order the slices and align each individual slice onto the corresponding brain axial height, which is an important step in content-based image retrieval and computer-assisted diagnosis. We proposed a fast indexing method using anatomical feature classification without registration. Our method is able to accurately and efficiently index brain CT slices with its corresponding label. We have also developed a new method for automatically tracing and quantifying the midline shift in brain CT images using Midline landmarks. In addition, we proposed a labeling method to automatically classify the types of TBI. The proposed method makes use of the radiology report associated with the medical images.

 

Topá 

 

3. Video Information

Digital video now plays an important role in entertainment, education, and other multimedia applications. The advancement in technology and decreasing price of devices has led to the increasing use of video in our day to day activities. For instance, students prefer to take video or photos of lecture presentation in classroom rather than writing on pad. Therefore, with hundreds of thousands of hours of archival videos, there is an urgent demand for tools that allow efficient browsing and retrieving of video data.

In response to such needs, various video content analysis techniques using one or a combination of image, audio, and textual information present in video have been proposed to parse, index, and abstract massive amount of data. Among these information sources, text present in the video is a reliable clue for three reasons: (1) it is closely related to the current content of video, (2) it has distinctive visual characteristics and (3) the state-of-the-art optical character recognition (OCR) techniques are far more robust than existing speech analysis techniques and visual object analysis techniques. Therefore, almost all video indexing research work begins with video text detection and recognition. A large number of approaches, such as connected based, edge based, texture based and compressed domain based methods, have been proposed and already obtained impressive performance.

A majority of text detection approaches proposed in the literature are for image documents. Although most of them can be adopted and extended for video documents, detecting text in videos presents unique challenges over that in images, due to many undesirable properties of video for text detection and extraction problem, such as low resolution, low contrast, unknown text color, size, position, orientation and layout, color bleeding and unconstrained backgrounds. At a high level, text in digital video can be divided into two classes, graphics text and scene text. Graphics/caption text is text that is mechanically added to video frames to supplement the visual and audio content. Scene text, on the other hand, is text that appears within scene and is captured by the camera. Clearly, the detection and extraction of scene text is a much tougher task due to varying lighting, complex moment and transformation. Hence, developing method that detects both graphics and scene text without constraints is still challenging and interesting for researchers in this community.

3.1 Graphics or Caption text Detection

Since graphics or caption text is purposefully added it is often more structured and closely related to the subject than scene text. Most related previous work has focused on the detection of graphics text because of its usefulness in several applications. For example, captions in news broadcasts and documentaries usually annotate information on where, when and who of the reported events. More importantly, a sequence of frames with caption text is often used to represent highlights in documentaries. Also captions are widely used to depict titles, producers, actors, credits, and sometimes, the context of story. Furthermore, text and symbols that are present at specific locations in a video image can be used to identify the TV station and program associated with the video. In summary, graphics/captions in video frames provide highly condensed information about the contents of the video and can be used for video skimming, browsing, and retrieval in large video databases.

Figure 3.1 The detected blocks of the four existing methods and the proposed method for a horizontal text image.
(a)Input (Graphics)
(b)Edge-Texture
(c)Edge-Color
(d)Gradient
(e)Color-Cluster
(f)Laplacian

 

Although, embedded graphics/caption provides important information about the image, it is not an easy problem to reliably detect and extract text in video. In a single frame, the size of the character can change from very small to very big. The font of text can be different. Text can occur in a very cluttered background. For video sequences, the text can be either still or moving in an arbitrary direction. The same text may vary its size from frame to frame, due to some special effects. The background can also moving/changing, independent of the text. By keeping these in our mind, we have developed a method called Laplacian method for video text detection, which works well for different kinds of text when compared to literature. However, we assume that text is in the horizontal direction because generally graphics or caption text in news video appears horizontally. Sample result of the method is shown in Figure 3.1.

3.2 Scene text Detection

A large portion of images are taken in outdoor environments. In these cases, there is often text information, e.g., on road signs and bill boards. The ability for a machine to read such texts is of great importance because it facilitates a wide range of applications, both for individuals (e.g., text translation on mobile phones for tourists) and for large user communities (e.g., enhancing online maps with business names extracted automatically from images of major streets). Till today, this is still an unsolved problem due to difficult scenarios such as text with fancy fonts, text captured from a side angle, and unconstrained text content (e.g., a brand name does not have to be a word in the dictionary). Because of these challenges, applying OCR directly on scene images produces unsatisfactory recognition results.

Figure 3.2 The detected blocks of the four existing methods and the proposed method for a non-horizontal text image.
(a)Input (Graphics)
(b)Edge-Texture
(c)Edge-Color
(d)Gradient
(e)Color-Cluster
(f)Laplacian+ Skeletonization

 

A scene text recognition system typically consists of two main steps: detecting the locations of texts in a natural scene image, and recognizing the detected texts into alphanumeric strings. For the first step, text detection, we have developed a few techniques. We extended the Laplacian method in Fourier domain with the help of skeletonization concept for detecting multi-oriented text including horizontal text. We have explored the skeletonization concept for separating text and non-text segments of the component detected by the Laplacian method. This method works well for text on straight line but not text on curve line. Sample result can be seen in Figure 3.2. To overcome the drawback of the Laplacian method, we also developed a method called Bayesian classifier approach with new boundary growing concept for detecting curve text line which includes detection of horizontal, non-horizontal text lines. Sample results can be seen in Figure 3.3.

Figure 3.3: Curve text line detection.

 

We also utilize the fact that text exhibits a great deal of symmetry and regularity. Different from previous works which focus solely on the symmetry and regularity within the characters, we explore the same properties in the gaps between consecutive characters. This approach is useful for cases where the characters are hard to read (e.g., due to uneven illumination) but the gaps between them are still clearly visible. Experimentally, our technique is comparable to state-of-the-art on a public dataset for scene text detection. Some sample text detection results are shown in Figure 3.4 .

Figure 3.4 Sample scene text detection results.

 

For the second step, text recognition, most current research is still limited to recognizing texts that are horizontal and frontal to the image plane. However, in practice, scene texts can appear in any orientation, and with perspective distortion. Thus, we have developed a technique for recognizing perspective scene texts of arbitrary orientations. We also took into consideration the fact that it is labor-intensive and time-consuming to collect enough samples of perspective characters to train a classifier. For this reason, we used only frontal training data and extracted features that are robust to rotation and viewpoint change. This allows the extracted features to be used to recognize perspective characters. Experimentally, our technique shows significant improvement over the state-of-the-art on perspective scene texts. Figure 3.5 illustrates sample recognition results of our technique.

Fiugre 3.5 Sample recognition results of our technique
on frontal, perspective and curved scene texts.
The string below each image is the result returned by our technique.
‘COPIES’
‘MURPHY’
‘LIGHTS’

 

3.3 Video Text Recognition

Similar to natural scene images, videos often contain useful text information. A typical application of video text recognition is video indexing and retrieval, which becomes increasingly important with the rapid growth of online video databases. Although many methods have been proposed for text recognition in the last decade, most of them work well for document captured by high resolution camera or scanner but not for text in video. Therefore, recognition of video text is challenging and interesting for the document research community.

Figure 3.6 Text blocks extracted from video frame


Video text shares most of the challenges of scene text, with an additional problem: small resolution. This is due to the fact that videos are often compressed for fast transmission on the Internet. Figure 3.6 shows some examples of video text blocks. There are methods which use text enhancement using temporal frames to increase the contrast so that the current document OCR can recognize them but the problem is how to generalize the enhancement procedure for different kinds of text in different situations. We have developed a binarization method which preprocesses the video image before sending it to the OCR for recognition. Sample recognition results are shown in Figure 3.7.

Figure 3.7 Binarization and Recognition results

 

Text recognition is generally divided into four steps: detection, localization, extraction and recognition. The detection step roughly classifies text and non-text regions. The localization step determines the accurate boundaries of text strings as we described in the above two sections. The extraction step filters out background pixels in the text strings, so that only text pixels are left for recognition. Since the above three steps generate a binary text image, the recognition step can be done by commercial document OCR software. The pipeline of a video text recognition system is similar to that of a scene text recognition system. However, the recognition step of the former needs to take into account the temporal information, because video text often appears on the screen for 2-3 seconds, in order to be readable by a human.

For the first step, text detection, we have developed a wide range of techniques for video text detection. One of our contributions to the community is to relax the assumption on text orientation. While most current research makes the assumption that text is horizontal, we have proposed techniques that are capable of handling multi-oriented texts as well as arbitrarily-oriented texts in video. Experimental shows that our techniques are able to detect video texts of a variety of orientation. Figure 3.8 shows a video text successfully detected by one of our techniques.

Figure 3.8 Sample detection results of our technique for video text.
In each pair of images, the one on the left shows the input video frame, and the one on the right shows the located text positions.

 

For the second step, text recognition, we have developed several methods to reconstruct the complete shapes of video characters (which often have broken edges due to the problem of low resolution). Furthermore, to exploit the temporal information in video, we have proposed a technique to track linearly moving texts. The tracked text instances are then aligned and integrated into an enhanced and binarized text image. Experimental results show that using character reconstruction and temporal information improves the recognition accuracy significantly. Sample video text recognition results are shown in Figure 3.9.

Figure 3.9 Sample video text recognition result.
The left hand side shows a few instances of the same word over time.
The right hand side shows the integrated and binarized text image.
By using temporal information, the final recognized string is correct,
despite the low quality of the individual text instances.
OCR Result: "Talk"

 

In addition, a video may contain more than one language. Hence, we have developed a technique for video script identification. The texts of each language can then be sent to the respective OCR engine. This paves the way for research into multilingual video text recognition in the future, which is especially relevant in the Singapore context (with four languages: English, Chinese, Malay and Tamil).

 

Topá 

 

4. Text Processing

Text through ASCII coding contains the direct information readily available for the machine to process. Natural language processing (NLP) is an important AI subfield that studies and develops techniques to enable machines to understand human languages. Natural language expressions easily understood by humans pose great challenges to the machine in extracting the intended meaning from the text.  Complex grammatical structures and idiomatic expression are barriers for the machine to overcome in extracting the correct information from text.  

In recent year, enormous applications on the Internet have been developed which are affecting and shaping people’s life. Among them, text based applications are particularly popular such as Search Engines, Blogs, Twitter, Facebook as well as other web social mining applications. Along with the development of these text-based web applications, natural language processing techniques have become increasingly more important, since they can help these web applications provide better user experience with more accurate results. This requirement strongly stimulates the research in Natural Language Processing (NLP) which has made quick progress in developing data driven applications. The NLP research domain also benefits from the availability of a large amount of natural language data and the improvement of the large scale statistical machine learning methods. The following are some of the machine learning research in the field of natural language processing.

 

4.1 Syntactic Structure Alignment for Statistical Machine Translation (SMT)

Machine translation is one of the earliest applications of computer in AI in understanding and translation human languages. While the early machine translation methods were mainly based on rules relying on grammatical rules of both the source and target languages. Today, the trend is based on statistical machine translation (SMT) approach which uses on a large body of existing bilingual texts as training data for machine to capture the statistical mapping of bilingual words and phrases between both languages. One of the steps required is to do word alignment between bilingual sentences in the parallel corpus. This is the research problem that interests us.

Most of the current work in SMT obtains Translational Equivalences by initially conducting word alignment on the plain parallel corpus and extracting the Translational Equivalences which are consistent with the word alignment. Therefore, a decent word alignment is required as a prerequisite. Such pipeline approach to get Translational Equivalences is argued to be vulnerable to the errors from the initial stage of word alignment. Currently, researchers address this problem by mainly focusing on how to improve word alignment. Alternatively, we attempt to directly conduct syntactic structure alignment to obtain the syntactic Translational Equivalences.

 

4.2 Linking Entities to a Knowledge Base

The explosive growth in the amount of textual information brings a need for building a structured Knowledge Base (KB) to organize the knowledge scattered among these unstructured texts. On the other hand, the available KBs such as Wikipedia and Google Knowledge Graph which contain rich knowledge about the world’s entities have been shown to form a valuable component for many NLP tasks. To populate or to utilize the KBs, we need to link the mentions of entities in text to their corresponding entries in the KB, which is called entity linking.

Given a knowledge base (KB), a document collection, and a list of queries in the format of [name-string, document-id], Entity Linking task is to determine for each query, which knowledge base entity is being referred to, or if the entity is a new entity which is not present in the reference KB. The proposed approach for Entity Linking is to automatically generate a large scale corpus annotation for ambiguous mentions leveraging on their unambiguous synonyms in the document collection.

4.2.1 Entity Linking Leveraging Automatically Generated Annotation

Traditionally, without any training data available, the solution is to rank the candidates based on similarity using a Vector Space Model. However it is difficult for the ranking approach to detect a new entity that is not yet covered in KB, and it is also inconvenient to combine different features. We create a large corpus for entity linking by an automatic method. A binary classifier is trained to filter out KB entities that are not similar to current mentions. We further leverage on the Wikipedia documents to provide additional information which is not available in our generated corpus through a domain adaption approach which provides further performance improvement. 24.1% accuracy improvements on KBP-09 over the baseline also benefits from more variations spotted using additional knowledge from Wikipedia.

4.2.2 Entity Linking system

Most of state-of-the-art entity linking systems use annotated data to learn a classifier or ranker by supervised learning algorithms. Our research initially focuses on automatically labeling a large scale training corpus for the supervised learning algorithms, where we label the ambiguous mentions leveraging on their unambiguous synonyms. We also propose an instance selection strategy to select an informative, representative and diverse subset from the autogenerated data set.

Next, we introduce topic models to entity linking for measuring the context similarity between mention and KB entries. We propose a Wikipedia-LDA method to model the context as some hidden topics instead of only treating the context as literal terms. We investigate the effectiveness of five subsets from Wikipedia categories to represent the underlying topics. Besides, we propose a lazy learning model for entity linking, which can incorporate the query-specific information to the learning process by automatically labeling some data for the queried name. Then, instead of only using the labeled data set related with other names to train the linker, we propose to use the predictive structure shared by the two data sets which are related with queried name and other names respectively.

Finally, we perform entity linking task under a more challenging scenario, where we link the mentions in microblog to a KB in real time. We propose an unsupervised-supervised learning framework (USLF) which is based on three bipartite graphs to address the new challenges in microblog. Our USLF uses a Bayes method to model the three clues of disambiguation: the context information of query and entities, popularity knowledge of entities and clustering result on an additional tweet set. Besides, in our USLF, a tweet enrichment function is embedded based on the word similarity, which is calculated in the k principal component space of the word set with the help of auxiliary long text.

Our Entity Linking system achieves 2nd best performance among 21 teams in National Institute of Standards and Technology (NIST, US) 2011 entity linking task benchmarking.

 

4.3 Kernel Engineering for Natural Language Processing

In the field of NLP, the most familiar data representation is a sequence, since a document can be considered as a sequence of words.  Many NLP applications rely on various fundamental text analysis techniques which have been developed for the sequential data. In addition, these techniques can be built on more complex data representations.  Traditionally, statistical machine learning methods adopt a vector representation to express the features embedded in these complex structures.  A preprocessing step is required to transform these complex structured data into a corresponding feature vector.  The preprocessing often requires an expert to appropriately design and extract the features which is obviously non-trivial. In addition, this feature engineering process may lead to loss of information of the input data.

To better perform these structural data driven tasks, kernel methods are introduced which can directly take these structures as the input to kernel-based machine learning algorithms instead of explicitly representing the input data as a feature vector. 

4.3.1 Wavelet Kernel for Textual Data Classification

We propose a wavelet kernel method which keeps all features and uses the relationship between them to improve effectiveness. The new kernel requires the feature set to be ordered such that consecutive features are related either statistically or based on some external knowledge source. The ordered feature set enables the interpretation of the vector representation of an object as a series of equally space observations of a hypothetical continuous signal. The new kernel maps the vector representation of objects to the L2 function space, where appropriately chosen compactly supported basis functions utilize the relation between features when calculating the similarity between two objects.

The suggested approach is not entirely new to text representation. In order to be efficient, the mathematical objects of a formal model, like vectors, have to reasonably reproduce language-related phenomena such as word meaning inherent in index terms. On the other hand, the classical model of text representation, when it comes to the representation of word meaning, is approximate only. Adding expansion terms to the vector representation can also improve effectiveness. The choice of expansion terms is either based on distributional similarity or on some lexical resource that establishes relationships between terms. Existing methods regard all expansion terms equally important. The proposed kernel, however, discounts less important expansion terms according to a semantic similarity distance. This approach improves effectiveness in both text classification and information retrieval.

4.3.2 Kernel Based Discourse Relation Recognition with Temporal Ordering Information

Syntactic knowledge is important for discourse relation recognition. Yet only heuristically selected flat paths and 2-level production rules have been used to incorporate such information so far. We propose using tree kernel based approach to automatically mine the syntactic information from the parse trees for discourse analysis, applying kernel function to the tree structures directly. These structural syntactic features, together with other normal flat features are incorporated into our composite kernel to capture diverse knowledge for simultaneous discourse identification and classification for both explicit and implicit relations.

Experiment shows that tree kernel approach is able to give statistical significant improvements over flat syntactic path feature. We also illustrate that tree kernel approach covers more structure information than the production rules which allows tree kernel to further incorporate information from a higher dimension space for possible better discriminate. Besides, we further propose to leverage on temporal ordering information to constrain the interpretation of discourse relation which also demonstrate statistically significant improvements for discourse relation recognition on PDTB 2.0 for both explicit and implicit relations as well.

4.3.3 Tree Sequence Based Kernels

Existing tree kernels explore the structure features with respect to a single sub-tree representation.  The structure of a large single sub-tree may be sparse in the data set, which prevents large structures from being effectively utilized.  Sometimes, only certain parts of a large sub-tree are beneficial instead of the entire sub-tree.  In this case, using the entire structure may introduce noisy information.  To address the above efficiency, we investigate the phrase parse tree and attempt to design more sophisticated kernels to deeply explore the structure features embedded in the phrase trees other than the single sub-tree representation. Specifically, we propose tree sequence based kernels to explore the structure features in the phrase parse trees for various NLP applications. We show that our algorithm can benefit a variety of NLP applications such as Statistical Machine Translation, Question Classification and Entity Relation Extraction.

 

4.4 Event Resolution

4.4.1 Anaphora Resolution

Anaphora resolution, the task of resolving a given text expression to its referred expression in prior texts, is important for intelligent text processing and information extraction. Most previous work on anaphora resolution aims at object anaphora in which both an anaphor and its antecedent are mentions of the same concrete real world object. In contrast, event anaphors are mentions that refer back to an abstract entity, i.e. an event expressed by a verb rather than a noun.

Event anaphora resolution is useful for cascaded event template extraction and other natural language processing study. In our preliminary study, we investigate event pronoun resolution for general purpose. We first explore various lexical, positional and syntactic features useful for the event pronoun resolution. We further explore tree kernel to model structure information embedded in the syntactic parse. A composite kernel is then used to combine the above diverse information. Besides, we also look into the incorporation of the negative training instances with non-anaphora, although these training instances are not used in previous study on co-reference or anaphora resolution as they make training instances too unbalanced. Our study shows that they are very useful for the final resolution through random sampling strategy.

4.4.2 Coreference Resolution

Coreference resolution, the task of resolving a given text expression to its referred expression in prior texts, is important for intelligent text processing and information extraction.  Most previous work on coreference resolution aims at object coreferences in which both the two text expressions are mentions of the same concrete real world object (e.g. person, location and organization) as shown in the following example:

“I bought a house yesterday. It was in the sub-urban area”.

Here the pronoun “it” refers to the object “house”.  In contrast, event corefrences are mentions that refer an abstract entity such as action, event and proposition. The following are three examples of event coreferences: 

“I ran two miles. The run did me good.”
“Fred wanted to go to the movies. But his mother wouldn’t allow it.”
“John’s hitting Fred got everyone in trouble, for it led to a brawl”

In the above, the action “ran”, the intention “to go to the movies”, and the event “John’s hitting Fred” are being referred by a later mention in the respective sentences.  Event Coreference resolution is a very important task for event template extraction and other natural language processing tasks, but it is a much more challenging task than object coreference resolution.

In our research for event coreference resolution, we adopt a divide-and-conquer approach by separating the complicated event coreference resolution to three steps, namely, Event Mention Detection, Mention Pair Resolution, and Event Chain Formation. At event mention detection step, we propose a topic related approach to effective identify the potential event mentions with minimal noise included. At mention pair resolution step, we propose a collective decision process utilizing multiple resolution results and a new training strategy. At the last and most important step, event chain formation, we propose two graph partitioning methods (spectral graph partitioning and random walk partitioning) with deep linguistic constraints and preferences. Our proposed resolution system achieves the state-of-the-art performance.

 

4.5 Disambiguation of Ambiguous Sentiment Adjectives

We propose supervised and unsupervised hierarchical dependency-based methods to disambiguate ambiguous sentiment adjectives (ASAs) within context and identify their sentiment polarity as positive or negative. Firstly, we demonstrate that there are determinant words in context which can be effectively used to disambiguate ASAs, rather than only using the phrase containing ASAs. For this purpose, we propose a hierarchical dependency-based method (HDEPM) to extract the determinant words. Then we employ the determinant words extracted by our hierarchical method in supervised and unsupervised manners to disambiguate ASAs.

Our supervised method simply makes use of the determinant words as the features of a Naive Bayse classifier while our unsupervised method works based on the automatically-detected association strength between the determinant words and the ambiguous adjective. We evaluate our methods on 410 sentences containing a sentiment ambiguous adjective, sampled from movie and hotel domains. The experimental results show that, our hierarchical methods outperform the methods that works based on the phrase-level disambiguation. The supervised learning method achieves an accuracy of 82.20%, while the accuracy of the supervised phrase-level method is 75.37%. The accuracy of the unsupervised learning algorithm ranges from 69.02% to 74.39% depending on the set of the determinant words extracted by HDEPM.


4.6 Unsupervised structure induction

Many Natural Language Processing (NLP) tasks involve some kind of structure analysis, such as word alignment for machine translation, syntactic parsing for coreference resolution, semantic parsing for question answering, etc. Traditional supervised learning methods rely on manually labeled structures for training. The rising amount of available text data on the Internet can provide huge resources for these tasks and improve their performance. Unfortunately, manual annotations are often expensive and time-consuming for large amount of rich text. Therefore, it is of great value to automatically induce structures from un-annotated sentences for NLP research.  In this research, we explore three unsupervised structure induction tasks: transliteration equivalence learning, constituency grammar induction, and dependency grammar induction

We first investigate transliteration equivalence learning where transliterated bilingual word pairs are given without internal syllable alignments. We study nonparametric Bayesian learning methods and propose a synchronous adaptor grammar for machine transliteration. Experiments demonstrate the proposed model outperforms the baseline EM-based model significantly on the transliteration task for four language pairs.  Our proposed model provides a general framework to automatically learn syllable equivalents without heuristics or restrictions.

In linguistics, a constituent is a word or a group of words that represents some linguistic function as a single unit. There are many kinds of constituents according to their linguistic functions, such as sentence, noun phrase, verb phrase, and prepositional phrase. The hierarchical structure of constituents forms a constituency tree. From the constituency tree, we can extract context free transformation rules. To facilitate syntactic analysis, constituency tree banks have been created for some languages, such as the Peen English Treebank and the Penn Chinese Treebank. However, manually creating tree structures is expensive and time consuming. So far, annotated treebanks are only available for a few languages. Constituency grammar induction is therefore useful which allows us to learn constituency grammar from plain strings (words or part-of-speech tags). The induced grammar can be used to construct large treebanks, study language acquisition, improve machine translation, and so on. We propose a local feature-based model with gradient-based optimization for unsupervised constituency grammar induction. Various kinds of linguistic knowledge can be encoded into the model which shows improvements over baselines.

Constituency grammars perform well for languages with relatively strict word order (e.g. English).  However, some free word order languages (e.g. Czech, Turkish) lack a finite verb phrase constituent, making constituency parsing difficult.  In contrast, dependency grammars model the word-to-word dependency relations, which is more suitable for languages with free word order.  Combinatory Categorial Grammar (CCG) is a linguistically expressive lexicalized grammar formalism which is able to capture long-range dependencies. We follow the state-of-the-art induction approach and propose a boundary model and Bayesian model, which improves the baseline on Penn treebank datasets.

 

4.7 From Semantic to Emotional Space in Sense Sentiment Analysis

Sentiment analysis/ Opinion mining aims to enable computers to derive sentiment from human language. Here, we aim to address sense sentiment similarity that is one of the fundamental tasks in sentiment analysis to infer the similarity between word pairs with respect to their senses and their underlying sentiments.

To achieve this aim, we analyze sentiments in Indirect Question Answer Pairs (IQAPs) to infer "yes" or "no" response, and show how similarity between adjectives in question and answer pairs can be a main factor to address this inference task. Consider the following IQAPs as examples with the adjectives playing play pivotal roles in inferring different degrees of yes and no:

Row
IAQP
Answer
E1

Q: Do you think that’s a great idea?

A: I think it’s acceptable

Weak-yes
E2

Q: Was she the best one on that old show?

A: She was simply funny.

Strong-yes
E3

Q: He says he opposes amnesty, but…. Is he right?

A: He is a bit funny.

Weak-no
E4

Q: ….Is that true?

A: This is extraordinary and preposterous.

Strong-no

 

Traditional approach to finding similarity between words has been based on semantic similarity using methods such as Latent Semantic Analysis, Point-wise Mutual Information, and WordNet-based Similarity. In our research, we show that another type of similarity measure based on emotion rather than semantics is capable of capturing more accurately the underlying sentiment between words. Here, we propose an effective approach to compute sentiment similarity from a connection between semantic space and emotional space. Our model maps from senses of words to vectors of basic human emotions. We initially employ a fixed set of twelve basic human emotions to construct emotion vectors. However, our experiments show that there is little agreement on the number and types of basic emotions. This observation leads to our next model in which the number and types of basic emotions are not pre-defined. In other words, the emotions are deemed to be hidden. A probabilistic approach is thus proposed to infer sense sentiment similarity with respect to automatically learned hidden emotions. We learn hidden emotions from the interaction between reviews, rates, and words.  We show that our emotion similarity based model significantly outperforms two popular semantic similarity measures.

 

Topá