CHIMEResearchPeopleContact
     
 

CHIME Text Processing Seminar


Click Here to view the up-to-date schedule.

Text Processing Research


A Term Weighting Scheme for Text Categorization


In text categorization, term weighting methods assign appropriate weights to the terms to improve the classification performance. In this study, we propose an effective term weighting scheme, i.e. tf.rf (i.e. term frequency times relevance frequency), and investigate several widely used unsupervised and supervised term weighting methods on two popular corpora (i.e. Reuters Corpus and 20 Newsgroups Corpus) in combination with SVM and kNN algorithms. From our controlled experimental results, not all existing supervised term weighting methods have a consistent superiority over unsupervised term weighting methods. On the other hand, our proposed tf.rf weighting scheme achieves the best performance consistently and outperforms other methods substantially and significantly.

Biomedical Text Classification


We applied our proposed tf.rf term weight scheme on three biomedical corpora, i.e. OHSUMED Corpus, BioCreAtivE II Corpus and 18 Journal Corpus and show that tf.rf outperform other existing term weighting schemes. The second corpus arose from BioCreAtivE II challenge workshop in 2007 which was a community-wide effort for evaluating text mining and information extraction systems. We participated in the sub-task of identifying protein-protein interactions articles from biology literature. Our categorization result based on our tf.rf scheme achieved the top F1 score among 19 international teams.

Further to the bag-of-words approach, we are now studying the alternative ways to represent text based on advanced natural language techniques, such as protein named entities, and biological domain knowledge, i.e. trigger keywords. These feature representations are evaluated using SVM classifier on the BioCreAtivE II benchmark corpus. In general, our work supports the need for more sophisticated natural language processing techniques.

Protein-Protein Interaction Information Extraction


The rapid increase of biological literature in recent year makes it difficult for biologists to keep up with current research or to find particular pieces of information that they need. Therefore, natural language processing (NLP) techniques have been applied in biomedical domain to perform various applications such as protein-protein interaction (PPI). PPI extraction seeks to explore the relationship between genes, proteins, and sequences in the biomedical literatures. In this study, we propose using Maximum Entropy to extract protein-protein interaction information from biomedical literature. Our method overcomes the limitation of the existing co-occurrence based and rule-based approaches. It incorporates corpus statistics of various lexical, syntactic, and semantic features, such as surrounding words, keywords, and abbreviations. Our approach achieves a 93.9% recall and 88.0% precision on IEPA corpus.

Word Sense Disambiguation


Sense disambiguation is essential for many language applications such as machine translation, information retrieval, and speech processing. Most of sense disambiguation methods are heavily dependant on manually compiled lexical resources. However these lexical resources often miss domain specific word senses, even many new words are not included inside, which limits the applicability of sense disambiguation in such domain. We have studied various machine learning approaches to word sense disambiguation.

In one of our studies, we propose an unsupervised word sense learning algorithm, which induces senses of a target word by grouping their occurrences into a "natural" number of clusters based on the similarity of their contexts. In another study, we exploit English-Chinese parallel texts and semi-supervised learning to scale up word sense disambiguation.

In another study, we rely on the global context by using topic features constructed with a machine learning algorithm known as LDA. The features include part of speech of neighboring words, single words in the surrounding context, local collocations, and syntactic patterns. Based on this work, our entry in the evaluation conference, SemEval 2007, was ranked first in the word sense disambiguation lexical sample task.

Co-reference Resolution


Co-reference resolution is the process of linking multiple expressions which refer to the same entity. This is an important step in understanding text for information extraction. Traditional supervised machine learning based approaches adopt the single-candidate representation of the co-reference resolution problem, that is, only the information between an anaphor and one antecedent is used for the co-reference determination. In contrast, our approach adopts a twin-candidate learning model, which presents the competition criterion for antecedent candidates reliably and thus ensures that the most preferred candidate is selected. Various enhancements of the twin-candidate model have been investigated, such as the identification of non-anaphors during resolution, the use of semantic compatibility information from the web, etc.

Statistical Machine Translation


Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. In our work, we propose a forest to forest alignment of bilingual text corpora to model the translation process. A forest is an ordered subtree sequence covering a consecutive tree fragment. Forest rules are more powerful than phrases or tree rules since they can capture all phrases (including both syntactic and non-syntactic phrases) with syntactic structure information and allow any tree node insertion, deletion and substitution in a longer span. This gives our model much more expressive power in that not only can it capture non-syntactic phrases and discontinuous phrases with syntactic structure information, but it also supports global structure re-orderings. To the best of our knowledge, this is the first attempt using forest-to-forest alignment based model in SMT.

 
     
NUS Home SoC Home CHIME Home Welcome Your Feedback