 |
| |
|
|
| |
CHIME Text Processing Seminar
|
Click
Here to view the up-to-date schedule.
|
Text Processing Research
|
|
A Term Weighting Scheme for Text Categorization
|
In text categorization, term weighting methods assign appropriate weights to the terms to improve the classification
performance. In this study, we propose an effective term weighting scheme, i.e. tf.rf (i.e. term frequency times
relevance frequency), and investigate several widely used unsupervised and supervised term weighting methods on two
popular corpora (i.e. Reuters Corpus and 20 Newsgroups Corpus) in combination with SVM and kNN algorithms. From our
controlled experimental results, not all existing supervised term weighting methods have a consistent superiority over
unsupervised term weighting methods. On the other hand, our proposed tf.rf weighting scheme achieves the best
performance consistently and outperforms other methods substantially and significantly.
|
 |
Biomedical Text Classification
|
We applied our proposed tf.rf term weight scheme on three biomedical corpora, i.e. OHSUMED Corpus, BioCreAtivE
II Corpus and 18 Journal Corpus and show that tf.rf outperform other existing term weighting schemes. The second
corpus arose from BioCreAtivE II challenge workshop in 2007 which was a community-wide effort for evaluating text
mining and information extraction systems. We participated in the sub-task of identifying protein-protein interactions
articles from biology literature. Our categorization result based on our tf.rf scheme achieved the top F1 score
among 19 international teams.
Further to the bag-of-words approach, we are now studying the alternative ways to represent text based on advanced
natural language techniques, such as protein named entities, and biological domain knowledge, i.e. trigger keywords.
These feature representations are evaluated using SVM classifier on the BioCreAtivE II benchmark corpus. In general,
our work supports the need for more sophisticated natural language processing techniques.
|
 |
Protein-Protein Interaction Information Extraction
|
The rapid increase of biological literature in recent year makes it difficult for biologists to keep up with current
research or to find particular pieces of information that they need. Therefore, natural language processing (NLP)
techniques have been applied in biomedical domain to perform various applications such as protein-protein interaction
(PPI). PPI extraction seeks to explore the relationship between genes, proteins, and sequences in the biomedical
literatures. In this study, we propose using Maximum Entropy to extract protein-protein interaction information from
biomedical literature. Our method overcomes the limitation of the existing co-occurrence based and rule-based approaches.
It incorporates corpus statistics of various lexical, syntactic, and semantic features, such as surrounding words,
keywords, and abbreviations. Our approach achieves a 93.9% recall and 88.0% precision on IEPA corpus.
|
 |
Word Sense Disambiguation
|
Sense disambiguation is essential for many language applications such as machine translation, information retrieval,
and speech processing. Most of sense disambiguation methods are heavily dependant on manually compiled lexical
resources. However these lexical resources often miss domain specific word senses, even many new words are not
included inside, which limits the applicability of sense disambiguation in such domain. We have studied various
machine learning approaches to word sense disambiguation.
In one of our studies, we propose an unsupervised word sense learning algorithm, which induces senses of a target
word by grouping their occurrences into a "natural" number of clusters based on the similarity of their contexts.
In another study, we exploit English-Chinese parallel texts and semi-supervised learning to scale up word sense
disambiguation.
In another study, we rely on the global context by using topic features constructed with a machine learning
algorithm known as LDA. The features include part of speech of neighboring words, single words in the surrounding
context, local collocations, and syntactic patterns. Based on this work, our entry in the evaluation conference,
SemEval 2007, was ranked first in the word sense disambiguation lexical sample task.
|
 |
Co-reference Resolution
|
Co-reference resolution is the process of linking multiple expressions which refer to the same entity. This is an
important step in understanding text for information extraction. Traditional supervised machine learning based
approaches adopt the single-candidate representation of the co-reference resolution problem, that is, only the
information between an anaphor and one antecedent is used for the co-reference determination. In contrast, our
approach adopts a twin-candidate learning model, which presents the competition criterion for antecedent candidates
reliably and thus ensures that the most preferred candidate is selected. Various enhancements of the twin-candidate
model have been investigated, such as the identification of non-anaphors during resolution, the use of semantic
compatibility information from the web, etc.
|
 |
Statistical Machine Translation
|
Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. In our work, we propose a forest to forest alignment of bilingual text corpora to model the translation process. A forest is an ordered subtree sequence covering a consecutive tree fragment. Forest rules are more powerful than phrases or tree rules since they can capture all phrases (including both syntactic and non-syntactic phrases) with syntactic structure information and allow any tree node insertion, deletion and substitution in a longer span. This gives our model much more expressive power in that not only can it capture non-syntactic phrases and discontinuous phrases with syntactic structure information, but it also supports global structure re-orderings. To the best of our knowledge, this is the first attempt using forest-to-forest alignment based model in SMT.
|
 |
|
|
| |
|
|
|