Wei Lu and Min-Yen Kan
AIRS 2005 (Jeju Island, Korea)
6/22
Text Categorization Baseline
•Vector Space Model (VSM) is used in Text Categorization
•Bag of words approach
•
•
•Our baseline using Text Categorization Bag-of-Word approach: accuracy 87.47%
–A high baseline, but still has a gap of 12.5% to improve
–Also important to measure error reduction rate
•
Firstly, I would like to introduce the baseline system and its performance of the task.
Vector-space model is often used in conventional text categorization tasks. The basic idea is to tokenize text data into a bag of words, and this bag of words will be a feature vector representing the original text data.
In the baseline system, we simply treat the JavaScript codes as plain texts and tokenize them using blanks and punctuation symbols as delimiters. Then we pass the tokens to the Weka SMO classifier to do classification.
As you can see, the baseline is relatively high compared to many other classification tasks. However, there is still a gap to improve. Given such a high baseline, we also measured the error reduction rate in our evaluations to assess the effectiveness of our techniques.