11 Oct 2005
CS 5244 - Computational Document Analysis
6
Feature Selection
¡Content-specific features (Foster 90)
lkey words, special characters
¡
¡Style markers
lWord- or character-based features
¡length of words, vocabulary richness
lFunction words (Mosteller & Wallace 64)
l
¡Structural features
lEmail: Title or signature, paragraph separators
(de Vel et al. 01)
lCan generalize to HTML tags
lTo think about: artifact of authoring software?