13 Nov 2004
WIDM 04: Lee et al. Co-training Web Block Classification
13
Lexical Features
•For each block:
–POS tag distribution in text
»
–Stemmed tokens weighted by TF×IDF
•IDF from Stanford’s web base
»
–Number of words
»
–Alt text of images
»
–Hyperlink type (e.g., embedded image, text, mailto)
•