13 Nov 2004
WIDM 04: Lee et al. Co-training Web Block Classification
11
Lexical and Stylistic Co-training
1.Split the document into blocks using DOM tree
–Nontrivial (overlapping blocks, visual segments differ)
»
2.Co-train
–Learner 1 – Stylistic learner
•Spatial and structural relationship
•External relationship to other blocks
»
–Learner 2 – Lexical learner
•POS and link related features
•Internal classification irrespective of other blocks