13 Nov 2004
WIDM 04: Lee et al. Co-training Web Block Classification
4
Approaches to classification
•Earliest systems use hard-coded wrappers
–Content-focused (E.g., largest table cell)
–Didn’t scale
•
•Now, multi-class classification using mixed features: lexical, structural and spatial.
–HTML Path structure
•(Yang et al. 03, Shih and Karger 04)
–Spatial random walk using browser-exposed DOM tree 
•Allows precise layout information (Xin and Lee 04)