13 Nov 2004
WIDM 04: Lee et al. Co-training Web Block Classification
10
Target Classification
•News stories
–Domain-specific fine grained classes (denoted by *)
–Needs XHTML / CSS support
»
•Blocks can have multiple classes
–Multi-class forced to single
–Assessor picks most prominent class
»
•Resulting corpus has skewed distribution
–50 sites from Google News
–Not well-formed: Tidy first
•Main Content
•Site Navigation
•Search
•Supporting content
•Links supporting content
•Image supporting content
•Sub headers
•Site image
•Advertisements*
•Links to related articles*
•Newsletter / alert links*
•Date or Time of article*
•Source Station (country of report)*
•Reporter Name*