Notes
Slide Show
Outline
1
Stylistic and lexical co-training  for webpage block classification
  • Chee How Lee, Min-Yen Kan and Sandra Lai


  • National University of Singapore
  • kanmy@comp.nus.edu.sg


2
Web page blocks
  • What’s a web page block?


  • Parts of a web page with different functions
    • e.g., main content, site navigation, advertisements
    • Important for this distinction


  • Different names for the same thing
    • fragments, elements, blocks, tiles
3
Uses of block classification
  • Extract content to mobile device
    • “Just the facts, ma’am”
    • summarization = better (whole) page classification


  • Advertisement blocking


  • Fragment versioning


  • Distinguish navigation from content
    • better link-based ranking
4
Approaches to classification
  • Earliest systems use hard-coded wrappers
    • Content-focused (E.g., largest table cell)
    • Didn’t scale

  • Now, multi-class classification using mixed features: lexical, structural and spatial.
    • HTML Path structure
      • (Yang et al. 03, Shih and Karger 04)
    • Spatial random walk using browser-exposed DOM tree
      • Allows precise layout information (Xin and Lee 04)
5
Which approach to use
  • A obvious approach is to build a supervised classifier
    • Train on labeled examples (f1,f2,…,fi,…,fn, C)
    • Test by distilling features (f1,f2,…,fi,…,fn) = ?

  • Training data costly, need to use unlabeled data
  • The feature sets are largely orthogonal
  • = Try co-training!
6
Co-training (Blum and Mitchell)
  • Two learners with separate views of the same problem
  • Characterize this as the example of classifying web pages
    • Link structure
    • Text on the page
7
Co-training (cont’d)
  • Use one classifier to help the other
    • e.g. pages that the link classifier is confident on should be passed as correct answers to the text-based classifier.


  • Assumes that the individual classifiers are not bad to start with
    • Otherwise noise level will escalate
8
Architecture
  • B&M co-training handles only binary classification
  • Handles distribution skewing
9
PARCELS
  • PARser for Content Extraction & Layout Structure


  • Goals:
    • Coarse-grained classification
    • Fine-grained information extraction
    • Work on a variety of sources
    • Open-source, reference implementation


10
Target Classification
  • News stories
    • Domain-specific fine grained classes (denoted by *)
    • Needs XHTML / CSS support


  • Blocks can have multiple classes
    • Multi-class forced to single
    • Assessor picks most prominent class


  • Resulting corpus has skewed distribution
    • 50 sites from Google News
    • Not well-formed: Tidy first
  • Main Content
  • Site Navigation
  • Search
  • Supporting content
  • Links supporting content
  • Image supporting content
  • Sub headers
  • Site image
  • Advertisements*
  • Links to related articles*
  • Newsletter / alert links*
  • Date or Time of article*
  • Source Station (country of report)*
  • Reporter Name*
11
Lexical and Stylistic Co-training
  • Split the document into blocks using DOM tree
    • Nontrivial (overlapping blocks, visual segments differ)


  • Co-train
    • Learner 1 – Stylistic learner
      • Spatial and structural relationship
      • External relationship to other blocks


    • Learner 2 – Lexical learner
      • POS and link related features
      • Internal classification irrespective of other blocks
12
Stylistic Features
  • Layout: guess from first level DOM nodes
    • Linear
    • <Table>: Use reading order, cell type propagation
    • XHTML / CSS (e.g., <DIV>): Translate relative to absolute positioning, model depth


  • Font (CSS too): relative features


  • Image size



13
Lexical Features
  • For each block:
    • POS tag distribution in text


    • Stemmed tokens weighted by TF×IDF
      • IDF from Stanford’s web base


    • Number of words


    • Alt text of images


    • Hyperlink type (e.g., embedded image, text, mailto)

14
Evaluations
  • Adapted co-training:
    • Sample balancing: preserve ratio of noisily labeled examples, poor performance without it
    • Replace unlabeled data at each round
  • Use BoosTexter: handles word features easily
  • Five fold cross validation


  • General performance?


  • Specific performance on:
    • Fine-grained classification?
    • XHTML / DIV pages?
    • Others’ tasks?
15
General performance
  • Statistically significant improvement
  • Improvement on large classes at expense of minority
    • Despite sample balancing
  • No fine grained classes detected
16
XHTML / DIV Evaluation
  • Smaller dataset
    • 1/5 the size, limited sites for sample
    • Both annotated and unannotated data sets were smaller
    • As a result, fewer co-training iterations
  • Single view model still seems to do better
17
Rough grained model
  • Slightly different model of splitting than earlier work
  • Smaller amount of training examples
  • No significant gain from co-training but comparable to other work (19.5% error vs. 14-18 error%)
18
Conclusion
  • Co-training model for web block classification
  • Achieves 28.5% reduction in error in main task
  • However, fails in
    • Detecting fine grained classes
      • → Exploit templates, IE methods, path similarity and context


    • Likely needs enough unlabeled data
      • → Re-run using more experimental data


    • Dependent on learning model
      • → Looking to change learning package
19
Question time!
  • Any questions?




  • http://parcels.sourceforge.net/


  • Available in late November 2004
  • Annotator, evaluation tools provided
  • Handles XHTML and DIV / CSS
  • Open source, GPL’ed code