Notes
Slide Show
Outline
1
Survey on Metadata Extraction and Indexing
  • Chen Xi
  • Wang Xiaohang
2
Agenda
  • Extracting metadata


  • Evaluating the metadata extracted as index terms


3
Automatic Document Metadata Extraction using Support Vector Machine
  • Hui Han, Lee Giles, et al.


4
Metadata elements
  • Tagable Metadata
    • Elements: title, author, affiliation, etc.
    • Method: directly identified
  • Non-tagable Metadata
    • Elements: subject,  coverage, genre etc.
    • Method: classified into given topics via TC
5
Single-class vs. Multi-classes lines
  • Single-class line
    • Each line with a single element
    • e.g. <title> Automatic Document Metadata Extraction using Support Vector Machine </title>
  • Multi-classes line
    • Each line with several elements
    • e.g. <author> Hui Han, C. Lee Giles, Ere Manavoglu</author>
6
Two-Step Extraction
  • Line classification
    • Line       single class / multi-classes


  • Chunk identification
    • Multi-classes line multiple elements

7
Line Classification – step 1
  • Independent classification
    • SVM
    • Feature selection:
      • Word-specific
        • “X. Wang”         single capital :: capital non-dictionary word
      • Line-specific
        • no. of words, line number, percentage of non-dictionary words etc.



8
Line Classification – step 2
  • Contextual classification
    • Predications of step 1 provides context
      • e.g. title lines, author line, affiliation line…
    • Context among lines can be used as features
      • e.g. title lines followed by author line
    • Combining contextual features and other features, then trained iteratively




9
Chunk identification
  • Chunking N-class line           =
  • Finding N-1 chunk boundaries
  • Boundary candidates
    • Space
      • authors - “Hui Han C. Lee Giles”
    • Punctuation
      • affiliation & address –
      •    “NUS, 10 Kent Ridge Crescent, Singapore 119260”





10
Summary
  • Classification-based method using SVM
  • Independent and contextual features


11
Automatic Identification and Organization of Index Terms for Interactive Browsing
  • Nina Wacholder, David K.Evans, Judith L.Klavans
12
Problems discussed
  • Automatically identify phrases as index terms in a dynamic text browser (DTB)



  • What constitutes useful index terms.
    • Standard IR metrics do not apply to their tasks.

13
Index Identification
                    ---- Head sorting method
  • Head: most important word syntactically and semantically.
    • Filter is the head of coffee filter, oil filter, etc
  • Simplex NPs (SNP) are identified instead of Complex NP (CNP)
    • “A form” and “cancer causing asbestos” are two SNP
    • “A form of cancer causing asbestos” is a CNP.
  • NPs are sorted by head in terms of their significances (frequency)
14
What constitutes useful index
                   ---- how to evaluate
  • Precision and Recall do not apply to index terms.
  • Three basic criteria:
    • Coherence
    • Usefulness of index terms
    • Thoroughness of coverage of document content



15
Coherence - 3 ratings
  • Coherent: coherent && NP
  • Incoherent: not coherent &&not NP
  • Intermediate: coherent


16
Usefulness of index terms
  • Qualitative judgment
    • With a scale from 1—5 ( High quality – junk)


  • Terms identified by 3 domain-independent techniques:
    • Key words
    • Technical terms
    • Head sorted NPs
17
Usefulness of index terms
18
Thoroughness of coverage of document
  • Evaluation number of terms identified relative to the size of the text.
    • (Unique NPs#)/ (Words# in a document) is around .27
19
Summary
  • A new way to evaluate index terms besides precision and recall.
    • Qualitative + Quantitative metrics


    • What is a good ratio of thoroughness


20
Issues
  • Extraction
    • Tagable Metadata – Information Extraction
    • UnTagable Metadata- Text Categorization/Summaization
  • Evaluation
    • How to make use of modifiers.
21
Thank You!!   =)
  • Q&A time