Survey on Metadata Extraction and Indexing
Chen Xi
Wang Xiaohang

Agenda
Extracting metadata
Evaluating the metadata extracted as index terms

Automatic Document Metadata Extraction using Support Vector Machine
Hui Han, Lee Giles, et al.

Metadata elements
Tagable Metadata
Elements: title, author, affiliation, etc.
Method: directly identified
Non-tagable Metadata
Elements: subject,  coverage, genre etc.
Method: classified into given topics via TC

Single-class vs. Multi-classes lines
Single-class line
Each line with a single element
e.g. <title> Automatic Document Metadata Extraction using Support Vector Machine </title>
Multi-classes line
Each line with several elements
e.g. <author> Hui Han, C. Lee Giles, Ere Manavoglu</author>

Two-Step Extraction
Line classification
Line       single class / multi-classes
Chunk identification
Multi-classes line multiple elements

Line Classification – step 1
Independent classification
SVM
Feature selection:
Word-specific
“X. Wang”         single capital :: capital non-dictionary word
Line-specific
no. of words, line number, percentage of non-dictionary words etc.

Line Classification – step 2
Contextual classification
Predications of step 1 provides context
e.g. title lines, author line, affiliation line…
Context among lines can be used as features
e.g. title lines followed by author line
Combining contextual features and other features, then trained iteratively

Chunk identification
Chunking N-class line           =
Finding N-1 chunk boundaries
Boundary candidates
Space
authors - “Hui Han C. Lee Giles”
Punctuation
affiliation & address –
   “NUS, 10 Kent Ridge Crescent, Singapore 119260”

Summary
Classification-based method using SVM
Independent and contextual features

Automatic Identification and Organization of Index Terms for Interactive Browsing
Nina Wacholder, David K.Evans, Judith L.Klavans

Problems discussed
Automatically identify phrases as index terms in a dynamic text browser (DTB)
What constitutes useful index terms.
Standard IR metrics do not apply to their tasks.

Index Identification
                    ---- Head sorting method
Head: most important word syntactically and semantically.
Filter is the head of coffee filter, oil filter, etc
Simplex NPs (SNP) are identified instead of Complex NP (CNP)
“A form” and “cancer causing asbestos” are two SNP
“A form of cancer causing asbestos” is a CNP.
NPs are sorted by head in terms of their significances (frequency)

What constitutes useful index
                   ---- how to evaluate
Precision and Recall do not apply to index terms.
Three basic criteria:
Coherence
Usefulness of index terms
Thoroughness of coverage of document content

Coherence - 3 ratings
Coherent: coherent && NP
Incoherent: not coherent &&not NP
Intermediate: coherent

Usefulness of index terms
Qualitative judgment
With a scale from 1—5 ( High quality – junk)
Terms identified by 3 domain-independent techniques:
Key words
Technical terms
Head sorted NPs

Usefulness of index terms

Thoroughness of coverage of document
Evaluation number of terms identified relative to the size of the text.
(Unique NPs#)/ (Words# in a document) is around .27

Summary
A new way to evaluate index terms besides precision and recall.
Qualitative + Quantitative metrics
What is a good ratio of thoroughness

Issues
Extraction
Tagable Metadata – Information Extraction
UnTagable Metadata- Text Categorization/Summaization
Evaluation
How to make use of modifiers.

Thank You!!   =)
Q&A time