Survey on Metadata
Extraction and Indexing
Agenda
|
|
|
Extracting metadata |
|
|
|
Evaluating the metadata extracted as
index terms |
|
|
Automatic Document
Metadata Extraction using Support Vector Machine
|
|
|
Hui Han, Lee Giles, et al. |
|
|
Metadata elements
|
|
|
|
Tagable Metadata |
|
Elements: title, author, affiliation,
etc. |
|
Method: directly identified |
|
Non-tagable Metadata |
|
Elements: subject, coverage, genre etc. |
|
Method: classified into given topics
via TC |
Single-class vs.
Multi-classes lines
|
|
|
|
Single-class line |
|
Each line with a single element |
|
e.g. <title> Automatic Document
Metadata Extraction using Support Vector Machine </title> |
|
Multi-classes line |
|
Each line with several elements |
|
e.g. <author> Hui Han, C. Lee
Giles, Ere Manavoglu</author> |
Two-Step Extraction
|
|
|
|
Line classification |
|
Line single class / multi-classes |
|
|
|
Chunk identification |
|
Multi-classes line multiple elements |
|
|
Line Classification –
step 1
|
|
|
|
|
|
Independent classification |
|
SVM |
|
Feature selection: |
|
Word-specific |
|
“X. Wang” single capital :: capital
non-dictionary word |
|
Line-specific |
|
no. of words, line number, percentage
of non-dictionary words etc. |
|
|
|
|
|
|
Line Classification –
step 2
|
|
|
|
|
|
Contextual classification |
|
Predications of step 1 provides context |
|
e.g. title lines, author line,
affiliation line… |
|
Context among lines can be used as
features |
|
e.g. title lines followed by author
line |
|
Combining contextual features and other
features, then trained iteratively |
|
|
|
|
|
|
Chunk identification
|
|
|
|
|
|
Chunking N-class line = |
|
Finding N-1 chunk boundaries |
|
Boundary candidates |
|
Space |
|
authors - “Hui Han C. Lee Giles” |
|
Punctuation |
|
affiliation & address – |
|
“NUS, 10 Kent Ridge Crescent, Singapore 119260” |
|
|
|
|
|
|
|
|
Summary
|
|
|
Classification-based method using SVM |
|
Independent and contextual features |
|
|
Automatic Identification
and Organization of Index Terms for Interactive Browsing
|
|
|
Nina Wacholder, David K.Evans, Judith
L.Klavans |
Problems discussed
|
|
|
|
Automatically identify phrases as index
terms in a dynamic text browser (DTB) |
|
|
|
|
|
What constitutes useful index terms. |
|
Standard IR metrics do not apply to
their tasks. |
|
|
Index Identification
---- Head sorting
method
|
|
|
|
Head: most important word syntactically
and semantically. |
|
Filter is the head of coffee filter,
oil filter, etc |
|
Simplex NPs (SNP) are identified
instead of Complex NP (CNP) |
|
“A form” and “cancer causing asbestos”
are two SNP |
|
“A form of cancer causing asbestos” is
a CNP. |
|
NPs are sorted by head in terms of
their significances (frequency) |
What constitutes useful
index
---- how to evaluate
|
|
|
|
Precision and Recall do not apply to
index terms. |
|
Three basic criteria: |
|
Coherence |
|
Usefulness of index terms |
|
Thoroughness of coverage of document
content |
|
|
|
|
Coherence - 3 ratings
|
|
|
Coherent: coherent && NP |
|
Incoherent: not coherent &¬
NP |
|
Intermediate: coherent |
|
|
Usefulness of index terms
|
|
|
|
Qualitative judgment |
|
With a scale from 1—5 ( High quality –
junk) |
|
|
|
Terms identified by 3
domain-independent techniques: |
|
Key words |
|
Technical terms |
|
Head sorted NPs |
Usefulness of index terms
Thoroughness of coverage
of document
|
|
|
|
Evaluation number of terms identified
relative to the size of the text. |
|
(Unique NPs#)/ (Words# in a document)
is around .27 |
Summary
|
|
|
|
A new way to evaluate index terms
besides precision and recall. |
|
Qualitative + Quantitative metrics |
|
|
|
What is a good ratio of thoroughness |
|
|
|
|
Issues
|
|
|
|
Extraction |
|
Tagable Metadata – Information
Extraction |
|
UnTagable Metadata- Text
Categorization/Summaization |
|
Evaluation |
|
How to make use of modifiers. |
Thank You!! =)