Survey on Metadata Extraction and Indexing

Chen Xi

Wang Xiaohang

Agenda

Extracting metadata

Evaluating the metadata extracted as index terms

Automatic Document Metadata Extraction using Support Vector Machine

Hui Han, Lee Giles, et al.

Metadata elements

Tagable Metadata

Elements: title, author, affiliation, etc.

Method: directly identified

Non-tagable Metadata

Elements: subject, coverage, genre etc.

Method: classified into given topics via TC

Single-class vs. Multi-classes lines

Single-class line

Each line with a single element

e.g. <title> Automatic Document Metadata Extraction using Support Vector Machine </title>

Multi-classes line

Each line with several elements

e.g. <author> Hui Han, C. Lee Giles, Ere Manavoglu</author>

Two-Step Extraction

Line classification

Line single class / multi-classes

Chunk identification

Multi-classes line multiple elements

Line Classification – step 1

Independent classification

SVM

Feature selection:

Word-specific

“X. Wang” single capital :: capital non-dictionary word

Line-specific

no. of words, line number, percentage of non-dictionary words etc.

Line Classification – step 2

Contextual classification

Predications of step 1 provides context

e.g. title lines, author line, affiliation line…

Context among lines can be used as features

e.g. title lines followed by author line

Combining contextual features and other features, then trained iteratively

Chunk identification

Chunking N-class line =

Finding N-1 chunk boundaries

Boundary candidates

Space

authors - “Hui Han C. Lee Giles”

Punctuation

affiliation & address –

“NUS, 10 Kent Ridge Crescent, Singapore 119260”

Summary

Classification-based method using SVM

Independent and contextual features

Automatic Identification and Organization of Index Terms for Interactive Browsing

Nina Wacholder, David K.Evans, Judith L.Klavans

Problems discussed

Automatically identify phrases as index terms in a dynamic text browser (DTB)

What constitutes useful index terms.

Standard IR metrics do not apply to their tasks.

Index Identification
---- Head sorting method

Head: most important word syntactically and semantically.

Filter is the head of coffee filter, oil filter, etc

Simplex NPs (SNP) are identified instead of Complex NP (CNP)

“A form” and “cancer causing asbestos” are two SNP

“A form of cancer causing asbestos” is a CNP.

NPs are sorted by head in terms of their significances (frequency)

What constitutes useful index
---- how to evaluate

Precision and Recall do not apply to index terms.

Three basic criteria:

Coherence

Usefulness of index terms

Thoroughness of coverage of document content

Coherence - 3 ratings

Coherent: coherent && NP

Incoherent: not coherent &&not NP

Intermediate: coherent

Usefulness of index terms

Qualitative judgment

With a scale from 1—5 ( High quality – junk)

Terms identified by 3 domain-independent techniques:

Key words

Technical terms

Head sorted NPs

Usefulness of index terms

Thoroughness of coverage of document

Evaluation number of terms identified relative to the size of the text.

(Unique NPs#)/ (Words# in a document) is around .27

Summary

A new way to evaluate index terms besides precision and recall.

Qualitative + Quantitative metrics

What is a good ratio of thoroughness

Issues

Extraction

Tagable Metadata – Information Extraction

UnTagable Metadata- Text Categorization/Summaization

Evaluation

How to make use of modifiers.

Thank You!! =)

Q&A time


	Extracting metadata

	Evaluating the metadata extracted as index terms


	Tagable Metadata
		Elements: title, author, affiliation, etc.
		Method: directly identified
	Non-tagable Metadata
		Elements: subject, coverage, genre etc.
		Method: classified into given topics via TC


	Single-class line
		Each line with a single element
		e.g. <title> Automatic Document Metadata Extraction using Support Vector Machine </title>
	Multi-classes line
		Each line with several elements
		e.g. <author> Hui Han, C. Lee Giles, Ere Manavoglu</author>


	Line classification
		Line single class / multi-classes

	Chunk identification
		Multi-classes line multiple elements


Independent classification
	SVM
	Feature selection:
		Word-specific
			“X. Wang” single capital :: capital non-dictionary word
		Line-specific
			no. of words, line number, percentage of non-dictionary words etc.


Contextual classification
	Predications of step 1 provides context
		e.g. title lines, author line, affiliation line…
	Context among lines can be used as features
		e.g. title lines followed by author line
	Combining contextual features and other features, then trained iteratively


Chunking N-class line =
Finding N-1 chunk boundaries
Boundary candidates
	Space
		authors - “Hui Han C. Lee Giles”
	Punctuation
		affiliation & address –
		“NUS, 10 Kent Ridge Crescent, Singapore 119260”


	Classification-based method using SVM
	Independent and contextual features


	Automatically identify phrases as index terms in a dynamic text browser (DTB)


	What constitutes useful index terms.
		Standard IR metrics do not apply to their tasks.


	Head: most important word syntactically and semantically.
		Filter is the head of coffee filter, oil filter, etc
	Simplex NPs (SNP) are identified instead of Complex NP (CNP)
		“A form” and “cancer causing asbestos” are two SNP
		“A form of cancer causing asbestos” is a CNP.
	NPs are sorted by head in terms of their significances (frequency)


	Precision and Recall do not apply to index terms.
	Three basic criteria:
		Coherence
		Usefulness of index terms
		Thoroughness of coverage of document content


	Coherent: coherent && NP
	Incoherent: not coherent &&not NP
	Intermediate: coherent


	Qualitative judgment
		With a scale from 1—5 ( High quality – junk)

	Terms identified by 3 domain-independent techniques:
		Key words
		Technical terms
		Head sorted NPs


	Evaluation number of terms identified relative to the size of the text.
		(Unique NPs#)/ (Words# in a document) is around .27


	A new way to evaluate index terms besides precision and recall.
		Qualitative + Quantitative metrics

		What is a good ratio of thoroughness


	Extraction
		Tagable Metadata – Information Extraction
		UnTagable Metadata- Text Categorization/Summaization
	Evaluation
		How to make use of modifiers.