Metadata creation and management*

Module 6 Min-Yen KAN

*Parts of this lecture come from
Lilian Tang’s lecture material at the Univ. of Surrey

What is metadata, anyways?

Data about data

From the DB community

“Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”

-- Gilliland-Swetland (1998)

“Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”

-- Vellucci (1998)

Outline

What is Metadata?

Some Frameworks

Packaging Metadata

Warwick Framework

Structural Metadata

Hidden Web Metadata

OAI

SDARTS

Crosswalking and Automated Extraction

Metadata formats

HTML Metadata

AACR2 / TEIH /
MARC / Z39.50

Dublin Core

SingCORE

Types of metadata

Administrative

Structural

Descriptive

Intellectual Property Rights

Use

Metadata attributes

Data types: MIME

Multipurpose Internet Mail Extensions

(text/plain, img/jpg application/msword)

Simple format, pre-web

Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)

Application tag: need to use an application to handle this data

Wild success shows a simple system is best:

Good for adoption / authoring

Good for common denominator

Complex objects and granularity

DOI identifier records: multiple versions of a single document (hi res / low res)

Syntax should be mirrored in reference metadata

MPEG 7

Audio/visual metadata

Based again on how people search
(The Potato Eaters)

I’m looking for a picture of a group.

I’d like it to be a family group.

This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.

Facet analysis is a good approach

Objective (“of”)

Subjective (“about”)

HTML Meta tags

<HTML><HEAD>

<META NAME=“attribute” VALUE=“value”>

</HEAD>… </HTML>

Not regulated or
controlled

You can add your
own tags

Only certain ones
parsed by finding aids
(e.g., GoogleBot)

Many tags use other
metadata formats

MARC / AACR 2 / TEIH

Machine Readable Cataloging

Standard for encoding cataloging data (bibliographic and authority)

Standoff Annotation (External)

Anglo American Cataloguing Rules 2

Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.)

Regulates format and number of access points

Text Encoding Initiative Header

Header, similar to <HEAD> in HTML

Is located within the document (Internal)

Z39.50

Protocol for clients to ask queries of servers

Librarians use __________ to devise values for fields to be encoded by __________ or in ____________. This data is accessible by users using the ________protocol.

Difficulties in Naming Authorities

People

Use most common name:

Dr Seuss

Not Theodore Seuss Geisel

Geographic Names

Use latest name:

Namibia

Not Zaïre

Data must be constantly updated to provide users with ________________ – not an easy job

Dublin Core Elements

A __________________ set of metadata attributes used for interoperability. Has recommended values for some fields.

Besides Title, Creator, Publisher, Contributor, 11 other fields:

Subject

Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource.

value from a controlled vocabulary or formal classification scheme.

Dublin Core Elements (Con’t)

Description

An account of the content of the resource.

(e.g., an abstract, ToC, graphical representation of content or a free-text account of the content)

Date

creation or availability of the resource

ISO 8601 (e.g., YYYY-MM-DD)

Type

The nature or genre of the content of the resource

value from a controlled vocabulary (e.g., DCMI Type Vocabulary)

Format

Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource.

value from a controlled vocabulary (e.g., MIME)

Identifier

Unambiguous reference to the resource within a given context.

Use a formal identification system, (e.g., URI, DOI, ISBN)

DC Elements (Con’t)

Source

Reference to a resource from which the present resource is derived (e.g., past edition)

Language

Language of the intellectual content of the resource.

use RFC 3066 with ISO639 (e.g., “en-GB”)

Relation

Reference to a related resource

Coverage

The extent or scope of the resource (e.g., location, time period)

value from a controlled vocabulary

Rights

Statement of copyright or a reference to one

If absent, no assumptions may be made

Warwick Example

STARTS: A Metasearching Protocol

Defines:

Query language

Results format

Metadata for the collection

No specified transport layer or implementation

Built to assist metasearchers.

Distributed Search? Why?
“Surface” Web vs. “Hidden” Web

Hidden Web: Examples

Query Probing

Idea: Send different queries to categorize data

Demo Time!
(if it works)

Focused Probing: Sampling

Transform each rule into a query

For each query:

Send to database

Record number of matches

Retrieve top-k matching documents

At the end of round:

Analyze matches for each category

Choose category to focus on

Adjusting Document Frequencies

We know ranking r of words according to document frequency in sample

We know absolute document frequency f of some words from one-word queries

Mandelbrot’s formula connects empirically word frequency f and ranking r

We use curve-fitting to estimate the absolute frequency of all words in sample

Focused Probing

Algorithm:

Send general queries to determine high level category

Send progressively more specific queries to determine mid- and lower- categories

Automatic Extraction of Metadata

Rule-based scripts (fragile):

DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl

Still heavily cited and used!

Wrapper induction: localized extraction

Define a local context and features to match and extract

Text classification: classification

Use features over the entire document to determine classification.

Crosswalking

The transfer of metadata from one format to another

Retrofitting = ____________

Aids accessibility and discovery

Complementary to OAI / SDARTS (which are ________ approaches)

Mostly done manually by specialists

CS work to be done here!

Reference

Dublin Core Tool List

http://www.lub.lu.se/tk/metadata/dctoollist.html

And many others

The Getty Research Institute

http://www.getty.edu/research/institute/

Crosswalking

http://www.ukoln.ac.uk/metadata/interoperability/

Summing Up

Metadata authoring highly intricate but two complementary purposes

Inventory

Access (what we care more about)

Uses CV standards (licensing drawback)

Automated approaches have promise …

To access and annotate more data

But generally needs re-work, or NLP post-processing to make data fit standard

Questions?!?


	Module 6 Min-Yen KAN
	*Parts of this lecture come from Lilian Tang’s lecture material at the Univ. of Surrey


	Data about data
		From the DB community

	“Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”
		-- Gilliland-Swetland (1998)

	“Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”
		-- Vellucci (1998)


	What is Metadata?
		Some Frameworks
	Packaging Metadata
		Warwick Framework

	Structural Metadata
	Hidden Web Metadata
		OAI
		SDARTS

	Crosswalking and Automated Extraction
	Metadata formats

	HTML Metadata
	AACR2 / TEIH / MARC / Z39.50
	Dublin Core
	SingCORE


	Administrative
	Structural
	Descriptive
	Intellectual Property Rights
	Use


Multipurpose Internet Mail Extensions
	(text/plain, img/jpg application/msword)
	Simple format, pre-web
	Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)
	Application tag: need to use an application to handle this data
	Wild success shows a simple system is best:
		Good for adoption / authoring
		Good for common denominator


		DOI identifier records: multiple versions of a single document (hi res / low res)
		Syntax should be mirrored in reference metadata


	Based again on how people search (The Potato Eaters)
		I’m looking for a picture of a group.
		I’d like it to be a family group.
		This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.

	Facet analysis is a good approach
		Objective (“of”)
		Subjective (“about”)


	<HTML><HEAD>
	<META NAME=“attribute” VALUE=“value”>
	</HEAD>… </HTML>

	Not regulated or controlled
	You can add your own tags
	Only certain ones parsed by finding aids (e.g., GoogleBot)
	Many tags use other metadata formats


	Machine Readable Cataloging
		Standard for encoding cataloging data (bibliographic and authority)
		Standoff Annotation (External)

	Anglo American Cataloguing Rules 2
		Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.)
		Regulates format and number of access points

	Text Encoding Initiative Header
		Header, similar to <HEAD> in HTML
		Is located within the document (Internal)

	Z39.50
		Protocol for clients to ask queries of servers

	Librarians use __________ to devise values for fields to be encoded by __________ or in ____________. This data is accessible by users using the ________protocol.


	People
		Use most common name:
		Dr Seuss
		Not Theodore Seuss Geisel
	Geographic Names
		Use latest name:
		Namibia
		Not Zaïre


	Data must be constantly updated to provide users with ________________ – not an easy job


	A __________________ set of metadata attributes used for interoperability. Has recommended values for some fields.










	Besides Title, Creator, Publisher, Contributor, 11 other fields:

	Subject
		Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource.
		value from a controlled vocabulary or formal classification scheme.


	Description
		An account of the content of the resource.
		(e.g., an abstract, ToC, graphical representation of content or a free-text account of the content)
	Date
		creation or availability of the resource
		ISO 8601 (e.g., YYYY-MM-DD)
	Type
		The nature or genre of the content of the resource
		value from a controlled vocabulary (e.g., DCMI Type Vocabulary)
	Format
		Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource.
		value from a controlled vocabulary (e.g., MIME)
	Identifier
		Unambiguous reference to the resource within a given context.
		Use a formal identification system, (e.g., URI, DOI, ISBN)


	Source
		Reference to a resource from which the present resource is derived (e.g., past edition)
	Language
		Language of the intellectual content of the resource.
		use RFC 3066 with ISO639 (e.g., “en-GB”)
	Relation
		Reference to a related resource
	Coverage
		The extent or scope of the resource (e.g., location, time period)
		value from a controlled vocabulary
	Rights
		Statement of copyright or a reference to one
		If absent, no assumptions may be made


	Defines:
		Query language
		Results format
		Metadata for the collection

	No specified transport layer or implementation
	Built to assist metasearchers.


	Idea: Send different queries to categorize data

	Demo Time! (if it works)


	Transform each rule into a query
	For each query:
		Send to database
		Record number of matches
		Retrieve top-k matching documents
	At the end of round:
		Analyze matches for each category
		Choose category to focus on


	We know ranking r of words according to document frequency in sample

	We know absolute document frequency f of some words from one-word queries

	Mandelbrot’s formula connects empirically word frequency f and ranking r

	We use curve-fitting to estimate the absolute frequency of all words in sample


	Algorithm:
		Send general queries to determine high level category
		Send progressively more specific queries to determine mid- and lower- categories


	Rule-based scripts (fragile):
		DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
		Still heavily cited and used!

	Wrapper induction: localized extraction
		Define a local context and features to match and extract

	Text classification: classification
		Use features over the entire document to determine classification.


	The transfer of metadata from one format to another
		Retrofitting = ____________

	Aids accessibility and discovery
	Complementary to OAI / SDARTS (which are ________ approaches)
	Mostly done manually by specialists
		CS work to be done here!


	Dublin Core Tool List
		http://www.lub.lu.se/tk/metadata/dctoollist.html
		And many others
	The Getty Research Institute
		http://www.getty.edu/research/institute/
	Crosswalking
		http://www.ukoln.ac.uk/metadata/interoperability/


	Metadata authoring highly intricate but two complementary purposes
		Inventory
		Access (what we care more about)
		Uses CV standards (licensing drawback)

	Automated approaches have promise …
		To access and annotate more data
		But generally needs re-work, or NLP post-processing to make data fit standard

	Questions?!?