Metadata creation and management*
Module 6              Min-Yen KAN
*Parts of this lecture come from
Lilian Tang’s lecture material at the Univ. of Surrey

What is metadata, anyways?
Data about data
From the DB community
“Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”
-- Gilliland-Swetland (1998)
“Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”
-- Vellucci (1998)

Outline
What is Metadata?
Some Frameworks
Packaging Metadata
Warwick Framework
Structural Metadata
Hidden Web Metadata
OAI
SDARTS
Crosswalking and Automated Extraction
Metadata formats
HTML Metadata
AACR2 / TEIH /
MARC / Z39.50
Dublin Core
SingCORE

Types of metadata
Administrative
Structural
Descriptive
Intellectual Property Rights
Use

Metadata attributes

Data types: MIME
Multipurpose Internet Mail Extensions
(text/plain, img/jpg application/msword)
Simple format, pre-web
Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)
Application tag: need to use an application to handle this data
Wild success shows a simple system is best:
Good for adoption / authoring
Good for common denominator

Complex objects and granularity
DOI identifier records: multiple versions of a single document (hi res / low res)
Syntax should be mirrored in reference metadata

MPEG 7

Audio/visual metadata
Based again on how people search
(The Potato Eaters)
I’m looking for a picture of a group.
I’d like it to be a family group.
This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.
Facet analysis is a good approach
Objective (“of”)
Subjective (“about”)

HTML Meta tags
<HTML><HEAD>
  <META NAME=“attribute” VALUE=“value”>
</HEAD>… </HTML>
Not regulated or
controlled
You can add your
own tags
Only certain ones
parsed by finding aids
(e.g., GoogleBot)
Many tags use other
metadata formats

MARC / AACR 2 / TEIH
Machine Readable Cataloging
Standard for encoding cataloging data (bibliographic and authority)
Standoff Annotation (External)
Anglo American Cataloguing Rules 2
Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.)
Regulates format and number of access points
Text Encoding Initiative Header
Header, similar to <HEAD> in HTML
Is located within the document (Internal)
Z39.50
Protocol for clients to ask queries of servers
Librarians use __________ to devise values for fields to be encoded by __________  or in ____________.  This data is accessible by users using the ________protocol.

Difficulties in Naming Authorities
People
Use most common name:
Dr Seuss
Not Theodore Seuss Geisel
Geographic Names
Use latest name:
  Namibia
Not Zaïre
Data must be constantly updated to provide users with ________________ – not an easy job

Dublin Core Elements
A __________________ set of metadata attributes used for interoperability.  Has recommended values for some fields.
Besides Title, Creator, Publisher, Contributor, 11 other fields:
Subject
Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource.
value from a controlled vocabulary or formal classification scheme.

Dublin Core Elements (Con’t)
Description
An account of the content of the resource.
(e.g., an abstract, ToC, graphical representation of content or a free-text account of the content)
Date
creation or availability of the resource
ISO 8601 (e.g., YYYY-MM-DD)
Type
The nature or genre of the content of the resource
value from a controlled vocabulary (e.g., DCMI Type Vocabulary)
Format
Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource.
value from a controlled vocabulary (e.g., MIME)
Identifier
Unambiguous reference to the resource within a given context.
Use a formal identification system, (e.g., URI, DOI, ISBN)

DC Elements (Con’t)
Source
Reference to a resource from which the present resource is derived (e.g., past edition)
Language
Language of the intellectual content of the resource.
use RFC 3066 with ISO639 (e.g., “en-GB”)
Relation
Reference to a related resource
Coverage
The extent or scope of the resource (e.g., location, time period)
value from a controlled vocabulary
Rights
Statement of copyright or a reference to one
If absent, no assumptions may be made

Warwick Example

STARTS: A Metasearching Protocol
Defines:
Query language
Results format
Metadata for the collection
No specified transport layer or implementation
Built to assist metasearchers.

Distributed Search? Why?
“Surface” Web vs. “Hidden” Web

Hidden Web: Examples

Query Probing
Idea: Send different queries to categorize data
Demo Time!
(if it works)

Focused Probing: Sampling
Transform each rule into a query
For each query:
Send to database
Record number of matches
Retrieve top-k matching documents
At the end of round:
Analyze matches for each category
Choose category to focus on

Adjusting Document Frequencies
We know ranking r of words according to document frequency in sample
We know absolute document frequency f of some words from one-word queries
Mandelbrot’s formula connects empirically word frequency f and ranking r
We use curve-fitting to estimate the absolute frequency of all words in sample

Focused Probing
Algorithm:
Send general queries to determine high level category
Send progressively more specific queries to determine mid- and lower- categories

Automatic Extraction of Metadata
Rule-based scripts (fragile):
DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
Still heavily cited and used!
Wrapper induction: localized extraction
Define a local context and features to match and extract
Text classification: classification
Use features over the entire document to determine classification.

Crosswalking
The transfer of metadata from one format to another
Retrofitting = ____________
Aids accessibility and discovery
Complementary to OAI / SDARTS (which are ________ approaches)
Mostly done manually by specialists
CS work to be done here!

Reference
Dublin Core Tool List
http://www.lub.lu.se/tk/metadata/dctoollist.html
And many others
The Getty Research Institute
http://www.getty.edu/research/institute/
Crosswalking
http://www.ukoln.ac.uk/metadata/interoperability/

Summing Up
Metadata authoring highly intricate but two complementary purposes
Inventory
Access (what we care more about)
Uses CV standards (licensing drawback)
Automated approaches have promise …
To access and annotate more data
But generally needs re-work, or NLP post-processing to make data fit standard
Questions?!?