Metadata creation and
management*
|
|
|
Module 6 Min-Yen KAN |
|
*Parts of this lecture come from
Lilian Tang’s lecture material at the Univ. of Surrey |
What is metadata,
anyways?
|
|
|
|
Data about data |
|
From the DB community |
|
|
|
“Cataloging or indexing information
that [information professions] create to arrange, describe, and otherwise
enhance access to an information object” |
|
-- Gilliland-Swetland (1998) |
|
|
|
“Data that describes attributes of a
resource, characterize its relationships, support its discover and effective
use and exist in an electronic environment” |
|
-- Vellucci (1998) |
Outline
|
|
|
|
What is Metadata? |
|
Some Frameworks |
|
Packaging Metadata |
|
Warwick Framework |
|
|
|
Structural Metadata |
|
Hidden Web Metadata |
|
OAI |
|
SDARTS |
|
|
|
Crosswalking and Automated Extraction |
|
Metadata formats |
|
|
|
HTML Metadata |
|
AACR2 / TEIH /
MARC / Z39.50 |
|
Dublin Core |
|
SingCORE |
Types of metadata
|
|
|
Administrative |
|
Structural |
|
Descriptive |
|
Intellectual Property Rights |
|
Use |
Metadata attributes
Data types: MIME
|
|
|
|
|
Multipurpose Internet Mail Extensions |
|
(text/plain, img/jpg
application/msword) |
|
Simple format, pre-web |
|
Can code an unofficial type using
x-subtype prefix (e.g., audio/x-pn-realaudio) |
|
Application tag: need to use an
application to handle this data |
|
Wild success shows a simple system is
best: |
|
Good for adoption / authoring |
|
Good for common denominator |
Complex objects and
granularity
|
|
|
|
DOI identifier records: multiple
versions of a single document (hi res / low res) |
|
Syntax should be mirrored in reference
metadata |
MPEG 7
Audio/visual metadata
|
|
|
|
Based again on how people search
(The Potato Eaters) |
|
I’m looking for a picture of a group. |
|
I’d like it to be a family group. |
|
This family should be doing something
that would be typical for a family, like sitting around a table with food in
front of them, look grateful for what they have to eat. |
|
|
|
Facet analysis is a good approach |
|
Objective (“of”) |
|
Subjective (“about”) |
HTML Meta tags
|
|
|
<HTML><HEAD> |
|
<META NAME=“attribute” VALUE=“value”> |
|
</HEAD>… </HTML> |
|
|
|
Not regulated or
controlled |
|
You can add your
own tags |
|
Only certain ones
parsed by finding aids
(e.g., GoogleBot) |
|
Many tags use other
metadata formats |
MARC / AACR 2 / TEIH
|
|
|
|
Machine Readable Cataloging |
|
Standard for encoding cataloging data
(bibliographic and authority) |
|
Standoff Annotation (External) |
|
|
|
Anglo American Cataloguing Rules 2 |
|
Set of rules used for collecting
bibliographic data and for formulating access points (for authors, titles,
subjects, related works, etc.) |
|
Regulates format and number of access
points |
|
|
|
Text Encoding Initiative Header |
|
Header, similar to <HEAD> in HTML |
|
Is located within the document
(Internal) |
|
|
|
Z39.50 |
|
Protocol for clients to ask queries of
servers |
|
|
|
Librarians use __________ to devise
values for fields to be encoded by __________
or in ____________. This data
is accessible by users using the ________protocol. |
Difficulties in Naming
Authorities
|
|
|
|
People |
|
Use most common name: |
|
Dr Seuss |
|
Not Theodore Seuss Geisel |
|
Geographic Names |
|
Use latest name: |
|
Namibia |
|
Not Zaïre |
|
|
|
|
|
Data must be constantly updated to
provide users with ________________ – not an easy job |
Dublin Core Elements
|
|
|
|
A __________________ set of metadata
attributes used for interoperability.
Has recommended values for some fields. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Besides Title, Creator, Publisher,
Contributor, 11 other fields: |
|
|
|
Subject |
|
Subject, expressed as keywords, key
phrases or classification codes that describe a topic of the resource. |
|
value from a controlled vocabulary or
formal classification scheme. |
Dublin Core Elements
(Con’t)
|
|
|
|
Description |
|
An account of the content of the
resource. |
|
(e.g., an abstract, ToC, graphical
representation of content or a free-text account of the content) |
|
Date |
|
creation or availability of the
resource |
|
ISO 8601 (e.g., YYYY-MM-DD) |
|
Type |
|
The nature or genre of the content of
the resource |
|
value from a controlled vocabulary
(e.g., DCMI Type Vocabulary) |
|
Format |
|
Media-type or dimensions. Also,
identifies the software, hardware, or other equipment needed to display or
operate the resource. |
|
value from a controlled vocabulary
(e.g., MIME) |
|
Identifier |
|
Unambiguous reference to the resource
within a given context. |
|
Use a formal identification system,
(e.g., URI, DOI, ISBN) |
DC Elements (Con’t)
|
|
|
|
Source |
|
Reference to a resource from which the
present resource is derived (e.g., past edition) |
|
Language |
|
Language of the intellectual content of
the resource. |
|
use RFC 3066 with ISO639 (e.g.,
“en-GB”) |
|
Relation |
|
Reference to a related resource |
|
Coverage |
|
The extent or scope of the resource
(e.g., location, time period) |
|
value from a controlled vocabulary |
|
Rights |
|
Statement of copyright or a reference
to one |
|
If absent, no assumptions may be made |
Warwick Example
STARTS: A Metasearching
Protocol
|
|
|
|
Defines: |
|
Query language |
|
Results format |
|
Metadata for the collection |
|
|
|
No specified transport layer or
implementation |
|
Built to assist metasearchers. |
Distributed Search?
Why?
“Surface” Web vs. “Hidden” Web
Hidden Web: Examples
Query Probing
|
|
|
Idea: Send different queries to
categorize data |
|
|
|
Demo Time!
(if it works) |
Focused Probing: Sampling
|
|
|
|
Transform each rule into a query |
|
For each query: |
|
Send to database |
|
Record number of matches |
|
Retrieve top-k matching documents |
|
At the end of round: |
|
Analyze matches for each category |
|
Choose category to focus on |
|
|
Adjusting Document
Frequencies
|
|
|
We know ranking r of words according to
document frequency in sample |
|
|
|
We know absolute document frequency f
of some words from one-word queries |
|
|
|
Mandelbrot’s formula connects
empirically word frequency f and ranking r |
|
|
|
We use curve-fitting to estimate the
absolute frequency of all words in sample |
Focused Probing
|
|
|
|
Algorithm: |
|
Send general queries to determine high
level category |
|
Send progressively more specific
queries to determine mid- and lower- categories |
Automatic Extraction of
Metadata
|
|
|
|
Rule-based scripts (fragile): |
|
DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl |
|
Still heavily cited and used! |
|
|
|
Wrapper induction: localized extraction |
|
Define a local context and features to
match and extract |
|
|
|
Text classification: classification |
|
Use features over the entire document
to determine classification. |
|
|
Crosswalking
|
|
|
|
The transfer of metadata from one
format to another |
|
Retrofitting = ____________ |
|
|
|
Aids accessibility and discovery |
|
Complementary to OAI / SDARTS (which
are ________ approaches) |
|
Mostly done manually by specialists |
|
CS work to be done here! |
Reference
|
|
|
|
Dublin Core Tool List |
|
http://www.lub.lu.se/tk/metadata/dctoollist.html |
|
And many others |
|
The Getty Research Institute |
|
http://www.getty.edu/research/institute/ |
|
Crosswalking |
|
http://www.ukoln.ac.uk/metadata/interoperability/ |
|
|
Summing Up
|
|
|
|
Metadata authoring highly intricate but
two complementary purposes |
|
Inventory |
|
Access (what we care more about) |
|
Uses CV standards (licensing drawback) |
|
|
|
Automated approaches have promise … |
|
To access and annotate more data |
|
But generally needs re-work, or NLP
post-processing to make data fit standard |
|
|
|
Questions?!? |