Notes
Slide Show
Outline
1
Metadata creation and management*
  • Module 6              Min-Yen KAN
  • *Parts of this lecture come from
    Lilian Tang’s lecture material at the Univ. of Surrey
2
What is metadata, anyways?
  • Data about data
    • From the DB community

  • “Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”
    • -- Gilliland-Swetland (1998)

  • “Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”
    • -- Vellucci (1998)
3
Outline
  • What is Metadata?
    • Some Frameworks
  • Packaging Metadata
    • Warwick Framework

  • Structural Metadata
  • Hidden Web Metadata
    • OAI
    • SDARTS


  • Crosswalking and Automated Extraction
  • Metadata formats


  • HTML Metadata
  • AACR2 / TEIH /
    MARC / Z39.50
  • Dublin Core
  • SingCORE
4
Types of metadata
  • Administrative
  • Structural
  • Descriptive
  • Intellectual Property Rights
  • Use
5
Metadata attributes
6
Data types: MIME
  • Multipurpose Internet Mail Extensions
    • (text/plain, img/jpg application/msword)
    • Simple format, pre-web
    • Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)
    • Application tag: need to use an application to handle this data
    • Wild success shows a simple system is best:
      • Good for adoption / authoring
      • Good for common denominator
7
Complex objects and granularity
    • DOI identifier records: multiple versions of a single document (hi res / low res)
    • Syntax should be mirrored in reference metadata
8
MPEG 7
9
Audio/visual metadata
  • Based again on how people search
    (The Potato Eaters)
    • I’m looking for a picture of a group.
    • I’d like it to be a family group.
    • This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.

  • Facet analysis is a good approach
    • Objective (“of”)
    • Subjective (“about”)
10
HTML Meta tags
  • <HTML><HEAD>
  •   <META NAME=“attribute” VALUE=“value”>
  • </HEAD>… </HTML>


  • Not regulated or
    controlled
  • You can add your
    own tags
  • Only certain ones
    parsed by finding aids
    (e.g., GoogleBot)
  • Many tags use other
    metadata formats
11
MARC / AACR 2 / TEIH
  • Machine Readable Cataloging
    • Standard for encoding cataloging data (bibliographic and authority)
    • Standoff Annotation (External)


  • Anglo American Cataloguing Rules 2
    • Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.)
    • Regulates format and number of access points


  • Text Encoding Initiative Header
    • Header, similar to <HEAD> in HTML
    • Is located within the document (Internal)


  • Z39.50
    • Protocol for clients to ask queries of servers


  • Librarians use __________ to devise values for fields to be encoded by __________  or in ____________.  This data is accessible by users using the ________protocol.
12
Difficulties in Naming Authorities
  • People
    • Use most common name:
    • Dr Seuss
    • Not Theodore Seuss Geisel
  • Geographic Names
    • Use latest name:
    •   Namibia
    • Not Zaïre



  • Data must be constantly updated to provide users with ________________ – not an easy job
13
Dublin Core Elements
  • A __________________ set of metadata attributes used for interoperability.  Has recommended values for some fields.











  • Besides Title, Creator, Publisher, Contributor, 11 other fields:


  • Subject
    • Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource.
    • value from a controlled vocabulary or formal classification scheme.
14
Dublin Core Elements (Con’t)
  • Description
    • An account of the content of the resource.
    • (e.g., an abstract, ToC, graphical representation of content or a free-text account of the content)
  • Date
    • creation or availability of the resource
    • ISO 8601 (e.g., YYYY-MM-DD)
  • Type
    • The nature or genre of the content of the resource
    • value from a controlled vocabulary (e.g., DCMI Type Vocabulary)
  • Format
    • Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource.
    • value from a controlled vocabulary (e.g., MIME)
  • Identifier
    • Unambiguous reference to the resource within a given context.
    • Use a formal identification system, (e.g., URI, DOI, ISBN)
15
DC Elements (Con’t)
  • Source
    • Reference to a resource from which the present resource is derived (e.g., past edition)
  • Language
    • Language of the intellectual content of the resource.
    • use RFC 3066 with ISO639 (e.g., “en-GB”)
  • Relation
    • Reference to a related resource
  • Coverage
    • The extent or scope of the resource (e.g., location, time period)
    • value from a controlled vocabulary
  • Rights
    • Statement of copyright or a reference to one
    • If absent, no assumptions may be made
16
Warwick Example
17
STARTS: A Metasearching Protocol
  • Defines:
    • Query language
    • Results format
    • Metadata for the collection


  • No specified transport layer or implementation
  • Built to assist metasearchers.
18
Distributed Search? Why?
“Surface” Web vs. “Hidden” Web
19
Hidden Web: Examples
20
Query Probing
  • Idea: Send different queries to categorize data


  • Demo Time!
    (if it works)
21
Focused Probing: Sampling
  • Transform each rule into a query
  • For each query:
    • Send to database
    • Record number of matches
    • Retrieve top-k matching documents
  • At the end of round:
    • Analyze matches for each category
    • Choose category to focus on


22
Adjusting Document Frequencies
  • We know ranking r of words according to document frequency in sample


  • We know absolute document frequency f of some words from one-word queries


  • Mandelbrot’s formula connects empirically word frequency f and ranking r


  • We use curve-fitting to estimate the absolute frequency of all words in sample
23
Focused Probing
  • Algorithm:
    • Send general queries to determine high level category
    • Send progressively more specific queries to determine mid- and lower- categories
24
Automatic Extraction of Metadata
  • Rule-based scripts (fragile):
    • DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
    • Still heavily cited and used!


  • Wrapper induction: localized extraction
    • Define a local context and features to match and extract


  • Text classification: classification
    • Use features over the entire document to determine classification.


25
Crosswalking
  • The transfer of metadata from one format to another
    • Retrofitting = ____________


  • Aids accessibility and discovery
  • Complementary to OAI / SDARTS (which are ________ approaches)
  • Mostly done manually by specialists
    • CS work to be done here!
26
Reference
  • Dublin Core Tool List
    • http://www.lub.lu.se/tk/metadata/dctoollist.html
    • And many others
  • The Getty Research Institute
    • http://www.getty.edu/research/institute/
  • Crosswalking
    • http://www.ukoln.ac.uk/metadata/interoperability/


27
Summing Up
  • Metadata authoring highly intricate but two complementary purposes
    • Inventory
    • Access (what we care more about)
    • Uses CV standards (licensing drawback)

  • Automated approaches have promise …
    • To access and annotate more data
    • But generally needs re-work, or NLP post-processing to make data fit standard


  • Questions?!?