|
1
|
- Module 6 Min-Yen KAN
- *Parts of this lecture come from
Lilian Tang’s lecture material at the Univ. of Surrey
|
|
2
|
- Data about data
- “Cataloging or indexing information that [information professions]
create to arrange, describe, and otherwise enhance access to an
information object”
- -- Gilliland-Swetland (1998)
- “Data that describes attributes of a resource, characterize its
relationships, support its discover and effective use and exist in an
electronic environment”
|
|
3
|
- What is Metadata?
- Packaging Metadata
- Structural Metadata
- Hidden Web Metadata
- Crosswalking and Automated Extraction
- Metadata formats
- HTML Metadata
- AACR2 / TEIH /
MARC / Z39.50
- Dublin Core
- SingCORE
|
|
4
|
- Administrative
- Structural
- Descriptive
- Intellectual Property Rights
- Use
|
|
5
|
|
|
6
|
- Multipurpose Internet Mail Extensions
- (text/plain, img/jpg application/msword)
- Simple format, pre-web
- Can code an unofficial type using x-subtype prefix (e.g.,
audio/x-pn-realaudio)
- Application tag: need to use an application to handle this data
- Wild success shows a simple system is best:
- Good for adoption / authoring
- Good for common denominator
|
|
7
|
- DOI identifier records: multiple versions of a single document (hi res
/ low res)
- Syntax should be mirrored in reference metadata
|
|
8
|
|
|
9
|
- Based again on how people search
(The Potato Eaters)
- I’m looking for a picture of a group.
- I’d like it to be a family group.
- This family should be doing something that would be typical for a
family, like sitting around a table with food in front of them, look
grateful for what they have to eat.
- Facet analysis is a good approach
- Objective (“of”)
- Subjective (“about”)
|
|
10
|
- <HTML><HEAD>
- <META NAME=“attribute”
VALUE=“value”>
- </HEAD>… </HTML>
- Not regulated or
controlled
- You can add your
own tags
- Only certain ones
parsed by finding aids
(e.g., GoogleBot)
- Many tags use other
metadata formats
|
|
11
|
- Machine Readable Cataloging
- Standard for encoding cataloging data (bibliographic and authority)
- Standoff Annotation (External)
- Anglo American Cataloguing Rules 2
- Set of rules used for collecting bibliographic data and for formulating
access points (for authors, titles, subjects, related works, etc.)
- Regulates format and number of access points
- Text Encoding Initiative Header
- Header, similar to <HEAD> in HTML
- Is located within the document (Internal)
- Z39.50
- Protocol for clients to ask queries of servers
- Librarians use __________ to devise values for fields to be encoded by
__________ or in
____________. This data is
accessible by users using the ________protocol.
|
|
12
|
- People
- Use most common name:
- Dr Seuss
- Not Theodore Seuss Geisel
- Geographic Names
- Use latest name:
- Namibia
- Not Zaïre
- Data must be constantly updated to provide users with ________________ –
not an easy job
|
|
13
|
- A __________________ set of metadata attributes used for
interoperability. Has recommended
values for some fields.
- Besides Title, Creator, Publisher, Contributor, 11 other fields:
- Subject
- Subject, expressed as keywords, key phrases or classification codes
that describe a topic of the resource.
- value from a controlled vocabulary or formal classification scheme.
|
|
14
|
- Description
- An account of the content of the resource.
- (e.g., an abstract, ToC, graphical representation of content or a
free-text account of the content)
- Date
- creation or availability of the resource
- ISO 8601 (e.g., YYYY-MM-DD)
- Type
- The nature or genre of the content of the resource
- value from a controlled vocabulary (e.g., DCMI Type Vocabulary)
- Format
- Media-type or dimensions. Also, identifies the software, hardware, or
other equipment needed to display or operate the resource.
- value from a controlled vocabulary (e.g., MIME)
- Identifier
- Unambiguous reference to the resource within a given context.
- Use a formal identification system, (e.g., URI, DOI, ISBN)
|
|
15
|
- Source
- Reference to a resource from which the present resource is derived
(e.g., past edition)
- Language
- Language of the intellectual content of the resource.
- use RFC 3066 with ISO639 (e.g., “en-GB”)
- Relation
- Reference to a related resource
- Coverage
- The extent or scope of the resource (e.g., location, time period)
- value from a controlled vocabulary
- Rights
- Statement of copyright or a reference to one
- If absent, no assumptions may be made
|
|
16
|
|
|
17
|
- Defines:
- Query language
- Results format
- Metadata for the collection
- No specified transport layer or implementation
- Built to assist metasearchers.
|
|
18
|
|
|
19
|
|
|
20
|
- Idea: Send different queries to categorize data
- Demo Time!
(if it works)
|
|
21
|
- Transform each rule into a query
- For each query:
- Send to database
- Record number of matches
- Retrieve top-k matching documents
- At the end of round:
- Analyze matches for each category
- Choose category to focus on
|
|
22
|
- We know ranking r of words according to document frequency in sample
- We know absolute document frequency f of some words from one-word
queries
- Mandelbrot’s formula connects empirically word frequency f and ranking r
- We use curve-fitting to estimate the absolute frequency of all words in
sample
|
|
23
|
- Algorithm:
- Send general queries to determine high level category
- Send progressively more specific queries to determine mid- and lower-
categories
|
|
24
|
- Rule-based scripts (fragile):
- DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
- Still heavily cited and used!
- Wrapper induction: localized extraction
- Define a local context and features to match and extract
- Text classification: classification
- Use features over the entire document to determine classification.
|
|
25
|
- The transfer of metadata from one format to another
- Retrofitting = ____________
- Aids accessibility and discovery
- Complementary to OAI / SDARTS (which are ________ approaches)
- Mostly done manually by specialists
|
|
26
|
- Dublin Core Tool List
- http://www.lub.lu.se/tk/metadata/dctoollist.html
- And many others
- The Getty Research Institute
- http://www.getty.edu/research/institute/
- Crosswalking
- http://www.ukoln.ac.uk/metadata/interoperability/
|
|
27
|
- Metadata authoring highly intricate but two complementary purposes
- Inventory
- Access (what we care more about)
- Uses CV standards (licensing drawback)
- Automated approaches have promise …
- To access and annotate more data
- But generally needs re-work, or NLP post-processing to make data fit
standard
- Questions?!?
|