1	Metadata creation and management* Module 6 Min-Yen KAN *Parts of this lecture come from Lilian Tang’s lecture material at the Univ. of Surrey
2	What is metadata, anyways? Data about data From the DB community “Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object” -- Gilliland-Swetland (1998) “Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment” -- Vellucci (1998)
3	Outline What is Metadata? Some Frameworks Packaging Metadata Warwick Framework Structural Metadata Hidden Web Metadata OAI SDARTS Crosswalking and Automated Extraction Metadata formats HTML Metadata AACR2 / TEIH / MARC / Z39.50 Dublin Core SingCORE
4	Types of metadata Administrative Structural Descriptive Intellectual Property Rights Use
5	Metadata attributes
6	Data types: MIME Multipurpose Internet Mail Extensions (text/plain, img/jpg application/msword) Simple format, pre-web Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio) Application tag: need to use an application to handle this data Wild success shows a simple system is best: Good for adoption / authoring Good for common denominator
7	Complex objects and granularity DOI identifier records: multiple versions of a single document (hi res / low res) Syntax should be mirrored in reference metadata
8	MPEG 7
9	Audio/visual metadata Based again on how people search (The Potato Eaters) I’m looking for a picture of a group. I’d like it to be a family group. This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat. Facet analysis is a good approach Objective (“of”) Subjective (“about”)
10	HTML Meta tags <HTML><HEAD> <META NAME=“attribute” VALUE=“value”> </HEAD>… </HTML> Not regulated or controlled You can add your own tags Only certain ones parsed by finding aids (e.g., GoogleBot) Many tags use other metadata formats
11	MARC / AACR 2 / TEIH Machine Readable Cataloging Standard for encoding cataloging data (bibliographic and authority) Standoff Annotation (External) Anglo American Cataloguing Rules 2 Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.) Regulates format and number of access points Text Encoding Initiative Header Header, similar to <HEAD> in HTML Is located within the document (Internal) Z39.50 Protocol for clients to ask queries of servers Librarians use __________ to devise values for fields to be encoded by __________ or in ____________. This data is accessible by users using the ________protocol.
12	Difficulties in Naming Authorities People Use most common name: Dr Seuss Not Theodore Seuss Geisel Geographic Names Use latest name: Namibia Not Zaïre Data must be constantly updated to provide users with ________________ – not an easy job
13	Dublin Core Elements A __________________ set of metadata attributes used for interoperability. Has recommended values for some fields. Besides Title, Creator, Publisher, Contributor, 11 other fields: Subject Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource. value from a controlled vocabulary or formal classification scheme.
14	Dublin Core Elements (Con’t) Description An account of the content of the resource. (e.g., an abstract, ToC, graphical representation of content or a free-text account of the content) Date creation or availability of the resource ISO 8601 (e.g., YYYY-MM-DD) Type The nature or genre of the content of the resource value from a controlled vocabulary (e.g., DCMI Type Vocabulary) Format Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource. value from a controlled vocabulary (e.g., MIME) Identifier Unambiguous reference to the resource within a given context. Use a formal identification system, (e.g., URI, DOI, ISBN)
15	DC Elements (Con’t) Source Reference to a resource from which the present resource is derived (e.g., past edition) Language Language of the intellectual content of the resource. use RFC 3066 with ISO639 (e.g., “en-GB”) Relation Reference to a related resource Coverage The extent or scope of the resource (e.g., location, time period) value from a controlled vocabulary Rights Statement of copyright or a reference to one If absent, no assumptions may be made
16	Warwick Example
17	STARTS: A Metasearching Protocol Defines: Query language Results format Metadata for the collection No specified transport layer or implementation Built to assist metasearchers.
18	Distributed Search? Why? “Surface” Web vs. “Hidden” Web
19	Hidden Web: Examples
20	Query Probing Idea: Send different queries to categorize data Demo Time! (if it works)
21	Focused Probing: Sampling Transform each rule into a query For each query: Send to database Record number of matches Retrieve top-k matching documents At the end of round: Analyze matches for each category Choose category to focus on
22	Adjusting Document Frequencies We know ranking r of words according to document frequency in sample We know absolute document frequency f of some words from one-word queries Mandelbrot’s formula connects empirically word frequency f and ranking r We use curve-fitting to estimate the absolute frequency of all words in sample
23	Focused Probing Algorithm: Send general queries to determine high level category Send progressively more specific queries to determine mid- and lower- categories
24	Automatic Extraction of Metadata Rule-based scripts (fragile): DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl Still heavily cited and used! Wrapper induction: localized extraction Define a local context and features to match and extract Text classification: classification Use features over the entire document to determine classification.
25	Crosswalking The transfer of metadata from one format to another Retrofitting = ____________ Aids accessibility and discovery Complementary to OAI / SDARTS (which are ________ approaches) Mostly done manually by specialists CS work to be done here!
26	Reference Dublin Core Tool List http://www.lub.lu.se/tk/metadata/dctoollist.html And many others The Getty Research Institute http://www.getty.edu/research/institute/ Crosswalking http://www.ukoln.ac.uk/metadata/interoperability/
27	Summing Up Metadata authoring highly intricate but two complementary purposes Inventory Access (what we care more about) Uses CV standards (licensing drawback) Automated approaches have promise … To access and annotate more data But generally needs re-work, or NLP post-processing to make data fit standard Questions?!?