Manual cataloging and indexing
Week 3                Min-Yen KAN
*heavily drawn from Lancaster (98) Indexing and Abstracting in Theory and Practice

Objectives of the Library
Ranganathan (1957)
Books are for use
Every reader his book
Every book its reader
Save time of the reader
The library is a growing organism

Mesopotamian Catalogs
Mesopotamians kept track of their tablets with a list of their incipits:
What is it?
A poem?
_______

Some Definitions
(Subject) Indexing
Assigning index terms to represent a document
Assists in document retrieval
Classification
Assigning a label to a document to assist in organizing that information
Not necessarily semantic labels

Linguistic Relativity
Also known as Sapir-Whorf hypothesis
A loose definition:
Our language to some extent determines the way in which we view and think about the world around us.
An example: time
Tomorrow = day after today
ةﺮﻜﺑ (“bukra”) = some point in the future
The result?
____________  representation
Every representative offers a _____
Many AI researchers reject / ignore this notion

Steps in Subject Indexing
Conceptual analysis
Determine “aboutness”
Computational approaches: TF × IDF
Translation
Expressing the concepts as index terms

Conceptual analysis
Generic: What is it about? What’s the main content
e.g., The History of Sociology
Specific: Why has it been added to our collection? What aspects will our users be interested in?
c.f., “Every reader his book”
Thus, organizations index ________
Different _______ (specialty, general interest)
Different _______ (own materials, 3rd party)

Index terms
1. Libraries
2.
3.
4.
5.

Number of index terms in record
Long (Exhaustive)
Gives good _____ at cost of ______
Few records fit in the UI
Hard to figure out which are main aspects
Short (Selective)
Gives good ______ at cost of ____
Less work
In practice: offer levels of indexing for tasks
Index Terms
Abstract

Translation
Extraction: use terms directly from the source itself
Assignment: use terms from an outside source.
Usually from a controlled vocabulary.

Controlled vocabularies
Benefits
(Potentially) high precision, high recall
Question: which of these components is more important?
Drawbacks
Costly to construct and maintain
Is difficult to use
Need CV knowledge

Controlled vocabulary objectives
Control / suggest synonyms, pick an authoritative term
Especially for entities: people (maiden names to married names), places (St. Petersburg)
Distinguish among homographs (e.g., mercury, turkey)
Link terms with their relationship (is-a and all others (associative))

Difficulties in Naming Authorities
People
Use most common name:
Dr Seuss
Not Theodore Seuss Geisel
Geographic Names
Use latest name:
  Namibia
Not Zaïre
Data must be constantly updated to provide users with best access points – not an easy job

Controlled vocabulary usability
Good structure to find the appropriate term
Standard fields in an CV:
USE/UF: Use instead / Use For (authoritative)
BT/NT: Broader / Narrower Term in terms of hierarchy
RT: Related Term (Associative Term)
Applied by experienced personnel
A large vocabulary can be hard to map to
Question: What to do if the controlled vocabulary has no term for the concept to be indexed?

Controlled vocabulary examples
General CVs
Sears List of Subject Headings
More general divisions, not intended for research libraries
Geared towards general subdivisions
Library of Congress Subject Headings (LCSH)
Comprehensive, very large, over five volumes
Domain-specific CV
Medical Subject Headings (MeSH)
Byproduct of indexing the NLM
Art & Architecture Thesaurus (AAT)
Object, images, architecture, styles
ERIC Thesaurus
Educational materials (journals, lesson plans and computer files)

Classification

Objectives of classification
Uniqueness
Be able to fetch a specific resource given a call number
Notational Permanence
(Seldom) have to reorganize/reassign labels
(e.g., paradigm shift in mathematics)
Comprehensiveness
Can successfully classify most things
Serendipity
Collocate related subjects together
Ease of Use
Ways of resolving ambiguities
(e.g., given religious architecture and Egyptian architecture, where does an article on the architecture of Egyptian temples go?)

Types of classification
Enumerative
Produce an alphabetical list of subject headings, assign numbers to each heading in alphabetical order
Hierarchical
Recursively divides subjects hierarchically, from most general to most specific
Faceted (analytico-synthetic):
Analytic: Divides subjects into mutually exclusive orthogonal facets
Synthetic: Combine facets to get a new class
- From Taylor (92)

Dewey Decimal Classification
Divide knowledge into ten classes
Recursively divide these categories into ten (or fewer classes)
Assign another digit
What type of classification scheme is it?
000 Generalities
100 Philosophy & psychology
200 Religion
300 Social sciences
400 Language
500 Natural sciences & mathematics
600 Technology (Applied sciences)
700 The arts
800 Literature & rhetoric
900 Geography & history

ACM Classification scheme
Four-level tree
3 coded levels and
a fourth uncoded level)
16 General Terms
H. Information Systems
H.0 GENERAL
H.1 MODELS AND PRINCIPLES
H.2 DATABASE MANAGEMENT (E.5)
H.3 INFORMATION STORAGE AND RETRIEVAL
H.4 INFORMATION SYSTEMS APPLICATIONS
H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)
H.m MISCELLANEOUS
I. Computing Methodologies
I.0 GENERAL
I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION
I.2 ARTIFICIAL INTELLIGENCE
I.3 COMPUTER GRAPHICS
I.4 IMAGE PROCESSING AND COMPUTER VISION
I.5 PATTERN RECOGNITION
I.6 SIMULATION AND MODELING (G.3)
I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)
I.m MISCELLANEOUS

Faceted Indexing
Facet – a characteristic of the resource (e.g., language)
Each facet organized hierarchically
allow drill-down browsing
represented by
set values (taxonomy)
continuous values (spectrum)

Colon Classification
Raganathan proposed 5 basic facets (PMEST):
Personality – the subject matter
Material
Energy – process or action
Space
Time
Each facet would have
 its own classification schedule
String together notation
to get classification number
Example:
The design of wooden furniture in 18th century America

Classification Maintenance
DDC and LCSH _____
centralized systems
Nowadays, rely on a distributed approach to update
Either hierarchically determined authorities
Or arbitration of conflicts
Think CVS and source control systems

To think about…
Now that we have free-text searching, do you feel controlled vocabularies are still necessary or not? What do you feel their impact will be in the future of the digital library?
How would you improve the ACM classification scheme?  How to deal with legacy schemes?
Booksellers also need to use classification to shelve books.  Which type of classification do you think booksellers use?  Would you make any adaptations to the classification schemes shown today?

Metadata creation and management*
*Parts of this lecture come from
Lilian Tang’s lecture material at the Univ. of Surrey

What is metadata, anyways?
Data about data
From the DB community
“Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”
-- Gilliland-Swetland (1998)
“Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”
-- Vellucci (1998)

Outline
What is Metadata?
Some Frameworks
Packaging Metadata
Warwick Framework
Structural Metadata
Hidden Web Metadata
OAI
SDARTS
Crosswalking and Automated Extraction
Metadata formats
HTML Metadata
AACR2 / TEIH /
MARC / Z39.50
Dublin Core

Types of metadata
Administrative
Structural
Descriptive
Intellectual Property Rights
Use

Types of metadata
Administrative
Structural
Descriptive
Intellectual Property Rights
Use

Types of metadata
Administrative
Structural
Descriptive
Intellectual Property Rights
Use

Types of metadata
Administrative
Structural
Descriptive
Intellectual Property Rights
Use

Types of metadata
Administrative
Structural
Descriptive
Intellectual Property Rights
Use

Metadata attributes

Data types: MIME
Multipurpose Internet Mail Extensions
(text/plain, img/jpg application/msword)
Simple format, pre-web
Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)
Application tag: need to use an application to handle this data
Wild success shows a simple system is best:
Good for adoption / authoring
Good for common denominator

Complex objects and granularity
DOI identifier records: multiple versions of a single document (hi res / low res)
Syntax should be mirrored in reference metadata

MPEG 7

Audio/visual metadata
Based again on how people search
(The Potato Eaters)
I’m looking for a picture of a group.
I’d like it to be a family group.
This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.
Facet analysis is a good approach
Objective (“of”)
Subjective (“about”)

HTML Meta tags
<HTML><HEAD>
  <META NAME=“attribute” VALUE=“value”>
</HEAD>… </HTML>
Not regulated or
controlled
You can add your
own tags
Only certain ones
parsed by finding aids
(e.g., GoogleBot)
Many tags use other
metadata formats

MARC / AACR 2 / TEIH
Machine Readable Cataloging
Standard for encoding cataloging data (bibliographic and authority)
Standoff Annotation (External)
Anglo American Cataloguing Rules 2
Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.)
Regulates format and number of access points
Text Encoding Initiative Header
Header, similar to <HEAD> in HTML
Is located within the document (Internal)
Z39.50
Protocol for clients to ask queries of servers
Librarians use AACR2 / TEI to devise values for fields to be encoded by MARC (external) or in TEIH (internal).  This data is accessible by users using the Z39.50 protocol.

Data Types
Used to describe the different types of (complex) objects in the digital library
Structural facets of documents

Dublin Core Elements
A common denominator set of metadata attributes used for interoperability.  Has recommended values for some fields.
Besides Title, Creator, Publisher, Contributor, 11 other fields:
Subject
Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource.
value from a controlled vocabulary or formal classification scheme.

Dublin Core Elements (Con’t)
Description
An account of the content of the resource.
(e.g., an abstract, ToC, graphical representation of content or a free-text account of the content)
Date
creation or availability of the resource
ISO 8601 (e.g., YYYY-MM-DD)
Type
The nature or genre of the content of the resource
value from a controlled vocabulary (e.g., DCMI Type Vocabulary)
Format
Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource.
value from a controlled vocabulary (e.g., MIME)
Identifier
Unambiguous reference to the resource within a given context.
Use a formal identification system, (e.g., URI, DOI, ISBN)

DC Elements (Con’t)
Source
Reference to a resource from which the present resource is derived (e.g., past edition)
Language
Language of the intellectual content of the resource.
use RFC 3066 with ISO639 (e.g., “en-GB”)
Relation
Reference to a related resource
Coverage
The extent or scope of the resource (e.g., location, time period)
value from a controlled vocabulary
Rights
Statement of copyright or a reference to one
If absent, no assumptions may be made

Warwick Example

STARTS: A Metasearching Protocol
Defines:
Query language
Results format
Metadata for the collection
No specified transport layer or implementation
Built to assist metasearchers.

Distributed Search? Why?
“Surface” Web vs. “Hidden” Web

Hidden Web: Examples

Query Probing
Idea: Send different queries to categorize data
Demo Time!
(if it works)

Focused Probing: Sampling
Transform each rule into a query
For each query:
Send to database
Record number of matches
Retrieve top-k matching documents
At the end of round:
Analyze matches for each category
Choose category to focus on

Adjusting Document Frequencies
We know ranking r of words according to document frequency in sample
We know absolute document frequency f of some words from one-word queries
Mandelbrot’s formula connects empirically word frequency f and ranking r
We use curve-fitting to estimate the absolute frequency of all words in sample

Focused Probing
Algorithm:
Send general queries to determine high level category
Send progressively more specific queries to determine mid- and lower- categories

Automatic Extraction of Metadata
Rule-based scripts (fragile):
DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
Still heavily cited and used!
Wrapper induction: localized extraction
Define a local context and features to match and extract
Text classification: classification
Use features over the entire document to determine classification.

MeURLin
Classification of URLs to the
Open Directory Project
http://www.onlineshawnee.com/stories/072901/ent shelton.shtml
Doesn’t require webpage, just address
About 1/2 - 1/3  as accurate as full words approaches
Uses scalable segmentation and expansion techniques

Crosswalking
The transfer of metadata from one format to another
Retrofitting = updating old metadata to a newer format
Aids accessibility and discovery
Complementary to OAI / SDARTS (which are centralized approaches)
Mostly done manually by specialists
CS work to be done here!

Reference
Dublin Core Tool List
http://www.lub.lu.se/tk/metadata/dctoollist.html
And many others
The Getty Research Institute
http://www.getty.edu/research/institute/
Crosswalking
http://www.ukoln.ac.uk/metadata/interoperability/

Summing Up
Metadata authoring highly intricate but two complementary purposes
Inventory
Access (what we care more about)
Uses CV standards (licensing drawback)
Automated approaches have promise …
To access and annotate more data
But generally needs re-work, or NLP post-processing to make data fit standard
Questions?!?