Manual cataloging and indexing

Week 3 Min-Yen KAN

*heavily drawn from Lancaster (98) Indexing and Abstracting in Theory and Practice

Objectives of the Library

Ranganathan (1957)

Books are for use

Every reader his book

Every book its reader

Save time of the reader

The library is a growing organism

Mesopotamian Catalogs

Mesopotamians kept track of their tablets with a list of their incipits:

What is it?

A poem?

_______

Some Definitions

(Subject) Indexing

Assigning index terms to represent a document

Assists in document retrieval

Classification

Assigning a label to a document to assist in organizing that information

Not necessarily semantic labels

Linguistic Relativity

Also known as Sapir-Whorf hypothesis

A loose definition:

Our language to some extent determines the way in which we view and think about the world around us.

An example: time

Tomorrow = day after today

ةﺮﻜﺑ (“bukra”) = some point in the future

The result?

____________ representation

Every representative offers a _____

Many AI researchers reject / ignore this notion

Steps in Subject Indexing

Conceptual analysis

Determine “aboutness”

Computational approaches: TF × IDF

Translation

Expressing the concepts as index terms

Conceptual analysis

Generic: What is it about? What’s the main content

e.g., The History of Sociology

Specific: Why has it been added to our collection? What aspects will our users be interested in?

c.f., “Every reader his book”

Thus, organizations index ________

Different _______ (specialty, general interest)

Different _______ (own materials, 3^rd party)

Index terms

1. Libraries

2.

3.

4.

5.

Number of index terms in record

Long (Exhaustive)

Gives good _____ at cost of ______

Few records fit in the UI

Hard to figure out which are main aspects

Short (Selective)

Gives good ______ at cost of ____

Less work

In practice: offer levels of indexing for tasks

Index Terms

Abstract

Translation

Extraction: use terms directly from the source itself

Assignment: use terms from an outside source.

Usually from a controlled vocabulary.

Controlled vocabularies

Benefits

(Potentially) high precision, high recall

Question: which of these components is more important?

Drawbacks

Costly to construct and maintain

Is difficult to use

Need CV knowledge

Controlled vocabulary objectives

Control / suggest synonyms, pick an authoritative term

Especially for entities: people (maiden names to married names), places (St. Petersburg)

Distinguish among homographs (e.g., mercury, turkey)

Link terms with their relationship (is-a and all others (associative))

Difficulties in Naming Authorities

People

Use most common name:

Dr Seuss

Not Theodore Seuss Geisel

Geographic Names

Use latest name:

Namibia

Not Zaïre

Data must be constantly updated to provide users with best access points – not an easy job

Controlled vocabulary usability

Good structure to find the appropriate term

Standard fields in an CV:

USE/UF: Use instead / Use For (authoritative)

BT/NT: Broader / Narrower Term in terms of hierarchy

RT: Related Term (Associative Term)

Applied by experienced personnel

A large vocabulary can be hard to map to

Question: What to do if the controlled vocabulary has no term for the concept to be indexed?

Controlled vocabulary examples

General CVs

Sears List of Subject Headings

More general divisions, not intended for research libraries

Geared towards general subdivisions

Library of Congress Subject Headings (LCSH)

Comprehensive, very large, over five volumes

Domain-specific CV

Medical Subject Headings (MeSH)

Byproduct of indexing the NLM

Art & Architecture Thesaurus (AAT)

Object, images, architecture, styles

ERIC Thesaurus

Educational materials (journals, lesson plans and computer files)

Classification

Objectives of classification

Uniqueness

Be able to fetch a specific resource given a call number

Notational Permanence

(Seldom) have to reorganize/reassign labels

(e.g., paradigm shift in mathematics)

Comprehensiveness

Can successfully classify most things

Serendipity

Collocate related subjects together

Ease of Use

Ways of resolving ambiguities

(e.g., given religious architecture and Egyptian architecture, where does an article on the architecture of Egyptian temples go?)

Types of classification

Enumerative

Produce an alphabetical list of subject headings, assign numbers to each heading in alphabetical order

Hierarchical

Recursively divides subjects hierarchically, from most general to most specific

Faceted (analytico-synthetic):

Analytic: Divides subjects into mutually exclusive orthogonal facets

Synthetic: Combine facets to get a new class

- From Taylor (92)

Dewey Decimal Classification

Divide knowledge into ten classes

Recursively divide these categories into ten (or fewer classes)

Assign another digit

What type of classification scheme is it?

000 Generalities

100 Philosophy & psychology

200 Religion

300 Social sciences

400 Language

500 Natural sciences & mathematics

600 Technology (Applied sciences)

700 The arts

800 Literature & rhetoric

900 Geography & history

ACM Classification scheme

Four-level tree

3 coded levels and

a fourth uncoded level)

16 General Terms

H. Information Systems

H.0 GENERAL

H.1 MODELS AND PRINCIPLES

H.2 DATABASE MANAGEMENT (E.5)

H.3 INFORMATION STORAGE AND RETRIEVAL

H.4 INFORMATION SYSTEMS APPLICATIONS

H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)

H.m MISCELLANEOUS

I. Computing Methodologies

I.0 GENERAL

I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION

I.2 ARTIFICIAL INTELLIGENCE

I.3 COMPUTER GRAPHICS

I.4 IMAGE PROCESSING AND COMPUTER VISION

I.5 PATTERN RECOGNITION

I.6 SIMULATION AND MODELING (G.3)

I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)

I.m MISCELLANEOUS

Faceted Indexing

Facet – a characteristic of the resource (e.g., language)

Each facet organized hierarchically

allow drill-down browsing

represented by

set values (taxonomy)

continuous values (spectrum)

Colon Classification

Raganathan proposed 5 basic facets (PMEST):

Personality – the subject matter

Material

Energy – process or action

Space

Time

Each facet would have
its own classification schedule

String together notation
to get classification number

Example:

The design of wooden furniture in 18th century America

Classification Maintenance

DDC and LCSH _____
centralized systems

Nowadays, rely on a distributed approach to update

Either hierarchically determined authorities

Or arbitration of conflicts

Think CVS and source control systems

To think about…

Now that we have free-text searching, do you feel controlled vocabularies are still necessary or not? What do you feel their impact will be in the future of the digital library?

How would you improve the ACM classification scheme? How to deal with legacy schemes?

Booksellers also need to use classification to shelve books. Which type of classification do you think booksellers use? Would you make any adaptations to the classification schemes shown today?

Metadata creation and management*

*Parts of this lecture come from
Lilian Tang’s lecture material at the Univ. of Surrey

What is metadata, anyways?

Data about data

From the DB community

“Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”

-- Gilliland-Swetland (1998)

“Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”

-- Vellucci (1998)

Outline

What is Metadata?

Some Frameworks

Packaging Metadata

Warwick Framework

Structural Metadata

Hidden Web Metadata

OAI

SDARTS

Crosswalking and Automated Extraction

Metadata formats

HTML Metadata

AACR2 / TEIH /
MARC / Z39.50

Dublin Core

Types of metadata

Administrative

Structural

Descriptive

Intellectual Property Rights

Use

Metadata attributes

Data types: MIME

Multipurpose Internet Mail Extensions

(text/plain, img/jpg application/msword)

Simple format, pre-web

Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)

Application tag: need to use an application to handle this data

Wild success shows a simple system is best:

Good for adoption / authoring

Good for common denominator

Complex objects and granularity

DOI identifier records: multiple versions of a single document (hi res / low res)

Syntax should be mirrored in reference metadata

MPEG 7

Audio/visual metadata

Based again on how people search
(The Potato Eaters)

I’m looking for a picture of a group.

I’d like it to be a family group.

This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.

Facet analysis is a good approach

Objective (“of”)

Subjective (“about”)

HTML Meta tags

<HTML><HEAD>

<META NAME=“attribute” VALUE=“value”>

</HEAD>… </HTML>

Not regulated or
controlled

You can add your
own tags

Only certain ones
parsed by finding aids
(e.g., GoogleBot)

Many tags use other
metadata formats

MARC / AACR 2 / TEIH

Machine Readable Cataloging

Standard for encoding cataloging data (bibliographic and authority)

Standoff Annotation (External)

Anglo American Cataloguing Rules 2

Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.)

Regulates format and number of access points

Text Encoding Initiative Header

Header, similar to <HEAD> in HTML

Is located within the document (Internal)

Z39.50

Protocol for clients to ask queries of servers

Librarians use AACR2 / TEI to devise values for fields to be encoded by MARC (external) or in TEIH (internal). This data is accessible by users using the Z39.50 protocol.

Data Types

Used to describe the different types of (complex) objects in the digital library

Structural facets of documents

Dublin Core Elements

A common denominator set of metadata attributes used for interoperability. Has recommended values for some fields.

Besides Title, Creator, Publisher, Contributor, 11 other fields:

Subject

Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource.

value from a controlled vocabulary or formal classification scheme.

Dublin Core Elements (Con’t)

Description

An account of the content of the resource.

(e.g., an abstract, ToC, graphical representation of content or a free-text account of the content)

Date

creation or availability of the resource

ISO 8601 (e.g., YYYY-MM-DD)

Type

The nature or genre of the content of the resource

value from a controlled vocabulary (e.g., DCMI Type Vocabulary)

Format

Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource.

value from a controlled vocabulary (e.g., MIME)

Identifier

Unambiguous reference to the resource within a given context.

Use a formal identification system, (e.g., URI, DOI, ISBN)

DC Elements (Con’t)

Source

Reference to a resource from which the present resource is derived (e.g., past edition)

Language

Language of the intellectual content of the resource.

use RFC 3066 with ISO639 (e.g., “en-GB”)

Relation

Reference to a related resource

Coverage

The extent or scope of the resource (e.g., location, time period)

value from a controlled vocabulary

Rights

Statement of copyright or a reference to one

If absent, no assumptions may be made

Warwick Example

STARTS: A Metasearching Protocol

Defines:

Query language

Results format

Metadata for the collection

No specified transport layer or implementation

Built to assist metasearchers.

Distributed Search? Why?
“Surface” Web vs. “Hidden” Web

Hidden Web: Examples

Query Probing

Idea: Send different queries to categorize data

Demo Time!
(if it works)

Focused Probing: Sampling

Transform each rule into a query

For each query:

Send to database

Record number of matches

Retrieve top-k matching documents

At the end of round:

Analyze matches for each category

Choose category to focus on

Adjusting Document Frequencies

We know ranking r of words according to document frequency in sample

We know absolute document frequency f of some words from one-word queries

Mandelbrot’s formula connects empirically word frequency f and ranking r

We use curve-fitting to estimate the absolute frequency of all words in sample

Focused Probing

Algorithm:

Send general queries to determine high level category

Send progressively more specific queries to determine mid- and lower- categories

Automatic Extraction of Metadata

Rule-based scripts (fragile):

DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl

Still heavily cited and used!

Wrapper induction: localized extraction

Define a local context and features to match and extract

Text classification: classification

Use features over the entire document to determine classification.

MeURLin

Classification of URLs to the
Open Directory Project

http://www.onlineshawnee.com/stories/072901/ent shelton.shtml

Doesn’t require webpage, just address

About 1/2 - 1/3 as accurate as full words approaches

Uses scalable segmentation and expansion techniques

Crosswalking

The transfer of metadata from one format to another

Retrofitting = updating old metadata to a newer format

Aids accessibility and discovery

Complementary to OAI / SDARTS (which are centralized approaches)

Mostly done manually by specialists

CS work to be done here!

Reference

Dublin Core Tool List

http://www.lub.lu.se/tk/metadata/dctoollist.html

And many others

The Getty Research Institute

http://www.getty.edu/research/institute/

Crosswalking

http://www.ukoln.ac.uk/metadata/interoperability/

Summing Up

Metadata authoring highly intricate but two complementary purposes

Inventory

Access (what we care more about)

Uses CV standards (licensing drawback)

Automated approaches have promise …

To access and annotate more data

But generally needs re-work, or NLP post-processing to make data fit standard

Questions?!?


	Week 3 Min-Yen KAN
	*heavily drawn from Lancaster (98) Indexing and Abstracting in Theory and Practice


	Ranganathan (1957)
	Books are for use
	Every reader his book
	Every book its reader
	Save time of the reader
	The library is a growing organism


	Mesopotamians kept track of their tablets with a list of their incipits:








	What is it?
	A poem?
	_______


	(Subject) Indexing
		Assigning index terms to represent a document
		Assists in document retrieval

	Classification
		Assigning a label to a document to assist in organizing that information
		Not necessarily semantic labels


	Also known as Sapir-Whorf hypothesis
	A loose definition:
	Our language to some extent determines the way in which we view and think about the world around us.

	An example: time
		Tomorrow = day after today
		ةﺮﻜﺑ (“bukra”) = some point in the future

	The result?
		____________ representation
		Every representative offers a _____
		Many AI researchers reject / ignore this notion


Conceptual analysis
	Determine “aboutness”
		Computational approaches: TF × IDF

Translation
	Expressing the concepts as index terms


	Generic: What is it about? What’s the main content
		e.g., The History of Sociology
	Specific: Why has it been added to our collection? What aspects will our users be interested in?
		c.f., “Every reader his book”

	Thus, organizations index ________
		Different _______ (specialty, general interest)
		Different _______ (own materials, 3^rd party)


	Long (Exhaustive)
		Gives good _____ at cost of ______
		Few records fit in the UI
		Hard to figure out which are main aspects

	Short (Selective)
		Gives good ______ at cost of ____
		Less work

	In practice: offer levels of indexing for tasks
		Index Terms
		Abstract


	Extraction: use terms directly from the source itself

	Assignment: use terms from an outside source.
		Usually from a controlled vocabulary.


Benefits
	(Potentially) high precision, high recall
	Question: which of these components is more important?

Drawbacks
	Costly to construct and maintain
	Is difficult to use
		Need CV knowledge


	Control / suggest synonyms, pick an authoritative term
		Especially for entities: people (maiden names to married names), places (St. Petersburg)

	Distinguish among homographs (e.g., mercury, turkey)

	Link terms with their relationship (is-a and all others (associative))


	People
		Use most common name:
		Dr Seuss
		Not Theodore Seuss Geisel
	Geographic Names
		Use latest name:
		Namibia
		Not Zaïre


	Data must be constantly updated to provide users with best access points – not an easy job


Good structure to find the appropriate term
	Standard fields in an CV:
		USE/UF: Use instead / Use For (authoritative)
		BT/NT: Broader / Narrower Term in terms of hierarchy
		RT: Related Term (Associative Term)

Applied by experienced personnel
	A large vocabulary can be hard to map to

	Question: What to do if the controlled vocabulary has no term for the concept to be indexed?


General CVs
	Sears List of Subject Headings
		More general divisions, not intended for research libraries
		Geared towards general subdivisions

	Library of Congress Subject Headings (LCSH)
		Comprehensive, very large, over five volumes
Domain-specific CV
	Medical Subject Headings (MeSH)
		Byproduct of indexing the NLM

	Art & Architecture Thesaurus (AAT)
		Object, images, architecture, styles

	ERIC Thesaurus
		Educational materials (journals, lesson plans and computer files)


	Uniqueness
		Be able to fetch a specific resource given a call number

	Notational Permanence
		(Seldom) have to reorganize/reassign labels
		(e.g., paradigm shift in mathematics)

	Comprehensiveness
		Can successfully classify most things

	Serendipity
		Collocate related subjects together

	Ease of Use
		Ways of resolving ambiguities
		(e.g., given religious architecture and Egyptian architecture, where does an article on the architecture of Egyptian temples go?)


	Enumerative
		Produce an alphabetical list of subject headings, assign numbers to each heading in alphabetical order

	Hierarchical
		Recursively divides subjects hierarchically, from most general to most specific

	Faceted (analytico-synthetic):
		Analytic: Divides subjects into mutually exclusive orthogonal facets
		Synthetic: Combine facets to get a new class

	- From Taylor (92)


	Divide knowledge into ten classes
	Recursively divide these categories into ten (or fewer classes)
		Assign another digit

	What type of classification scheme is it?
	000 Generalities
	100 Philosophy & psychology
	200 Religion
	300 Social sciences
	400 Language
	500 Natural sciences & mathematics
	600 Technology (Applied sciences)
	700 The arts
	800 Literature & rhetoric
	900 Geography & history


	Four-level tree
		3 coded levels and
		a fourth uncoded level)

	16 General Terms
	H. Information Systems
	H.0 GENERAL
	H.1 MODELS AND PRINCIPLES
	H.2 DATABASE MANAGEMENT (E.5)
	H.3 INFORMATION STORAGE AND RETRIEVAL
	H.4 INFORMATION SYSTEMS APPLICATIONS
	H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)
	H.m MISCELLANEOUS
	I. Computing Methodologies
	I.0 GENERAL
	I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION
	I.2 ARTIFICIAL INTELLIGENCE
	I.3 COMPUTER GRAPHICS
	I.4 IMAGE PROCESSING AND COMPUTER VISION
	I.5 PATTERN RECOGNITION
	I.6 SIMULATION AND MODELING (G.3)
	I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)
	I.m MISCELLANEOUS


Facet – a characteristic of the resource (e.g., language)

Each facet organized hierarchically
	allow drill-down browsing
	represented by
		set values (taxonomy)
		continuous values (spectrum)


	Raganathan proposed 5 basic facets (PMEST):
		Personality – the subject matter
		Material
		Energy – process or action
		Space
		Time

	Each facet would have its own classification schedule

	String together notation to get classification number


	Example:
	The design of wooden furniture in 18th century America


DDC and LCSH _____ centralized systems

Nowadays, rely on a distributed approach to update
	Either hierarchically determined authorities
	Or arbitration of conflicts
		Think CVS and source control systems


	Now that we have free-text searching, do you feel controlled vocabularies are still necessary or not? What do you feel their impact will be in the future of the digital library?

	How would you improve the ACM classification scheme? How to deal with legacy schemes?

	Booksellers also need to use classification to shelve books. Which type of classification do you think booksellers use? Would you make any adaptations to the classification schemes shown today?


	*Parts of this lecture come from Lilian Tang’s lecture material at the Univ. of Surrey


	Data about data
		From the DB community

	“Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”
		-- Gilliland-Swetland (1998)

	“Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”
		-- Vellucci (1998)


	What is Metadata?
		Some Frameworks
	Packaging Metadata
		Warwick Framework

	Structural Metadata
	Hidden Web Metadata
		OAI
		SDARTS

	Crosswalking and Automated Extraction
	Metadata formats

	HTML Metadata
	AACR2 / TEIH / MARC / Z39.50
	Dublin Core


	Administrative
	Structural
	Descriptive
	Intellectual Property Rights
	Use


Multipurpose Internet Mail Extensions
	(text/plain, img/jpg application/msword)
	Simple format, pre-web
	Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)
	Application tag: need to use an application to handle this data
	Wild success shows a simple system is best:
		Good for adoption / authoring
		Good for common denominator


		DOI identifier records: multiple versions of a single document (hi res / low res)
		Syntax should be mirrored in reference metadata


	Based again on how people search (The Potato Eaters)
		I’m looking for a picture of a group.
		I’d like it to be a family group.
		This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.

	Facet analysis is a good approach
		Objective (“of”)
		Subjective (“about”)


	<HTML><HEAD>
	<META NAME=“attribute” VALUE=“value”>
	</HEAD>… </HTML>

	Not regulated or controlled
	You can add your own tags
	Only certain ones parsed by finding aids (e.g., GoogleBot)
	Many tags use other metadata formats