Manual cataloging and indexing

Module 5 Min-Yen KAN

*heavily drawn from Lancaster (98) Indexing and Abstracting in Theory and Practice

Mesopotamian Catalogs

Mesopotamians kept track of their tablets with a list of their incipits:

What is it?

A poem?

Some Definitions

(Subject) Indexing

Assigning index terms to represent a document

Assists in document retrieval

Classification

Assigning a label to a document to assist in organizing that information

Not necessarily semantic labels

Steps in Subject Indexing

Conceptual analysis

Determine “aboutness”

Computational approaches: TF × IDF

Translation

Expressing the concepts as index terms

In controlled vocabularies, similar to Taylor’s (68) compromised need

Conceptual analysis

Generic: What is it about? What’s the main content

e.g., The History of Sociology

Specific: Why has it been added to our collection? What aspects will our users be interested in?

c.f., “Every reader his book”

Thus, organizations index differently

Different subjects (specialty, general interest)

Different materials (own materials, 3^rd party)

Index terms

1. Libraries

2.

3.

4.

5.

Number of index terms in record

Long (Exhaustive)

Gives good recall at cost of precision

Few records fit in the UI

Hard to figure out which are main aspects

Short (Selective)

Gives good precision at cost of recall

Less work

In practice: offer levels of indexing for tasks

Index Terms

Abstract

Translation

Extraction: use terms directly from the source itself

Assignment: use terms from an outside source.

Usually from a controlled vocabulary.

Controlled vocabularies

Benefits

(Potentially) high precision, high recall

Question: which of these components is more important?

Drawbacks

Costly to construct and maintain

Is difficult to use

Need CV knowledge

Controlled vocabulary objectives

Control / suggest synonyms, pick an authoritative term

Especially for entities: people (maiden names to married names), places (St. Petersburg)

Distinguish among homographs (e.g., mercury, turkey)

Link terms with their relationship (is-a and all others (associative))

Controlled vocabulary usability

Good structure to find the appropriate term

Standard fields in an CV:

USE/UF: Use instead / Use For (authoritative)

BT/NT: Broader / Narrower Term in terms of hierarchy

RT: Related Term (Associative Term)

Applied by experienced personnel

A large vocabulary can be hard to map to

Question: What to do if the controlled vocabulary has no term for the concept to be indexed?

Controlled vocabulary examples

General CVs

Sears List of Subject Headings

More general divisions, not intended for research libraries

Geared towards general subdivisions

Library of Congress Subject Headinges (LCSH)

Comprehensive, very large, over five volumes

Domain-specific CV

Medical Subject Headings (MeSH)

Byproduct of indexing the NLM

Art & Architecture Thesaurus (AAT)

Object, images, architecture, styles

ERIC Thesaurus

Educational materials (journals, lesson plans and computer files)

Classification

Objectives of classification

Uniqueness

Be able to fetch a specific resource given a call number

Notational Permanence

(Seldom) have to reorganize/reassign labels

(e.g., paradigm shift in mathematics)

Comprehensive

Can successfully classify most things

Serendipity

Collocate related subjects together

Ease of Use

Ways of resolving ambiguities

(e.g., given religious architecture and Egyptian architecture, where does an article on the architecture of Egyptian temples go?)

Types of classification

Enumerative

Produce an alphabetical list of subject headings, assign numbers to each heading in alphabetical order

Hierarchical

Recursively divides subjects hierarchically, from most general to most specific

Faceted (analytico-synthetic):

Analytic: Divides subjects into mutually exclusive orthogonal facets

Synthetic: Combine facets to get a new class

- From Taylor (92)

Dewey Decimal Classification

Divide knowledge into ten classes

Recursively divide these categories into ten (or fewer classes)

Assign another digit

What type of classification scheme is it?

000 Generalities

100 Philosophy & psychology

200 Religion

300 Social sciences

400 Language

500 Natural sciences & mathematics

600 Technology (Applied sciences)

700 The arts

800 Literature & rhetoric

900 Geography & history

ACM Classification scheme

Four-level tree

3 coded levels and

a fourth uncoded level)

16 General Terms

H. Information Systems

H.0 GENERAL

H.1 MODELS AND PRINCIPLES

H.2 DATABASE MANAGEMENT (E.5)

H.3 INFORMATION STORAGE AND RETRIEVAL

H.4 INFORMATION SYSTEMS APPLICATIONS

H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)

H.m MISCELLANEOUS

I. Computing Methodologies

I.0 GENERAL

I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION

I.2 ARTIFICIAL INTELLIGENCE

I.3 COMPUTER GRAPHICS

I.4 IMAGE PROCESSING AND COMPUTER VISION

I.5 PATTERN RECOGNITION

I.6 SIMULATION AND MODELING (G.3)

I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)

I.m MISCELLANEOUS

Faceted Indexing

Facet – a characteristic of the resource (e.g., language)

Each facet organized hierarchically

allow drill-down browsing

represented by

set values (taxonomy)

continuous values (spectrum)

Colon Classification

Raganathan proposed 5 basic facets (PMEST):

Personality – the subject matter

Material

Energy – process or action

Space

Time

Each facet would have
its own classification schedule

String together notation
to get classification number

Example:

The design of wooden furniture in 18th century America

To think about…

Now that we have free-text searching, do you feel controlled vocabularies are still necessary or not? What do you feel their impact will be in the future of the digital library?

How would to improve the ACM classification scheme? How to deal with legacy schemes?

Booksellers also need to use classification to shelve books. Which type of classification do you think booksellers use? Would you make any adaptations to the classification schemes shown today?


	Module 5 Min-Yen KAN
	*heavily drawn from Lancaster (98) Indexing and Abstracting in Theory and Practice


	Mesopotamians kept track of their tablets with a list of their incipits:








	What is it?
	A poem?


	(Subject) Indexing
		Assigning index terms to represent a document
		Assists in document retrieval

	Classification
		Assigning a label to a document to assist in organizing that information
		Not necessarily semantic labels


	Conceptual analysis
		Determine “aboutness”
		Computational approaches: TF × IDF

	Translation
		Expressing the concepts as index terms
		In controlled vocabularies, similar to Taylor’s (68) compromised need


	Generic: What is it about? What’s the main content
		e.g., The History of Sociology
	Specific: Why has it been added to our collection? What aspects will our users be interested in?
		c.f., “Every reader his book”

	Thus, organizations index differently
		Different subjects (specialty, general interest)
		Different materials (own materials, 3^rd party)


	Long (Exhaustive)
		Gives good recall at cost of precision
		Few records fit in the UI
		Hard to figure out which are main aspects

	Short (Selective)
		Gives good precision at cost of recall
		Less work

	In practice: offer levels of indexing for tasks
		Index Terms
		Abstract


	Extraction: use terms directly from the source itself

	Assignment: use terms from an outside source.
		Usually from a controlled vocabulary.


Benefits
	(Potentially) high precision, high recall
	Question: which of these components is more important?

Drawbacks
	Costly to construct and maintain
	Is difficult to use
		Need CV knowledge


	Control / suggest synonyms, pick an authoritative term
		Especially for entities: people (maiden names to married names), places (St. Petersburg)

	Distinguish among homographs (e.g., mercury, turkey)

	Link terms with their relationship (is-a and all others (associative))


Good structure to find the appropriate term
	Standard fields in an CV:
		USE/UF: Use instead / Use For (authoritative)
		BT/NT: Broader / Narrower Term in terms of hierarchy
		RT: Related Term (Associative Term)

Applied by experienced personnel
	A large vocabulary can be hard to map to

	Question: What to do if the controlled vocabulary has no term for the concept to be indexed?


General CVs
	Sears List of Subject Headings
		More general divisions, not intended for research libraries
		Geared towards general subdivisions

	Library of Congress Subject Headinges (LCSH)
		Comprehensive, very large, over five volumes
Domain-specific CV
	Medical Subject Headings (MeSH)
		Byproduct of indexing the NLM

	Art & Architecture Thesaurus (AAT)
		Object, images, architecture, styles

	ERIC Thesaurus
		Educational materials (journals, lesson plans and computer files)


	Uniqueness
		Be able to fetch a specific resource given a call number

	Notational Permanence
		(Seldom) have to reorganize/reassign labels
		(e.g., paradigm shift in mathematics)

	Comprehensive
		Can successfully classify most things

	Serendipity
		Collocate related subjects together

	Ease of Use
		Ways of resolving ambiguities
		(e.g., given religious architecture and Egyptian architecture, where does an article on the architecture of Egyptian temples go?)


	Enumerative
		Produce an alphabetical list of subject headings, assign numbers to each heading in alphabetical order

	Hierarchical
		Recursively divides subjects hierarchically, from most general to most specific

	Faceted (analytico-synthetic):
		Analytic: Divides subjects into mutually exclusive orthogonal facets
		Synthetic: Combine facets to get a new class

	- From Taylor (92)


	Divide knowledge into ten classes
	Recursively divide these categories into ten (or fewer classes)
		Assign another digit

	What type of classification scheme is it?
	000 Generalities
	100 Philosophy & psychology
	200 Religion
	300 Social sciences
	400 Language
	500 Natural sciences & mathematics
	600 Technology (Applied sciences)
	700 The arts
	800 Literature & rhetoric
	900 Geography & history


	Four-level tree
		3 coded levels and
		a fourth uncoded level)

	16 General Terms
	H. Information Systems
	H.0 GENERAL
	H.1 MODELS AND PRINCIPLES
	H.2 DATABASE MANAGEMENT (E.5)
	H.3 INFORMATION STORAGE AND RETRIEVAL
	H.4 INFORMATION SYSTEMS APPLICATIONS
	H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)
	H.m MISCELLANEOUS
	I. Computing Methodologies
	I.0 GENERAL
	I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION
	I.2 ARTIFICIAL INTELLIGENCE
	I.3 COMPUTER GRAPHICS
	I.4 IMAGE PROCESSING AND COMPUTER VISION
	I.5 PATTERN RECOGNITION
	I.6 SIMULATION AND MODELING (G.3)
	I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)
	I.m MISCELLANEOUS


Facet – a characteristic of the resource (e.g., language)

Each facet organized hierarchically
	allow drill-down browsing
	represented by
		set values (taxonomy)
		continuous values (spectrum)


	Raganathan proposed 5 basic facets (PMEST):
		Personality – the subject matter
		Material
		Energy – process or action
		Space
		Time

	Each facet would have its own classification schedule

	String together notation to get classification number


	Example:
	The design of wooden furniture in 18th century America


	Now that we have free-text searching, do you feel controlled vocabularies are still necessary or not? What do you feel their impact will be in the future of the digital library?

	How would to improve the ACM classification scheme? How to deal with legacy schemes?

	Booksellers also need to use classification to shelve books. Which type of classification do you think booksellers use? Would you make any adaptations to the classification schemes shown today?