Manual cataloging and indexing
Module 5                Min-Yen KAN
*heavily drawn from Lancaster (98) Indexing and Abstracting in Theory and Practice

Mesopotamian Catalogs
Mesopotamians kept track of their tablets with a list of their incipits:
What is it?
A poem?

Some Definitions
(Subject) Indexing
Assigning index terms to represent a document
Assists in document retrieval
Classification
Assigning a label to a document to assist in organizing that information
Not necessarily semantic labels

Steps in Subject Indexing
Conceptual analysis
Determine “aboutness”
Computational approaches: TF × IDF
Translation
Expressing the concepts as index terms
In controlled vocabularies, similar to Taylor’s (68) compromised need

Conceptual analysis
Generic: What is it about? What’s the main content
e.g., The History of Sociology
Specific: Why has it been added to our collection? What aspects will our users be interested in?
c.f., “Every reader his book”
Thus, organizations index differently
Different subjects (specialty, general interest)
Different materials (own materials, 3rd party)

Index terms
1. Libraries
2.
3.
4.
5.

Number of index terms in record
Long (Exhaustive)
Gives good recall at cost of precision
Few records fit in the UI
Hard to figure out which are main aspects
Short (Selective)
Gives good precision at cost of recall
Less work
In practice: offer levels of indexing for tasks
Index Terms
Abstract

Translation
Extraction: use terms directly from the source itself
Assignment: use terms from an outside source.
Usually from a controlled vocabulary.

Controlled vocabularies
Benefits
(Potentially) high precision, high recall
Question: which of these components is more important?
Drawbacks
Costly to construct and maintain
Is difficult to use
Need CV knowledge

Controlled vocabulary objectives
Control / suggest synonyms, pick an authoritative term
Especially for entities: people (maiden names to married names), places (St. Petersburg)
Distinguish among homographs (e.g., mercury, turkey)
Link terms with their relationship (is-a and all others (associative))

Controlled vocabulary usability
Good structure to find the appropriate term
Standard fields in an CV:
USE/UF: Use instead / Use For (authoritative)
BT/NT: Broader / Narrower Term in terms of hierarchy
RT: Related Term (Associative Term)
Applied by experienced personnel
A large vocabulary can be hard to map to
Question: What to do if the controlled vocabulary has no term for the concept to be indexed?

Controlled vocabulary examples
General CVs
Sears List of Subject Headings
More general divisions, not intended for research libraries
Geared towards general subdivisions
Library of Congress Subject Headinges (LCSH)
Comprehensive, very large, over five volumes
Domain-specific CV
Medical Subject Headings (MeSH)
Byproduct of indexing the NLM
Art & Architecture Thesaurus (AAT)
Object, images, architecture, styles
ERIC Thesaurus
Educational materials (journals, lesson plans and computer files)

Classification

Objectives of classification
Uniqueness
Be able to fetch a specific resource given a call number
Notational Permanence
(Seldom) have to reorganize/reassign labels
(e.g., paradigm shift in mathematics)
Comprehensive
Can successfully classify most things
Serendipity
Collocate related subjects together
Ease of Use
Ways of resolving ambiguities
(e.g., given religious architecture and Egyptian architecture, where does an article on the architecture of Egyptian temples go?)

Types of classification
Enumerative
Produce an alphabetical list of subject headings, assign numbers to each heading in alphabetical order
Hierarchical
Recursively divides subjects hierarchically, from most general to most specific
Faceted (analytico-synthetic):
Analytic: Divides subjects into mutually exclusive orthogonal facets
Synthetic: Combine facets to get a new class
- From Taylor (92)

Dewey Decimal Classification
Divide knowledge into ten classes
Recursively divide these categories into ten (or fewer classes)
Assign another digit
What type of classification scheme is it?
000 Generalities
100 Philosophy & psychology
200 Religion
300 Social sciences
400 Language
500 Natural sciences & mathematics
600 Technology (Applied sciences)
700 The arts
800 Literature & rhetoric
900 Geography & history

ACM Classification scheme
Four-level tree
3 coded levels and
a fourth uncoded level)
16 General Terms
H. Information Systems
H.0 GENERAL
H.1 MODELS AND PRINCIPLES
H.2 DATABASE MANAGEMENT (E.5)
H.3 INFORMATION STORAGE AND RETRIEVAL
H.4 INFORMATION SYSTEMS APPLICATIONS
H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)
H.m MISCELLANEOUS
I. Computing Methodologies
I.0 GENERAL
I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION
I.2 ARTIFICIAL INTELLIGENCE
I.3 COMPUTER GRAPHICS
I.4 IMAGE PROCESSING AND COMPUTER VISION
I.5 PATTERN RECOGNITION
I.6 SIMULATION AND MODELING (G.3)
I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)
I.m MISCELLANEOUS

Faceted Indexing
Facet – a characteristic of the resource (e.g., language)
Each facet organized hierarchically
allow drill-down browsing
represented by
set values (taxonomy)
continuous values (spectrum)

Colon Classification
Raganathan proposed 5 basic facets (PMEST):
Personality – the subject matter
Material
Energy – process or action
Space
Time
Each facet would have
 its own classification schedule
String together notation
to get classification number
Example:
The design of wooden furniture in 18th century America

To think about…
Now that we have free-text searching, do you feel controlled vocabularies are still necessary or not? What do you feel their impact will be in the future of the digital library?
How would to improve the ACM classification scheme?  How to deal with legacy schemes?
Booksellers also need to use classification to shelve books.  Which type of classification do you think booksellers use?  Would you make any adaptations to the classification schemes shown today?