Notes
Slide Show
Outline
1
Manual cataloging and indexing
  • Week 3                Min-Yen KAN
  • *heavily drawn from Lancaster (98) Indexing and Abstracting in Theory and Practice
2
Objectives of the Library
  • Ranganathan (1957)
  • Books are for use
  • Every reader his book
  • Every book its reader
  • Save time of the reader
  • The library is a growing organism
3
Mesopotamian Catalogs
  • Mesopotamians kept track of their tablets with a list of their incipits:









  • What is it?
  • A poem?
  • _______


4
Some Definitions
  • (Subject) Indexing
    • Assigning index terms to represent a document
    • Assists in document retrieval


  • Classification
    • Assigning a label to a document to assist in organizing that information
    • Not necessarily semantic labels
5
Linguistic Relativity
  • Also known as Sapir-Whorf hypothesis
  • A loose definition:
  • Our language to some extent determines the way in which we view and think about the world around us.


  • An example: time
    • Tomorrow = day after today
    • ةﺮﻜﺑ (“bukra”) = some point in the future

  • The result?
    • ____________  representation
    • Every representative offers a _____
    • Many AI researchers reject / ignore this notion
6
Steps in Subject Indexing
  • Conceptual analysis
    • Determine “aboutness”
      • Computational approaches: TF × IDF

  • Translation
    • Expressing the concepts as index terms
7
Conceptual analysis
  • Generic: What is it about? What’s the main content
    • e.g., The History of Sociology
  • Specific: Why has it been added to our collection? What aspects will our users be interested in?
    • c.f., “Every reader his book”

  • Thus, organizations index ________
    • Different _______ (specialty, general interest)
    • Different _______ (own materials, 3rd party)
8
Index terms
  • 1. Libraries
  • 2.
  • 3.
  • 4.
  • 5.
9
Number of index terms in record
  • Long (Exhaustive)
    • Gives good _____ at cost of ______
    • Few records fit in the UI
    • Hard to figure out which are main aspects


  • Short (Selective)
    • Gives good ______ at cost of ____
    • Less work


  • In practice: offer levels of indexing for tasks
    • Index Terms
    • Abstract
10
Translation
  • Extraction: use terms directly from the source itself


  • Assignment: use terms from an outside source.
    • Usually from a controlled vocabulary.


11
Controlled vocabularies
  • Benefits
    • (Potentially) high precision, high recall
    • Question: which of these components is more important?


  • Drawbacks
    • Costly to construct and maintain
    • Is difficult to use
      • Need CV knowledge
12
Controlled vocabulary objectives
  • Control / suggest synonyms, pick an authoritative term
    • Especially for entities: people (maiden names to married names), places (St. Petersburg)


  • Distinguish among homographs (e.g., mercury, turkey)


  • Link terms with their relationship (is-a and all others (associative))
13
Difficulties in Naming Authorities
  • People
    • Use most common name:
    • Dr Seuss
    • Not Theodore Seuss Geisel
  • Geographic Names
    • Use latest name:
    •   Namibia
    • Not Zaïre



  • Data must be constantly updated to provide users with best access points – not an easy job
14
Controlled vocabulary usability
  • Good structure to find the appropriate term
    • Standard fields in an CV:
      • USE/UF: Use instead / Use For (authoritative)
      • BT/NT: Broader / Narrower Term in terms of hierarchy
      • RT: Related Term (Associative Term)


  • Applied by experienced personnel
    • A large vocabulary can be hard to map to


    • Question: What to do if the controlled vocabulary has no term for the concept to be indexed?


15
Controlled vocabulary examples
  • General CVs
    • Sears List of Subject Headings
      • More general divisions, not intended for research libraries
      • Geared towards general subdivisions


    • Library of Congress Subject Headings (LCSH)
      • Comprehensive, very large, over five volumes
  • Domain-specific CV
    • Medical Subject Headings (MeSH)
      • Byproduct of indexing the NLM


    • Art & Architecture Thesaurus (AAT)
      • Object, images, architecture, styles

    • ERIC Thesaurus
      • Educational materials (journals, lesson plans and computer files)

16
Classification
17
Objectives of classification
  • Uniqueness
    • Be able to fetch a specific resource given a call number


  • Notational Permanence
    • (Seldom) have to reorganize/reassign labels
    • (e.g., paradigm shift in mathematics)


  • Comprehensiveness
    • Can successfully classify most things


  • Serendipity
    • Collocate related subjects together


  • Ease of Use
    • Ways of resolving ambiguities
    • (e.g., given religious architecture and Egyptian architecture, where does an article on the architecture of Egyptian temples go?)

18
Types of classification
  • Enumerative
    • Produce an alphabetical list of subject headings, assign numbers to each heading in alphabetical order


  • Hierarchical
    • Recursively divides subjects hierarchically, from most general to most specific


  • Faceted (analytico-synthetic):
    • Analytic: Divides subjects into mutually exclusive orthogonal facets
    • Synthetic: Combine facets to get a new class


  • - From Taylor (92)
19
Dewey Decimal Classification
  • Divide knowledge into ten classes
  • Recursively divide these categories into ten (or fewer classes)
    • Assign another digit

  • What type of classification scheme is it?
  • 000 Generalities
  • 100 Philosophy & psychology
  • 200 Religion
  • 300 Social sciences
  • 400 Language
  • 500 Natural sciences & mathematics
  • 600 Technology (Applied sciences)
  • 700 The arts
  • 800 Literature & rhetoric
  • 900 Geography & history


20
ACM Classification scheme
  • Four-level tree
    • 3 coded levels and
    • a fourth uncoded level)


  • 16 General Terms
  • H. Information Systems
  • H.0 GENERAL
  • H.1 MODELS AND PRINCIPLES
  • H.2 DATABASE MANAGEMENT (E.5)
  • H.3 INFORMATION STORAGE AND RETRIEVAL
  • H.4 INFORMATION SYSTEMS APPLICATIONS
  • H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)
  • H.m MISCELLANEOUS
  • I. Computing Methodologies
  • I.0 GENERAL
  • I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION
  • I.2 ARTIFICIAL INTELLIGENCE
  • I.3 COMPUTER GRAPHICS
  • I.4 IMAGE PROCESSING AND COMPUTER VISION
  • I.5 PATTERN RECOGNITION
  • I.6 SIMULATION AND MODELING (G.3)
  • I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)
  • I.m MISCELLANEOUS
21
Faceted Indexing
  • Facet – a characteristic of the resource (e.g., language)


  • Each facet organized hierarchically
    • allow drill-down browsing
    • represented by
      • set values (taxonomy)
      • continuous values (spectrum)
22
Colon Classification
  • Raganathan proposed 5 basic facets (PMEST):
    • Personality – the subject matter
    • Material
    • Energy – process or action
    • Space
    • Time


  • Each facet would have
     its own classification schedule


  • String together notation
    to get classification number



  • Example:
  • The design of wooden furniture in 18th century America



23
Classification Maintenance
  • DDC and LCSH _____
    centralized systems


  • Nowadays, rely on a distributed approach to update
    • Either hierarchically determined authorities
    • Or arbitration of conflicts
      • Think CVS and source control systems
24
To think about…
  • Now that we have free-text searching, do you feel controlled vocabularies are still necessary or not? What do you feel their impact will be in the future of the digital library?


  • How would you improve the ACM classification scheme?  How to deal with legacy schemes?


  • Booksellers also need to use classification to shelve books.  Which type of classification do you think booksellers use?  Would you make any adaptations to the classification schemes shown today?
25
Metadata creation and management*
  • *Parts of this lecture come from
    Lilian Tang’s lecture material at the Univ. of Surrey
26
What is metadata, anyways?
  • Data about data
    • From the DB community

  • “Cataloging or indexing information that [information professions] create to arrange, describe, and otherwise enhance access to an information object”
    • -- Gilliland-Swetland (1998)

  • “Data that describes attributes of a resource, characterize its relationships, support its discover and effective use and exist in an electronic environment”
    • -- Vellucci (1998)
27
Outline
  • What is Metadata?
    • Some Frameworks
  • Packaging Metadata
    • Warwick Framework

  • Structural Metadata
  • Hidden Web Metadata
    • OAI
    • SDARTS


  • Crosswalking and Automated Extraction
  • Metadata formats


  • HTML Metadata
  • AACR2 / TEIH /
    MARC / Z39.50
  • Dublin Core
28
Types of metadata
  • Administrative
  • Structural
  • Descriptive
  • Intellectual Property Rights
  • Use
29
Types of metadata
  • Administrative
  • Structural
  • Descriptive
  • Intellectual Property Rights
  • Use
30
Types of metadata
  • Administrative
  • Structural
  • Descriptive
  • Intellectual Property Rights
  • Use
31
Types of metadata
  • Administrative
  • Structural
  • Descriptive
  • Intellectual Property Rights
  • Use
32
Types of metadata
  • Administrative
  • Structural
  • Descriptive
  • Intellectual Property Rights
  • Use
33
Metadata attributes
34
Data types: MIME
  • Multipurpose Internet Mail Extensions
    • (text/plain, img/jpg application/msword)
    • Simple format, pre-web
    • Can code an unofficial type using x-subtype prefix (e.g., audio/x-pn-realaudio)
    • Application tag: need to use an application to handle this data
    • Wild success shows a simple system is best:
      • Good for adoption / authoring
      • Good for common denominator
35
Complex objects and granularity
    • DOI identifier records: multiple versions of a single document (hi res / low res)
    • Syntax should be mirrored in reference metadata
36
MPEG 7
37
Audio/visual metadata
  • Based again on how people search
    (The Potato Eaters)
    • I’m looking for a picture of a group.
    • I’d like it to be a family group.
    • This family should be doing something that would be typical for a family, like sitting around a table with food in front of them, look grateful for what they have to eat.

  • Facet analysis is a good approach
    • Objective (“of”)
    • Subjective (“about”)
38
HTML Meta tags
  • <HTML><HEAD>
  •   <META NAME=“attribute” VALUE=“value”>
  • </HEAD>… </HTML>


  • Not regulated or
    controlled
  • You can add your
    own tags
  • Only certain ones
    parsed by finding aids
    (e.g., GoogleBot)
  • Many tags use other
    metadata formats
39
MARC / AACR 2 / TEIH
  • Machine Readable Cataloging
    • Standard for encoding cataloging data (bibliographic and authority)
    • Standoff Annotation (External)


  • Anglo American Cataloguing Rules 2
    • Set of rules used for collecting bibliographic data and for formulating access points (for authors, titles, subjects, related works, etc.)
    • Regulates format and number of access points


  • Text Encoding Initiative Header
    • Header, similar to <HEAD> in HTML
    • Is located within the document (Internal)


  • Z39.50
    • Protocol for clients to ask queries of servers


  • Librarians use AACR2 / TEI to devise values for fields to be encoded by MARC (external) or in TEIH (internal).  This data is accessible by users using the Z39.50 protocol.
40
Data Types
  • Used to describe the different types of (complex) objects in the digital library


  • Structural facets of documents


41
Dublin Core Elements
  • A common denominator set of metadata attributes used for interoperability.  Has recommended values for some fields.











  • Besides Title, Creator, Publisher, Contributor, 11 other fields:


  • Subject
    • Subject, expressed as keywords, key phrases or classification codes that describe a topic of the resource.
    • value from a controlled vocabulary or formal classification scheme.
42
Dublin Core Elements (Con’t)
  • Description
    • An account of the content of the resource.
    • (e.g., an abstract, ToC, graphical representation of content or a free-text account of the content)
  • Date
    • creation or availability of the resource
    • ISO 8601 (e.g., YYYY-MM-DD)
  • Type
    • The nature or genre of the content of the resource
    • value from a controlled vocabulary (e.g., DCMI Type Vocabulary)
  • Format
    • Media-type or dimensions. Also, identifies the software, hardware, or other equipment needed to display or operate the resource.
    • value from a controlled vocabulary (e.g., MIME)
  • Identifier
    • Unambiguous reference to the resource within a given context.
    • Use a formal identification system, (e.g., URI, DOI, ISBN)
43
DC Elements (Con’t)
  • Source
    • Reference to a resource from which the present resource is derived (e.g., past edition)
  • Language
    • Language of the intellectual content of the resource.
    • use RFC 3066 with ISO639 (e.g., “en-GB”)
  • Relation
    • Reference to a related resource
  • Coverage
    • The extent or scope of the resource (e.g., location, time period)
    • value from a controlled vocabulary
  • Rights
    • Statement of copyright or a reference to one
    • If absent, no assumptions may be made
44
Warwick Example
45
STARTS: A Metasearching Protocol
  • Defines:
    • Query language
    • Results format
    • Metadata for the collection


  • No specified transport layer or implementation
  • Built to assist metasearchers.
46
Distributed Search? Why?
“Surface” Web vs. “Hidden” Web
47
Hidden Web: Examples
48
Query Probing
  • Idea: Send different queries to categorize data


  • Demo Time!
    (if it works)
49
Focused Probing: Sampling
  • Transform each rule into a query
  • For each query:
    • Send to database
    • Record number of matches
    • Retrieve top-k matching documents
  • At the end of round:
    • Analyze matches for each category
    • Choose category to focus on


50
Adjusting Document Frequencies
  • We know ranking r of words according to document frequency in sample


  • We know absolute document frequency f of some words from one-word queries


  • Mandelbrot’s formula connects empirically word frequency f and ranking r


  • We use curve-fitting to estimate the absolute frequency of all words in sample
51
Focused Probing
  • Algorithm:
    • Send general queries to determine high level category
    • Send progressively more specific queries to determine mid- and lower- categories
52
Automatic Extraction of Metadata
  • Rule-based scripts (fragile):
    • DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
    • Still heavily cited and used!


  • Wrapper induction: localized extraction
    • Define a local context and features to match and extract


  • Text classification: classification
    • Use features over the entire document to determine classification.


53
MeURLin
  • Classification of URLs to the
    Open Directory Project


  • http://www.onlineshawnee.com/stories/072901/ent shelton.shtml


  • Doesn’t require webpage, just address
  • About 1/2 - 1/3  as accurate as full words approaches
  • Uses scalable segmentation and expansion techniques
54
Crosswalking
  • The transfer of metadata from one format to another
    • Retrofitting = updating old metadata to a newer format


  • Aids accessibility and discovery
  • Complementary to OAI / SDARTS (which are centralized approaches)
  • Mostly done manually by specialists
    • CS work to be done here!
55
Reference
  • Dublin Core Tool List
    • http://www.lub.lu.se/tk/metadata/dctoollist.html
    • And many others
  • The Getty Research Institute
    • http://www.getty.edu/research/institute/
  • Crosswalking
    • http://www.ukoln.ac.uk/metadata/interoperability/


56
Summing Up
  • Metadata authoring highly intricate but two complementary purposes
    • Inventory
    • Access (what we care more about)
    • Uses CV standards (licensing drawback)

  • Automated approaches have promise …
    • To access and annotate more data
    • But generally needs re-work, or NLP post-processing to make data fit standard


  • Questions?!?