1
|
- Week 3 Min-Yen KAN
- *heavily drawn from Lancaster (98) Indexing and Abstracting in Theory
and Practice
|
2
|
- Ranganathan (1957)
- Books are for use
- Every reader his book
- Every book its reader
- Save time of the reader
- The library is a growing organism
|
3
|
- Mesopotamians kept track of their tablets with a list of their incipits:
- What is it?
- A poem?
- _______
|
4
|
- (Subject) Indexing
- Assigning index terms to represent a document
- Assists in document retrieval
- Classification
- Assigning a label to a document to assist in organizing that
information
- Not necessarily semantic labels
|
5
|
- Also known as Sapir-Whorf hypothesis
- A loose definition:
- Our language to some extent determines the way in which we view and
think about the world around us.
- An example: time
- Tomorrow = day after today
- ةﺮﻜﺑ (“bukra”) = some
point in the future
- The result?
- ____________ representation
- Every representative offers a _____
- Many AI researchers reject / ignore this notion
|
6
|
- Conceptual analysis
- Determine “aboutness”
- Computational approaches: TF × IDF
- Translation
- Expressing the concepts as index terms
|
7
|
- Generic: What is it about? What’s the main content
- e.g., The History of Sociology
- Specific: Why has it been added to our collection? What aspects will our
users be interested in?
- c.f., “Every reader his book”
- Thus, organizations index ________
- Different _______ (specialty, general interest)
- Different _______ (own materials, 3rd party)
|
8
|
|
9
|
- Long (Exhaustive)
- Gives good _____ at cost of ______
- Few records fit in the UI
- Hard to figure out which are main aspects
- Short (Selective)
- Gives good ______ at cost of ____
- Less work
- In practice: offer levels of indexing for tasks
|
10
|
- Extraction: use terms directly from the source itself
- Assignment: use terms from an outside source.
- Usually from a controlled vocabulary.
|
11
|
- Benefits
- (Potentially) high precision, high recall
- Question: which of these components is more important?
- Drawbacks
- Costly to construct and maintain
- Is difficult to use
|
12
|
- Control / suggest synonyms, pick an authoritative term
- Especially for entities: people (maiden names to married names), places
(St. Petersburg)
- Distinguish among homographs (e.g., mercury, turkey)
- Link terms with their relationship (is-a and all others (associative))
|
13
|
- People
- Use most common name:
- Dr Seuss
- Not Theodore Seuss Geisel
- Geographic Names
- Use latest name:
- Namibia
- Not Zaïre
- Data must be constantly updated to provide users with best access points
– not an easy job
|
14
|
- Good structure to find the appropriate term
- Standard fields in an CV:
- USE/UF: Use instead / Use For (authoritative)
- BT/NT: Broader / Narrower Term in terms of hierarchy
- RT: Related Term (Associative Term)
- Applied by experienced personnel
- A large vocabulary can be hard to map to
- Question: What to do if the controlled vocabulary has no term for the
concept to be indexed?
|
15
|
- General CVs
- Sears List of Subject Headings
- More general divisions, not intended for research libraries
- Geared towards general subdivisions
- Library of Congress Subject Headings (LCSH)
- Comprehensive, very large, over five volumes
- Domain-specific CV
- Medical Subject Headings (MeSH)
- Byproduct of indexing the NLM
- Art & Architecture Thesaurus (AAT)
- Object, images, architecture, styles
- ERIC Thesaurus
- Educational materials (journals, lesson plans and computer files)
|
16
|
|
17
|
- Uniqueness
- Be able to fetch a specific resource given a call number
- Notational Permanence
- (Seldom) have to reorganize/reassign labels
- (e.g., paradigm shift in mathematics)
- Comprehensiveness
- Can successfully classify most things
- Serendipity
- Collocate related subjects together
- Ease of Use
- Ways of resolving ambiguities
- (e.g., given religious architecture and Egyptian architecture, where
does an article on the architecture of Egyptian temples go?)
|
18
|
- Enumerative
- Produce an alphabetical list of subject headings, assign numbers to
each heading in alphabetical order
- Hierarchical
- Recursively divides subjects hierarchically, from most general to most
specific
- Faceted (analytico-synthetic):
- Analytic: Divides subjects into mutually exclusive orthogonal facets
- Synthetic: Combine facets to get a new class
- - From Taylor (92)
|
19
|
- Divide knowledge into ten classes
- Recursively divide these categories into ten (or fewer classes)
- What type of classification scheme is it?
- 000 Generalities
- 100 Philosophy & psychology
- 200 Religion
- 300 Social sciences
- 400 Language
- 500 Natural sciences & mathematics
- 600 Technology (Applied sciences)
- 700 The arts
- 800 Literature & rhetoric
- 900 Geography & history
|
20
|
- Four-level tree
- 3 coded levels and
- a fourth uncoded level)
- 16 General Terms
- H. Information Systems
- H.0 GENERAL
- H.1 MODELS AND PRINCIPLES
- H.2 DATABASE MANAGEMENT (E.5)
- H.3 INFORMATION STORAGE AND RETRIEVAL
- H.4 INFORMATION SYSTEMS APPLICATIONS
- H.5 INFORMATION INTERFACES ANDPRESENTATION (e.g., HCI) (I.7)
- H.m MISCELLANEOUS
- I. Computing Methodologies
- I.0 GENERAL
- I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION
- I.2 ARTIFICIAL INTELLIGENCE
- I.3 COMPUTER GRAPHICS
- I.4 IMAGE PROCESSING AND COMPUTER VISION
- I.5 PATTERN RECOGNITION
- I.6 SIMULATION AND MODELING (G.3)
- I.7 DOCUMENT AND TEXT PROCESSING (H.4, H.5)
- I.m MISCELLANEOUS
|
21
|
- Facet – a characteristic of the resource (e.g., language)
- Each facet organized hierarchically
- allow drill-down browsing
- represented by
- set values (taxonomy)
- continuous values (spectrum)
|
22
|
- Raganathan proposed 5 basic facets (PMEST):
- Personality – the subject matter
- Material
- Energy – process or action
- Space
- Time
- Each facet would have
its own classification
schedule
- String together notation
to get classification number
- Example:
- The design of wooden furniture in 18th century America
|
23
|
- DDC and LCSH _____
centralized systems
- Nowadays, rely on a distributed approach to update
- Either hierarchically determined authorities
- Or arbitration of conflicts
- Think CVS and source control systems
|
24
|
- Now that we have free-text searching, do you feel controlled
vocabularies are still necessary or not? What do you feel their impact
will be in the future of the digital library?
- How would you improve the ACM classification scheme? How to deal with legacy schemes?
- Booksellers also need to use classification to shelve books. Which type of classification do you
think booksellers use? Would you
make any adaptations to the classification schemes shown today?
|
25
|
- *Parts of this lecture come from
Lilian Tang’s lecture material at the Univ. of Surrey
|
26
|
- Data about data
- “Cataloging or indexing information that [information professions]
create to arrange, describe, and otherwise enhance access to an
information object”
- -- Gilliland-Swetland (1998)
- “Data that describes attributes of a resource, characterize its
relationships, support its discover and effective use and exist in an
electronic environment”
|
27
|
- What is Metadata?
- Packaging Metadata
- Structural Metadata
- Hidden Web Metadata
- Crosswalking and Automated Extraction
- Metadata formats
- HTML Metadata
- AACR2 / TEIH /
MARC / Z39.50
- Dublin Core
|
28
|
- Administrative
- Structural
- Descriptive
- Intellectual Property Rights
- Use
|
29
|
- Administrative
- Structural
- Descriptive
- Intellectual Property Rights
- Use
|
30
|
- Administrative
- Structural
- Descriptive
- Intellectual Property Rights
- Use
|
31
|
- Administrative
- Structural
- Descriptive
- Intellectual Property Rights
- Use
|
32
|
- Administrative
- Structural
- Descriptive
- Intellectual Property Rights
- Use
|
33
|
|
34
|
- Multipurpose Internet Mail Extensions
- (text/plain, img/jpg application/msword)
- Simple format, pre-web
- Can code an unofficial type using x-subtype prefix (e.g.,
audio/x-pn-realaudio)
- Application tag: need to use an application to handle this data
- Wild success shows a simple system is best:
- Good for adoption / authoring
- Good for common denominator
|
35
|
- DOI identifier records: multiple versions of a single document (hi res
/ low res)
- Syntax should be mirrored in reference metadata
|
36
|
|
37
|
- Based again on how people search
(The Potato Eaters)
- I’m looking for a picture of a group.
- I’d like it to be a family group.
- This family should be doing something that would be typical for a
family, like sitting around a table with food in front of them, look
grateful for what they have to eat.
- Facet analysis is a good approach
- Objective (“of”)
- Subjective (“about”)
|
38
|
- <HTML><HEAD>
- <META NAME=“attribute”
VALUE=“value”>
- </HEAD>… </HTML>
- Not regulated or
controlled
- You can add your
own tags
- Only certain ones
parsed by finding aids
(e.g., GoogleBot)
- Many tags use other
metadata formats
|
39
|
- Machine Readable Cataloging
- Standard for encoding cataloging data (bibliographic and authority)
- Standoff Annotation (External)
- Anglo American Cataloguing Rules 2
- Set of rules used for collecting bibliographic data and for formulating
access points (for authors, titles, subjects, related works, etc.)
- Regulates format and number of access points
- Text Encoding Initiative Header
- Header, similar to <HEAD> in HTML
- Is located within the document (Internal)
- Z39.50
- Protocol for clients to ask queries of servers
- Librarians use AACR2 / TEI to devise values for fields to be encoded by
MARC (external) or in TEIH (internal).
This data is accessible by users using the Z39.50 protocol.
|
40
|
- Used to describe the different types of (complex) objects in the digital
library
- Structural facets of documents
|
41
|
- A common denominator set of metadata attributes used for
interoperability. Has recommended
values for some fields.
- Besides Title, Creator, Publisher, Contributor, 11 other fields:
- Subject
- Subject, expressed as keywords, key phrases or classification codes
that describe a topic of the resource.
- value from a controlled vocabulary or formal classification scheme.
|
42
|
- Description
- An account of the content of the resource.
- (e.g., an abstract, ToC, graphical representation of content or a
free-text account of the content)
- Date
- creation or availability of the resource
- ISO 8601 (e.g., YYYY-MM-DD)
- Type
- The nature or genre of the content of the resource
- value from a controlled vocabulary (e.g., DCMI Type Vocabulary)
- Format
- Media-type or dimensions. Also, identifies the software, hardware, or
other equipment needed to display or operate the resource.
- value from a controlled vocabulary (e.g., MIME)
- Identifier
- Unambiguous reference to the resource within a given context.
- Use a formal identification system, (e.g., URI, DOI, ISBN)
|
43
|
- Source
- Reference to a resource from which the present resource is derived
(e.g., past edition)
- Language
- Language of the intellectual content of the resource.
- use RFC 3066 with ISO639 (e.g., “en-GB”)
- Relation
- Reference to a related resource
- Coverage
- The extent or scope of the resource (e.g., location, time period)
- value from a controlled vocabulary
- Rights
- Statement of copyright or a reference to one
- If absent, no assumptions may be made
|
44
|
|
45
|
- Defines:
- Query language
- Results format
- Metadata for the collection
- No specified transport layer or implementation
- Built to assist metasearchers.
|
46
|
|
47
|
|
48
|
- Idea: Send different queries to categorize data
- Demo Time!
(if it works)
|
49
|
- Transform each rule into a query
- For each query:
- Send to database
- Record number of matches
- Retrieve top-k matching documents
- At the end of round:
- Analyze matches for each category
- Choose category to focus on
|
50
|
- We know ranking r of words according to document frequency in sample
- We know absolute document frequency f of some words from one-word
queries
- Mandelbrot’s formula connects empirically word frequency f and ranking r
- We use curve-fitting to estimate the absolute frequency of all words in
sample
|
51
|
- Algorithm:
- Send general queries to determine high level category
- Send progressively more specific queries to determine mid- and lower-
categories
|
52
|
- Rule-based scripts (fragile):
- DC Dot Demo: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
- Still heavily cited and used!
- Wrapper induction: localized extraction
- Define a local context and features to match and extract
- Text classification: classification
- Use features over the entire document to determine classification.
|
53
|
- Classification of URLs to the
Open Directory Project
- http://www.onlineshawnee.com/stories/072901/ent shelton.shtml
- Doesn’t require webpage, just address
- About 1/2 - 1/3 as accurate as
full words approaches
- Uses scalable segmentation and expansion techniques
|
54
|
- The transfer of metadata from one format to another
- Retrofitting = updating old metadata to a newer format
- Aids accessibility and discovery
- Complementary to OAI / SDARTS (which are centralized approaches)
- Mostly done manually by specialists
|
55
|
- Dublin Core Tool List
- http://www.lub.lu.se/tk/metadata/dctoollist.html
- And many others
- The Getty Research Institute
- http://www.getty.edu/research/institute/
- Crosswalking
- http://www.ukoln.ac.uk/metadata/interoperability/
|
56
|
- Metadata authoring highly intricate but two complementary purposes
- Inventory
- Access (what we care more about)
- Uses CV standards (licensing drawback)
- Automated approaches have promise …
- To access and annotate more data
- But generally needs re-work, or NLP post-processing to make data fit
standard
- Questions?!?
|