Digital Libraries
Orientation
Week 1                Min-Yen KAN

What is a library?
A place set apart to contain books for reading, study, or reference.
(Not applied, e.g. to the shop or warehouse of a bookseller.)
A building … containing a collection of books for the use of the public or of some particular portion of it, or of the members of some society or the like;
a public institution or establishment, charged with the care of a collection of books, and the duty of rendering the books accessible to those who require to use them.

What is a library?
A private commercial establishment for the lending of books, the borrower paying either a fixed sum for each book lent or a periodical subscription.
a great mass of learning or knowledge;
the objects of a person's study, the sources on which he depends for instruction.
Computers. An organized collection of routines, esp. of tested routines suitable for a particular model of computer
Biology. a collection of sequences of DNA …  that represent the genetic material of a particular organism or tissue

Introduction
Bush’s “As we may think”
Writes this at the end of WW II
_____ was the first computer, born to compute ballistic tables fast
_______ just invented 5 years ago
_______ (“display technology”) still a less than perfect process.
_______ (“storage technology”) was a mature and stable technology.

Vannevar Bush (1890-1974)
Director of the Office of Scientific Research and Development
lead 6000 scientists in R&D for WWII
Predicted many technological advances
the “memex” is one whose spirit we are implementing
the purpose was to provide scientists the capability to exchange information; to have access to the totality of recorded information

Design for Memex (c. 1945)

Memex
Integrated computer, keyboard, and desk
“mechanized private file and library”
remove drudgery from information retrieval
suggested implementation was microfilm
various user operations  are suggested
_______________ was the main purpose
“the process of tying two items together is the important thing”
prelude to hypertext...

Memex
Information could come pre-associatively indexed, but the key point was ______________
WWW still does not provide that today
Bush observes that tools change our way of doing, and expand the horizons before us
full impact of WWW and DLs still not known

What is a Digital Library (DL)?
“a collection of information that is both digitized and organized” (Lesk)
there are numbers of alternate definitions, but this seems fair enough
no mention of ________, _________, __________, etc.
It is not just to reform the current library system, rather, we aim to
organize and access the “information overload”

Outline for today
Introduction to libraries √
Course administration
Reading and writing research
To think about

Course administration
Teaching staff
Web sites
Objective
Syllabus
Assessment overview
Survey paper and project
Any questions?

Teaching staff
Lecturer:
Min-Yen Kan (“Min”)
kanmy@comp.nus.
edu.sg
Office: S15 05-05
6875-1885
Hours: 4-6 pm Tuesdays
Interests:
rock climbing, ballroom dancing, and inline skating… and digital libraries!

Course web sites
http://ivle.nus.edu.sg/
Discussion forum
Any questions related to the course should be raised on this forum
I expect you to talk amongst yourselves to answer questions, so will not answer questions here much.
Send me emails for urgent or personal matters
Announcements!
Workbin: Lecture notes (purposely incomplete!)
http://www.comp.nus.edu.sg/~cs5244
Grading specification
Other supplementary content

Objective
Building, using, presenting and maintaining large volumes of information
Contrast computational approaches with traditional library science methods

Hey min, go over the website!
http://www.comp.nus.edu.sg/~cs5244

Discussions
Class participation is very important. There are no “dumb” questions. You will only be penalized for “no” questions / comments.
Possibilities:
Name tags
Cold calls
Small group discussion and presentation

Midterm and Final
1 hour midterm (10%) and a
2 hour final (20%)
Both basically of the same format
Calculation questions – that have an exact answer
Essay questions – many to look at tradeoffs in the digital library realm
No necessarily right or wrong answers

Literature survey
Each student will pick an area of study to survey at least 4 papers in detail.
Must be interesting to you
Journal or conference papers from an authority list
Limit to 6 pages
Individual work only
Give your perspective on area’s future
Add value by comparing strengths and weaknesses of different approaches.

Final project
Students will self-organize into groups for the final projects, shortly after the survey papers are due.
Requires original work
Cooperation and coordination
Report as a conference submission
Poster presentation to the public
Sample topics on the web page

Outline for today
Introduction to libraries √
Course administration √
Reading and writing research
To think about

Reading and writing research papers
References:
 http://www.cse.ogi.edu/~dylan/
efficientReading.html
 ftp://fast.cs.utah.edu/pub/writing-papers.ps
This section partially from Surendar Chandra
of University of Notre Dame.

Why do you read a paper?
Understand and learn new contributions
However…
Not all papers are “good”
Not all papers are “interesting”
Not all papers are “worthwhile” for you
You have to learn to identify a good paper and spend your time wisely
Breadth
Depth
React

Reading a research paper
What is this paper about?
Read the title and the abstract
If you still don’t know what this paper is about, then this is a poorly-written paper.
 Read the conclusion
Are you now sure you know what this paper is about? If not, throw it away.
Read the ___________
Read the ____________
Read _____________ and captions

How to read a paper
See who wrote it, where it was published, when was it written (credibility)
Skim references
Are authors are aware of relevant related work?
Do you know the work that they cite?
Do you know other work that they should have cited?

How to read a paper - depth
Approach with scientific skepticism
Read with context of other things that you’ve read in mind
It’s only one part of the puzzle of a subject
Examine the assumptions.  Are they:
Reasonable?
What are the limitations of the work
There are always limitations!  Did they disclose them?

How to read a paper - depth
Examine the methods:
Did they measure what they claim?
Can they explain what they observed?
Want an analysis of why the system behaves a certain way, not raw data.
Did they have adequate controls?
Were tests carried out in a standard way? Were the performance metrics standard?
If not, do they explain their metrics clearly?

How to read a paper - depth
Examine the statistics:
“Lies, d*mned lies and statistics”
Appropriate statistical tests applied properly?
Did they do proper error analysis?
Are the results statistically significant?

How to read a paper - depth
Examine the conclusions:
Do the conclusions follow logically from the experiments?
What other explanations are there for the observed effects ?
What other conclusions or correlations are in the data that were not pointed out?

How to read a paper - react
Take notes
Highlight major points
React to the points in the paper
Place this work with your own experience
If you doubt a statement, note your objection
Summarize what you read
Good practice: maintain your own bibliography of all papers that you ever read
___________ !

How to write a research paper
Write it such that anyone who reads it using the method we just discussed understands the idea
Clearly explain what problem you are solving, why it is interesting and how your solution solves this interesting problem
Be crisp. Explain what your contributions are, what your ideas are and what are others’ ideas

Any questions?
Introduction to libraries √
Course administration √
Reading and writing research √

To think about for discussion
What are the functions of a traditional library?
Are these same functions in the digital library?
How is the digital library different from:
_________?
_________?

Coffee Break

Digital Libraries
Week 1                      Min-Yen KAN
Implementation of
(Textual) Information Retrieval

Slide 35

What is information retrieval?
Part of the information seeking process
Matches a query with most relevant documents
View a query as a ______________-

Searching in books
_______
_______
_______
Procedure:
Look up topic
Find the page
Skim page to find topic
Index, 11, 103-151, 443
Audio, 476
Comparison of methods 143-145
Granularity, 105, 112
N-gram, 170-172
Of integer sequences, 11
Of musical themes, 11
Of this book, 103, 507ff
Within inverted file entry, see skipping
Index compression, 114-129, 198-201, 235-237
Batched, 125,128
Bernoulli, 119-122, 128, 150, 247, 421
Context-sensitive, 125-126
Global, 115-121
Hyperbolic model, 123-124, 150
In MG, 421-423
Interpolative coding, 126-128
Local, 115, 121-122, 247
Nonparameterized, 115-119
Observed frequency, 121, 124-125, 128, 247
Parameterized, 115
Performance of, 128-129. 421
Skewed Bernoulli, 122-123, 138, 150
Within-document frequencies, 198-201
Index Construction, 223-261 (see also inversion)
bitmaps, 255-256

Information retrieval
Algorithm
(Permute query to fit index)
Search index
Go to resource
(Permute query to fit item)
(Search for item)

What to index?
Books indices have key words and phrases
Search engines index ____________
Why the disparity?
What do people really search for?

Trading precision for size
Can save up to 32% without too much loss:
Stemming
Usually just word inflection
Information → Inform = Informal, Informed
Case folding
N.B.: keep odd variants (e.g., NeXT, LaTeX)
Stop words
Don’t index common words, people won’t search on them anyways
Pop Quiz: Which of these techniques are more effective?

Indexing output
Output = Lw,DD,IW×D
Inverted File (Index)
Postings (e.g., wt → (d1,fwt,d1), (d2,fwt,d),  …, (dn,fwt,dn)
Variable length records
Lexicon:
String Wt
Document frequency ft
Address within inverted file It
Sorted, fixed length records
×       D1 D2 D3 D4 D5 D6 … Dm
W1           1        1
W2       2            1
W3        1
W4                         1           1
W5        1           1
W6            1       1   1
Wn

Trading precision for size, redux
Pop Quiz: Which of these techniques are more effective?
Typical:
Lexicon = 30 MB Inverted File: 400 MB
Stemming
Affects Lexicon
Case folding
Affects Lexicon
Stop words
Affects Inverted File

Is fine-grained indexing worthwhile?
Problem: still have to scan document to find the term.
Cons:
Need access methods to take advantage
Extra storage space overhead (variable sized)
Alternative methods:
Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset size
Split long documents into n shorter ones.

Inverted file compression
Clue: Encode gap length instead of offset
Use small number of bits to encode  more common gap lengths
(e.g., Huffman encoding)
Better: Use a distribution of expected gap length (e.g., Bernoulli process)
If p = prob that any word x appears in doc y, then
Then pgap size z = (1-p)z p .  This constructs a geometric distribution.
Works for intra and inter-document index compression
Why does it hold for documents as well as words?

Building the index – Memory based inversion
Takes lots of main memory, ugh!
Can we reduce the memory requirement?

Sort-based inversion
Idea: try to make random access of disk (memory) sequential
// Phase I – collection of term appearances on disk
For each document Dd in collection, 1 ≤ d ≤ N
Read Dd, parsing it into index terms
For each index term t in Dd
Calculate fd,t
Dump to file a tuple (t,d,fd,t)
// Phase II – sort tuples
Sort all the tuples (t,d,f) using External Mergesort
// Phase III – write output file
Read the tuples in sorted order and create inverted file

Sort based inversion: example
<a,1,2>
<b,1,2>
<c,1,1>
<a,2,2>
<d,2,1>
<b,2,1>
<b,3,1>
<d,3,1>

Using a first pass for the lexicon
Gets us fd,t and N
Savings: For any t, we know fd,t, so can use an array vs. LL (shrinks record by 40%!)

Lexicon-based inversion
Partition inversion as |I|/|M| = k smaller problems
build 1/k of inverted index on each pass
(e.g., a-b, b-c, …, y-z)
Tuned to fit amount of main memory in machine
Just remember boundary words
Can pair with disk strategy
Create k temporary files and write tuples (t,d,fd,t) for each partition on first pass
Each second pass builds index from temporary file

Inversion – Summary of Techniques
How do these techniques stack up?
Assume a 5 GB corpus and 40 MB main memory machine
Technique       Memory Disk Time
(MB) (GB) (Hours)
*Linked lists (memory)  4000 0 6
Linked lists (disk) 30 4 1100
Sort-based 40 8 20
Lexicon-based 40 0 79
Lexicon w/ disk 40 4 12

Query Matching
Now that we have an index, how do we answer queries?

Query Matching
Assuming a simple word matching engine:
For each query term t
Stem t
Search lexicon
Record ft and its inverted entry address, It
Select a query term t
Set list of candidates, C = It
For each remaining term t
Read its It
For each d in C, if d not in It set C = C – {d}
X and Y and Z – high _______
X or Y or Z – high _______
Which algorithm is the above?

Boolean Model
Query processing strategy:
Join less frequent terms first
Even in ORs, as merging takes longer than lookup
Problems with Boolean model:
Retrieves too many or too few documents
Longer documents are tend to match more often because they have a larger vocabulary
Need ranked retrieval to help out

Deciding ranking
Boolean assigns same importance to all terms in a query
5566 concert dates in Singapore
“5566” has same weight as “date”
One way:
Assign weights to the words, make more important words worth more
Process results in q and d vectors: (word, weight), (word, weight) … (word, weight)

Term Frequency
Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple.  Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx.  Xxxxxxxxxx xxxxxxxx Compaq.  Xxxxxxxxx xxxxxxx IBM.
(Relative) term frequency can indicate importance.
Rd,f = fd,t
Rd,t = 1 + ln fd,t
Rd,t = (K + (1-K)           )

Inverse Document Frequency
Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.

Inverse Document Frequency
Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.
Words with higher ft are less discriminative.
Use inverse to measure importance:
wt = 1/ft
wt = ln (1+ N/ft) ß this one is most common
wt = ln (1 + fm/ft), where fm is the max observed frequency

This is TF*IDF
Many variants, but all capture:
Term frequency:
Rd,t as being __________________
Inverse Document Frequency:
Wt as being ___________________
Standard formulation is:
wd,t = rd,t × wt
= (1+ ln(fd,t)) × ln (1 + N/ft)
Problem:
rd,t grows as document grows, need to normalize; otherwise biased towards _____________

Calculating Similarity
Euclidean Distance - bad
M(Q,Dd) = sqrt (Σ |wq,t – wd,t|2)
Dissimilarity Measure; use reciprocal
Has problem with long documents, why?
Actually don’t care about vector length, just their direction
Want to measure difference in direction

Cosine Similarity
If X and Y are two n-dimensional vectors:
X · Y = |X| |Y| cos θ
cos θ = X · Y / |X| |Y|
= 1 when identical
= 0 when orthogonal

Calculating the ranked list
To get the ranked list, we use doc. accumulators:
For each query term t, in order of increasing ft,
Read its inverted file entry It
Update acc. for each doc in It: Ad+= ln (1 + fd,t) ×wt
For each Ad in A
Ad /= Wd // that’s basically cos θ, don’t use wq
Report top r of A

Accumulator Storage
Holding all possible accumulators is expensive
Could need one for each document if query is broad
In practice, use fixed |A| wrt main memory.  What to do when all used?
Quit: _________________
Continue _____________________

Selecting r entries from accumulators
Want to return documents with largest cos values.
How? Use a min-heap
Load r A values into the heap H
Process remaining A-r values
If Ad > min{H} then
Delete min{H}, add Ad, and sift
// H now contains the top r exact cosine values

To think about
How do you deal with a dynamic collection?
How do you support phrasal searching?
What about wildcard searching?
What types of wildcard searching are common?