Digital Libraries
|
|
|
Orientation |
|
|
|
Week 1 Min-Yen KAN |
What is a library?
|
|
|
|
A place set apart to contain books for
reading, study, or reference. |
|
(Not applied, e.g. to the shop or
warehouse of a bookseller.) |
|
A building … containing a collection of
books for the use of the public or of some particular portion of it, or of
the members of some society or the like; |
|
a public institution or establishment,
charged with the care of a collection of books, and the duty of rendering the
books accessible to those who require to use them. |
What is a library?
|
|
|
A private commercial establishment for
the lending of books, the borrower paying either a fixed sum for each book
lent or a periodical subscription. |
|
a great mass of learning or knowledge; |
|
the objects of a person's study, the
sources on which he depends for instruction. |
|
Computers. An organized collection of
routines, esp. of tested routines suitable for a particular model of computer |
|
Biology. a collection of sequences of
DNA … that represent the genetic
material of a particular organism or tissue |
|
|
Introduction
|
|
|
|
Bush’s “As we may think” |
|
|
|
Writes this at the end of WW II |
|
_____ was the first computer, born to
compute ballistic tables fast |
|
_______ just invented 5 years ago |
|
_______ (“display technology”) still a
less than perfect process. |
|
_______ (“storage technology”) was a
mature and stable technology. |
Vannevar Bush (1890-1974)
|
|
|
|
Director of the Office of Scientific
Research and Development |
|
lead 6000 scientists in R&D for
WWII |
|
Predicted many technological advances |
|
the “memex” is one whose spirit we are
implementing |
|
the purpose was to provide scientists
the capability to exchange information; to have access to the totality of
recorded information |
Design for Memex (c.
1945)
Memex
|
|
|
|
Integrated computer, keyboard, and desk |
|
“mechanized private file and library” |
|
remove drudgery from information
retrieval |
|
suggested implementation was microfilm |
|
various user operations are suggested |
|
_______________ was the main purpose |
|
“the process of tying two items
together is the important thing” |
|
prelude to hypertext... |
Memex
|
|
|
|
Information could come
pre-associatively indexed, but the key point was ______________ |
|
WWW still does not provide that today |
|
Bush observes that tools change our way
of doing, and expand the horizons before us |
|
full impact of WWW and DLs still not
known |
|
|
What is a Digital Library
(DL)?
|
|
|
|
“a collection of information that is
both digitized and organized” (Lesk) |
|
there are numbers of alternate
definitions, but this seems fair enough |
|
no mention of ________, _________,
__________, etc. |
|
|
|
It is not just to reform the current
library system, rather, we aim to |
|
organize and access the “information
overload” |
Outline for today
|
|
|
Introduction to libraries √ |
|
Course administration |
|
Reading and writing research |
|
To think about |
Course administration
|
|
|
Teaching staff |
|
Web sites |
|
Objective |
|
Syllabus |
|
Assessment overview |
|
Survey paper and project |
|
|
|
Any questions? |
Teaching staff
|
|
|
|
Lecturer: |
|
Min-Yen Kan (“Min”) |
|
kanmy@comp.nus.
edu.sg |
|
Office: S15 05-05 |
|
6875-1885 |
|
Hours: 4-6 pm Tuesdays |
|
Interests:
rock climbing, ballroom dancing, and inline skating… and digital libraries! |
Course web sites
|
|
|
|
|
http://ivle.nus.edu.sg/ |
|
Discussion forum |
|
Any questions related to the course
should be raised on this forum |
|
I expect you to talk amongst yourselves
to answer questions, so will not answer questions here much. |
|
Send me emails for urgent or personal
matters |
|
Announcements! |
|
Workbin: Lecture notes (purposely
incomplete!) |
|
|
|
http://www.comp.nus.edu.sg/~cs5244 |
|
Grading specification |
|
Other supplementary content |
Objective
|
|
|
Building, using, presenting and
maintaining large volumes of information |
|
Contrast computational approaches with
traditional library science methods |
Hey min, go over the
website!
|
|
|
http://www.comp.nus.edu.sg/~cs5244 |
|
|
Discussions
|
|
|
Class participation is very important.
There are no “dumb” questions. You will only be penalized for “no” questions
/ comments. |
|
|
|
Possibilities: |
|
Name tags |
|
Cold calls |
|
Small group discussion and presentation |
Midterm and Final
|
|
|
|
|
1 hour midterm (10%) and a
2 hour final (20%) |
|
Both basically of the same format |
|
Calculation questions – that have an
exact answer |
|
Essay questions – many to look at
tradeoffs in the digital library realm |
|
No necessarily right or wrong answers |
|
|
Literature survey
|
|
|
Each student will pick an area of study
to survey at least 4 papers in detail. |
|
|
|
Must be interesting to you |
|
Journal or conference papers from an
authority list |
|
Limit to 6 pages |
|
Individual work only |
|
Give your perspective on area’s future |
|
Add value by comparing strengths and
weaknesses of different approaches. |
Final project
|
|
|
Students will self-organize into groups
for the final projects, shortly after the survey papers are due. |
|
|
|
Requires original work |
|
Cooperation and coordination |
|
Report as a conference submission |
|
Poster presentation to the public |
|
Sample topics on the web page |
Outline for today
|
|
|
Introduction to libraries √ |
|
Course administration √ |
|
Reading and writing research |
|
To think about |
Reading and writing
research papers
|
|
|
References: |
|
|
|
http://www.cse.ogi.edu/~dylan/
efficientReading.html |
|
|
|
ftp://fast.cs.utah.edu/pub/writing-papers.ps |
|
|
|
This section partially from Surendar
Chandra
of University of Notre Dame. |
|
|
Why do you read a paper?
|
|
|
|
Understand and learn new contributions |
|
|
|
However… |
|
Not all papers are “good” |
|
Not all papers are “interesting” |
|
Not all papers are “worthwhile” for you |
|
|
|
You have to learn to identify a good
paper and spend your time wisely |
|
Breadth |
|
Depth |
|
React |
Reading a research paper
|
|
|
|
|
What is this paper about? |
|
Read the title and the abstract |
|
If you still don’t know what this paper
is about, then this is a poorly-written paper. |
|
Read the conclusion |
|
Are you now sure you know what this
paper is about? If not, throw it away. |
|
|
|
Read the ___________ |
|
Read the ____________ |
|
Read _____________ and captions |
How to read a paper
|
|
|
|
See who wrote it, where it was
published, when was it written (credibility) |
|
Skim references |
|
Are authors are aware of relevant
related work? |
|
Do you know the work that they cite? |
|
Do you know other work that they should
have cited? |
How to read a paper -
depth
|
|
|
|
|
Approach with scientific skepticism |
|
Read with context of other things that
you’ve read in mind |
|
It’s only one part of the puzzle of a
subject |
|
|
|
Examine the assumptions. Are they: |
|
Reasonable? |
|
What are the limitations of the work |
|
There are always limitations! Did they disclose them? |
How to read a paper -
depth
|
|
|
|
|
Examine the methods: |
|
Did they measure what they claim? |
|
|
|
Can they explain what they observed? |
|
Want an analysis of why the system
behaves a certain way, not raw data. |
|
|
|
Did they have adequate controls? |
|
|
|
Were tests carried out in a standard
way? Were the performance metrics standard? |
|
If not, do they explain their metrics
clearly? |
How to read a paper -
depth
|
|
|
|
Examine the statistics:
“Lies, d*mned lies and statistics” |
|
Appropriate statistical tests applied
properly? |
|
Did they do proper error analysis? |
|
Are the results statistically
significant? |
How to read a paper -
depth
|
|
|
|
Examine the conclusions: |
|
Do the conclusions follow logically
from the experiments? |
|
What other explanations are there for
the observed effects ? |
|
What other conclusions or correlations
are in the data that were not pointed out? |
How to read a paper -
react
|
|
|
|
Take notes |
|
Highlight major points |
|
React to the points in the paper |
|
Place this work with your own
experience |
|
If you doubt a statement, note your
objection |
|
|
|
Summarize what you read |
|
Good practice: maintain your own
bibliography of all papers that you ever read |
|
___________ ! |
How to write a research
paper
|
|
|
Write it such that anyone who reads it
using the method we just discussed understands the idea |
|
|
|
Clearly explain what problem you are
solving, why it is interesting and how your solution solves this interesting
problem |
|
|
|
Be crisp. Explain what your
contributions are, what your ideas are and what are others’ ideas |
Any questions?
|
|
|
Introduction to libraries √ |
|
Course administration √ |
|
Reading and writing research √ |
To think about for
discussion
|
|
|
|
What are the functions of a traditional
library? |
|
Are these same functions in the digital
library? |
|
How is the digital library different
from: |
|
_________? |
|
_________? |
Coffee Break
Digital Libraries
|
|
|
Week 1 Min-Yen KAN |
|
Implementation of
(Textual) Information Retrieval |
Slide 35
What is information
retrieval?
|
|
|
Part of the information seeking process |
|
Matches a query with most relevant
documents |
|
View a query as a ______________- |
|
|
Searching in books
|
|
|
|
_______ |
|
_______ |
|
_______ |
|
|
|
Procedure: |
|
Look up topic |
|
Find the page |
|
Skim page to find topic |
|
… |
|
Index, 11, 103-151, 443 |
|
Audio, 476 |
|
Comparison of methods 143-145 |
|
Granularity, 105, 112 |
|
N-gram, 170-172 |
|
Of integer sequences, 11 |
|
Of musical themes, 11 |
|
Of this book, 103, 507ff |
|
Within inverted file entry, see skipping |
|
Index compression, 114-129, 198-201,
235-237 |
|
Batched, 125,128 |
|
Bernoulli, 119-122, 128, 150, 247, 421 |
|
Context-sensitive, 125-126 |
|
Global, 115-121 |
|
Hyperbolic model, 123-124, 150 |
|
In MG, 421-423 |
|
Interpolative coding, 126-128 |
|
Local, 115, 121-122, 247 |
|
Nonparameterized, 115-119 |
|
Observed frequency, 121, 124-125, 128,
247 |
|
Parameterized, 115 |
|
Performance of, 128-129. 421 |
|
Skewed Bernoulli, 122-123, 138, 150 |
|
Within-document frequencies, 198-201 |
|
Index Construction, 223-261 (see also inversion) |
|
bitmaps, 255-256 |
|
… |
Information retrieval
|
|
|
|
Algorithm |
|
(Permute query to fit index) |
|
Search index |
|
Go to resource |
|
(Permute query to fit item) |
|
(Search for item) |
What to index?
|
|
|
|
Books indices have key words and
phrases |
|
Search engines index ____________ |
|
|
|
Why the disparity? |
|
What do people really search for? |
|
|
|
|
Trading precision for
size
|
|
|
|
Can save up to 32% without too much
loss: |
|
|
|
Stemming |
|
Usually just word inflection |
|
Information → Inform = Informal,
Informed |
|
|
|
Case folding |
|
N.B.: keep odd variants (e.g., NeXT,
LaTeX) |
|
|
|
Stop words |
|
Don’t index common words, people won’t
search on them anyways |
|
|
|
Pop Quiz: Which of these techniques are
more effective? |
Indexing output
|
|
|
|
Output = Lw,DD,IW×D |
|
|
|
Inverted File (Index) |
|
Postings (e.g., wt → (d1,fwt,d1),
(d2,fwt,d), …,
(dn,fwt,dn) |
|
Variable length records |
|
|
|
Lexicon: |
|
String Wt |
|
Document frequency ft |
|
Address within inverted file It |
|
Sorted, fixed length records |
|
|
|
×
D1 D2 D3 D4 D5
D6 … Dm |
|
|
|
W1 1 1 |
|
W2 2 1 |
|
W3 1 |
|
W4 1 1 |
|
W5 1 1 |
|
W6 1
1 1 |
|
… |
|
Wn |
|
|
|
|
|
|
|
|
|
|
Trading precision for
size, redux
|
|
|
|
Pop Quiz: Which of these techniques are
more effective? |
|
|
|
Typical: |
|
Lexicon = 30 MB Inverted File: 400 MB |
|
|
|
Stemming |
|
Affects Lexicon |
|
|
|
Case folding |
|
Affects Lexicon |
|
|
|
Stop words |
|
Affects Inverted File |
Is fine-grained indexing
worthwhile?
|
|
|
|
Problem: still have to scan document to
find the term. |
|
|
|
|
|
|
|
|
|
Cons: |
|
Need access methods to take advantage |
|
Extra storage space overhead (variable
sized) |
|
Alternative methods: |
|
Hierarchical encoding (doc #, para #,
sent #, word #) to shrink offset size |
|
Split long documents into n shorter
ones. |
Inverted file compression
|
|
|
|
Clue: Encode gap length instead of
offset |
|
Use small number of bits to encode more common gap lengths |
|
(e.g., Huffman encoding) |
|
Better: Use a distribution of expected
gap length (e.g., Bernoulli process) |
|
If p = prob that any word x appears in
doc y, then |
|
Then pgap size z = (1-p)z
p . This constructs a geometric
distribution. |
|
|
|
Works for intra and inter-document
index compression |
|
Why does it hold for documents as well
as words? |
Building the index –
Memory based inversion
|
|
|
Takes lots of main memory, ugh! |
|
Can we reduce the memory requirement? |
Sort-based inversion
|
|
|
Idea: try to make random access of disk
(memory) sequential |
|
|
|
// Phase I – collection of term
appearances on disk |
|
For each document Dd in
collection, 1 ≤ d ≤ N |
|
Read Dd, parsing it into
index terms |
|
For each index term t in Dd |
|
Calculate fd,t |
|
Dump to file a tuple (t,d,fd,t) |
|
// Phase II – sort tuples |
|
Sort all the tuples (t,d,f) using
External Mergesort |
|
|
|
// Phase III – write output file |
|
Read the tuples in sorted order and
create inverted file |
Sort based inversion:
example
|
|
|
<a,1,2> |
|
<b,1,2> |
|
<c,1,1> |
|
<a,2,2> |
|
<d,2,1> |
|
<b,2,1> |
|
<b,3,1> |
|
<d,3,1> |
|
|
Using a first pass for
the lexicon
|
|
|
|
Gets us fd,t and N |
|
Savings: For any t, we know fd,t,
so can use an array vs. LL (shrinks record by 40%!) |
Lexicon-based inversion
|
|
|
|
Partition inversion as |I|/|M| = k
smaller problems |
|
build 1/k of inverted index on each
pass |
|
(e.g., a-b, b-c, …, y-z) |
|
Tuned to fit amount of main memory in
machine |
|
Just remember boundary words |
|
|
|
Can pair with disk strategy |
|
Create k temporary files and write
tuples (t,d,fd,t) for each partition on first pass |
|
Each second pass builds index from
temporary file |
|
|
Inversion – Summary of
Techniques
|
|
|
How do these techniques stack up? |
|
Assume a 5 GB corpus and 40 MB main
memory machine |
|
|
|
Technique Memory Disk Time |
|
(MB) (GB) (Hours) |
|
*Linked lists (memory) 4000 0 6 |
|
Linked lists (disk) 30 4 1100 |
|
Sort-based 40 8 20 |
|
Lexicon-based 40 0 79 |
|
Lexicon w/ disk 40 4 12 |
Query Matching
|
|
|
Now that we have an index, how do we
answer queries? |
Query Matching
|
|
|
|
Assuming a simple word matching engine: |
|
|
|
For each query term t |
|
Stem t |
|
Search lexicon |
|
Record ft and its inverted
entry address, It |
|
Select a query term t |
|
Set list of candidates, C = It |
|
For each remaining term t |
|
Read its It |
|
For each d in C, if d not in It
set C = C – {d} |
|
|
|
X and Y and Z – high _______ |
|
X or Y or Z – high _______ |
|
Which algorithm is the above? |
|
|
Boolean Model
|
|
|
|
Query processing strategy: |
|
Join less frequent terms first |
|
Even in ORs, as merging takes longer
than lookup |
|
|
|
Problems with Boolean model: |
|
Retrieves too many or too few documents |
|
Longer documents are tend to match more
often because they have a larger vocabulary |
|
Need ranked retrieval to help out |
Deciding ranking
|
|
|
|
Boolean assigns same importance to all
terms in a query |
|
|
|
5566 concert dates in Singapore |
|
“5566” has same weight as “date” |
|
|
|
One way: |
|
Assign weights to the words, make more
important words worth more |
|
Process results in q and d vectors:
(word, weight), (word, weight) … (word, weight) |
Term Frequency
|
|
|
Xxxxxxxxxxxxxx IBM xxxxxxxxxxx
xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple. Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx. Xxxxxxxxxx xxxxxxxx Compaq. Xxxxxxxxx xxxxxxx IBM. |
|
|
|
(Relative) term frequency can indicate
importance. |
|
Rd,f = fd,t |
|
Rd,t = 1 + ln fd,t |
|
Rd,t = (K + (1-K) ) |
Inverse Document
Frequency
|
|
|
Consider a future device for
individual use, which is a sort of mechanized private file and library. It
needs a name, and, to coin one at random, "memex" will do. |
|
|
Inverse Document
Frequency
|
|
|
|
Consider a future device for individual
use, which is a sort of mechanized private file and library. It needs a name,
and, to coin one at random, "memex" will do. |
|
|
|
Words with higher ft are
less discriminative. |
|
Use inverse to measure importance: |
|
wt = 1/ft |
|
wt = ln (1+ N/ft)
ß this
one is most common |
|
wt = ln (1 + fm/ft),
where fm is the max observed frequency |
|
|
|
|
|
|
This is TF*IDF
|
|
|
|
Many variants, but all capture: |
|
Term frequency:
Rd,t as being __________________ |
|
|
|
Inverse Document Frequency:
Wt as being ___________________ |
|
|
|
Standard formulation is:
wd,t = rd,t × wt
= (1+ ln(fd,t)) × ln (1 + N/ft) |
|
|
|
Problem: |
|
rd,t grows as document
grows, need to normalize; otherwise biased towards _____________ |
Calculating Similarity
|
|
|
|
Euclidean Distance - bad |
|
M(Q,Dd) = sqrt (Σ |wq,t
– wd,t|2) |
|
Dissimilarity Measure; use reciprocal |
|
Has problem with long documents, why? |
|
|
|
Actually don’t care about vector
length, just their direction |
|
Want to measure difference in direction |
Cosine Similarity
|
|
|
|
If X and Y are two n-dimensional
vectors: |
|
X · Y = |X| |Y| cos θ |
|
cos θ = X · Y / |X| |Y| |
|
|
|
= 1 when identical |
|
= 0 when orthogonal |
Calculating the ranked
list
|
|
|
To get the ranked list, we use doc.
accumulators: |
|
|
|
For each query term t, in order of
increasing ft, |
|
Read its inverted file entry It |
|
Update acc. for each doc in
It: Ad+= ln (1 + fd,t) ×wt |
|
For each Ad in A |
|
Ad /= Wd //
that’s basically cos θ, don’t use wq |
|
Report top r of A |
|
|
Accumulator Storage
|
|
|
|
Holding all possible accumulators is
expensive |
|
Could need one for each document if
query is broad |
|
|
|
In practice, use fixed |A| wrt main
memory. What to do when all used? |
|
Quit: _________________ |
|
Continue _____________________ |
Selecting r entries from
accumulators
|
|
|
|
|
Want to return documents with largest
cos values. |
|
|
|
How? Use a min-heap |
|
Load r A values into the heap H |
|
Process remaining A-r values |
|
If Ad > min{H} then |
|
Delete min{H}, add Ad, and
sift |
|
// H now contains the top r exact
cosine values |
|
|
To think about
|
|
|
|
How do you deal with a dynamic
collection? |
|
How do you support phrasal searching? |
|
What about wildcard searching? |
|
What types of wildcard searching are
common? |
|
|