Notes
Slide Show
Outline
1
Digital Libraries
  • Orientation


  • Week 1                Min-Yen KAN
2
What is a library?
  • A place set apart to contain books for reading, study, or reference.
    • (Not applied, e.g. to the shop or warehouse of a bookseller.)
  • A building … containing a collection of books for the use of the public or of some particular portion of it, or of the members of some society or the like;
  • a public institution or establishment, charged with the care of a collection of books, and the duty of rendering the books accessible to those who require to use them.
3
What is a library?
  • A private commercial establishment for the lending of books, the borrower paying either a fixed sum for each book lent or a periodical subscription.
  • a great mass of learning or knowledge;
  • the objects of a person's study, the sources on which he depends for instruction.
  • Computers. An organized collection of routines, esp. of tested routines suitable for a particular model of computer
  • Biology. a collection of sequences of DNA …  that represent the genetic material of a particular organism or tissue


4
Introduction
  • Bush’s “As we may think”


    • Writes this at the end of WW II
    • _____ was the first computer, born to compute ballistic tables fast
    • _______ just invented 5 years ago
    • _______ (“display technology”) still a less than perfect process.
    • _______ (“storage technology”) was a mature and stable technology.
5
Vannevar Bush (1890-1974)
  • Director of the Office of Scientific Research and Development
    • lead 6000 scientists in R&D for WWII
  • Predicted many technological advances
    • the “memex” is one whose spirit we are implementing
    • the purpose was to provide scientists the capability to exchange information; to have access to the totality of recorded information
6
Design for Memex (c. 1945)
7
Memex
  • Integrated computer, keyboard, and desk
  • “mechanized private file and library”
    • remove drudgery from information retrieval
    • suggested implementation was microfilm
    • various user operations  are suggested
  • _______________ was the main purpose
    • “the process of tying two items together is the important thing”
    • prelude to hypertext...
8
Memex
  • Information could come pre-associatively indexed, but the key point was ______________
    • WWW still does not provide that today
  • Bush observes that tools change our way of doing, and expand the horizons before us
    • full impact of WWW and DLs still not known

9
What is a Digital Library (DL)?
  • “a collection of information that is both digitized and organized” (Lesk)
    • there are numbers of alternate definitions, but this seems fair enough
    • no mention of ________, _________, __________, etc.


  • It is not just to reform the current library system, rather, we aim to
    • organize and access the “information overload”
10
Outline for today
  • Introduction to libraries √
  • Course administration
  • Reading and writing research
  • To think about
11
Course administration
  • Teaching staff
  • Web sites
  • Objective
  • Syllabus
  • Assessment overview
  • Survey paper and project


  • Any questions?
12
Teaching staff
  • Lecturer:
    • Min-Yen Kan (“Min”)
    • kanmy@comp.nus.
      edu.sg
    • Office: S15 05-05
    • 6875-1885
    • Hours: 4-6 pm Tuesdays
    • Interests:
      rock climbing, ballroom dancing, and inline skating… and digital libraries!
13
Course web sites
  • http://ivle.nus.edu.sg/
    • Discussion forum
      • Any questions related to the course should be raised on this forum
      • I expect you to talk amongst yourselves to answer questions, so will not answer questions here much.
      • Send me emails for urgent or personal matters
    • Announcements!
    • Workbin: Lecture notes (purposely incomplete!)


  • http://www.comp.nus.edu.sg/~cs5244
    • Grading specification
    • Other supplementary content
14
Objective
  • Building, using, presenting and maintaining large volumes of information
  • Contrast computational approaches with traditional library science methods
15
Hey min, go over the website!
  • http://www.comp.nus.edu.sg/~cs5244


16
Discussions
  • Class participation is very important. There are no “dumb” questions. You will only be penalized for “no” questions / comments.


  • Possibilities:
  • Name tags
  • Cold calls
  • Small group discussion and presentation
17
Midterm and Final
  • 1 hour midterm (10%) and a
    2 hour final (20%)
    • Both basically of the same format
    • Calculation questions – that have an exact answer
    • Essay questions – many to look at tradeoffs in the digital library realm
      • No necessarily right or wrong answers


18
Literature survey
  • Each student will pick an area of study to survey at least 4 papers in detail.


  • Must be interesting to you
  • Journal or conference papers from an authority list
  • Limit to 6 pages
  • Individual work only
  • Give your perspective on area’s future
  • Add value by comparing strengths and weaknesses of different approaches.
19
Final project
  • Students will self-organize into groups for the final projects, shortly after the survey papers are due.


  • Requires original work
  • Cooperation and coordination
  • Report as a conference submission
  • Poster presentation to the public
  • Sample topics on the web page
20
Outline for today
  • Introduction to libraries √
  • Course administration √
  • Reading and writing research
  • To think about
21
Reading and writing research papers
  • References:


  •  http://www.cse.ogi.edu/~dylan/
    efficientReading.html


  •  ftp://fast.cs.utah.edu/pub/writing-papers.ps


  • This section partially from Surendar Chandra
    of University of Notre Dame.


22
Why do you read a paper?
  • Understand and learn new contributions


  • However…
    • Not all papers are “good”
    • Not all papers are “interesting”
    • Not all papers are “worthwhile” for you


  • You have to learn to identify a good paper and spend your time wisely
    • Breadth
    • Depth
    • React
23
Reading a research paper
  • What is this paper about?
    • Read the title and the abstract
      • If you still don’t know what this paper is about, then this is a poorly-written paper.
    •  Read the conclusion
      • Are you now sure you know what this paper is about? If not, throw it away.


  • Read the ___________
  • Read the ____________
  • Read _____________ and captions
24
How to read a paper
  • See who wrote it, where it was published, when was it written (credibility)
  • Skim references
    • Are authors are aware of relevant related work?
    • Do you know the work that they cite?
    • Do you know other work that they should have cited?
25
How to read a paper - depth
  • Approach with scientific skepticism
  • Read with context of other things that you’ve read in mind
    • It’s only one part of the puzzle of a subject


  • Examine the assumptions.  Are they:
    • Reasonable?
    • What are the limitations of the work
      • There are always limitations!  Did they disclose them?
26
How to read a paper - depth
  • Examine the methods:
    • Did they measure what they claim?


    • Can they explain what they observed?
      • Want an analysis of why the system behaves a certain way, not raw data.


    • Did they have adequate controls?


    • Were tests carried out in a standard way? Were the performance metrics standard?
      • If not, do they explain their metrics clearly?
27
How to read a paper - depth
  • Examine the statistics:
    “Lies, d*mned lies and statistics”
    • Appropriate statistical tests applied properly?
    • Did they do proper error analysis?
    • Are the results statistically significant?
28
How to read a paper - depth
  • Examine the conclusions:
    • Do the conclusions follow logically from the experiments?
    • What other explanations are there for the observed effects ?
    • What other conclusions or correlations are in the data that were not pointed out?
29
How to read a paper - react
  • Take notes
  • Highlight major points
  • React to the points in the paper
    • Place this work with your own experience
    • If you doubt a statement, note your objection

  • Summarize what you read
    • Good practice: maintain your own bibliography of all papers that you ever read
    • ___________ !
30
How to write a research paper
  • Write it such that anyone who reads it using the method we just discussed understands the idea


  • Clearly explain what problem you are solving, why it is interesting and how your solution solves this interesting problem


  • Be crisp. Explain what your contributions are, what your ideas are and what are others’ ideas
31
Any questions?
  • Introduction to libraries √
  • Course administration √
  • Reading and writing research √
32
To think about for discussion
  • What are the functions of a traditional library?
  • Are these same functions in the digital library?
  • How is the digital library different from:
    • _________?
    • _________?
33
Coffee Break
34
Digital Libraries
  • Week 1                      Min-Yen KAN
  • Implementation of
    (Textual) Information Retrieval
35
 


36
What is information retrieval?
  • Part of the information seeking process
  • Matches a query with most relevant documents
  • View a query as a ______________-


37
Searching in books
  • _______
  • _______
  • _______


  • Procedure:
    • Look up topic
    • Find the page
    • Skim page to find topic
  • …
  • Index, 11, 103-151, 443
  • Audio, 476
  • Comparison of methods 143-145
  • Granularity, 105, 112
  • N-gram, 170-172
  • Of integer sequences, 11
  • Of musical themes, 11
  • Of this book, 103, 507ff
  • Within inverted file entry, see skipping
  • Index compression, 114-129, 198-201, 235-237
  • Batched, 125,128
  • Bernoulli, 119-122, 128, 150, 247, 421
  • Context-sensitive, 125-126
  • Global, 115-121
  • Hyperbolic model, 123-124, 150
  • In MG, 421-423
  • Interpolative coding, 126-128
  • Local, 115, 121-122, 247
  • Nonparameterized, 115-119
  • Observed frequency, 121, 124-125, 128, 247
  • Parameterized, 115
  • Performance of, 128-129. 421
  • Skewed Bernoulli, 122-123, 138, 150
  • Within-document frequencies, 198-201
  • Index Construction, 223-261 (see also inversion)
  • bitmaps, 255-256
  • …
38
Information retrieval
  • Algorithm
    • (Permute query to fit index)
    • Search index
    • Go to resource
    • (Permute query to fit item)
    • (Search for item)
39
What to index?
  • Books indices have key words and phrases
  • Search engines index ____________


  • Why the disparity?
  • What do people really search for?



40
Trading precision for size
  • Can save up to 32% without too much loss:


  • Stemming
    • Usually just word inflection
    • Information → Inform = Informal, Informed

  • Case folding
    • N.B.: keep odd variants (e.g., NeXT, LaTeX)

  • Stop words
    • Don’t index common words, people won’t search on them anyways

  • Pop Quiz: Which of these techniques are more effective?
41
Indexing output
  • Output = Lw,DD,IW×D


  • Inverted File (Index)
    • Postings (e.g., wt → (d1,fwt,d1), (d2,fwt,d),  …, (dn,fwt,dn)
    • Variable length records

  • Lexicon:
    • String Wt
    • Document frequency ft
    • Address within inverted file It
    • Sorted, fixed length records
  • ×       D1 D2 D3 D4 D5 D6 … Dm


  • W1           1        1
  • W2       2            1
  • W3        1
  • W4                         1           1
  • W5        1           1
  • W6            1       1   1
  • …
  • Wn





42
Trading precision for size, redux
  • Pop Quiz: Which of these techniques are more effective?


  • Typical:
  • Lexicon = 30 MB Inverted File: 400 MB
  • Stemming
    • Affects Lexicon

  • Case folding
    • Affects Lexicon


  • Stop words
    • Affects Inverted File
43
Is fine-grained indexing worthwhile?
  • Problem: still have to scan document to find the term.





  • Cons:
    • Need access methods to take advantage
    • Extra storage space overhead (variable sized)
  • Alternative methods:
    • Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset size
    • Split long documents into n shorter ones.
44
Inverted file compression
  • Clue: Encode gap length instead of offset
  • Use small number of bits to encode  more common gap lengths
    • (e.g., Huffman encoding)
  • Better: Use a distribution of expected gap length (e.g., Bernoulli process)
    • If p = prob that any word x appears in doc y, then
    • Then pgap size z = (1-p)z p .  This constructs a geometric distribution.

  • Works for intra and inter-document index compression
    • Why does it hold for documents as well as words?
45
Building the index – Memory based inversion
  • Takes lots of main memory, ugh!
  • Can we reduce the memory requirement?
46
Sort-based inversion
  • Idea: try to make random access of disk (memory) sequential


  • // Phase I – collection of term appearances on disk
  • For each document Dd in collection, 1 ≤ d ≤ N
  • Read Dd, parsing it into index terms
  • For each index term t in Dd
  • Calculate fd,t
  • Dump to file a tuple (t,d,fd,t)
  • // Phase II – sort tuples
  • Sort all the tuples (t,d,f) using External Mergesort


  • // Phase III – write output file
  • Read the tuples in sorted order and create inverted file
47
Sort based inversion: example
  • <a,1,2>
  • <b,1,2>
  • <c,1,1>
  • <a,2,2>
  • <d,2,1>
  • <b,2,1>
  • <b,3,1>
  • <d,3,1>


48
Using a first pass for the lexicon
  • Gets us fd,t and N
    • Savings: For any t, we know fd,t, so can use an array vs. LL (shrinks record by 40%!)
49
Lexicon-based inversion
  • Partition inversion as |I|/|M| = k smaller problems
    • build 1/k of inverted index on each pass
    • (e.g., a-b, b-c, …, y-z)
    • Tuned to fit amount of main memory in machine
    • Just remember boundary words

  • Can pair with disk strategy
    • Create k temporary files and write tuples (t,d,fd,t) for each partition on first pass
    • Each second pass builds index from temporary file


50
Inversion – Summary of Techniques
  • How do these techniques stack up?
  • Assume a 5 GB corpus and 40 MB main memory machine


  • Technique       Memory Disk Time
  • (MB) (GB) (Hours)
  • *Linked lists (memory)  4000 0 6
  • Linked lists (disk) 30 4 1100
  • Sort-based 40 8 20
  • Lexicon-based 40 0 79
  • Lexicon w/ disk 40 4 12
51
Query Matching
  • Now that we have an index, how do we answer queries?
52
Query Matching
  • Assuming a simple word matching engine:


  • For each query term t
  • Stem t
  • Search lexicon
  • Record ft and its inverted entry address, It
  • Select a query term t
  • Set list of candidates, C = It
  • For each remaining term t
  • Read its It
  • For each d in C, if d not in It set C = C – {d}


    • X and Y and Z – high _______
    • X or Y or Z – high _______
    • Which algorithm is the above?


53
Boolean Model
  • Query processing strategy:
    • Join less frequent terms first
    • Even in ORs, as merging takes longer than lookup


  • Problems with Boolean model:
    • Retrieves too many or too few documents
    • Longer documents are tend to match more often because they have a larger vocabulary
    • Need ranked retrieval to help out
54
Deciding ranking
  • Boolean assigns same importance to all terms in a query


  • 5566 concert dates in Singapore
    • “5566” has same weight as “date”

  • One way:
    • Assign weights to the words, make more important words worth more
    • Process results in q and d vectors: (word, weight), (word, weight) … (word, weight)
55
Term Frequency
  • Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple.  Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx.  Xxxxxxxxxx xxxxxxxx Compaq.  Xxxxxxxxx xxxxxxx IBM.


  • (Relative) term frequency can indicate importance.
  • Rd,f = fd,t
  • Rd,t = 1 + ln fd,t
  • Rd,t = (K + (1-K)           )
56
Inverse Document Frequency
  • Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.


57
Inverse Document Frequency
  • Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.


  • Words with higher ft are less discriminative.
  • Use inverse to measure importance:
    • wt = 1/ft
    • wt = ln (1+ N/ft) ß this one is most common
    • wt = ln (1 + fm/ft), where fm is the max observed frequency



58
This is TF*IDF
  • Many variants, but all capture:
    • Term frequency:
      Rd,t as being __________________


    • Inverse Document Frequency:
      Wt as being ___________________

  • Standard formulation is:
    wd,t = rd,t × wt
    = (1+ ln(fd,t)) × ln (1 + N/ft)


  • Problem:
    • rd,t grows as document grows, need to normalize; otherwise biased towards _____________
59
Calculating Similarity
  • Euclidean Distance - bad
    • M(Q,Dd) = sqrt (Σ |wq,t – wd,t|2)
    • Dissimilarity Measure; use reciprocal
    • Has problem with long documents, why?

  • Actually don’t care about vector length, just their direction
    • Want to measure difference in direction
60
Cosine Similarity
  • If X and Y are two n-dimensional vectors:
    • X · Y = |X| |Y| cos θ
    • cos θ = X · Y / |X| |Y|


    • = 1 when identical
    • = 0 when orthogonal
61
Calculating the ranked list
  • To get the ranked list, we use doc. accumulators:
  • For each query term t, in order of increasing ft,
  • Read its inverted file entry It
  • Update acc. for each doc in It: Ad+= ln (1 + fd,t) ×wt
  • For each Ad in A
  • Ad /= Wd // that’s basically cos θ, don’t use wq
  • Report top r of A


62
Accumulator Storage
  • Holding all possible accumulators is expensive
    • Could need one for each document if query is broad


  • In practice, use fixed |A| wrt main memory.  What to do when all used?
    • Quit: _________________
    • Continue _____________________
63
Selecting r entries from accumulators
  • Want to return documents with largest cos values.


  • How? Use a min-heap
    • Load r A values into the heap H
    • Process remaining A-r values
    • If Ad > min{H} then
    • Delete min{H}, add Ad, and sift
    • // H now contains the top r exact cosine values


64
To think about
  • How do you deal with a dynamic collection?
  • How do you support phrasal searching?
  • What about wildcard searching?
    • What types of wildcard searching are common?