|
1
|
- Orientation
- Week 1 Min-Yen KAN
|
|
2
|
- A place set apart to contain books for reading, study, or reference.
- (Not applied, e.g. to the shop or warehouse of a bookseller.)
- A building … containing a collection of books for the use of the public
or of some particular portion of it, or of the members of some society
or the like;
- a public institution or establishment, charged with the care of a
collection of books, and the duty of rendering the books accessible to
those who require to use them.
|
|
3
|
- A private commercial establishment for the lending of books, the
borrower paying either a fixed sum for each book lent or a periodical subscription.
- a great mass of learning or knowledge;
- the objects of a person's study, the sources on which he depends for
instruction.
- Computers. An organized collection of routines, esp. of tested routines
suitable for a particular model of computer
- Biology. a collection of sequences of DNA … that represent the genetic material of
a particular organism or tissue
|
|
4
|
- Bush’s “As we may think”
- Writes this at the end of WW II
- _____ was the first computer, born to compute ballistic tables fast
- _______ just invented 5 years ago
- _______ (“display technology”) still a less than perfect process.
- _______ (“storage technology”) was a mature and stable technology.
|
|
5
|
- Director of the Office of Scientific Research and Development
- lead 6000 scientists in R&D for WWII
- Predicted many technological advances
- the “memex” is one whose spirit we are implementing
- the purpose was to provide scientists the capability to exchange
information; to have access to the totality of recorded information
|
|
6
|
|
|
7
|
- Integrated computer, keyboard, and desk
- “mechanized private file and library”
- remove drudgery from information retrieval
- suggested implementation was microfilm
- various user operations are
suggested
- _______________ was the main purpose
- “the process of tying two items together is the important thing”
- prelude to hypertext...
|
|
8
|
- Information could come pre-associatively indexed, but the key point was
______________
- WWW still does not provide that today
- Bush observes that tools change our way of doing, and expand the
horizons before us
- full impact of WWW and DLs still not known
|
|
9
|
- “a collection of information that is both digitized and organized”
(Lesk)
- there are numbers of alternate definitions, but this seems fair enough
- no mention of ________, _________, __________, etc.
- It is not just to reform the current library system, rather, we aim to
- organize and access the “information overload”
|
|
10
|
- Introduction to libraries √
- Course administration
- Reading and writing research
- To think about
|
|
11
|
- Teaching staff
- Web sites
- Objective
- Syllabus
- Assessment overview
- Survey paper and project
- Any questions?
|
|
12
|
- Lecturer:
- Min-Yen Kan (“Min”)
- kanmy@comp.nus.
edu.sg
- Office: S15 05-05
- 6875-1885
- Hours: 4-6 pm Tuesdays
- Interests:
rock climbing, ballroom dancing, and inline skating… and digital
libraries!
|
|
13
|
- http://ivle.nus.edu.sg/
- Discussion forum
- Any questions related to the course should be raised on this forum
- I expect you to talk amongst yourselves to answer questions, so will
not answer questions here much.
- Send me emails for urgent or personal matters
- Announcements!
- Workbin: Lecture notes (purposely incomplete!)
- http://www.comp.nus.edu.sg/~cs5244
- Grading specification
- Other supplementary content
|
|
14
|
- Building, using, presenting and maintaining large volumes of information
- Contrast computational approaches with traditional library science
methods
|
|
15
|
- http://www.comp.nus.edu.sg/~cs5244
|
|
16
|
- Class participation is very important. There are no “dumb” questions.
You will only be penalized for “no” questions / comments.
- Possibilities:
- Name tags
- Cold calls
- Small group discussion and presentation
|
|
17
|
- 1 hour midterm (10%) and a
2 hour final (20%)
- Both basically of the same format
- Calculation questions – that have an exact answer
- Essay questions – many to look at tradeoffs in the digital library
realm
- No necessarily right or wrong answers
|
|
18
|
- Each student will pick an area of study to survey at least 4 papers in
detail.
- Must be interesting to you
- Journal or conference papers from an authority list
- Limit to 6 pages
- Individual work only
- Give your perspective on area’s future
- Add value by comparing strengths and weaknesses of different approaches.
|
|
19
|
- Students will self-organize into groups for the final projects, shortly
after the survey papers are due.
- Requires original work
- Cooperation and coordination
- Report as a conference submission
- Poster presentation to the public
- Sample topics on the web page
|
|
20
|
- Introduction to libraries √
- Course administration √
- Reading and writing research
- To think about
|
|
21
|
- References:
- http://www.cse.ogi.edu/~dylan/
efficientReading.html
- ftp://fast.cs.utah.edu/pub/writing-papers.ps
- This section partially from Surendar Chandra
of University of Notre Dame.
|
|
22
|
- Understand and learn new contributions
- However…
- Not all papers are “good”
- Not all papers are “interesting”
- Not all papers are “worthwhile” for you
- You have to learn to identify a good paper and spend your time wisely
|
|
23
|
- What is this paper about?
- Read the title and the abstract
- If you still don’t know what this paper is about, then this is a
poorly-written paper.
- Read the conclusion
- Are you now sure you know what this paper is about? If not, throw it
away.
- Read the ___________
- Read the ____________
- Read _____________ and captions
|
|
24
|
- See who wrote it, where it was published, when was it written
(credibility)
- Skim references
- Are authors are aware of relevant related work?
- Do you know the work that they cite?
- Do you know other work that they should have cited?
|
|
25
|
- Approach with scientific skepticism
- Read with context of other things that you’ve read in mind
- It’s only one part of the puzzle of a subject
- Examine the assumptions. Are
they:
- Reasonable?
- What are the limitations of the work
- There are always limitations!
Did they disclose them?
|
|
26
|
- Examine the methods:
- Did they measure what they claim?
- Can they explain what they observed?
- Want an analysis of why the system behaves a certain way, not raw
data.
- Did they have adequate controls?
- Were tests carried out in a standard way? Were the performance metrics
standard?
- If not, do they explain their metrics clearly?
|
|
27
|
- Examine the statistics:
“Lies, d*mned lies and statistics”
- Appropriate statistical tests applied properly?
- Did they do proper error analysis?
- Are the results statistically significant?
|
|
28
|
- Examine the conclusions:
- Do the conclusions follow logically from the experiments?
- What other explanations are there for the observed effects ?
- What other conclusions or correlations are in the data that were not
pointed out?
|
|
29
|
- Take notes
- Highlight major points
- React to the points in the paper
- Place this work with your own experience
- If you doubt a statement, note your objection
- Summarize what you read
- Good practice: maintain your own bibliography of all papers that you
ever read
- ___________ !
|
|
30
|
- Write it such that anyone who reads it using the method we just
discussed understands the idea
- Clearly explain what problem you are solving, why it is interesting and
how your solution solves this interesting problem
- Be crisp. Explain what your contributions are, what your ideas are and
what are others’ ideas
|
|
31
|
- Introduction to libraries √
- Course administration √
- Reading and writing research √
|
|
32
|
- What are the functions of a traditional library?
- Are these same functions in the digital library?
- How is the digital library different from:
|
|
33
|
|
|
34
|
- Week 1
Min-Yen KAN
- Implementation of
(Textual) Information Retrieval
|
|
35
|
|
|
36
|
- Part of the information seeking process
- Matches a query with most relevant documents
- View a query as a ______________-
|
|
37
|
- _______
- _______
- _______
- Procedure:
- Look up topic
- Find the page
- Skim page to find topic
- …
- Index, 11, 103-151, 443
- Audio, 476
- Comparison of methods 143-145
- Granularity, 105, 112
- N-gram, 170-172
- Of integer sequences, 11
- Of musical themes, 11
- Of this book, 103, 507ff
- Within inverted file entry, see skipping
- Index compression, 114-129, 198-201, 235-237
- Batched, 125,128
- Bernoulli, 119-122, 128, 150, 247, 421
- Context-sensitive, 125-126
- Global, 115-121
- Hyperbolic model, 123-124, 150
- In MG, 421-423
- Interpolative coding, 126-128
- Local, 115, 121-122, 247
- Nonparameterized, 115-119
- Observed frequency, 121, 124-125, 128, 247
- Parameterized, 115
- Performance of, 128-129. 421
- Skewed Bernoulli, 122-123, 138, 150
- Within-document frequencies, 198-201
- Index Construction, 223-261 (see also inversion)
- bitmaps, 255-256
- …
|
|
38
|
- Algorithm
- (Permute query to fit index)
- Search index
- Go to resource
- (Permute query to fit item)
- (Search for item)
|
|
39
|
- Books indices have key words and phrases
- Search engines index ____________
- Why the disparity?
- What do people really search for?
|
|
40
|
- Can save up to 32% without too much loss:
- Stemming
- Usually just word inflection
- Information → Inform = Informal, Informed
- Case folding
- N.B.: keep odd variants (e.g., NeXT, LaTeX)
- Stop words
- Don’t index common words, people won’t search on them anyways
- Pop Quiz: Which of these techniques are more effective?
|
|
41
|
- Output = Lw,DD,IW×D
- Inverted File (Index)
- Postings (e.g., wt → (d1,fwt,d1),
(d2,fwt,d),
…, (dn,fwt,dn)
- Variable length records
- Lexicon:
- String Wt
- Document frequency ft
- Address within inverted file It
- Sorted, fixed length records
- × D1 D2
D3 D4 D5 D6 … Dm
- W1 1 1
- W2 2 1
- W3 1
- W4 1 1
- W5 1 1
- W6 1 1 1
- …
- Wn
|
|
42
|
- Pop Quiz: Which of these techniques are more effective?
- Typical:
- Lexicon = 30 MB Inverted File: 400 MB
- Stemming
- Case folding
- Stop words
|
|
43
|
- Problem: still have to scan document to find the term.
- Cons:
- Need access methods to take advantage
- Extra storage space overhead (variable sized)
- Alternative methods:
- Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset
size
- Split long documents into n shorter ones.
|
|
44
|
- Clue: Encode gap length instead of offset
- Use small number of bits to encode
more common gap lengths
- Better: Use a distribution of expected gap length (e.g., Bernoulli
process)
- If p = prob that any word x appears in doc y, then
- Then pgap size z = (1-p)z p . This constructs a geometric
distribution.
- Works for intra and inter-document index compression
- Why does it hold for documents as well as words?
|
|
45
|
- Takes lots of main memory, ugh!
- Can we reduce the memory requirement?
|
|
46
|
- Idea: try to make random access of disk (memory) sequential
- // Phase I – collection of term appearances on disk
- For each document Dd in collection, 1 ≤ d ≤ N
- Read Dd, parsing it into index terms
- For each index term t in Dd
- Calculate fd,t
- Dump to file a tuple (t,d,fd,t)
- // Phase II – sort tuples
- Sort all the tuples (t,d,f) using External Mergesort
- // Phase III – write output file
- Read the tuples in sorted order and create inverted file
|
|
47
|
- <a,1,2>
- <b,1,2>
- <c,1,1>
- <a,2,2>
- <d,2,1>
- <b,2,1>
- <b,3,1>
- <d,3,1>
|
|
48
|
- Gets us fd,t and N
- Savings: For any t, we know fd,t, so can use an array vs. LL
(shrinks record by 40%!)
|
|
49
|
- Partition inversion as |I|/|M| = k smaller problems
- build 1/k of inverted index on each pass
- (e.g., a-b, b-c, …, y-z)
- Tuned to fit amount of main memory in machine
- Just remember boundary words
- Can pair with disk strategy
- Create k temporary files and write tuples (t,d,fd,t) for
each partition on first pass
- Each second pass builds index from temporary file
|
|
50
|
- How do these techniques stack up?
- Assume a 5 GB corpus and 40 MB main memory machine
- Technique Memory Disk Time
- (MB) (GB) (Hours)
- *Linked lists (memory) 4000 0 6
- Linked lists (disk) 30 4 1100
- Sort-based 40 8 20
- Lexicon-based 40 0 79
- Lexicon w/ disk 40 4 12
|
|
51
|
- Now that we have an index, how do we answer queries?
|
|
52
|
- Assuming a simple word matching engine:
- For each query term t
- Stem t
- Search lexicon
- Record ft and its inverted entry address, It
- Select a query term t
- Set list of candidates, C = It
- For each remaining term t
- Read its It
- For each d in C, if d not in It set C = C – {d}
- X and Y and Z – high _______
- X or Y or Z – high _______
- Which algorithm is the above?
|
|
53
|
- Query processing strategy:
- Join less frequent terms first
- Even in ORs, as merging takes longer than lookup
- Problems with Boolean model:
- Retrieves too many or too few documents
- Longer documents are tend to match more often because they have a
larger vocabulary
- Need ranked retrieval to help out
|
|
54
|
- Boolean assigns same importance to all terms in a query
- 5566 concert dates in Singapore
- “5566” has same weight as “date”
- One way:
- Assign weights to the words, make more important words worth more
- Process results in q and d vectors: (word, weight), (word, weight) …
(word, weight)
|
|
55
|
- Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx
xxxxxxxxxx xxxxxxxx Apple.
Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx. Xxxxxxxxxx xxxxxxxx Compaq. Xxxxxxxxx xxxxxxx IBM.
- (Relative) term frequency can indicate importance.
- Rd,f = fd,t
- Rd,t = 1 + ln fd,t
- Rd,t = (K + (1-K) )
|
|
56
|
- Consider a future device for individual use, which is a sort of
mechanized private file and library. It needs a name, and, to coin one
at random, "memex" will do.
|
|
57
|
- Consider a future device for individual use, which is a sort of mechanized
private file and library. It needs a name, and, to coin one at random,
"memex" will do.
- Words with higher ft are less discriminative.
- Use inverse to measure importance:
- wt = 1/ft
- wt = ln (1+ N/ft) ß this one is most common
- wt = ln (1 + fm/ft), where fm
is the max observed frequency
|
|
58
|
- Many variants, but all capture:
- Term frequency:
Rd,t as being __________________
- Inverse Document Frequency:
Wt as being ___________________
- Standard formulation is:
wd,t = rd,t × wt
= (1+ ln(fd,t)) × ln (1 + N/ft)
- Problem:
- rd,t grows as document grows, need to normalize; otherwise
biased towards _____________
|
|
59
|
- Euclidean Distance - bad
- M(Q,Dd) = sqrt (Σ |wq,t – wd,t|2)
- Dissimilarity Measure; use reciprocal
- Has problem with long documents, why?
- Actually don’t care about vector length, just their direction
- Want to measure difference in direction
|
|
60
|
- If X and Y are two n-dimensional vectors:
- X · Y = |X| |Y| cos θ
- cos θ = X · Y / |X| |Y|
- = 1 when identical
- = 0 when orthogonal
|
|
61
|
- To get the ranked list, we use doc. accumulators:
- For each query term t, in order of increasing ft,
- Read its inverted file entry It
- Update acc. for each doc in It: Ad+=
ln (1 + fd,t) ×wt
- For each Ad in A
- Ad /= Wd // that’s basically cos θ, don’t
use wq
- Report top r of A
|
|
62
|
- Holding all possible accumulators is expensive
- Could need one for each document if query is broad
- In practice, use fixed |A| wrt main memory. What to do when all used?
- Quit: _________________
- Continue _____________________
|
|
63
|
- Want to return documents with largest cos values.
- How? Use a min-heap
- Load r A values into the heap H
- Process remaining A-r values
- If Ad > min{H} then
- Delete min{H}, add Ad, and sift
- // H now contains the top r exact cosine values
|
|
64
|
- How do you deal with a dynamic collection?
- How do you support phrasal searching?
- What about wildcard searching?
- What types of wildcard searching are common?
|