|
1
|
- Module 3
Min-Yen KAN
- Implementation of
(Textual) Information Retrieval
|
|
2
|
- Part of the information seeking process
- Matches a query with most relevant documents
- View a query as a mini-document
|
|
3
|
- ___
- ___
- ___
- Procedure:
- Look up topic
- Find the page
- Skim page to find topic
- …
- Index, 11, 103-151, 443
- Audio, 476
- Comparison of methods 143-145
- Granularity, 105, 112
- N-gram, 170-172
- Of integer sequences, 11
- Of musical themes, 11
- Of this book, 103, 507ff
- Within inverted file entry, see skipping
- Index compression, 114-129, 198-201, 235-237
- Batched, 125,128
- Bernoulli, 119-122, 128, 150, 247, 421
- Context-sensitive, 125-126
- Global, 115-121
- Hyperbolic model, 123-124, 150
- In MG, 421-423
- Interpolative coding, 126-128
- Local, 115, 121-122, 247
- Nonparameterized, 115-119
- Observed frequency, 121, 124-125, 128, 247
- Parameterized, 115
- Performance of, 128-129. 421
- Skewed Bernoulli, 122-123, 138, 150
- Within-document frequencies, 198-201
- Index Construction, 223-261 (see also inversion)
- bitmaps, 255-256
- …
|
|
4
|
- Algorithm
- (Permute query to fit index)
- Search index
- Go to resource
- (Permute query to fit item)
- (Search for item)
|
|
5
|
- Books indices have key words and phrases
- Search engines index (all) words
- Why the disparity?
- What do people really search for?
|
|
6
|
- Can save up to 32% without too much loss:
- Stemming
- Usually just word inflection
- Information → Inform = Informal, Informed
- Case folding
- N.B.: keep odd variants (e.g., NeXT, LaTeX)
- Stop words
- Don’t index common words, people won’t search on them anyways
- Pop Quiz: Which of these techniques are more effective?
|
|
7
|
- Output = Lw,DD,IW×D
- Inverted File (Index)
- Postings (e.g., wt → (d1,fwt,d1),
(d2,fwt,d),
…, (dn,fwt,dn)
- Variable length records
- Lexicon:
- String Wt
- Document frequency ft
- Address within inverted file It
- Sorted, fixed length records
- × D1 D2
D3 D4 D5 D6 … Dm
- W1 1 1
- W2 2 1
- W3 1
- W4 1 1
- W5 1 1
- W6 1 1 1
- …
- Wn
|
|
8
|
- Pop Quiz: Which of these techniques are more effective?
- Typical:
- Lexicon = 30 MB Inverted File: 400 MB
- Stemming
- Case folding
- Stop words
|
|
9
|
- Problem: still have to scan document to find the term.
- Cons:
- Need access methods to take advantage
- Extra storage space overhead (variable sized)
- Alternative methods:
- Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset
size
- Split long documents into n shorter ones.
|
|
10
|
- Clue: Encode gap length instead of offset
- Use small number of bits to encode more common gap lengths
- Better: Use a distribution of expected gap length (e.g., Bernoulli
process)
- If p = prob that any word x appears in doc y, then
- Then pgap size z = (1-p)z p . This constructs a geometric
distribution.
- Works for intra and inter-document index compression
- _________________________________
|
|
11
|
- Takes lots of main memory, ugh!
- Can we reduce the memory requirement?
|
|
12
|
- Idea: try to make random access of disk (memory) sequential
- // Phase I – collection of term appearances on disk
- For each document Dd in collection, 1 ≤ d ≤ N
- Read Dd, parsing it into index terms
- For each index term t in Dd
- Calculate fd,t
- Dump to file a tuple (t,d,fd,t)
- // Phase II – sort tuples
- Sort all the tuples (t,d,f) using External Mergesort
- // Phase III – write output file
- Read the tuples in sorted order and create inverted file
|
|
13
|
- <a,1,2>
- <b,1,2>
- <c,1,1>
- <a,2,2>
- <d,2,1>
- <b,2,1>
- <b,3,1>
- <d,3,1>
|
|
14
|
- Gets us fd,t and N
- Savings: For any t, we know fd,t, so can use an array vs. LL
(shrinks record by 40%!)
|
|
15
|
- Partition inversion as |I|/|M| = k smaller problems
- build 1/k of inverted index on each pass
- (e.g., a-b, b-c, …, y-z)
- Tuned to fit amount of main memory in machine
- Just remember boundary words
- Can pair with disk strategy
- Create k temporary files and write tuples (t,d,fd,t) for
each partition on first pass
- Each second pass builds index from temporary file
|
|
16
|
- How do these techniques stack up?
- Assume a 5 GB corpus and 40 MB main memory machine
- Technique Memory Disk Time
- (MB) (GB) (Hours)
- *Linked lists (memory) 4000 0 6
- Linked lists (disk) 30 4 1100
- Sort-based 40 8 20
- Lexicon-based 40 0 79
- Lexicon w/ disk 40 4 12
|
|
17
|
- Now that we have an index, how do we answer queries?
|
|
18
|
- Assuming a simple word matching engine:
- For each query term t
- Stem t
- Search lexicon
- Record ft and its inverted entry address, It
- Select a query term t
- Set list of candidates, C = It
- For each remaining term t
- Read its It
- For each d in C, if d not in It set C = C – {d}
- X and Y and Z – high precision
- X or Y or Z – high recall
|
|
19
|
- Query processing strategy:
- ______________________
- Even in ORs, as merging takes longer than lookup
- Problems with Boolean model:
- _______________________________
- Longer documents are tend to match more often because they have a
larger vocabulary
- Need ranked retrieval to help out
|
|
20
|
- Boolean assigns same importance to all terms in a query
- F4 concert dates in Singapore
- Problem: “F4” has same weight as “date”
- One way:
- Assign weights to the words, make more important words worth more
- Process results in q and d vectors: (word, weight), (word, weight) …
(word, weight)
|
|
21
|
- Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx
xxxxxxxxxx xxxxxxxx Apple.
Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx. Xxxxxxxxxx xxxxxxxx Compaq. Xxxxxxxxx xxxxxxx IBM.
- (Relative) term frequency can indicate importance.
- Rd,f = fd,t
- Rd,t = 1 + ln fd,t
- Rd,t = (K + (1-K) )
|
|
22
|
- Consider a future device for individual use, which is a sort of
mechanized private file and library. It needs a name, and, to coin one
at random, "memex" will do.
|
|
23
|
- Consider a future device for individual use, which is a sort of mechanized
private file and library. It needs a name, and, to coin one at random,
"memex" will do.
- Words with higher ft are less discriminative.
- Use inverse to measure importance:
- wt = 1/ft
- wt = ln (1+ N/ft) ß this one is most common
- wt = ln (1 + fm/ft), where fm
is the max observed frequency
|
|
24
|
- Many variants, but all capture:
- Term frequency:
Rd,t as _____________________
- Inverse Document Frequency:
Wt as ________________________
- Standard formulation is:
wd,t = rd,t × wt
= (1+ ln(fd,t)) × ln (1 + N/ft)
- Problem:
- rd,t grows as document grows, need to normalize; otherwise
biased towards _______________
|
|
25
|
- Euclidean Distance - bad
- M(Q,Dd) = sqrt (Σ |wq,t – wd,t|2)
- Dissimilarity Measure; use reciprocal
- Has problem with long documents, why?
- Actually don’t care about vector length, just their direction
- Want to measure difference in direction
|
|
26
|
- If X and Y are two n-dimensional vectors:
- X · Y = |X| |Y| cos θ
- cos θ = X · Y / |X| |Y|
- = 1 when identical
- = 0 when orthogonal
|
|
27
|
- To get the ranked list, we use doc. accumulators:
- For each query term t, in order of increasing ft,
- Read its inverted file entry It
- Update acc. for each doc in It: Ad+=
ln (1 + fd,t) ×wt
- For each Ad in A
- Ad /= Wd // that’s basically cos θ, don’t
use wq
- Report top r of A
|
|
28
|
- Holding all possible accumulators is expensive
- Could need one for each document if query is broad
- In practice, use fixed |A| wrt main memory. What to do when all used?
- Quit: _____________________
- Continue ____________________
|
|
29
|
- Want to return documents with largest cos values.
- How? Use a ____________
- Load r A values into the _____
- Process remaining A-r values
- If Ad > min{H} then
- Delete min{H}, add Ad, and sift
- // H now contains the top r exact cosine values
|
|
30
|
- How do you deal with a dynamic collection?
- How do you support phrasal searching?
- What about wildcard searching?
- What types of wildcard searching are common?
|