Fundamentals of
Information Retrieval
Module 3                      Min-Yen KAN
Implementation of
(Textual) Information Retrieval

What is information retrieval?
Part of the information seeking process
Matches a query with most relevant documents
View a query as a mini-document

Searching in books
___
___
___
Procedure:
Look up topic
Find the page
Skim page to find topic
Index, 11, 103-151, 443
Audio, 476
Comparison of methods 143-145
Granularity, 105, 112
N-gram, 170-172
Of integer sequences, 11
Of musical themes, 11
Of this book, 103, 507ff
Within inverted file entry, see skipping
Index compression, 114-129, 198-201, 235-237
Batched, 125,128
Bernoulli, 119-122, 128, 150, 247, 421
Context-sensitive, 125-126
Global, 115-121
Hyperbolic model, 123-124, 150
In MG, 421-423
Interpolative coding, 126-128
Local, 115, 121-122, 247
Nonparameterized, 115-119
Observed frequency, 121, 124-125, 128, 247
Parameterized, 115
Performance of, 128-129. 421
Skewed Bernoulli, 122-123, 138, 150
Within-document frequencies, 198-201
Index Construction, 223-261 (see also inversion)
bitmaps, 255-256

Information retrieval
Algorithm
(Permute query to fit index)
Search index
Go to resource
(Permute query to fit item)
(Search for item)

What to index?
Books indices have key words and phrases
Search engines index (all) words
Why the disparity?
What do people really search for?

Trading precision for size
Can save up to 32% without too much loss:
Stemming
Usually just word inflection
Information → Inform = Informal, Informed
Case folding
N.B.: keep odd variants (e.g., NeXT, LaTeX)
Stop words
Don’t index common words, people won’t search on them anyways
Pop Quiz: Which of these techniques are more effective?

Indexing output
Output = Lw,DD,IW×D
Inverted File (Index)
Postings (e.g., wt → (d1,fwt,d1), (d2,fwt,d),  …, (dn,fwt,dn)
Variable length records
Lexicon:
String Wt
Document frequency ft
Address within inverted file It
Sorted, fixed length records
×       D1 D2 D3 D4 D5 D6 … Dm
W1           1        1
W2       2            1
W3        1
W4                         1           1
W5        1           1
W6            1       1   1
Wn

Trading precision for size, redux
Pop Quiz: Which of these techniques are more effective?
Typical:
Lexicon = 30 MB Inverted File: 400 MB
Stemming
Affects Lexicon
Case folding
Affects Lexicon
Stop words
Affects Inverted File

Is fine-grained indexing worthwhile?
Problem: still have to scan document to find the term.
Cons:
Need access methods to take advantage
Extra storage space overhead (variable sized)
Alternative methods:
Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset size
Split long documents into n shorter ones.

Inverted file compression
Clue: Encode gap length instead of offset
Use small number of bits to encode more common gap lengths
(e.g., Huffman encoding)
Better: Use a distribution of expected gap length (e.g., Bernoulli process)
If p = prob that any word x appears in doc y, then
Then pgap size z = (1-p)z p .  This constructs a geometric distribution.
Works for intra and inter-document index compression
_________________________________

Building the index – Memory based inversion
Takes lots of main memory, ugh!
Can we reduce the memory requirement?

Sort-based inversion
Idea: try to make random access of disk (memory) sequential
// Phase I – collection of term appearances on disk
For each document Dd in collection, 1 ≤ d ≤ N
Read Dd, parsing it into index terms
For each index term t in Dd
Calculate fd,t
Dump to file a tuple (t,d,fd,t)
// Phase II – sort tuples
Sort all the tuples (t,d,f) using External Mergesort
// Phase III – write output file
Read the tuples in sorted order and create inverted file

Sort based inversion: example
<a,1,2>
<b,1,2>
<c,1,1>
<a,2,2>
<d,2,1>
<b,2,1>
<b,3,1>
<d,3,1>

Using a first pass for the lexicon
Gets us fd,t and N
Savings: For any t, we know fd,t, so can use an array vs. LL (shrinks record by 40%!)

Lexicon-based inversion
Partition inversion as |I|/|M| = k smaller problems
build 1/k of inverted index on each pass
(e.g., a-b, b-c, …, y-z)
Tuned to fit amount of main memory in machine
Just remember boundary words
Can pair with disk strategy
Create k temporary files and write tuples (t,d,fd,t) for each partition on first pass
Each second pass builds index from temporary file

Inversion – Summary of Techniques
How do these techniques stack up?
Assume a 5 GB corpus and 40 MB main memory machine
Technique       Memory Disk Time
(MB) (GB) (Hours)
*Linked lists (memory)  4000 0 6
Linked lists (disk) 30 4 1100
Sort-based 40 8 20
Lexicon-based 40 0 79
Lexicon w/ disk 40 4 12

Query Matching
Now that we have an index, how do we answer queries?

Query Matching
Assuming a simple word matching engine:
For each query term t
Stem t
Search lexicon
Record ft and its inverted entry address, It
Select a query term t
Set list of candidates, C = It
For each remaining term t
Read its It
For each d in C, if d not in It set C = C – {d}
X and Y and Z – high precision
X or Y or Z – high recall

Boolean Model
Query processing strategy:
______________________
Even in ORs, as merging takes longer than lookup
Problems with Boolean model:
_______________________________
Longer documents are tend to match more often because they have a larger vocabulary
Need ranked retrieval to help out

Deciding ranking
Boolean assigns same importance to all terms in a query
F4 concert dates in Singapore
Problem: “F4” has same weight as “date”
One way:
Assign weights to the words, make more important words worth more
Process results in q and d vectors: (word, weight), (word, weight) … (word, weight)

Term Frequency
Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple.  Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx.  Xxxxxxxxxx xxxxxxxx Compaq.  Xxxxxxxxx xxxxxxx IBM.
(Relative) term frequency can indicate importance.
Rd,f = fd,t
Rd,t = 1 + ln fd,t
Rd,t = (K + (1-K)           )

Inverse Document Frequency
Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.

Inverse Document Frequency
Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.
Words with higher ft are less discriminative.
Use inverse to measure importance:
wt = 1/ft
wt = ln (1+ N/ft) ß this one is most common
wt = ln (1 + fm/ft), where fm is the max observed frequency

This is TF*IDF
Many variants, but all capture:
Term frequency:
Rd,t as _____________________
Inverse Document Frequency:
Wt as ________________________
Standard formulation is:
wd,t = rd,t × wt
= (1+ ln(fd,t)) × ln (1 + N/ft)
Problem:
rd,t grows as document grows, need to normalize; otherwise biased towards _______________

Calculating Similarity
Euclidean Distance - bad
M(Q,Dd) = sqrt (Σ |wq,t – wd,t|2)
Dissimilarity Measure; use reciprocal
Has problem with long documents, why?
Actually don’t care about vector length, just their direction
Want to measure difference in direction

Cosine Similarity
If X and Y are two n-dimensional vectors:
X · Y = |X| |Y| cos θ
cos θ = X · Y / |X| |Y|
= 1 when identical
= 0 when orthogonal

Calculating the ranked list
To get the ranked list, we use doc. accumulators:
For each query term t, in order of increasing ft,
Read its inverted file entry It
Update acc. for each doc in It: Ad+= ln (1 + fd,t) ×wt
For each Ad in A
Ad /= Wd // that’s basically cos θ, don’t use wq
Report top r of A

Accumulator Storage
Holding all possible accumulators is expensive
Could need one for each document if query is broad
In practice, use fixed |A| wrt main memory.  What to do when all used?
Quit: _____________________
Continue ____________________

Selecting r entries from accumulators
Want to return documents with largest cos values.
How? Use a ____________
Load r A values into the _____
Process remaining A-r values
If Ad > min{H} then
Delete min{H}, add Ad, and sift
// H now contains the top r exact cosine values

To think about
How do you deal with a dynamic collection?
How do you support phrasal searching?
What about wildcard searching?
What types of wildcard searching are common?