Fundamentals of
Information Retrieval
|
|
|
Module 3 Min-Yen KAN |
|
Implementation of
(Textual) Information Retrieval |
What is information
retrieval?
|
|
|
Part of the information seeking process |
|
Matches a query with most relevant
documents |
|
View a query as a mini-document |
|
|
Searching in books
|
|
|
|
___ |
|
___ |
|
___ |
|
|
|
Procedure: |
|
Look up topic |
|
Find the page |
|
Skim page to find topic |
|
… |
|
Index, 11, 103-151, 443 |
|
Audio, 476 |
|
Comparison of methods 143-145 |
|
Granularity, 105, 112 |
|
N-gram, 170-172 |
|
Of integer sequences, 11 |
|
Of musical themes, 11 |
|
Of this book, 103, 507ff |
|
Within inverted file entry, see skipping |
|
Index compression, 114-129, 198-201,
235-237 |
|
Batched, 125,128 |
|
Bernoulli, 119-122, 128, 150, 247, 421 |
|
Context-sensitive, 125-126 |
|
Global, 115-121 |
|
Hyperbolic model, 123-124, 150 |
|
In MG, 421-423 |
|
Interpolative coding, 126-128 |
|
Local, 115, 121-122, 247 |
|
Nonparameterized, 115-119 |
|
Observed frequency, 121, 124-125, 128,
247 |
|
Parameterized, 115 |
|
Performance of, 128-129. 421 |
|
Skewed Bernoulli, 122-123, 138, 150 |
|
Within-document frequencies, 198-201 |
|
Index Construction, 223-261 (see also inversion) |
|
bitmaps, 255-256 |
|
… |
Information retrieval
|
|
|
|
Algorithm |
|
(Permute query to fit index) |
|
Search index |
|
Go to resource |
|
(Permute query to fit item) |
|
(Search for item) |
What to index?
|
|
|
|
Books indices have key words and
phrases |
|
Search engines index (all) words |
|
|
|
Why the disparity? |
|
What do people really search for? |
|
|
|
|
Trading precision for
size
|
|
|
|
Can save up to 32% without too much
loss: |
|
|
|
Stemming |
|
Usually just word inflection |
|
Information → Inform = Informal,
Informed |
|
|
|
Case folding |
|
N.B.: keep odd variants (e.g., NeXT,
LaTeX) |
|
|
|
Stop words |
|
Don’t index common words, people won’t
search on them anyways |
|
|
|
Pop Quiz: Which of these techniques are
more effective? |
Indexing output
|
|
|
|
Output = Lw,DD,IW×D |
|
|
|
Inverted File (Index) |
|
Postings (e.g., wt → (d1,fwt,d1),
(d2,fwt,d), …,
(dn,fwt,dn) |
|
Variable length records |
|
|
|
Lexicon: |
|
String Wt |
|
Document frequency ft |
|
Address within inverted file It |
|
Sorted, fixed length records |
|
|
|
×
D1 D2 D3 D4 D5
D6 … Dm |
|
|
|
W1 1 1 |
|
W2 2 1 |
|
W3 1 |
|
W4 1 1 |
|
W5 1 1 |
|
W6 1
1 1 |
|
… |
|
Wn |
|
|
|
|
|
|
|
|
|
|
Trading precision for
size, redux
|
|
|
|
Pop Quiz: Which of these techniques are
more effective? |
|
|
|
Typical: |
|
Lexicon = 30 MB Inverted File: 400 MB |
|
|
|
Stemming |
|
Affects Lexicon |
|
|
|
Case folding |
|
Affects Lexicon |
|
|
|
Stop words |
|
Affects Inverted File |
Is fine-grained indexing
worthwhile?
|
|
|
|
Problem: still have to scan document to
find the term. |
|
|
|
|
|
|
|
|
|
Cons: |
|
Need access methods to take advantage |
|
Extra storage space overhead (variable
sized) |
|
Alternative methods: |
|
Hierarchical encoding (doc #, para #,
sent #, word #) to shrink offset size |
|
Split long documents into n shorter
ones. |
Inverted file compression
|
|
|
|
Clue: Encode gap length instead of
offset |
|
Use small number of bits to encode more
common gap lengths |
|
(e.g., Huffman encoding) |
|
Better: Use a distribution of expected
gap length (e.g., Bernoulli process) |
|
If p = prob that any word x appears in
doc y, then |
|
Then pgap size z = (1-p)z
p . This constructs a geometric
distribution. |
|
|
|
Works for intra and inter-document
index compression |
|
_________________________________ |
Building the index –
Memory based inversion
|
|
|
Takes lots of main memory, ugh! |
|
Can we reduce the memory requirement? |
Sort-based inversion
|
|
|
Idea: try to make random access of disk
(memory) sequential |
|
|
|
// Phase I – collection of term
appearances on disk |
|
For each document Dd in
collection, 1 ≤ d ≤ N |
|
Read Dd, parsing it into
index terms |
|
For each index term t in Dd |
|
Calculate fd,t |
|
Dump to file a tuple (t,d,fd,t) |
|
// Phase II – sort tuples |
|
Sort all the tuples (t,d,f) using
External Mergesort |
|
|
|
// Phase III – write output file |
|
Read the tuples in sorted order and
create inverted file |
Sort based inversion:
example
|
|
|
<a,1,2> |
|
<b,1,2> |
|
<c,1,1> |
|
<a,2,2> |
|
<d,2,1> |
|
<b,2,1> |
|
<b,3,1> |
|
<d,3,1> |
|
|
Using a first pass for
the lexicon
|
|
|
|
Gets us fd,t and N |
|
Savings: For any t, we know fd,t,
so can use an array vs. LL (shrinks record by 40%!) |
Lexicon-based inversion
|
|
|
|
Partition inversion as |I|/|M| = k
smaller problems |
|
build 1/k of inverted index on each
pass |
|
(e.g., a-b, b-c, …, y-z) |
|
Tuned to fit amount of main memory in
machine |
|
Just remember boundary words |
|
|
|
Can pair with disk strategy |
|
Create k temporary files and write
tuples (t,d,fd,t) for each partition on first pass |
|
Each second pass builds index from
temporary file |
|
|
Inversion – Summary of
Techniques
|
|
|
How do these techniques stack up? |
|
Assume a 5 GB corpus and 40 MB main
memory machine |
|
|
|
Technique Memory Disk Time |
|
(MB) (GB) (Hours) |
|
*Linked lists (memory) 4000 0 6 |
|
Linked lists (disk) 30 4 1100 |
|
Sort-based 40 8 20 |
|
Lexicon-based 40 0 79 |
|
Lexicon w/ disk 40 4 12 |
Query Matching
|
|
|
Now that we have an index, how do we
answer queries? |
Query Matching
|
|
|
|
Assuming a simple word matching engine: |
|
|
|
For each query term t |
|
Stem t |
|
Search lexicon |
|
Record ft and its inverted
entry address, It |
|
Select a query term t |
|
Set list of candidates, C = It |
|
For each remaining term t |
|
Read its It |
|
For each d in C, if d not in It
set C = C – {d} |
|
|
|
X and Y and Z – high precision |
|
X or Y or Z – high recall |
|
|
|
|
Boolean Model
|
|
|
|
Query processing strategy: |
|
______________________ |
|
Even in ORs, as merging takes longer
than lookup |
|
|
|
Problems with Boolean model: |
|
_______________________________ |
|
Longer documents are tend to match more
often because they have a larger vocabulary |
|
Need ranked retrieval to help out |
Deciding ranking
|
|
|
|
Boolean assigns same importance to all
terms in a query |
|
|
|
F4 concert dates in Singapore |
|
Problem: “F4” has same weight as “date” |
|
|
|
One way: |
|
Assign weights to the words, make more
important words worth more |
|
Process results in q and d vectors:
(word, weight), (word, weight) … (word, weight) |
Term Frequency
|
|
|
Xxxxxxxxxxxxxx IBM xxxxxxxxxxx
xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple. Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx. Xxxxxxxxxx xxxxxxxx Compaq. Xxxxxxxxx xxxxxxx IBM. |
|
|
|
(Relative) term frequency can indicate
importance. |
|
Rd,f = fd,t |
|
Rd,t = 1 + ln fd,t |
|
Rd,t = (K + (1-K) ) |
Inverse Document
Frequency
|
|
|
Consider a future device for
individual use, which is a sort of mechanized private file and library. It
needs a name, and, to coin one at random, "memex" will do. |
|
|
Inverse Document
Frequency
|
|
|
|
Consider a future device for individual
use, which is a sort of mechanized private file and library. It needs a name,
and, to coin one at random, "memex" will do. |
|
|
|
Words with higher ft are
less discriminative. |
|
Use inverse to measure importance: |
|
wt = 1/ft |
|
wt = ln (1+ N/ft)
ß this
one is most common |
|
wt = ln (1 + fm/ft),
where fm is the max observed frequency |
|
|
|
|
|
|
This is TF*IDF
|
|
|
|
Many variants, but all capture: |
|
Term frequency:
Rd,t as _____________________ |
|
|
|
Inverse Document Frequency:
Wt as ________________________ |
|
|
|
Standard formulation is:
wd,t = rd,t × wt
= (1+ ln(fd,t)) × ln (1 + N/ft) |
|
|
|
Problem: |
|
rd,t grows as document
grows, need to normalize; otherwise biased towards _______________ |
Calculating Similarity
|
|
|
|
Euclidean Distance - bad |
|
M(Q,Dd) = sqrt (Σ |wq,t
– wd,t|2) |
|
Dissimilarity Measure; use reciprocal |
|
Has problem with long documents, why? |
|
|
|
Actually don’t care about vector
length, just their direction |
|
Want to measure difference in direction |
Cosine Similarity
|
|
|
|
If X and Y are two n-dimensional
vectors: |
|
X · Y = |X| |Y| cos θ |
|
cos θ = X · Y / |X| |Y| |
|
|
|
= 1 when identical |
|
= 0 when orthogonal |
Calculating the ranked
list
|
|
|
To get the ranked list, we use doc.
accumulators: |
|
|
|
For each query term t, in order of
increasing ft, |
|
Read its inverted file entry It |
|
Update acc. for each doc in
It: Ad+= ln (1 + fd,t) ×wt |
|
For each Ad in A |
|
Ad /= Wd //
that’s basically cos θ, don’t use wq |
|
Report top r of A |
|
|
Accumulator Storage
|
|
|
|
Holding all possible accumulators is
expensive |
|
Could need one for each document if
query is broad |
|
|
|
In practice, use fixed |A| wrt main
memory. What to do when all used? |
|
Quit: _____________________ |
|
Continue ____________________ |
Selecting r entries from
accumulators
|
|
|
|
|
Want to return documents with largest
cos values. |
|
|
|
How? Use a ____________ |
|
Load r A values into the _____ |
|
Process remaining A-r values |
|
If Ad > min{H} then |
|
Delete min{H}, add Ad, and
sift |
|
// H now contains the top r exact
cosine values |
|
|
To think about
|
|
|
|
How do you deal with a dynamic
collection? |
|
How do you support phrasal searching? |
|
What about wildcard searching? |
|
What types of wildcard searching are
common? |
|
|