Fundamentals of
Information Retrieval

Module 3 Min-Yen KAN

Implementation of
(Textual) Information Retrieval

What is information retrieval?

Part of the information seeking process

Matches a query with most relevant documents

View a query as a mini-document

Searching in books

___

___

___

Procedure:

Look up topic

Find the page

Skim page to find topic

…

Index, 11, 103-151, 443

Audio, 476

Comparison of methods 143-145

Granularity, 105, 112

N-gram, 170-172

Of integer sequences, 11

Of musical themes, 11

Of this book, 103, 507ff

Within inverted file entry, see skipping

Index compression, 114-129, 198-201, 235-237

Batched, 125,128

Bernoulli, 119-122, 128, 150, 247, 421

Context-sensitive, 125-126

Global, 115-121

Hyperbolic model, 123-124, 150

In MG, 421-423

Interpolative coding, 126-128

Local, 115, 121-122, 247

Nonparameterized, 115-119

Observed frequency, 121, 124-125, 128, 247

Parameterized, 115

Performance of, 128-129. 421

Skewed Bernoulli, 122-123, 138, 150

Within-document frequencies, 198-201

Index Construction, 223-261 (see also inversion)

bitmaps, 255-256

…

Information retrieval

Algorithm

(Permute query to fit index)

Search index

Go to resource

(Permute query to fit item)

(Search for item)

What to index?

Books indices have key words and phrases

Search engines index (all) words

Why the disparity?

What do people really search for?

Trading precision for size

Can save up to 32% without too much loss:

Stemming

Usually just word inflection

Information → Inform = Informal, Informed

Case folding

N.B.: keep odd variants (e.g., NeXT, LaTeX)

Stop words

Don’t index common words, people won’t search on them anyways

Pop Quiz: Which of these techniques are more effective?

Indexing output

Output = L_w,D_D,I_W_×_D

Inverted File (Index)

Postings (e.g., w_t → (d₁,f_wt,d1), (d₂,f_wt,d), …, (d_n,f_wt,dn)

Variable length records

Lexicon:

String W_t

Document frequency f_t

Address within inverted file I_t

Sorted, fixed length records

×       D₁ D₂ D₃ D₄ D₅ D₆ … D_m

W₁           1        1

W₂       2            1

W₃        1

W₄1           1

W₅       1           1

W₆           1       1   1

…

W_n

Trading precision for size, redux

Pop Quiz: Which of these techniques are more effective?

Typical:

Lexicon = 30 MB Inverted File: 400 MB

Stemming

Affects Lexicon

Case folding

Affects Lexicon

Stop words

Affects Inverted File

Is fine-grained indexing worthwhile?

Problem: still have to scan document to find the term.

Cons:

Need access methods to take advantage

Extra storage space overhead (variable sized)

Alternative methods:

Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset size

Split long documents into n shorter ones.

Inverted file compression

Clue: Encode gap length instead of offset

Use small number of bits to encode more common gap lengths

(e.g., Huffman encoding)

Better: Use a distribution of expected gap length (e.g., Bernoulli process)

If p = prob that any word x appears in doc y, then

Then p_{gap size z}= (1-p)^zp . This constructs a geometric distribution.

Works for intra and inter-document index compression

_________________________________

Building the index – Memory based inversion

Takes lots of main memory, ugh!

Can we reduce the memory requirement?

Sort-based inversion

Idea: try to make random access of disk (memory) sequential

// Phase I – collection of term appearances on disk

For each document D_d in collection, 1 ≤ d ≤ N

Read D_d, parsing it into index terms

For each index term t in D_d

Calculate fd,t

Dump to file a tuple (t,d,f_d,t)

// Phase II – sort tuples

Sort all the tuples (t,d,f) using External Mergesort

// Phase III – write output file

Read the tuples in sorted order and create inverted file

Sort based inversion: example

<a,1,2>

<b,1,2>

<c,1,1>

<a,2,2>

<d,2,1>

<b,2,1>

<b,3,1>

<d,3,1>

Using a first pass for the lexicon

Gets us f_d,t and N

Savings: For any t, we know f_d,t, so can use an array vs. LL (shrinks record by 40%!)

Lexicon-based inversion

Partition inversion as |I|/|M| = k smaller problems

build 1/k of inverted index on each pass

(e.g., a-b, b-c, …, y-z)

Tuned to fit amount of main memory in machine

Just remember boundary words

Can pair with disk strategy

Create k temporary files and write tuples (t,d,f_d,t) for each partition on first pass

Each second pass builds index from temporary file

Inversion – Summary of Techniques

How do these techniques stack up?

Assume a 5 GB corpus and 40 MB main memory machine

Technique Memory Disk Time

(MB) (GB) (Hours)

*Linked lists (memory) 4000 0 6

Linked lists (disk) 30 4 1100

Sort-based 40 8 20

Lexicon-based 40 0 79

Lexicon w/ disk 40 4 12

Query Matching

Now that we have an index, how do we answer queries?

Query Matching

Assuming a simple word matching engine:

For each query term t

Stem t

Search lexicon

Record f_t and its inverted entry address, I_t

Select a query term t

Set list of candidates, C = I_t

For each remaining term t

Read its I_t

For each d in C, if d not in I_t set C = C – {d}

X and Y and Z – high precision

X or Y or Z – high recall

Boolean Model

Query processing strategy:

______________________

Even in ORs, as merging takes longer than lookup

Problems with Boolean model:

_______________________________

Longer documents are tend to match more often because they have a larger vocabulary

Need ranked retrieval to help out

Deciding ranking

Boolean assigns same importance to all terms in a query

F4 concert dates in Singapore

Problem: “F4” has same weight as “date”

One way:

Assign weights to the words, make more important words worth more

Process results in q and d vectors: (word, weight), (word, weight) … (word, weight)

Term Frequency

Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple. Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx. Xxxxxxxxxx xxxxxxxx Compaq. Xxxxxxxxx xxxxxxx IBM.

(Relative) term frequency can indicate importance.

R_d,f = f_d,t

R_d,t = 1 + ln f_d,t

Rd,t = (K + (1-K) )

Inverse Document Frequency

Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.

Inverse Document Frequency

Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.

Words with higher f_t are less discriminative.

Use inverse to measure importance:

w_t = 1/f_t

w_t = ln (1+ N/f_t) ß this one is most common

w_t = ln (1 + f^m/f_t), where f^m is the max observed frequency

This is TF*IDF

Many variants, but all capture:

Term frequency:
R_d,t as _____________________

Inverse Document Frequency:
W_tas ________________________

Standard formulation is:
w_d,t = r_d,t × w_t
= (1+ ln(f_d,t)) × ln (1 + N/f_t)

Problem:

r_d,t grows as document grows, need to normalize; otherwise biased towards _______________

Calculating Similarity

Euclidean Distance - bad

M(Q,D_d) = sqrt (Σ |w_q,t – w_d,t|²)

Dissimilarity Measure; use reciprocal

Has problem with long documents, why?

Actually don’t care about vector length, just their direction

Want to measure difference in direction

Cosine Similarity

If X and Y are two n-dimensional vectors:

X · Y = |X| |Y| cos θ

cos θ = X · Y / |X| |Y|

= 1 when identical

= 0 when orthogonal

Calculating the ranked list

To get the ranked list, we use doc. accumulators:

For each query term t, in order of increasing f_t,

Read its inverted file entry I_t

Update acc. for each doc in I_t: A_d+= ln (1 + f_d,t) ×w_t

For each A_d in A

A_d /= W_d// that’s basically cos θ, don’t use w_q

Report top r of A

Accumulator Storage

Holding all possible accumulators is expensive

Could need one for each document if query is broad

In practice, use fixed |A| wrt main memory. What to do when all used?

Quit: _____________________

Continue ____________________

Selecting r entries from accumulators

Want to return documents with largest cos values.

How? Use a ____________

Load r A values into the _____

Process remaining A-r values

If A_d > min{H} then

Delete min{H}, add A_d, and sift

// H now contains the top r exact cosine values

To think about

How do you deal with a dynamic collection?

How do you support phrasal searching?

What about wildcard searching?

What types of wildcard searching are common?


	Module 3 Min-Yen KAN
	Implementation of (Textual) Information Retrieval


	Part of the information seeking process
	Matches a query with most relevant documents
	View a query as a mini-document


	___
	___
	___

	Procedure:
		Look up topic
		Find the page
		Skim page to find topic
	…
	Index, 11, 103-151, 443
	Audio, 476
	Comparison of methods 143-145
	Granularity, 105, 112
	N-gram, 170-172
	Of integer sequences, 11
	Of musical themes, 11
	Of this book, 103, 507ff
	Within inverted file entry, see skipping
	Index compression, 114-129, 198-201, 235-237
	Batched, 125,128
	Bernoulli, 119-122, 128, 150, 247, 421
	Context-sensitive, 125-126
	Global, 115-121
	Hyperbolic model, 123-124, 150
	In MG, 421-423
	Interpolative coding, 126-128
	Local, 115, 121-122, 247
	Nonparameterized, 115-119
	Observed frequency, 121, 124-125, 128, 247
	Parameterized, 115
	Performance of, 128-129. 421
	Skewed Bernoulli, 122-123, 138, 150
	Within-document frequencies, 198-201
	Index Construction, 223-261 (see also inversion)
	bitmaps, 255-256
	…


	Algorithm
		(Permute query to fit index)
		Search index
		Go to resource
		(Permute query to fit item)
		(Search for item)


	Books indices have key words and phrases
	Search engines index (all) words

	Why the disparity?
	What do people really search for?


	Can save up to 32% without too much loss:

	Stemming
		Usually just word inflection
		Information → Inform = Informal, Informed

	Case folding
		N.B.: keep odd variants (e.g., NeXT, LaTeX)

	Stop words
		Don’t index common words, people won’t search on them anyways

	Pop Quiz: Which of these techniques are more effective?


	Output = L_w,D_D,I_W_×_D

	Inverted File (Index)
		Postings (e.g., w_t → (d₁,f_wt,d1), (d₂,f_wt,d), …, (d_n,f_wt,dn)
		Variable length records

	Lexicon:
		String W_t
		Document frequency f_t
		Address within inverted file I_t
		Sorted, fixed length records

	× D₁ D₂ D₃ D₄ D₅ D₆ … D_m

	W₁ 1 1
	W₂ 2 1
	W₃ 1
	W₄1 1
	W₅ 1 1
	W₆ 1 1 1
	…
	W_n


	Pop Quiz: Which of these techniques are more effective?

	Typical:
	Lexicon = 30 MB Inverted File: 400 MB

	Stemming
		Affects Lexicon

	Case folding
		Affects Lexicon

	Stop words
		Affects Inverted File


	Problem: still have to scan document to find the term.




	Cons:
		Need access methods to take advantage
		Extra storage space overhead (variable sized)
	Alternative methods:
		Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset size
		Split long documents into n shorter ones.


	Clue: Encode gap length instead of offset
	Use small number of bits to encode more common gap lengths
		(e.g., Huffman encoding)
	Better: Use a distribution of expected gap length (e.g., Bernoulli process)
		If p = prob that any word x appears in doc y, then
		Then p_{gap size z}= (1-p)^zp . This constructs a geometric distribution.

	Works for intra and inter-document index compression
		_________________________________


	Takes lots of main memory, ugh!
	Can we reduce the memory requirement?


	Idea: try to make random access of disk (memory) sequential

	// Phase I – collection of term appearances on disk
	For each document D_d in collection, 1 ≤ d ≤ N
	Read D_d, parsing it into index terms
	For each index term t in D_d
	Calculate fd,t
	Dump to file a tuple (t,d,f_d,t)
	// Phase II – sort tuples
	Sort all the tuples (t,d,f) using External Mergesort

	// Phase III – write output file
	Read the tuples in sorted order and create inverted file


	Gets us f_d,t and N
		Savings: For any t, we know f_d,t, so can use an array vs. LL (shrinks record by 40%!)


	Partition inversion as \|I\|/\|M\| = k smaller problems
		build 1/k of inverted index on each pass
		(e.g., a-b, b-c, …, y-z)
		Tuned to fit amount of main memory in machine
		Just remember boundary words

	Can pair with disk strategy
		Create k temporary files and write tuples (t,d,f_d,t) for each partition on first pass
		Each second pass builds index from temporary file


	How do these techniques stack up?
	Assume a 5 GB corpus and 40 MB main memory machine

	Technique Memory Disk Time
	(MB) (GB) (Hours)
	*Linked lists (memory) 4000 0 6
	Linked lists (disk) 30 4 1100
	Sort-based 40 8 20
	Lexicon-based 40 0 79
	Lexicon w/ disk 40 4 12


	Assuming a simple word matching engine:

	For each query term t
	Stem t
	Search lexicon
	Record f_t and its inverted entry address, I_t
	Select a query term t
	Set list of candidates, C = I_t
	For each remaining term t
	Read its I_t
	For each d in C, if d not in I_t set C = C – {d}

		X and Y and Z – high precision
		X or Y or Z – high recall


	Query processing strategy:
		______________________
		Even in ORs, as merging takes longer than lookup

	Problems with Boolean model:
		_______________________________
		Longer documents are tend to match more often because they have a larger vocabulary
		Need ranked retrieval to help out


	Boolean assigns same importance to all terms in a query

	F4 concert dates in Singapore
		Problem: “F4” has same weight as “date”

	One way:
		Assign weights to the words, make more important words worth more
		Process results in q and d vectors: (word, weight), (word, weight) … (word, weight)


	Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple. Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx. Xxxxxxxxxx xxxxxxxx Compaq. Xxxxxxxxx xxxxxxx IBM.

	(Relative) term frequency can indicate importance.
	R_d,f = f_d,t
	R_d,t = 1 + ln f_d,t
	Rd,t = (K + (1-K) )


	Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.


	Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.

	Words with higher f_t are less discriminative.
	Use inverse to measure importance:
		w_t = 1/f_t
		w_t = ln (1+ N/f_t) ß this one is most common
		w_t = ln (1 + f^m/f_t), where f^m is the max observed frequency


	Many variants, but all capture:
		Term frequency: R_d,t as _____________________

		Inverse Document Frequency: W_tas ________________________

	Standard formulation is: w_d,t = r_d,t × w_t = (1+ ln(f_d,t)) × ln (1 + N/f_t)

	Problem:
		r_d,t grows as document grows, need to normalize; otherwise biased towards _______________


	Euclidean Distance - bad
		M(Q,D_d) = sqrt (Σ \|w_q,t – w_d,t\|²)
		Dissimilarity Measure; use reciprocal
		Has problem with long documents, why?

	Actually don’t care about vector length, just their direction
		Want to measure difference in direction


	If X and Y are two n-dimensional vectors:
		X · Y = \|X\| \|Y\| cos θ
		cos θ = X · Y / \|X\| \|Y\|

		= 1 when identical
		= 0 when orthogonal


	To get the ranked list, we use doc. accumulators:

	For each query term t, in order of increasing f_t,
	Read its inverted file entry I_t
	Update acc. for each doc in I_t: A_d+= ln (1 + f_d,t) ×w_t
	For each A_d in A
	A_d /= W_d// that’s basically cos θ, don’t use w_q
	Report top r of A


	Holding all possible accumulators is expensive
		Could need one for each document if query is broad

	In practice, use fixed \|A\| wrt main memory. What to do when all used?
		Quit: _____________________
		Continue ____________________


	Want to return documents with largest cos values.

	How? Use a ____________
		Load r A values into the _____
		Process remaining A-r values
		If A_d > min{H} then
		Delete min{H}, add A_d, and sift
		// H now contains the top r exact cosine values


	How do you deal with a dynamic collection?
	How do you support phrasal searching?
	What about wildcard searching?
		What types of wildcard searching are common?