5244: Orientation - Practical IR

10 Aug 2004

CS 5244: Orientation

43/32

Is fine-grained indexing worthwhile?

¡Problem: still have to scan document to find the term.

¡Cons:

lNeed access methods to take advantage

lExtra storage space overhead (variable sized)

¡Alternative methods:

lHierarchical encoding (doc #, para #, sent #, word #) to shrink offset size

lSplit long documents into n shorter ones.

Image (D1, 2), (D4, 1)

Implicit (D2, 1), (D3, 1) …

Index (D5, 3), (D2, 1) …

Inverse (D2, 2)

Internet (D1, 2), (D3, 2) …

Image (D1, 2; 10, 205), (D4, 1, 3993)

Implicit (D2, 1; 242), (D3, 1; 233) …

Index (D5, 3; 20, 42, 3920), (D2, 1 …

Inverse (D2, 2; 599, 847)

Internet (D1, 2; 12, 43), (D3, 2; 302, …