5244: Orientation - Practical IR

10 Aug 2004

CS 5244: Orientation

44/32

Inverted file compression

¡Clue: Encode gap length instead of offset

¡Use small number of bits to encode more common gap lengths

l(e.g., Huffman encoding)

¡Better: Use a distribution of expected gap length (e.g., Bernoulli process)

lIf p = prob that any word x appears in doc y, then

lThen pgap size z = (1-p)z p . This constructs a geometric distribution.

¡Works for intra and inter-document index compression

lWhy does it hold for documents as well as words?

Bridegroom

Twelfth

Jezebel

Occurrences in the Bible