27 Aug 2003
CS 6210: Module 3
10/31
Inverted file compression
¡Clue: Encode gap length instead of offset
¡Use small number of bits to encode more common gap lengths
l(e.g., Huffman encoding)
¡Better: Use a distribution of expected gap length (e.g., Bernoulli process)
lIf p = prob that any word x appears in doc y, then
lThen pgap size z = (1-p)z p .  This constructs a geometric distribution.
¡
¡Works for intra and inter-document index compression
l_________________________________
Bridegroom
Twelfth
Jezebel
Occurrences in the Bible