10 Aug 2004
CS 5244: Orientation
44/32
Inverted file compression
¡Clue: Encode gap length instead of offset
¡Use small number of bits to encode  more common gap lengths
l(e.g., Huffman encoding)
¡Better: Use a distribution of expected gap length (e.g., Bernoulli process)
lIf p = prob that any word x appears in doc y, then
lThen pgap size z = (1-p)z p .  This constructs a geometric distribution.
¡
¡Works for intra and inter-document index compression
lWhy does it hold for documents as well as words?
Bridegroom
Twelfth
Jezebel
Occurrences in the Bible