¡Clue: Encode gap
length instead of offset
¡Use
small number of bits to encode more
common gap lengths
l(e.g.,
Huffman encoding)
¡Better: Use a distribution of expected gap length (e.g.,
Bernoulli process)
lIf
p = prob that any word x appears in doc y, then
lThen
pgap size z = (1-p)z p . This
constructs a geometric distribution.
¡
¡Works
for intra and inter-document index compression
lWhy
does it hold for documents as well as words?