Subset problem
¡ Problem: If a document consists is just a
subset of another document, standard VS
model may show low similarity
l Example: cosine (D1,D2) = .61
D1: <A, B, C>,
D2: <A, B, C, D, E, F, G, H>
¡ Shivakumar and Garcia-Molina (95): use
only close words in VSM
l Close = comparable frequency, defined by a
tunable ε distance.