11 Oct 2005
CS 5244 - Computational Document Analysis
24
Subset problem
¡Problem: If a document consists is just a subset of another document, standard VS model may show low similarity
lExample: cosine (D1,D2) = .61
D1: <A, B, C>,
D2: <A, B, C, D, E, F, G, H>
¡
¡Shivakumar and Garcia-Molina (95): use only close words in VSM
lClose = comparable frequency, defined by a tunable ε distance.