¡Calculate distance between p1, p2
¡VSM: L1 distance
Σf|Pf1-Pf2|
¡VSM: L2 Euclidean distance (Σf|Pf1-Pf2|2)1/2
¡Weighted
feature combinations
¡For
text features, can use edit distance
lCalculate
using dynamic programming
¡
¡Detect and flag copies
¡Assume
top n% as possible plagiarisms
¡Use a
tuned similarity threshold
¡Other
way: do tuning on supervised set
(learn weights for features: Bilenko and Mooney)
What are some
problems with these approaches?