Computational Analysis of Genre, Authorship and Duplication

11 Oct 2005

CS 5244 - Computational Document Analysis

Distance calculations

Calculate distance between p1, p2

¡VSM: L1 distance Σf|Pf1-Pf2|

¡VSM: L2 Euclidean distance (Σf|Pf1-Pf2|2)1/2

¡Weighted feature combinations

¡For text features, can use edit distance

lCalculate using dynamic programming

Detect and flag copies

¡Assume top n% as possible plagiarisms

¡Use a tuned similarity threshold

¡Other way: do tuning on supervised set
(learn weights for features: Bilenko and Mooney)

What are some problems with these approaches?