¡Signature-based
methods common, design-based assumes domain knowledge.
lThe
importance of granularity and ordering changes between domains
¡Difficult
to scale up
lMost
work only does pairwise comparison
lLow
complexity clustering may help as a first pass
l
¡References
¡Belkouche
et al. (04) Plagiarism
Detection in Software Designs, ACM Southeast
Conference
¡Shivakumar
& Garcia-Molina (95) SCAM: A copy detection mechanism for
digital documents, Proc. of DL 95.
¡Bilenko
and Mooney (03) Adaptive duplicate detection
using learnable string similarity measures,
Proc. of KDD 03.
¡Khmelev
and Teahan (03) A repetition based measure for verification of
text collections and for text categorization, Proc. SIGIR 03
¡Ramaswamy
et al. (04) Automatic detection of fragments in dynamically
generated web pages, Proc. WWW 04.