11 Oct 2005
CS 5244 - Computational Document Analysis
32
Conclusion
¡Signature-based methods common, design-based assumes domain knowledge.
lThe importance of granularity and ordering changes between domains
¡Difficult to scale up
lMost work only does pairwise comparison
lLow complexity clustering may help as a first pass
l
¡References
¡Belkouche et al. (04) Plagiarism Detection in Software Designs, ACM Southeast Conference
¡Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95.
¡Bilenko and Mooney (03) Adaptive duplicate detection using learnable string similarity measures, Proc. of KDD 03.
¡Khmelev and Teahan (03) A repetition based measure for verification of text collections and for text categorization, Proc. SIGIR 03
¡Ramaswamy et al. (04) Automatic detection of fragments in dynamically generated web pages, Proc. WWW 04.