1	Digital Libraries Computational Literary Analysis, Duplicate and Plagiarism Detection Week 9 Min-Yen KAN
2	Outline Literary Analysis Authorship detection Genre classification Duplicate Detection Web pages Plagiarism Detection In text In programs
3	The Federalist papers A series of 85 papers written by Jay, Hamilton and Madison Intended to help persuade voters to ratify the US constitution
4	Disputed papers of the Federalist Most of the papers have attribution but the authorship of 12 papers are disputed Either Hamilton or Madison Want to determine who wrote these papers Also known as textual forensics
5	Wordprint and Stylistics Claim: Authors leave a unique wordprint in the documents which they author Claim: Authors also exhibit certain stylistic patterns in their publications
6	Feature Selection Content-specific features (Foster 90) key words, special characters Style markers Word- or character-based features length of words, vocabulary richness Function words (Mosteller & Wallace 64) Structural features Email: Title or signature, paragraph separators (de Vel et al. 01) Can generalize to HTML tags To think about: artifact of authoring software?
7	Bayes Theorem on function words M & W examined the frequency of 100 function words Used Bayes’ theorem and linear regression to find weights to fit for observed data Sample words: as do has is no or than this at down have it not our that to be even her its now shall the up
8	A Funeral Elegy and Primary Colors “Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown A Funeral Elegy: Foster attributed this poem to W.S. Initially rejected, but identified his anonymous reviewer Forster also attributed Primary Colors to Newsweek columnist Joe Klein Analyzes text mainly by hand
9	Foster’s features Very large feature space, look for distinguishing features: Topic words Punctuation Misused common words Irregular spelling and grammar Some specific features (most compound): Adverbs ending with “y”: talky Parenthetical connectives: … , then, … Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style
10	Typology of English texts Five dimensions … Involved vs. informational production Narrative? Explicit vs. situation-dependent Persuasive? Abstract? … targeting these genres Intimate, interpersonal interactions Face-to-face conversations Scientific exposition Imaginative narrative General narrative exposition
11	Features used (e.g., Dimension 1) Biber also gives a feature inventory for each dimension THAT deletion Contractions BE as main verb WH questions 1^st person pronouns 2^nd person pronouns General hedges Nouns Word Length Prepositions Type/Token Ratio 35 Face to face conversations 30 25 20 Personal Letters Interviews 15 10 5 Prepared speeches 0 General fiction -5 -10 Editorials -15 Academic prose; Press reportage Official Documents -20
12	Discriminant analysis for text genres Karlgren and Cutting (94) Same text genre categories as Biber Simple count and average metrics Discriminant analysis (using SPSS software) 64% precision over four categories
13	Genre vs. Subject (Lee & Myaeng 02) Genre: style and purpose of text Subject: content of text What about the interaction between the two? Study found that certain genres overlap signficantly in subject vocabulary So, want to use terms that cover more subjects represented by a genre Do this by selecting terms that: Appear in a large ratio of documents belonging to the genre Appear evenly distributed among the subject classes that represent the genre Discriminate this genre from others
14	Putting the constraints together
15	In summary… Genre and authorship analysis relies on highly frequent evidence that is portable across document subjects. Contrast with subject/text classification which looks for specific keywords as evidence. References: Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3) Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94. de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record Foster (00) Author Unknown. Owl Books PE1421 Fos Biber (89) A typology of English texts, Linguistics, 27(3) Lee and Myaeng (02) Text genre classification with genre-revealing and subject-revealing features, SIGIR 02
16	To think about… The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods? What are the implications of an application that would emulate the wordprint of another author? What are some of the potential effects of being able to undo anonymity?
17	Water Break See you in five minutes! I will hold a short tutorial for HW #2 at the end of class today.
18	Copy detection
19	Duplicate detection characteristics Plagiarism copies intentionally may obfuscate target and source relation Self-plagiarism* copy from one’s own work Often to offer for background of work in incremental research (near) Clone/duplicate same functionality in code / citation data but in different modules by different developers Fragment web page content generated by content manager interferes with spiders’ re-sampling rate
20	Signature method Register signature of authority doc Check a query doc against existing signature Flag down very similar documents Some design choices have to be made: How to compute a signature How to judge similarity between signatures
21	Effect of granularity Divide the document into smaller chunks document – no division sentence window of n words Large chunks Lower probability of match, higher threshold Small chunks Smaller number of unique chunks Lower search complexity
22	Signature methods For text documents Checksum Keywords N-gram (usually character) inventory Grammatical phrases For source code Words, characters and lines Halstead profile (Ignores comments) Operator histogram e.g., frequency of each type sorted Operand histogram
23	Distance calculations Calculate distance between p₁, p₂ VSM: L₁distance Σ_f\|P_f1-P_f2\| VSM: L₂ Euclidean distance (Σ_f\|P_f1-P_f2\|²)^1/2 Weighted feature combinations For text features, can use edit distance Calculate using dynamic programming Detect and flag copies Assume top n% as possible plagiarisms Use a tuned similarity threshold Other way: do tuning on supervised set (learn weights for features: Bilenko and Mooney)
24	Subset problem Problem: If a document consists is just a subset of another document, standard VS model may show low similarity Example: cosine (D₁,D₂) = .61 D₁: <A, B, C>, D₂: <A, B, C, D, E, F, G, H> Shivakumar and Garcia-Molina (95): use only close words in VSM Close = comparable frequency, defined by a tunable ε distance.
25	R-measure: amount repeated in other documents (Khmelev and Teahan) Normalized sum of lengths of all suffixes of the text repeated in other documents where Q(S\|T₁…T_n) = length of longest prefix of S repeated in any one document Computed easily using suffix array data structure More effective than simple longest common substring
26	R-measure example T = cat_sat_on T1 = the_cat_on_a_mat T2 = the_cat_sat
27	Computer program plagiarism Use stylistic rules to compile fingerprint: Commenting Variable names Formatting Style (e.g., K&R) Use this along with program structure Edit distance /*********************************** * This function concatenates the first and * second string into the third string. ************************************* void strcat(char string1, char string2, char string3) { char ptr1, ptr2; ptr2 = string3; / * Copy first string / for(ptr1=string1;ptr1;ptr1++) { (ptr2++) = ptr1; } /* * concatenate s2 to s1 into s3. * Enough memory for s3 must already be allocated. No checks !!!!!! / mysc(s1, s2, s3) char s1, s2, s3; { while (s1) s3++ = s1++; while (s2) s3++ = s2++; }
28	Design-based methods Idea: capture syntactic and semantic flow rather than token identity (for source code) Replace variable names with IDs correlated with symbol table and data type Decompose each pinto regions of sequential statements conditionals looping blocks – recurse on these Calculate similarity from root node downwards
29	Recursive region coding
30	Fragments of a web page Which are duplicated? Changed?
31	Defining fragments Base case: each web page is a fragment Inductive step: each part of a fragment is also a fragment if Shared: it is shared among at least n other fragments (n > 1) and is not subsumed by a parent fragment Different: it changes at a different rate than fragments containing it
32	Conclusion Signature-based methods common, design-based assumes domain knowledge. The importance of granularity and ordering changes between domains Difficult to scale up Most work only does pairwise comparison Low complexity clustering may help as a first pass References Belkouche et al. (04) Plagiarism Detection in Software Designs, ACM Southeast Conference Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95. Bilenko and Mooney (03) Adaptive duplicate detection using learnable string similarity measures, Proc. of KDD 03. Khmelev and Teahan (03) A repetition based measure for verification of text collections and for text categorization, Proc. SIGIR 03 Ramaswamy et al. (04) Automatic detection of fragments in dynamically generated web pages, Proc. WWW 04.
33	To think about… How to free duplicate detection algorithms from needing to do pairwise comparisons? What size chunk would you use for signature based methods for images, music, video? Would you encode a structural dependency as well (ordering using edit distance) or not (bag of chunks using VSM) for these other media types?