1	Computational Literary Analysis Authorship Attribution and Plagiarism Detection Module 9 Min-Yen KAN
2	The Federalist papers A series of 85 papers written by Jay, Hamilton and Madison Intended to help persuade voters to ratify the US constitution
3	Disputed papers of the Federalist Most of the papers have attribution but the authorship of 12 papers are disputed Either Hamilton or Madison Want to determine who wrote these papers Also known as textual forensics
4	Wordprint and Stylistics Claim: Authors leave a unique wordprint in the documents which they author Claim: Authors also exhibit certain stylistic patterns in their publications
5	Feature Selection Content-specific features (Foster 90) key words, special characters Style markers Word- or character-based features (Yule 38) length of words, vocabulary richness Function words (Mosteller & Wallace 64) Structural features Email: Title or signature, paragraph separators (de Vel et al. 01) Can generalize to HTML tags To think about: artifact of authoring software?
6	Bayes Theorem on function words M & W examined the frequency of 100 function words Smoothed these frequencies using negative binomial (not Poisson) distribution Used Bayes’ theorem and linear regression to find weights to fit for observed data Sample words: as do has is no or than this at down have it not our that to be even her its now shall the up
7	A Funeral Elegy and Primary Colors “Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown A Funeral Elegy: Foster attributed this poem to W.S. Initially rejected, but identified his anonymous reviewer Forster also attributed Primary Colors to Newsweek columnist Joe Klein Analyzes text mainly by hand
8	Foster’s features Very large feature space, look for distinguishing features: Topic words Punctuation __________________ Irregular spelling and grammar Some specific features (most compound): Adverbs ending with “y”: talky Parenthetical connectives: … , then, … Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style
9	Typology of English texts Five dimensions … Involved vs. informational production Narrative? Explicit vs. situation-dependent Persuasive? Abstract? … targeting these genres Intimate, interpersonal interactions Face-to-face conversations Scientific exposition Imaginative narrative General narrative exposition
10	Features used (e.g., Dimension 1) Biber also gives a feature inventory for each dimension THAT deletion Contractions BE as main verb WH questions 1^st person pronouns 2^nd person pronouns General hedges Nouns Word Length Prepositions Type/Token Ratio 35 Face to face conversations 30 25 20 Personal Letters Interviews 15 10 5 Prepared speeches 0 General fiction -5 -10 Editorials -15 Academic prose; Press reportage Official Documents -20
11	Discriminant analysis for text genres Karlgren and Cutting (94) Same text genre categories as Biber Simple count and average metrics Discriminant analysis (in SPSS) 64% precision over four categories
12	Recent developments Using machine learning techniques to assist genre analysis and authorship detection Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP to confirm claim that the disputed papers are Madison’s They use counts of up to three sets of function words as their features -0.5242as + 0.8895our + 4.9235upon ≥ 4.7368 Many other studies out there…
13	Copy detection Prevention – stop or disable copying process Detection – decide if one source is the same as another
14	Copy / duplicate detection Compute signature for documents Register signature of authority doc Check a query doc against existing signature Variations: Length: document / sentence* / window Signature: checksum / _______ / phrases
15	Granularity Large chunks Lower probability of match, higher threshold Small chunks Smaller number of unique chunks Lower search complexity
16	Subset problem If a document consists of just a subset of another document, standard VS model may show low similarity Example: Cosine (D₁,D₂) = .61 D₁: <A, B, C>, D₂: <A, B, C, D, E, F, G, H> Shivakumar and Garcia-Molina (95): use only close words in VSM Close = _________________, defined by a tunable ε distance.
17	Computer program plagiarism Use stylistic rules to compile fingerprint: Commenting ___________ Formatting Style (e.g., K&R) Use this along with program structure Edit distance What about hypertext structure? /*********************************** * This function concatenates the first and * second string into the third string. ************************************* void strcat(char string1, char string2, char string3) { char ptr1, ptr2; ptr2 = string3; / * Copy first string / for(ptr1=string1;ptr1;ptr1++) { (ptr2++) = ptr1; } /* * concatenate s2 to s1 into s3. * Enough memory for s3 must already be allocated. No checks !!!!!! / mysc(s1, s2, s3) char s1, s2, s3; { while (s1) s3++ = s1++; while (s2) s3++ = s2++; }
18	Conclusion Find attributes that are _____between texts for a collection, but _____ across different collections Difficult to scale up to many authors and many sources Most work only does pairwise comparison _______ may help as a first pass for plagiarism detection
19	To think about… The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods? What are the implications of an application that would emulate the wordprint of another author? What are some of the potential effects of being able to undo anonymity? Self-plagiarism is common in the scientific community. Should we condone this practice?
20	References Foster (00) Author Unknown. Owl Books PE1421 Fos Biber (89) A typology of English texts, Linguistics, 27(3) Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95 Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3) Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94. de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record