Computational Literary Analysis
Authorship Attribution and
Plagiarism Detection
Module 9                Min-Yen KAN

The Federalist papers
A series of 85 papers written by Jay, Hamilton and Madison
Intended to help persuade voters to ratify the US constitution

Disputed papers of the Federalist
Most of the papers have attribution but the authorship of 12 papers are disputed
Either Hamilton or Madison
Want to determine who wrote these papers
Also known as textual forensics

Wordprint and Stylistics
Claim: Authors leave a unique wordprint in the documents which they author
Claim: Authors also exhibit certain stylistic patterns in their publications

Feature Selection
Content-specific features (Foster 90)
key words, special characters
Style markers
Word- or character-based features (Yule 38)
length of words, vocabulary richness
Function words (Mosteller & Wallace 64)
Structural features
Email: Title or signature, paragraph separators
(de Vel et al. 01)
Can generalize to HTML tags
To think about: artifact of authoring software?

Bayes Theorem on function words
M & W examined the frequency of 100 function words
Smoothed these frequencies using negative binomial (not Poisson) distribution
Used Bayes’ theorem and linear regression to find weights to fit for observed data
Sample words:
as do has is no or than this
at down have it not our that to
be even her its now shall the up

A Funeral Elegy and Primary Colors
“Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown
A Funeral Elegy: Foster attributed this poem to W.S.
Initially rejected, but identified his anonymous reviewer
Forster also attributed Primary Colors to Newsweek columnist Joe Klein
Analyzes text mainly by hand

Foster’s features
Very large feature space, look for distinguishing features:
Topic words
Punctuation
__________________
Irregular spelling and grammar
Some specific features (most compound):
Adverbs ending with “y”: talky
Parenthetical connectives: … , then, …
Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style

Typology of English texts
Five dimensions …
Involved vs. informational production
Narrative?
Explicit vs. situation-dependent
Persuasive?
Abstract?
… targeting these genres
Intimate, interpersonal interactions
Face-to-face conversations
Scientific exposition
Imaginative narrative
General narrative exposition

Features used (e.g., Dimension 1)
Biber also gives a feature inventory for each dimension
THAT deletion
Contractions
BE as main verb
WH questions
1st person pronouns
2nd person pronouns
General hedges
Nouns
Word Length
Prepositions
Type/Token Ratio
35 Face to face conversations
30
25
20 Personal Letters
Interviews
15
10
5
  Prepared speeches
0
General fiction
-5
-10 Editorials
-15 Academic prose; Press reportage
Official Documents
-20

Discriminant analysis for text genres
Karlgren and Cutting (94)
Same text genre categories as Biber
Simple count and average metrics
Discriminant analysis (in SPSS)
64% precision over four categories

Recent developments
Using machine learning techniques to assist genre analysis and authorship detection
Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP to confirm claim that the disputed papers are Madison’s
They use counts of up to three sets of function words as their features
-0.5242as + 0.8895our + 4.9235upon ≥ 4.7368
Many other studies out there…

Copy detection
Prevention –
stop or disable copying process
Detection –
decide if one source is the same as another

Copy / duplicate detection
Compute signature for documents
Register signature of authority doc
Check a query doc against existing signature
Variations:
Length: document / sentence* / window
Signature: checksum / _______  / phrases

Granularity
Large chunks
Lower probability of match, higher threshold
Small chunks
Smaller number of unique chunks
Lower search complexity

Subset problem
If a document consists of just a subset of another document, standard VS model may show low similarity
Example: Cosine (D1,D2) = .61
D1: <A, B, C>,
D2: <A, B, C, D, E, F, G, H>
Shivakumar and Garcia-Molina (95): use only close words in VSM
Close = _________________, defined by a tunable ε distance.

Computer program plagiarism
Use stylistic rules to compile fingerprint:
Commenting
___________
Formatting
Style (e.g., K&R)
Use this along with program structure
Edit distance
What about hypertext structure?
/***********************************
* This function concatenates the first and
* second string into the third string.
*************************************
void strcat(char *string1, char *string2, char *string3)
{
 char *ptr1, *ptr2;
 ptr2 = string3;
/*
 * Copy first string
 */
for(ptr1=string1;*ptr1;ptr1++) {
*(ptr2++) = *ptr1;
}
/*
 * concatenate s2 to s1 into s3.
 * Enough memory for s3 must already be allocated. No checks !!!!!!
 */
mysc(s1, s2, s3)
      char *s1, *s2, *s3;
{
  while (*s1)
    *s3++ = *s1++;
  while (*s2)
    *s3++ = *s2++;
}

Conclusion
Find attributes that are _____between texts for a collection, but _____ across different collections
Difficult to scale up to many authors and many sources
Most work only does pairwise comparison
_______ may help as a first pass for plagiarism detection

To think about…
The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods?
What are the implications of an application that would emulate the wordprint of another author?
What are some of the potential effects of being able to undo anonymity?
Self-plagiarism is common in the scientific community.  Should we condone this practice?

References
Foster (00) Author Unknown. Owl Books PE1421 Fos
Biber (89) A typology of English texts, Linguistics, 27(3)
Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95
Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3)
Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94.
de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record