Computational Literary
Analysis
|
|
|
Authorship Attribution and
Plagiarism Detection |
|
|
|
Module 9 Min-Yen KAN |
The Federalist papers
|
|
|
A series of 85 papers written by Jay,
Hamilton and Madison |
|
|
|
Intended to help persuade voters to
ratify the US constitution |
Disputed papers of the
Federalist
|
|
|
|
Most of the papers have attribution but
the authorship of 12 papers are disputed |
|
Either Hamilton or Madison |
|
|
|
Want to determine who wrote these
papers |
|
Also known as textual forensics |
Wordprint and Stylistics
|
|
|
Claim: Authors leave a unique wordprint
in the documents which they author |
|
|
|
Claim: Authors also exhibit certain stylistic
patterns in their publications |
Feature Selection
|
|
|
|
|
Content-specific features (Foster 90) |
|
key words, special characters |
|
|
|
Style markers |
|
Word- or character-based features (Yule
38) |
|
length of words, vocabulary richness |
|
Function words (Mosteller & Wallace
64) |
|
|
|
Structural features |
|
Email: Title or signature, paragraph
separators
(de Vel et al. 01) |
|
Can generalize to HTML tags |
|
To think about: artifact of authoring
software? |
Bayes Theorem on function
words
|
|
|
M & W examined the frequency of 100
function words |
|
Smoothed these frequencies using
negative binomial (not Poisson) distribution |
|
|
|
|
|
|
|
|
|
Used Bayes’ theorem and linear
regression to find weights to fit for observed data |
|
|
|
Sample words: |
|
as do has is no or than this |
|
at down have it not our that to |
|
be even her its now shall the up |
A Funeral Elegy and Primary
Colors
|
|
|
|
“Give anonymous offenders enough verbal
rope and column inches, and they will hang themselves for you, every time” –
Donald Foster in Author Unknown |
|
A Funeral Elegy: Foster attributed this
poem to W.S. |
|
Initially rejected, but identified his
anonymous reviewer |
|
Forster also attributed Primary Colors
to Newsweek columnist Joe Klein |
|
|
|
Analyzes text mainly by hand |
Foster’s features
|
|
|
|
Very large feature space, look for
distinguishing features: |
|
Topic words |
|
Punctuation |
|
__________________ |
|
Irregular spelling and grammar |
|
|
|
Some specific features (most compound): |
|
Adverbs ending with “y”: talky |
|
Parenthetical connectives: … , then, … |
|
Nouns ending with “mode”, “style”: crisis
mode, outdoor-stadium style |
|
|
Typology of English texts
|
|
|
|
Five dimensions … |
|
Involved vs. informational production |
|
Narrative? |
|
Explicit vs. situation-dependent |
|
Persuasive? |
|
Abstract? |
|
… targeting these genres |
|
Intimate, interpersonal interactions |
|
Face-to-face conversations |
|
Scientific exposition |
|
Imaginative narrative |
|
General narrative exposition |
Features used (e.g.,
Dimension 1)
|
|
|
Biber also gives a feature inventory
for each dimension |
|
|
|
THAT deletion |
|
Contractions |
|
BE as main verb |
|
WH questions |
|
1st person pronouns |
|
2nd person pronouns |
|
General hedges |
|
Nouns |
|
Word Length |
|
Prepositions |
|
Type/Token Ratio |
|
|
|
35 Face to face conversations |
|
|
|
30 |
|
|
|
25 |
|
|
|
20 Personal Letters |
|
Interviews |
|
|
|
15 |
|
|
|
10 |
|
|
|
5 |
|
Prepared speeches |
|
0 |
|
General fiction |
|
-5 |
|
|
|
-10 Editorials |
|
|
|
-15 Academic prose; Press reportage |
|
Official Documents |
|
-20 |
|
|
Discriminant analysis for
text genres
|
|
|
|
Karlgren and Cutting (94) |
|
Same text genre categories as Biber |
|
Simple count and average metrics |
|
Discriminant analysis (in SPSS) |
|
64% precision over four categories |
Recent developments
|
|
|
|
|
Using machine learning techniques to
assist genre analysis and authorship detection |
|
|
|
Fung & Mangasarian (03) use SVMs
and Bosch & Smith (98) use LP to confirm claim that the disputed papers
are Madison’s |
|
|
|
They use counts of up to three sets of
function words as their features |
|
|
|
-0.5242as + 0.8895our + 4.9235upon ≥
4.7368 |
|
|
|
Many other studies out there… |
Copy detection
|
|
|
Prevention – |
|
stop or disable copying process |
|
Detection –
decide if one source is the same as another |
Copy / duplicate
detection
|
|
|
|
Compute signature for documents |
|
Register signature of authority doc |
|
Check a query doc against existing
signature |
|
|
|
Variations: |
|
Length: document / sentence* / window |
|
Signature: checksum / _______ / phrases |
Granularity
|
|
|
|
Large chunks |
|
Lower probability of match, higher
threshold |
|
|
|
Small chunks |
|
Smaller number of unique chunks |
|
Lower search complexity |
Subset problem
|
|
|
|
|
|
|
If a document consists of just a subset
of another document, standard VS model may show low similarity |
|
Example: Cosine (D1,D2)
= .61
D1: <A, B, C>,
D2: <A, B, C, D, E, F, G, H> |
|
|
|
Shivakumar and Garcia-Molina (95): use
only close words in VSM |
|
Close = _________________, defined by a
tunable ε distance. |
Computer program
plagiarism
|
|
|
|
Use stylistic rules to compile
fingerprint: |
|
Commenting |
|
___________ |
|
Formatting |
|
Style (e.g., K&R) |
|
Use this along with program structure |
|
Edit distance |
|
What about hypertext structure? |
|
/*********************************** |
|
* This function concatenates the first
and |
|
* second string into the third string. |
|
************************************* |
|
void strcat(char *string1, char
*string2, char *string3) |
|
{ |
|
char *ptr1, *ptr2; |
|
ptr2 = string3; |
|
/* |
|
* Copy first string |
|
*/ |
|
for(ptr1=string1;*ptr1;ptr1++) { |
|
*(ptr2++) = *ptr1; |
|
} |
|
|
|
|
|
|
|
/* |
|
* concatenate s2 to s1 into s3. |
|
* Enough memory for s3 must already be
allocated. No checks !!!!!! |
|
*/ |
|
mysc(s1, s2, s3) |
|
char *s1, *s2, *s3; |
|
{ |
|
while (*s1) |
|
*s3++ = *s1++; |
|
|
|
while (*s2) |
|
*s3++ = *s2++; |
|
} |
|
|
Conclusion
|
|
|
|
Find attributes that are _____between
texts for a collection, but _____ across different collections |
|
Difficult to scale up to many authors
and many sources |
|
Most work only does pairwise comparison |
|
_______ may help as a first pass for
plagiarism detection |
To think about…
|
|
|
The Mosteller-Wallace method examines
function words while Foster’s method uses key words. What are the advantages
and disadvantages of these two different methods? |
|
|
|
What are the implications of an
application that would emulate the wordprint of another author? |
|
|
|
What are some of the potential effects
of being able to undo anonymity? |
|
|
|
Self-plagiarism is common in the
scientific community. Should we
condone this practice? |
References
|
|
|
Foster (00) Author Unknown. Owl Books PE1421
Fos |
|
Biber (89) A typology of English texts,
Linguistics, 27(3) |
|
Shivakumar & Garcia-Molina (95) SCAM:
A copy detection mechanism for digital documents, Proc. of DL 95 |
|
Mosteller & Wallace (63) Inference
in an authorship problem, J American Statistical Association 58(3) |
|
Karlgren & Cutting (94) Recognizing
Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of
COLING-94. |
|
de Vel, Anderson, Corney & Mohay
(01) Mining Email Content for Author Identification Forensics, SIGMOD Record |