Notes
Slide Show
Outline
1
Computational Literary Analysis
  • Authorship Attribution and
    Plagiarism Detection


  • Module 9                Min-Yen KAN
2
The Federalist papers
  • A series of 85 papers written by Jay, Hamilton and Madison


  • Intended to help persuade voters to ratify the US constitution
3
Disputed papers of the Federalist
  • Most of the papers have attribution but the authorship of 12 papers are disputed
    • Either Hamilton or Madison

  • Want to determine who wrote these papers
    • Also known as textual forensics
4
Wordprint and Stylistics
  • Claim: Authors leave a unique wordprint in the documents which they author


  • Claim: Authors also exhibit certain stylistic patterns in their publications
5
Feature Selection
  • Content-specific features (Foster 90)
    • key words, special characters

  • Style markers
    • Word- or character-based features (Yule 38)
      • length of words, vocabulary richness
    • Function words (Mosteller & Wallace 64)


  • Structural features
    • Email: Title or signature, paragraph separators
      (de Vel et al. 01)
    • Can generalize to HTML tags
    • To think about: artifact of authoring software?
6
Bayes Theorem on function words
  • M & W examined the frequency of 100 function words
  • Smoothed these frequencies using negative binomial (not Poisson) distribution





  • Used Bayes’ theorem and linear regression to find weights to fit for observed data


  • Sample words:
  • as do has is no or than this
  • at down have it not our that to
  • be even her its now shall the up
7
A Funeral Elegy and Primary Colors
  • “Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown
  • A Funeral Elegy: Foster attributed this poem to W.S.
    • Initially rejected, but identified his anonymous reviewer
  • Forster also attributed Primary Colors to Newsweek columnist Joe Klein


  • Analyzes text mainly by hand
8
Foster’s features
  • Very large feature space, look for distinguishing features:
    • Topic words
    • Punctuation
    • __________________
    • Irregular spelling and grammar


  • Some specific features (most compound):
    • Adverbs ending with “y”: talky
    • Parenthetical connectives: … , then, …
    • Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style

9
Typology of English texts
  • Five dimensions …
    • Involved vs. informational production
    • Narrative?
    • Explicit vs. situation-dependent
    • Persuasive?
    • Abstract?
  • … targeting these genres
    • Intimate, interpersonal interactions
    • Face-to-face conversations
    • Scientific exposition
    • Imaginative narrative
    • General narrative exposition
10
Features used (e.g., Dimension 1)
  • Biber also gives a feature inventory for each dimension


  • THAT deletion
  • Contractions
  • BE as main verb
  • WH questions
  • 1st person pronouns
  • 2nd person pronouns
  • General hedges
  • Nouns
  • Word Length
  • Prepositions
  • Type/Token Ratio


  • 35 Face to face conversations


  • 30


  • 25


  • 20 Personal Letters
  • Interviews


  • 15


  • 10


  • 5
  •   Prepared speeches
  • 0
  • General fiction
  • -5


  • -10 Editorials


  • -15 Academic prose; Press reportage
  • Official Documents
  • -20


11
Discriminant analysis for text genres
  • Karlgren and Cutting (94)
    • Same text genre categories as Biber
    • Simple count and average metrics
    • Discriminant analysis (in SPSS)
    • 64% precision over four categories
12
Recent developments
  • Using machine learning techniques to assist genre analysis and authorship detection


    • Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP to confirm claim that the disputed papers are Madison’s


    • They use counts of up to three sets of function words as their features


      • -0.5242as + 0.8895our + 4.9235upon ≥ 4.7368

    • Many other studies out there…
13
Copy detection
  • Prevention –
  • stop or disable copying process
  • Detection –
    decide if one source is the same as another
14
Copy / duplicate detection
  • Compute signature for documents
    • Register signature of authority doc
    • Check a query doc against existing signature

  • Variations:
    • Length: document / sentence* / window
    • Signature: checksum / _______  / phrases
15
Granularity
  • Large chunks
    • Lower probability of match, higher threshold


  • Small chunks
    • Smaller number of unique chunks
    • Lower search complexity
16
Subset problem
  • If a document consists of just a subset of another document, standard VS model may show low similarity
    • Example: Cosine (D1,D2) = .61
      D1: <A, B, C>,
      D2: <A, B, C, D, E, F, G, H>


  • Shivakumar and Garcia-Molina (95): use only close words in VSM
    • Close = _________________, defined by a tunable ε distance.
17
Computer program plagiarism
  • Use stylistic rules to compile fingerprint:
    • Commenting
    • ___________
    • Formatting
    • Style (e.g., K&R)
  • Use this along with program structure
    • Edit distance
    • What about hypertext structure?
  • /***********************************
  • * This function concatenates the first and
  • * second string into the third string.
  • *************************************
  • void strcat(char *string1, char *string2, char *string3)
  • {
  •  char *ptr1, *ptr2;
  •  ptr2 = string3;
  • /*
  •  * Copy first string
  •  */
  • for(ptr1=string1;*ptr1;ptr1++) {
  • *(ptr2++) = *ptr1;
  • }




  • /*
  •  * concatenate s2 to s1 into s3.
  •  * Enough memory for s3 must already be allocated. No checks !!!!!!
  •  */
  • mysc(s1, s2, s3)
  •       char *s1, *s2, *s3;
  • {
  •   while (*s1)
  •     *s3++ = *s1++;


  •   while (*s2)
  •     *s3++ = *s2++;
  • }


18
Conclusion
  • Find attributes that are _____between texts for a collection, but _____ across different collections
  • Difficult to scale up to many authors and many sources
    • Most work only does pairwise comparison
    • _______ may help as a first pass for plagiarism detection
19
To think about…
  • The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods?


  • What are the implications of an application that would emulate the wordprint of another author?


  • What are some of the potential effects of being able to undo anonymity?


  • Self-plagiarism is common in the scientific community.  Should we condone this practice?
20
References
  • Foster (00) Author Unknown. Owl Books PE1421 Fos
  • Biber (89) A typology of English texts, Linguistics, 27(3)
  • Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95
  • Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3)
  • Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94.
  • de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record