|
1
|
- Authorship Attribution and
Plagiarism Detection
- Module 9 Min-Yen
KAN
|
|
2
|
- A series of 85 papers written by Jay, Hamilton and Madison
- Intended to help persuade voters to ratify the US constitution
|
|
3
|
- Most of the papers have attribution but the authorship of 12 papers are
disputed
- Either Hamilton or Madison
- Want to determine who wrote these papers
- Also known as textual forensics
|
|
4
|
- Claim: Authors leave a unique wordprint in the documents which they
author
- Claim: Authors also exhibit certain stylistic patterns in their
publications
|
|
5
|
- Content-specific features (Foster 90)
- key words, special characters
- Style markers
- Word- or character-based features (Yule 38)
- length of words, vocabulary richness
- Function words (Mosteller & Wallace 64)
- Structural features
- Email: Title or signature, paragraph separators
(de Vel et al. 01)
- Can generalize to HTML tags
- To think about: artifact of authoring software?
|
|
6
|
- M & W examined the frequency of 100 function words
- Smoothed these frequencies using negative binomial (not Poisson)
distribution
- Used Bayes’ theorem and linear regression to find weights to fit for
observed data
- Sample words:
- as do has is no or than this
- at down have it not our that to
- be even her its now shall the up
|
|
7
|
- “Give anonymous offenders enough verbal rope and column inches, and they
will hang themselves for you, every time” – Donald Foster in Author
Unknown
- A Funeral Elegy: Foster attributed this poem to W.S.
- Initially rejected, but identified his anonymous reviewer
- Forster also attributed Primary Colors to Newsweek columnist Joe Klein
- Analyzes text mainly by hand
|
|
8
|
- Very large feature space, look for distinguishing features:
- Topic words
- Punctuation
- __________________
- Irregular spelling and grammar
- Some specific features (most compound):
- Adverbs ending with “y”: talky
- Parenthetical connectives: … , then, …
- Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style
|
|
9
|
- Five dimensions …
- Involved vs. informational production
- Narrative?
- Explicit vs. situation-dependent
- Persuasive?
- Abstract?
- … targeting these genres
- Intimate, interpersonal interactions
- Face-to-face conversations
- Scientific exposition
- Imaginative narrative
- General narrative exposition
|
|
10
|
- Biber also gives a feature inventory for each dimension
- THAT deletion
- Contractions
- BE as main verb
- WH questions
- 1st person pronouns
- 2nd person pronouns
- General hedges
- Nouns
- Word Length
- Prepositions
- Type/Token Ratio
- 35 Face to face conversations
- 30
- 25
- 20 Personal Letters
- Interviews
- 15
- 10
- 5
- Prepared speeches
- 0
- General fiction
- -5
- -10 Editorials
- -15 Academic prose; Press reportage
- Official Documents
- -20
|
|
11
|
- Karlgren and Cutting (94)
- Same text genre categories as Biber
- Simple count and average metrics
- Discriminant analysis (in SPSS)
- 64% precision over four categories
|
|
12
|
- Using machine learning techniques to assist genre analysis and
authorship detection
- Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP
to confirm claim that the disputed papers are Madison’s
- They use counts of up to three sets of function words as their features
- -0.5242as + 0.8895our + 4.9235upon ≥ 4.7368
- Many other studies out there…
|
|
13
|
- Prevention –
- stop or disable copying process
- Detection –
decide if one source is the same as another
|
|
14
|
- Compute signature for documents
- Register signature of authority doc
- Check a query doc against existing signature
- Variations:
- Length: document / sentence* / window
- Signature: checksum / _______ /
phrases
|
|
15
|
- Large chunks
- Lower probability of match, higher threshold
- Small chunks
- Smaller number of unique chunks
- Lower search complexity
|
|
16
|
- If a document consists of just a subset of another document, standard VS
model may show low similarity
- Example: Cosine (D1,D2) = .61
D1: <A, B, C>,
D2: <A, B, C, D, E, F, G, H>
- Shivakumar and Garcia-Molina (95): use only close words in VSM
- Close = _________________, defined by a tunable ε distance.
|
|
17
|
- Use stylistic rules to compile fingerprint:
- Commenting
- ___________
- Formatting
- Style (e.g., K&R)
- Use this along with program structure
- Edit distance
- What about hypertext structure?
- /***********************************
- * This function concatenates the first and
- * second string into the third string.
- *************************************
- void strcat(char *string1, char *string2, char *string3)
- {
- char *ptr1, *ptr2;
- ptr2 = string3;
- /*
- * Copy first string
- */
- for(ptr1=string1;*ptr1;ptr1++) {
- *(ptr2++) = *ptr1;
- }
- /*
- * concatenate s2 to s1 into s3.
- * Enough memory for s3 must
already be allocated. No checks !!!!!!
- */
- mysc(s1, s2, s3)
- char *s1, *s2, *s3;
- {
- while (*s1)
- *s3++ = *s1++;
- while (*s2)
- *s3++ = *s2++;
- }
|
|
18
|
- Find attributes that are _____between texts for a collection, but _____
across different collections
- Difficult to scale up to many authors and many sources
- Most work only does pairwise comparison
- _______ may help as a first pass for plagiarism detection
|
|
19
|
- The Mosteller-Wallace method examines function words while Foster’s
method uses key words. What are the advantages and disadvantages of
these two different methods?
- What are the implications of an application that would emulate the
wordprint of another author?
- What are some of the potential effects of being able to undo anonymity?
- Self-plagiarism is common in the scientific community. Should we condone this practice?
|
|
20
|
- Foster (00) Author Unknown. Owl Books PE1421 Fos
- Biber (89) A typology of English texts, Linguistics, 27(3)
- Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for
digital documents, Proc. of DL 95
- Mosteller & Wallace (63) Inference in an authorship problem, J
American Statistical Association 58(3)
- Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics
Using Discriminant Analysis, Proc. of COLING-94.
- de Vel, Anderson, Corney & Mohay (01) Mining Email Content for
Author Identification Forensics, SIGMOD Record
|