Computational Literary Analysis

Authorship Attribution and
Plagiarism Detection

Module 9 Min-Yen KAN

The Federalist papers

A series of 85 papers written by Jay, Hamilton and Madison

Intended to help persuade voters to ratify the US constitution

Disputed papers of the Federalist

Most of the papers have attribution but the authorship of 12 papers are disputed

Either Hamilton or Madison

Want to determine who wrote these papers

Also known as textual forensics

Wordprint and Stylistics

Claim: Authors leave a unique wordprint in the documents which they author

Claim: Authors also exhibit certain stylistic patterns in their publications

Feature Selection

Content-specific features (Foster 90)

key words, special characters

Style markers

Word- or character-based features (Yule 38)

length of words, vocabulary richness

Function words (Mosteller & Wallace 64)

Structural features

Email: Title or signature, paragraph separators
(de Vel et al. 01)

Can generalize to HTML tags

To think about: artifact of authoring software?

Bayes Theorem on function words

M & W examined the frequency of 100 function words

Smoothed these frequencies using negative binomial (not Poisson) distribution

Used Bayes’ theorem and linear regression to find weights to fit for observed data

Sample words:

as do has is no or than this

at down have it not our that to

be even her its now shall the up

A Funeral Elegy and Primary Colors

“Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown

A Funeral Elegy: Foster attributed this poem to W.S.

Initially rejected, but identified his anonymous reviewer

Forster also attributed Primary Colors to Newsweek columnist Joe Klein

Analyzes text mainly by hand

Foster’s features

Very large feature space, look for distinguishing features:

Topic words

Punctuation

__________________

Irregular spelling and grammar

Some specific features (most compound):

Adverbs ending with “y”: talky

Parenthetical connectives: … , then, …

Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style

Typology of English texts

Five dimensions …

Involved vs. informational production

Narrative?

Explicit vs. situation-dependent

Persuasive?

Abstract?

… targeting these genres

Intimate, interpersonal interactions

Face-to-face conversations

Scientific exposition

Imaginative narrative

General narrative exposition

Features used (e.g., Dimension 1)

Biber also gives a feature inventory for each dimension

THAT deletion

Contractions

BE as main verb

WH questions

1^st person pronouns

2^nd person pronouns

General hedges

Nouns

Word Length

Prepositions

Type/Token Ratio

35 Face to face conversations

30

25

20 Personal Letters

Interviews

15

10

5

Prepared speeches

0

General fiction

-5

-10 Editorials

-15 Academic prose; Press reportage

Official Documents

-20

Discriminant analysis for text genres

Karlgren and Cutting (94)

Same text genre categories as Biber

Simple count and average metrics

Discriminant analysis (in SPSS)

64% precision over four categories

Recent developments

Using machine learning techniques to assist genre analysis and authorship detection

Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP to confirm claim that the disputed papers are Madison’s

They use counts of up to three sets of function words as their features

-0.5242as + 0.8895our + 4.9235upon ≥ 4.7368

Many other studies out there…

Copy detection

Prevention –

stop or disable copying process

Detection –
decide if one source is the same as another

Copy / duplicate detection

Compute signature for documents

Register signature of authority doc

Check a query doc against existing signature

Variations:

Length: document / sentence* / window

Signature: checksum / _______ / phrases

Granularity

Large chunks

Lower probability of match, higher threshold

Small chunks

Smaller number of unique chunks

Lower search complexity

Subset problem

If a document consists of just a subset of another document, standard VS model may show low similarity

Example: Cosine (D₁,D₂) = .61
D₁: <A, B, C>,
D₂: <A, B, C, D, E, F, G, H>

Shivakumar and Garcia-Molina (95): use only close words in VSM

Close = _________________, defined by a tunable ε distance.

Computer program plagiarism

Use stylistic rules to compile fingerprint:

Commenting

___________

Formatting

Style (e.g., K&R)

Use this along with program structure

Edit distance

What about hypertext structure?

/***********************************

* This function concatenates the first and

* second string into the third string.

*************************************

void strcat(char *string1, char *string2, char *string3)

{

char *ptr1, *ptr2;

ptr2 = string3;

/*

* Copy first string

*/

for(ptr1=string1;*ptr1;ptr1++) {

*(ptr2++) = *ptr1;

}

/*

* concatenate s2 to s1 into s3.

* Enough memory for s3 must already be allocated. No checks !!!!!!

*/

mysc(s1, s2, s3)

      char *s1, *s2, *s3;

{

while (*s1)

    *s3++ = *s1++;

while (*s2)

    *s3++ = *s2++;

}

Conclusion

Find attributes that are _____between texts for a collection, but _____ across different collections

Difficult to scale up to many authors and many sources

Most work only does pairwise comparison

_______ may help as a first pass for plagiarism detection

To think about…

The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods?

What are the implications of an application that would emulate the wordprint of another author?

What are some of the potential effects of being able to undo anonymity?

Self-plagiarism is common in the scientific community. Should we condone this practice?

References

Foster (00) Author Unknown. Owl Books PE1421 Fos

Biber (89) A typology of English texts, Linguistics, 27(3)

Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95

Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3)

Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94.

de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record


	Authorship Attribution and Plagiarism Detection

	Module 9 Min-Yen KAN


	A series of 85 papers written by Jay, Hamilton and Madison

	Intended to help persuade voters to ratify the US constitution


	Most of the papers have attribution but the authorship of 12 papers are disputed
		Either Hamilton or Madison

	Want to determine who wrote these papers
		Also known as textual forensics


	Claim: Authors leave a unique wordprint in the documents which they author

	Claim: Authors also exhibit certain stylistic patterns in their publications


Content-specific features (Foster 90)
	key words, special characters

Style markers
	Word- or character-based features (Yule 38)
		length of words, vocabulary richness
	Function words (Mosteller & Wallace 64)

Structural features
	Email: Title or signature, paragraph separators (de Vel et al. 01)
	Can generalize to HTML tags
	To think about: artifact of authoring software?


	M & W examined the frequency of 100 function words
	Smoothed these frequencies using negative binomial (not Poisson) distribution




	Used Bayes’ theorem and linear regression to find weights to fit for observed data

	Sample words:
	as do has is no or than this
	at down have it not our that to
	be even her its now shall the up


	“Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown
	A Funeral Elegy: Foster attributed this poem to W.S.
		Initially rejected, but identified his anonymous reviewer
	Forster also attributed Primary Colors to Newsweek columnist Joe Klein

	Analyzes text mainly by hand


	Very large feature space, look for distinguishing features:
		Topic words
		Punctuation
		__________________
		Irregular spelling and grammar

	Some specific features (most compound):
		Adverbs ending with “y”: talky
		Parenthetical connectives: … , then, …
		Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style


	Five dimensions …
		Involved vs. informational production
		Narrative?
		Explicit vs. situation-dependent
		Persuasive?
		Abstract?
	… targeting these genres
		Intimate, interpersonal interactions
		Face-to-face conversations
		Scientific exposition
		Imaginative narrative
		General narrative exposition


	Biber also gives a feature inventory for each dimension

	THAT deletion
	Contractions
	BE as main verb
	WH questions
	1^st person pronouns
	2^nd person pronouns
	General hedges
	Nouns
	Word Length
	Prepositions
	Type/Token Ratio

	35 Face to face conversations

	30

	25

	20 Personal Letters
	Interviews

	15

	10

	5
	Prepared speeches
	0
	General fiction
	-5

	-10 Editorials

	-15 Academic prose; Press reportage
	Official Documents
	-20


	Karlgren and Cutting (94)
		Same text genre categories as Biber
		Simple count and average metrics
		Discriminant analysis (in SPSS)
		64% precision over four categories


Using machine learning techniques to assist genre analysis and authorship detection

	Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP to confirm claim that the disputed papers are Madison’s

	They use counts of up to three sets of function words as their features

		-0.5242as + 0.8895our + 4.9235upon ≥ 4.7368

	Many other studies out there…


	Prevention –
	stop or disable copying process
	Detection – decide if one source is the same as another


	Compute signature for documents
		Register signature of authority doc
		Check a query doc against existing signature

	Variations:
		Length: document / sentence* / window
		Signature: checksum / _______ / phrases


	Large chunks
		Lower probability of match, higher threshold

	Small chunks
		Smaller number of unique chunks
		Lower search complexity


	If a document consists of just a subset of another document, standard VS model may show low similarity
		Example: Cosine (D₁,D₂) = .61 D₁: <A, B, C>, D₂: <A, B, C, D, E, F, G, H>

	Shivakumar and Garcia-Molina (95): use only close words in VSM
		Close = _________________, defined by a tunable ε distance.


	Use stylistic rules to compile fingerprint:
		Commenting
		___________
		Formatting
		Style (e.g., K&R)
	Use this along with program structure
		Edit distance
		What about hypertext structure?
	/***********************************
	* This function concatenates the first and
	* second string into the third string.
	*************************************
	void strcat(char string1, char string2, char *string3)
	{
	char ptr1, ptr2;
	ptr2 = string3;
	/*
	* Copy first string
	*/
	for(ptr1=string1;*ptr1;ptr1++) {
	(ptr2++) = ptr1;
	}



	/*
	* concatenate s2 to s1 into s3.
	* Enough memory for s3 must already be allocated. No checks !!!!!!
	*/
	mysc(s1, s2, s3)
	char s1, s2, *s3;
	{
	while (*s1)
	s3++ = s1++;

	while (*s2)
	s3++ = s2++;
	}


	Find attributes that are _____between texts for a collection, but _____ across different collections
	Difficult to scale up to many authors and many sources
		Most work only does pairwise comparison
		_______ may help as a first pass for plagiarism detection


	The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods?

	What are the implications of an application that would emulate the wordprint of another author?

	What are some of the potential effects of being able to undo anonymity?

	Self-plagiarism is common in the scientific community. Should we condone this practice?


	Foster (00) Author Unknown. Owl Books PE1421 Fos
	Biber (89) A typology of English texts, Linguistics, 27(3)
	Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95
	Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3)
	Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94.
	de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record