Digital Libraries
Computational Literary Analysis, Duplicate and Plagiarism Detection
Week 9                Min-Yen KAN

Outline
Literary Analysis
Authorship detection
Genre classification
Duplicate Detection
Web pages
Plagiarism Detection
In text
In programs

The Federalist papers
A series of 85 papers written by Jay, Hamilton and Madison
Intended to help persuade voters to ratify the US constitution

Disputed papers of the Federalist
Most of the papers have attribution but the authorship of 12 papers are disputed
Either Hamilton or Madison
Want to determine who wrote these papers
Also known as textual forensics

Wordprint and Stylistics
Claim: Authors leave a unique wordprint in the documents which they author
Claim: Authors also exhibit certain stylistic patterns in their publications

Feature Selection
Content-specific features (Foster 90)
key words, special characters
Style markers
Word- or character-based features
length of words, vocabulary richness
Function words (Mosteller & Wallace 64)
Structural features
Email: Title or signature, paragraph separators
(de Vel et al. 01)
Can generalize to HTML tags
To think about: artifact of authoring software?

Bayes Theorem on function words
M & W examined the frequency of 100 function words
Used Bayes’ theorem and linear regression to find weights to fit for observed data
Sample words:
as do has is no or than this
at down have it not our that to
be even her its now shall the up

A Funeral Elegy and Primary Colors
“Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown
A Funeral Elegy: Foster attributed this poem to W.S.
Initially rejected, but identified his anonymous reviewer
Forster also attributed Primary Colors to Newsweek columnist Joe Klein
Analyzes text mainly by hand

Foster’s features
Very large feature space, look for distinguishing features:
Topic words
Punctuation
Misused common words
Irregular spelling and grammar
Some specific features (most compound):
Adverbs ending with “y”: talky
Parenthetical connectives: … , then, …
Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style

Typology of English texts
Five dimensions …
Involved vs. informational production
Narrative?
Explicit vs. situation-dependent
Persuasive?
Abstract?
… targeting these genres
Intimate, interpersonal interactions
Face-to-face conversations
Scientific exposition
Imaginative narrative
General narrative exposition

Features used (e.g., Dimension 1)
Biber also gives a feature inventory for each dimension
THAT deletion
Contractions
BE as main verb
WH questions
1st person pronouns
2nd person pronouns
General hedges
Nouns
Word Length
Prepositions
Type/Token Ratio
35 Face to face conversations
30
25
20 Personal Letters
Interviews
15
10
5
  Prepared speeches
0
General fiction
-5
-10 Editorials
-15 Academic prose; Press reportage
Official Documents
-20

Discriminant analysis for text genres
Karlgren and Cutting (94)
Same text genre categories as Biber
Simple count and average metrics
Discriminant analysis (using SPSS software)
64% precision over four categories

Genre vs. Subject (Lee & Myaeng 02)
Genre: style and purpose of text
Subject: content of text
What about the interaction between the two?
Study found that certain genres overlap signficantly in subject vocabulary
So, want to use terms that cover more subjects represented by a genre
Do this by selecting terms that:
Appear in a large ratio of documents belonging to the genre
Appear evenly distributed among the subject classes that represent the genre
Discriminate this genre from others

Putting the constraints together

In summary…
Genre and authorship analysis relies on highly frequent evidence that is portable across document subjects.
Contrast with subject/text classification which looks for specific keywords as evidence.
References:
Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3)
Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94.
de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record
Foster (00) Author Unknown. Owl Books PE1421 Fos
Biber (89) A typology of English texts, Linguistics, 27(3)
Lee and Myaeng (02) Text genre classification with genre-revealing and subject-revealing features, SIGIR 02

To think about…
The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods?
What are the implications of an application that would emulate the wordprint of another author?
What are some of the potential effects of being able to undo anonymity?

Water Break
See you in five minutes!
I will hold a short tutorial for HW #2 at the end of class today.

Copy detection

Duplicate detection characteristics
Plagiarism
copies intentionally
may obfuscate
target and source relation
Self-plagiarism*
copy from one’s own work
Often to offer for background of work in incremental research
(near) Clone/duplicate
same functionality in code / citation data
but in different modules by different developers
Fragment
web page content generated by content manager
interferes with spiders’ re-sampling rate

Signature method
Register signature of authority doc
Check a query doc against existing signature
Flag down very similar documents
Some design choices have to be made:
How to compute a signature
How to judge similarity between signatures

Effect of granularity
Divide the document into smaller chunks
document – no division
sentence
window of n words
Large chunks
Lower probability of match, higher threshold
Small chunks
Smaller number of unique chunks
Lower search complexity

Signature methods
For text documents
Checksum
Keywords
N-gram (usually character) inventory
Grammatical phrases
For source code
Words, characters and lines
Halstead profile
(Ignores comments)
Operator histogram
e.g., frequency of each type sorted
Operand histogram

Distance calculations
Calculate distance between p1, p2
VSM: L1 distance Σf|Pf1-Pf2|
VSM: L2 Euclidean distance (Σf|Pf1-Pf2|2)1/2
Weighted feature combinations
For text features, can use edit distance
Calculate using dynamic programming
Detect and flag copies
Assume top n% as possible plagiarisms
Use a tuned similarity threshold
Other way: do tuning on supervised set
(learn weights for features: Bilenko and Mooney)

Subset problem
Problem: If a document consists is just a subset of another document, standard VS model may show low similarity
Example: cosine (D1,D2) = .61
D1: <A, B, C>,
D2: <A, B, C, D, E, F, G, H>
Shivakumar and Garcia-Molina (95): use only close words in VSM
Close = comparable frequency, defined by a tunable ε distance.

R-measure: amount repeated in other documents (Khmelev and Teahan)
Normalized sum of lengths of all suffixes of the text repeated in other documents
where Q(S|T1…Tn) = length of longest prefix of S repeated in any one document
Computed easily using suffix array data structure
More effective than simple longest common substring

R-measure example
T = cat_sat_on
T1 = the_cat_on_a_mat
T2 = the_cat_sat

Computer program plagiarism
Use stylistic rules to compile fingerprint:
Commenting
Variable names
Formatting
Style (e.g., K&R)
Use this along with program structure
Edit distance
/***********************************
* This function concatenates the first and
* second string into the third string.
*************************************
void strcat(char *string1, char *string2, char *string3)
{
 char *ptr1, *ptr2;
 ptr2 = string3;
/*
 * Copy first string
 */
for(ptr1=string1;*ptr1;ptr1++) {
*(ptr2++) = *ptr1;
}
/*
 * concatenate s2 to s1 into s3.
 * Enough memory for s3 must already be allocated. No checks !!!!!!
 */
mysc(s1, s2, s3)
      char *s1, *s2, *s3;
{
  while (*s1)
    *s3++ = *s1++;
  while (*s2)
    *s3++ = *s2++;
}

Design-based methods
Idea: capture syntactic and semantic flow rather than token identity (for source code)
Replace variable names with IDs correlated with symbol table and data type
Decompose each p into regions of
sequential statements
conditionals
looping blocks – recurse on these
Calculate similarity from root node downwards

Recursive region coding

Fragments of a web page
Which are duplicated?  Changed?

Defining fragments
Base case: each web page is a fragment
Inductive step: each part of a fragment is also a fragment if
Shared: it is shared among at least n other fragments (n > 1) and is not subsumed by a parent fragment
Different: it changes at a different rate than fragments containing it

Conclusion
Signature-based methods common, design-based assumes domain knowledge.
The importance of granularity and ordering changes between domains
Difficult to scale up
Most work only does pairwise comparison
Low complexity clustering may help as a first pass
References
Belkouche et al. (04) Plagiarism Detection in Software Designs, ACM Southeast Conference
Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95.
Bilenko and Mooney (03) Adaptive duplicate detection using learnable string similarity measures, Proc. of KDD 03.
Khmelev and Teahan (03) A repetition based measure for verification of text collections and for text categorization, Proc. SIGIR 03
Ramaswamy et al. (04) Automatic detection of fragments in dynamically generated web pages, Proc. WWW 04.

To think about…
How to free duplicate detection algorithms from needing to do pairwise comparisons?
What size chunk would you use for signature based methods for images, music, video? Would you encode a structural dependency as well (ordering using edit distance) or not (bag of chunks using VSM) for these other media types?