Notes
Slide Show
Outline
1
Digital Libraries
  • Computational Literary Analysis, Duplicate and Plagiarism Detection


  • Week 9                Min-Yen KAN
2
Outline
  • Literary Analysis
    • Authorship detection
    • Genre classification
  • Duplicate Detection
    • Web pages
  • Plagiarism Detection
    • In text
    • In programs


3
The Federalist papers
  • A series of 85 papers written by Jay, Hamilton and Madison


  • Intended to help persuade voters to ratify the US constitution
4
Disputed papers of the Federalist
  • Most of the papers have attribution but the authorship of 12 papers are disputed
    • Either Hamilton or Madison

  • Want to determine who wrote these papers
    • Also known as textual forensics
5
Wordprint and Stylistics
  • Claim: Authors leave a unique wordprint in the documents which they author


  • Claim: Authors also exhibit certain stylistic patterns in their publications
6
Feature Selection
  • Content-specific features (Foster 90)
    • key words, special characters

  • Style markers
    • Word- or character-based features
      • length of words, vocabulary richness
    • Function words (Mosteller & Wallace 64)


  • Structural features
    • Email: Title or signature, paragraph separators
      (de Vel et al. 01)
    • Can generalize to HTML tags
    • To think about: artifact of authoring software?
7
Bayes Theorem on function words
  • M & W examined the frequency of 100 function words





  • Used Bayes’ theorem and linear regression to find weights to fit for observed data


  • Sample words:
  • as do has is no or than this
  • at down have it not our that to
  • be even her its now shall the up
8
A Funeral Elegy and Primary Colors
  • “Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown
  • A Funeral Elegy: Foster attributed this poem to W.S.
    • Initially rejected, but identified his anonymous reviewer
  • Forster also attributed Primary Colors to Newsweek columnist Joe Klein


  • Analyzes text mainly by hand
9
Foster’s features
  • Very large feature space, look for distinguishing features:
    • Topic words
    • Punctuation
    • Misused common words
    • Irregular spelling and grammar


  • Some specific features (most compound):
    • Adverbs ending with “y”: talky
    • Parenthetical connectives: … , then, …
    • Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style

10
Typology of English texts
  • Five dimensions …
    • Involved vs. informational production
    • Narrative?
    • Explicit vs. situation-dependent
    • Persuasive?
    • Abstract?
  • … targeting these genres
    • Intimate, interpersonal interactions
    • Face-to-face conversations
    • Scientific exposition
    • Imaginative narrative
    • General narrative exposition
11
Features used (e.g., Dimension 1)
  • Biber also gives a feature inventory for each dimension


  • THAT deletion
  • Contractions
  • BE as main verb
  • WH questions
  • 1st person pronouns
  • 2nd person pronouns
  • General hedges
  • Nouns
  • Word Length
  • Prepositions
  • Type/Token Ratio


  • 35 Face to face conversations


  • 30


  • 25


  • 20 Personal Letters
  • Interviews


  • 15


  • 10


  • 5
  •   Prepared speeches
  • 0
  • General fiction
  • -5


  • -10 Editorials


  • -15 Academic prose; Press reportage
  • Official Documents
  • -20


12
Discriminant analysis for text genres
  • Karlgren and Cutting (94)
    • Same text genre categories as Biber
    • Simple count and average metrics
    • Discriminant analysis (using SPSS software)
    • 64% precision over four categories
13
Genre vs. Subject (Lee & Myaeng 02)
  • Genre: style and purpose of text
  • Subject: content of text
  • What about the interaction between the two?


  • Study found that certain genres overlap signficantly in subject vocabulary
  • So, want to use terms that cover more subjects represented by a genre
  • Do this by selecting terms that:
    • Appear in a large ratio of documents belonging to the genre
    • Appear evenly distributed among the subject classes that represent the genre
    • Discriminate this genre from others
14
Putting the constraints together
15
In summary…
  • Genre and authorship analysis relies on highly frequent evidence that is portable across document subjects.
  • Contrast with subject/text classification which looks for specific keywords as evidence.


  • References:
  • Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3)
  • Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94.
  • de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record
  • Foster (00) Author Unknown. Owl Books PE1421 Fos
  • Biber (89) A typology of English texts, Linguistics, 27(3)
  • Lee and Myaeng (02) Text genre classification with genre-revealing and subject-revealing features, SIGIR 02


16
To think about…
  • The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods?


  • What are the implications of an application that would emulate the wordprint of another author?


  • What are some of the potential effects of being able to undo anonymity?


17
Water Break
  • See you in five minutes!


  • I will hold a short tutorial for HW #2 at the end of class today.
18
Copy detection
19
Duplicate detection characteristics
  • Plagiarism
    • copies intentionally
    • may obfuscate
    • target and source relation

  • Self-plagiarism*
    • copy from one’s own work
    • Often to offer for background of work in incremental research
  • (near) Clone/duplicate
    • same functionality in code / citation data
    • but in different modules by different developers
  • Fragment
    • web page content generated by content manager
    • interferes with spiders’ re-sampling rate

20
Signature method
  • Register signature of authority doc
  • Check a query doc against existing signature
  • Flag down very similar documents


  • Some design choices have to be made:
  • How to compute a signature
  • How to judge similarity between signatures
21
Effect of granularity
  • Divide the document into smaller chunks
  • document – no division
    sentence
    window of n words


  • Large chunks
    • Lower probability of match, higher threshold


  • Small chunks
    • Smaller number of unique chunks
    • Lower search complexity
22
Signature methods
  • For text documents
  • Checksum
  • Keywords
  • N-gram (usually character) inventory
  • Grammatical phrases


  • For source code
  • Words, characters and lines
  • Halstead profile
    • (Ignores comments)
    • Operator histogram
      • e.g., frequency of each type sorted
    • Operand histogram

23
Distance calculations
  • Calculate distance between p1, p2
  • VSM: L1 distance Σf|Pf1-Pf2|
  • VSM: L2 Euclidean distance (Σf|Pf1-Pf2|2)1/2
  • Weighted feature combinations
  • For text features, can use edit distance
    • Calculate using dynamic programming

  • Detect and flag copies
  • Assume top n% as possible plagiarisms
  • Use a tuned similarity threshold
  • Other way: do tuning on supervised set
    (learn weights for features: Bilenko and Mooney)
24
Subset problem
  • Problem: If a document consists is just a subset of another document, standard VS model may show low similarity
    • Example: cosine (D1,D2) = .61
      D1: <A, B, C>,
      D2: <A, B, C, D, E, F, G, H>


  • Shivakumar and Garcia-Molina (95): use only close words in VSM
    • Close = comparable frequency, defined by a tunable ε distance.
25
R-measure: amount repeated in other documents (Khmelev and Teahan)
  • Normalized sum of lengths of all suffixes of the text repeated in other documents



  • where Q(S|T1…Tn) = length of longest prefix of S repeated in any one document


    • Computed easily using suffix array data structure
    • More effective than simple longest common substring
26
R-measure example
  • T = cat_sat_on
  • T1 = the_cat_on_a_mat
  • T2 = the_cat_sat



27
Computer program plagiarism
  • Use stylistic rules to compile fingerprint:
    • Commenting
    • Variable names
    • Formatting
    • Style (e.g., K&R)
  • Use this along with program structure
    • Edit distance


  • /***********************************
  • * This function concatenates the first and
  • * second string into the third string.
  • *************************************
  • void strcat(char *string1, char *string2, char *string3)
  • {
  •  char *ptr1, *ptr2;
  •  ptr2 = string3;
  • /*
  •  * Copy first string
  •  */
  • for(ptr1=string1;*ptr1;ptr1++) {
  • *(ptr2++) = *ptr1;
  • }




  • /*
  •  * concatenate s2 to s1 into s3.
  •  * Enough memory for s3 must already be allocated. No checks !!!!!!
  •  */
  • mysc(s1, s2, s3)
  •       char *s1, *s2, *s3;
  • {
  •   while (*s1)
  •     *s3++ = *s1++;


  •   while (*s2)
  •     *s3++ = *s2++;
  • }


28
Design-based methods
  • Idea: capture syntactic and semantic flow rather than token identity (for source code)


  • Replace variable names with IDs correlated with symbol table and data type
  • Decompose each p into regions of
    • sequential statements
    • conditionals
    • looping blocks – recurse on these
  • Calculate similarity from root node downwards



29
Recursive region coding
30
Fragments of a web page
  • Which are duplicated?  Changed?
31
Defining fragments
  • Base case: each web page is a fragment
  • Inductive step: each part of a fragment is also a fragment if
    • Shared: it is shared among at least n other fragments (n > 1) and is not subsumed by a parent fragment
    • Different: it changes at a different rate than fragments containing it
32
Conclusion
  • Signature-based methods common, design-based assumes domain knowledge.
    • The importance of granularity and ordering changes between domains
  • Difficult to scale up
    • Most work only does pairwise comparison
    • Low complexity clustering may help as a first pass


  • References
  • Belkouche et al. (04) Plagiarism Detection in Software Designs, ACM Southeast Conference
  • Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95.
  • Bilenko and Mooney (03) Adaptive duplicate detection using learnable string similarity measures, Proc. of KDD 03.
  • Khmelev and Teahan (03) A repetition based measure for verification of text collections and for text categorization, Proc. SIGIR 03
  • Ramaswamy et al. (04) Automatic detection of fragments in dynamically generated web pages, Proc. WWW 04.
33
To think about…
  • How to free duplicate detection algorithms from needing to do pairwise comparisons?


  • What size chunk would you use for signature based methods for images, music, video? Would you encode a structural dependency as well (ordering using edit distance) or not (bag of chunks using VSM) for these other media types?