1	Digital Libraries Collaborative Filtering and Recommender Systems Week 12 Min-Yen KAN
2	Information Seeking, recap In information seeking, we may seek others’ opinion: Recommender systems may use collaborative filtering algorithms to generate their recommendations
3	Is it IR? Clustering? Information Retrieval: Uses content of document Recommendation Systems: Uses item’s metadata Item – item recommendation Collaborative Filtering User – user recommendation Find similar users to current user, Then return their recommendations Clustering can be used to find recommendations
4	Collaborative Filtering Effective when untainted data is available Typically have to deal with sparse data Users will only vote over a subset of all items they’ve seen Data: Explicit: recommendations, reviews, ratings Implicit: query, browser, past purchases, session logs Approaches Model based – derive a user model and use for prediction Memory based – use entire database Functions Predict – predict ranking for an item Recommend – produce ordered list of items of interest to the user.
5	Memory-based CF Assume active user a has ranked I items: Mean ranking given by: Expected ranking of a new item given by:
6	Correlation How to find similar users? Check correlation between active user’s ratings and yours Use Pearson correlation: Generates a value between 1 and -1 1 (perfect agreement), 0 (random)
7	Two modifications Sparse data Default Voting Users would agree on some items that they didn’t get a chance to rank Assume all unobserved items have neutral or negative ranking. Smoothes correlation values in sparse data Balancing Votes: Inverse User Frequency Universally liked items not important to correlation Weight (j) = ln (# users/# users voting for item j)
8	Model-based methods: NB Clustering Assume all users belong to several different types C = {C₁,C₂, …, C_n} Find the model (class) of active user Eg. Horror movie lovers This class is hidden Then apply model to predict vote
9	Detecting untainted data Shill = a decoy who acts enthusiastically in order to stimulate the participation of others Push: cause an item’s rating to rise Nuke: cause an item’s rating to fall
10	Properties of shilling Given current user-user recommender systems: An item with more variable recommendations is easier to shill An item with less recommendations is easier to shill An item farther away from the mean value is easier to shill towards the same direction
11	Attacking a recommender system Introduce new users who rate target item with high/low value To avoid detection, rank other items to force user’s mean to average value and its ratings distribution to be normal
12	Shilling, continued Recommendation is different from prediction Recommendation produces ordered list, most people only look at first n items Obtain recommendation of new items before releasing item Default Value
13	To think about… How would you combine user-user and item-item recommendation systems? How does the type of product influence the recommendation algorithm you might choose? What are the key differences in a model-based versus a memory-based system?
14	References A good survey paper to start with: Breese Heckerman and Kadie (1998) Empirical Analysis of Predictive Algorithms for Collaborative Filtering, In Proc. of Uncertainty in AI. Shilling Lam and Riedl (2004) Shilling Recommender Systems for Fun and Profit. In Proc. WWW 2004. Collaborative Filtering Research Papers http://jamesthornton.com/cf/
15	Mee Goreng Break See ya!
16	Digital Libraries Computational Literary Analysis Week 12 Min-Yen KAN
17	The Federalist papers A series of 85 papers written by Jay, Hamilton and Madison Intended to help persuade voters to ratify the US constitution
18	Disputed papers of the Federalist Most of the papers have attribution but the authorship of 12 papers are disputed Either Hamilton or Madison Want to determine who wrote these papers Also known as textual forensics
19	Wordprint and Stylistics Claim: Authors leave a unique wordprint in the documents which they author Claim: Authors also exhibit certain stylistic patterns in their publications
20	Feature Selection Content-specific features (Foster 90) key words, special characters Style markers Word- or character-based features (Yule 38) length of words, vocabulary richness Function words (Mosteller & Wallace 64) Structural features Email: Title or signature, paragraph separators (de Vel et al. 01) Can generalize to HTML tags To think about: artifact of authoring software?
21	Bayes Theorem on function words M & W examined the frequency of 100 function words Smoothed these frequencies using negative binomial (not Poisson) distribution Used Bayes’ theorem and linear regression to find weights to fit for observed data Sample words: as do has is no or than this at down have it not our that to be even her its now shall the up
22	A Funeral Elegy and Primary Colors “Give anonymous offenders enough verbal rope and column inches, and they will hang themselves for you, every time” – Donald Foster in Author Unknown A Funeral Elegy: Foster attributed this poem to W.S. Initially rejected, but identified his anonymous reviewer Forster also attributed Primary Colors to Newsweek columnist Joe Klein Analyzes text mainly by hand
23	Foster’s features Very large feature space, look for distinguishing features: Topic words Punctuation Misused common words Irregular spelling and grammar Some specific features (most compound): Adverbs ending with “y”: talky Parenthetical connectives: … , then, … Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style
24	Typology of English texts Five dimensions … Involved vs. informational production Narrative? Explicit vs. situation-dependent Persuasive? Abstract? … targeting these genres Intimate, interpersonal interactions Face-to-face conversations Scientific exposition Imaginative narrative General narrative exposition
25	Features used (e.g., Dimension 1) Biber also gives a feature inventory for each dimension THAT deletion Contractions BE as main verb WH questions 1^st person pronouns 2^nd person pronouns General hedges Nouns Word Length Prepositions Type/Token Ratio 35 Face to face conversations 30 25 20 Personal Letters Interviews 15 10 5 Prepared speeches 0 General fiction -5 -10 Editorials -15 Academic prose; Press reportage Official Documents -20
26	Discriminant analysis for text genres Karlgren and Cutting (94) Same text genre categories as Biber Simple count and average metrics Discriminant analysis (in SPSS) 64% precision over four categories
27	Recent developments Using machine learning techniques to assist genre analysis and authorship detection Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP to confirm claim that the disputed papers are Madison’s They use counts of up to three sets of function words as their features -0.5242as + 0.8895our + 4.9235upon ≥ 4.7368 Many other studies out there…
28	Copy detection Prevention – stop or disable copying process Detection – decide if one source is the same as another
29	Copy / duplicate detection Compute signature for documents Register signature of authority doc Check a query doc against existing signature Variations: Length: document / sentence* / window Signature: checksum / keywords / phrases
30	R-measure Normalized sum of lengths of all suffixes of the text repeated in other documents where Q(S\|T₁…T_n) = length of longest prefix of S repeated in any one document Computed easily using suffix array data structure More effective than simple longest common substring
31	R-measure example T = cat_sat_on T1 = the_cat_on_a_mat T2 = the_cat_sat
32	Granularity Large chunks Lower probability of match, higher threshold Small chunks Smaller number of unique chunks Lower search complexity
33	Subset problem If a document consists of just a subset of another document, standard VS model may show low similarity Example: Cosine (D₁,D₂) = .61 D₁: <A, B, C>, D₂: <A, B, C, D, E, F, G, H> Shivakumar and Garcia-Molina (95): use only close words in VSM Close = comparable frequency, defined by a tunable ε distance.
34	Computer program plagiarism Use stylistic rules to compile fingerprint: Commenting Variable names Formatting Style (e.g., K&R) Use this along with program structure Edit distance What about hypertext structure? /*********************************** * This function concatenates the first and * second string into the third string. ************************************* void strcat(char string1, char string2, char string3) { char ptr1, ptr2; ptr2 = string3; / * Copy first string / for(ptr1=string1;ptr1;ptr1++) { (ptr2++) = ptr1; } /* * concatenate s2 to s1 into s3. * Enough memory for s3 must already be allocated. No checks !!!!!! / mysc(s1, s2, s3) char s1, s2, s3; { while (s1) s3++ = s1++; while (s2) s3++ = s2++; }
35	Conclusion Find attributes that are stable between (low variance) texts for a collection, but differ across different collections Difficult to scale up to many authors and many sources Most work only does pairwise comparison Clustering may help as a first pass for plagiarism detection
36	To think about… The Mosteller-Wallace method examines function words while Foster’s method uses key words. What are the advantages and disadvantages of these two different methods? What are the implications of an application that would emulate the wordprint of another author? What are some of the potential effects of being able to undo anonymity? Self-plagiarism is common in the scientific community. Should we condone this practice?
37	References Foster (00) Author Unknown. Owl Books PE1421 Fos Biber (89) A typology of English texts, Linguistics, 27(3) Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for digital documents, Proc. of DL 95 Mosteller & Wallace (63) Inference in an authorship problem, J American Statistical Association 58(3) Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. of COLING-94. de Vel, Anderson, Corney & Mohay (01) Mining Email Content for Author Identification Forensics, SIGMOD Record