1
|
- Collaborative Filtering and Recommender Systems
- Week 12 Min-Yen KAN
|
2
|
- In information seeking, we may seek others’ opinion:
- Recommender systems may use collaborative filtering algorithms to
generate their recommendations
|
3
|
- Information Retrieval:
- Recommendation Systems:
- Uses item’s metadata
- Item – item recommendation
- Collaborative Filtering
- User – user recommendation
- Find similar users to current user,
- Then return their recommendations
- Clustering can be used to find recommendations
|
4
|
- Effective when untainted data is available
- Typically have to deal with sparse data
- Users will only vote over a subset of all items they’ve seen
- Data:
- Explicit: recommendations, reviews, ratings
- Implicit: query, browser, past purchases, session logs
- Approaches
- Model based – derive a user model and use for prediction
- Memory based – use entire database
- Functions
- Predict – predict ranking for an item
- Recommend – produce ordered list of items of interest to the user.
|
5
|
- Assume active user a has ranked I items:
- Mean ranking given by:
- Expected ranking of a new item given by:
|
6
|
- How to find similar users?
- Check correlation between active user’s ratings and yours
- Use Pearson correlation:
- Generates a value between 1 and -1
- 1 (perfect agreement), 0 (random)
|
7
|
- Sparse data
- Default Voting
- Users would agree on some items that they didn’t get a chance to rank
- Assume all unobserved items have neutral or negative ranking.
- Smoothes correlation values in sparse data
- Balancing Votes:
- Inverse User Frequency
- Universally liked items not important to correlation
- Weight (j) = ln (# users/# users voting for item j)
|
8
|
- Assume all users belong to several different types C = {C1,C2,
…, Cn}
- Find the model (class) of active user
- Eg. Horror movie lovers
- This class is hidden
- Then apply model to predict vote
|
9
|
- Shill = a decoy who acts enthusiastically in order to stimulate the
participation of others
- Push: cause an item’s rating to rise
- Nuke: cause an item’s rating to fall
|
10
|
- Given current user-user recommender systems:
- An item with more variable recommendations is easier to shill
- An item with less recommendations is easier to shill
- An item farther away from the mean value is easier to shill towards the
same direction
|
11
|
- Introduce new users who rate target item with high/low value
- To avoid detection, rank other items to force user’s mean to average
value and its ratings distribution to be normal
|
12
|
- Recommendation is different from prediction
- Recommendation produces ordered list, most people only look at first n
items
- Obtain recommendation of new items before releasing item
|
13
|
- How would you combine user-user and item-item recommendation systems?
- How does the type of product influence the recommendation algorithm you
might choose?
- What are the key differences in a model-based versus a memory-based
system?
|
14
|
- A good survey paper to start with:
- Breese Heckerman and Kadie (1998) Empirical Analysis of Predictive
Algorithms for Collaborative Filtering, In Proc. of Uncertainty in AI.
- Shilling
- Lam and Riedl (2004) Shilling Recommender Systems for Fun and Profit.
In Proc. WWW 2004.
- Collaborative Filtering Research Papers
- http://jamesthornton.com/cf/
|
15
|
|
16
|
- Computational Literary Analysis
- Week 12 Min-Yen KAN
|
17
|
- A series of 85 papers written by Jay, Hamilton and Madison
- Intended to help persuade voters to ratify the US constitution
|
18
|
- Most of the papers have attribution but the authorship of 12 papers are
disputed
- Either Hamilton or Madison
- Want to determine who wrote these papers
- Also known as textual forensics
|
19
|
- Claim: Authors leave a unique wordprint in the documents which they
author
- Claim: Authors also exhibit certain stylistic patterns in their
publications
|
20
|
- Content-specific features (Foster 90)
- key words, special characters
- Style markers
- Word- or character-based features (Yule 38)
- length of words, vocabulary richness
- Function words (Mosteller & Wallace 64)
- Structural features
- Email: Title or signature, paragraph separators
(de Vel et al. 01)
- Can generalize to HTML tags
- To think about: artifact of authoring software?
|
21
|
- M & W examined the frequency of 100 function words
- Smoothed these frequencies using negative binomial (not Poisson)
distribution
- Used Bayes’ theorem and linear regression to find weights to fit for
observed data
- Sample words:
- as do has is no or than this
- at down have it not our that to
- be even her its now shall the up
|
22
|
- “Give anonymous offenders enough verbal rope and column inches, and they
will hang themselves for you, every time” – Donald Foster in Author
Unknown
- A Funeral Elegy: Foster attributed this poem to W.S.
- Initially rejected, but identified his anonymous reviewer
- Forster also attributed Primary Colors to Newsweek columnist Joe Klein
- Analyzes text mainly by hand
|
23
|
- Very large feature space, look for distinguishing features:
- Topic words
- Punctuation
- Misused common words
- Irregular spelling and grammar
- Some specific features (most compound):
- Adverbs ending with “y”: talky
- Parenthetical connectives: … , then, …
- Nouns ending with “mode”, “style”: crisis mode, outdoor-stadium style
|
24
|
- Five dimensions …
- Involved vs. informational production
- Narrative?
- Explicit vs. situation-dependent
- Persuasive?
- Abstract?
- … targeting these genres
- Intimate, interpersonal interactions
- Face-to-face conversations
- Scientific exposition
- Imaginative narrative
- General narrative exposition
|
25
|
- Biber also gives a feature inventory for each dimension
- THAT deletion
- Contractions
- BE as main verb
- WH questions
- 1st person pronouns
- 2nd person pronouns
- General hedges
- Nouns
- Word Length
- Prepositions
- Type/Token Ratio
- 35 Face to face conversations
- 30
- 25
- 20 Personal Letters
- Interviews
- 15
- 10
- 5
- Prepared speeches
- 0
- General fiction
- -5
- -10 Editorials
- -15 Academic prose; Press reportage
- Official Documents
- -20
|
26
|
- Karlgren and Cutting (94)
- Same text genre categories as Biber
- Simple count and average metrics
- Discriminant analysis (in SPSS)
- 64% precision over four categories
|
27
|
- Using machine learning techniques to assist genre analysis and
authorship detection
- Fung & Mangasarian (03) use SVMs and Bosch & Smith (98) use LP
to confirm claim that the disputed papers are Madison’s
- They use counts of up to three sets of function words as their features
- -0.5242as + 0.8895our + 4.9235upon ≥ 4.7368
- Many other studies out there…
|
28
|
- Prevention –
- stop or disable copying process
- Detection –
decide if one source is the same as another
|
29
|
- Compute signature for documents
- Register signature of authority doc
- Check a query doc against existing signature
- Variations:
- Length: document / sentence* / window
- Signature: checksum / keywords / phrases
|
30
|
- Normalized sum of lengths of all suffixes of the text repeated in other
documents
- where Q(S|T1…Tn) = length of longest prefix of S
repeated in any one document
- Computed easily using suffix array data structure
- More effective than simple longest common substring
|
31
|
- T = cat_sat_on
- T1 = the_cat_on_a_mat
- T2 = the_cat_sat
|
32
|
- Large chunks
- Lower probability of match, higher threshold
- Small chunks
- Smaller number of unique chunks
- Lower search complexity
|
33
|
- If a document consists of just a subset of another document, standard VS
model may show low similarity
- Example: Cosine (D1,D2) = .61
D1: <A, B, C>,
D2: <A, B, C, D, E, F, G, H>
- Shivakumar and Garcia-Molina (95): use only close words in VSM
- Close = comparable frequency, defined by a tunable ε distance.
|
34
|
- Use stylistic rules to compile fingerprint:
- Commenting
- Variable names
- Formatting
- Style (e.g., K&R)
- Use this along with program structure
- Edit distance
- What about hypertext structure?
- /***********************************
- * This function concatenates the first and
- * second string into the third string.
- *************************************
- void strcat(char *string1, char *string2, char *string3)
- {
- char *ptr1, *ptr2;
- ptr2 = string3;
- /*
- * Copy first string
- */
- for(ptr1=string1;*ptr1;ptr1++) {
- *(ptr2++) = *ptr1;
- }
- /*
- * concatenate s2 to s1 into s3.
- * Enough memory for s3 must
already be allocated. No checks !!!!!!
- */
- mysc(s1, s2, s3)
- char *s1, *s2, *s3;
- {
- while (*s1)
- *s3++ = *s1++;
- while (*s2)
- *s3++ = *s2++;
- }
|
35
|
- Find attributes that are stable between (low variance) texts for a
collection, but differ across different collections
- Difficult to scale up to many authors and many sources
- Most work only does pairwise comparison
- Clustering may help as a first pass for plagiarism detection
|
36
|
- The Mosteller-Wallace method examines function words while Foster’s
method uses key words. What are the advantages and disadvantages of
these two different methods?
- What are the implications of an application that would emulate the
wordprint of another author?
- What are some of the potential effects of being able to undo anonymity?
- Self-plagiarism is common in the scientific community. Should we condone this practice?
|
37
|
- Foster (00) Author Unknown. Owl Books PE1421 Fos
- Biber (89) A typology of English texts, Linguistics, 27(3)
- Shivakumar & Garcia-Molina (95) SCAM: A copy detection mechanism for
digital documents, Proc. of DL 95
- Mosteller & Wallace (63) Inference in an authorship problem, J
American Statistical Association 58(3)
- Karlgren & Cutting (94) Recognizing Text Genres with Simple Metrics
Using Discriminant Analysis, Proc. of COLING-94.
- de Vel, Anderson, Corney & Mohay (01) Mining Email Content for
Author Identification Forensics, SIGMOD Record
|