1	Pagerank and HITS* Module 7 Applied Bibliometrics KAN Min-Yen *Part of these lecture notes come from Manning, Raghavan and Schütze @ Stanford CS
2	Connectivity analysis Idea: mine hyperlink information in the Web Assumptions: Links often connect related pages A link between pages is a recommendation “people vote with their links”
3	Query-independent ordering Using link counts as simple measures of popularity Two basic suggestions: Undirected popularity: in-links plus out-links (3+2=5) Directed popularity: number of its in-links (3)
4	Algorithm Retrieve all pages meeting the text query (say venture capital), perhaps by using Boolean model Order these by link popularity (either variant on the previous page)
5	Pagerank scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, follow one of the n links on that page, each with 1/n probability Do this repeatedly. Use the “long-term visit rate” as the page’s score
6	Not quite enough The web is full of dead ends. What sites have dead ends? Our random walk can get stuck.
7	Teleporting At each step, with probability 10%, teleport to a random web page With remaining probability (90%), follow a random link on the page If a dead-end, stay put in this case
8	Result of teleporting Now we cannot get stuck locally There is a long-term rate at which any page is visited (not obvious, will show this) How do we compute this visit rate?
9	Markov chains A Markov chain consists of n _____, plus an n´n __________________ P. At each step, we are in exactly one of the states. For 1 £ i,k £ n, the matrix entry P_ik tells us the probability of k being the next state, given we are currently in state i.
10	Markov chains Clearly, for all i, Markov chains are abstractions of random walks
11	Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other you can be in any state at every time step, with non-zero probability With teleportation, our Markov chain is ergodic
12	Steady State For any ergodic Markov chain, there is a unique long-term visit rate for each state Over a long period, we’ll visit each state in proportion to this rate It doesn’t matter where we start
13	Probability vectors A probability (row) vector x = (x₁, … x_n) tells us where the walk is at any point E.g., (000…1…000) means we’re in state i.
14	Change in probability vector If the probability vector is x = (x₁, … x_n) at this step, what is it at the next step? Recall that row i of the transition prob. Matrix P tells us where we go next from state i. So from x, our next state is distributed as __.
15	Pagerank algorithm Regardless of where we start, we eventually reach the steady state a Start with any distribution (say x=(10…0)) After one step, we’re at xP After two steps at xP² , then xP³ and so on. “Eventually” means for “large” k, _______ Algorithm: multiply x by increasing powers of P until the product looks stable
16	Pagerank summary Pre-processing: Given graph of links, build matrix P From it compute a The pagerank a_i is a scaled number between 0 and 1 Query processing: Retrieve pages meeting query Rank them by their pagerank Order is query-______________
17	Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a subject. e.g., “Bob’s list of cancer-related links.” Authority pages occur recurrently on good hubs for the subject. Best suited for _______________ rather than for known-item searches. Gets at a broader slice of common opinion.
18	Hubs and Authorities Thus, a good __________ for a topic points to many authoritative pages for that topic. A good ______________ for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation.
19	Hubs and Authorities
20	High-level scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages iterative algorithm
21	Base set Given text query (say university), use a text index to get all pages containing university. Call this the root set of pages Add in any page that either: points to a page in the root set, or is pointed to by a page in the root set Call this the base set
22
23	Assembling the base set Root set typically 200-1000 nodes. Base set may have up to 5000 nodes. How do you find the base set nodes? Follow out-links by parsing root set pages. Get in-links (and out-links) from a connectivity server.
24	Distilling hubs and authorities Compute, for each page x in the base set, a hub score h(x) and an authority score a(x). Initialize: for all x, h(x)¬1; a(x) ¬1; Iteratively update all h(x), a(x); After iterations: highest h() scores are hubs highest a() scores are authorities
25	Iterative update Repeat the following updates, for all x:
26	How many iterations? Relative values of scores will converge after a few iterations We only require the _____________ of the h() and a() scores - not their absolute values In practice, ~5 iterations needed
27	Things to think about Use only link analysis _____ base set assembled iterative scoring is query-__________ Iterative computation ____ text index retrieval - significant overhead
28	Things to think about How does the selection of the base set influence computation of H & As? Can we embed the computation of H & A during the standard VS retrieval algorithm? A pagerank score is a global score. Can there be a fusion between H&A (which are query sensitive) and pagerank? How would you do it?