Notes
Slide Show
Outline
1
Pagerank and HITS*
  • Module 7 Applied Bibliometrics
  • KAN Min-Yen
  • *Part of these lecture notes come from Manning, Raghavan and Schütze @ Stanford CS
2
Connectivity analysis
  •  Idea: mine hyperlink information in the Web
  •  Assumptions:
    •  Links often connect related pages
    •  A link between pages is a recommendation
      • “people vote with their links”
3
Query-independent ordering
  • Using link counts as simple measures of popularity


  • Two basic suggestions:
    • Undirected popularity:
      • in-links plus out-links (3+2=5)
    • Directed popularity:
      • number of its in-links (3)
4
Algorithm
  • Retrieve all pages meeting the text query (say venture capital), perhaps by using Boolean model


  • Order these by link popularity
    (either variant on the previous page)
5
Pagerank scoring
  • Imagine a browser doing a random walk on web pages:
    • Start at a random page
    • At each step, follow one of the n links on that page, each with 1/n probability
  • Do this repeatedly.  Use the “long-term visit rate” as the page’s score
6
Not quite enough
  • The web is full of dead ends.
    • What sites have dead ends?
    • Our random walk can get stuck.
7
Teleporting
  • At each step, with probability 10%, teleport to a random web page


  • With remaining probability (90%), follow a random link on the page
    • If a dead-end, stay put in this case


8
Result of teleporting
  • Now we cannot get stuck locally
  • There is a long-term rate at which any page is visited (not obvious, will show this)
    • How do we compute this visit rate?
9
Markov chains
  • A Markov chain consists of n _____, plus an n´n __________________ P.
    • At each step, we are in exactly one of the states.
    • For 1 £ i,k £ n, the matrix entry Pik tells us the probability of k being the next state, given we are currently in state i.
10
Markov chains
  • Clearly, for all i,
  •  Markov chains are abstractions of random walks
11
Ergodic Markov chains
  • A Markov chain is ergodic if
    • you have a path from any state to any other
    • you can be in any state at every time step, with non-zero probability



    • With teleportation, our Markov chain is ergodic
12
Steady State
  • For any ergodic Markov chain, there is a unique long-term visit rate for each state
    • Over a long period, we’ll visit each state in proportion to this rate
    • It doesn’t matter where we start

13
Probability vectors
  • A probability (row) vector x = (x1, … xn) tells us where the walk is at any point
  • E.g., (000…1…000) means we’re in state i.
14
Change in probability vector
  • If the probability vector is  x = (x1, … xn) at this step, what is it at the next step?
  • Recall that row i of the transition prob. Matrix P tells us where we go next from state i.
  • So from x, our next state is distributed as __.
15
Pagerank algorithm
  • Regardless of where we start, we eventually reach the steady state a
    • Start with any distribution (say x=(10…0))
    • After one step, we’re at xP
    • After two steps at xP2 , then xP3 and so on.
    • “Eventually” means for “large” k, _______
  • Algorithm: multiply x by increasing powers of P until the product looks stable
16
Pagerank summary
  • Pre-processing:
    • Given graph of links, build matrix P
    • From it compute a
    • The pagerank ai is a scaled number between 0 and 1
  • Query processing:
    • Retrieve pages meeting query
    • Rank them by their pagerank
    • Order is query-______________
17
Hyperlink-Induced Topic Search (HITS)
  • In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:
    • Hub pages are good lists of links on a subject.
      • e.g., “Bob’s list of cancer-related links.”
    • Authority pages occur recurrently on good hubs for the subject.
  • Best suited for _______________ rather than for known-item searches.
  • Gets at a broader slice of common opinion.
18
Hubs and Authorities
  • Thus, a good __________ for a topic points to many authoritative pages for that topic.


  • A good ______________ for a topic is pointed to by many good hubs for that topic.


  • Circular definition - will turn this into an iterative computation.
19
Hubs and Authorities
20
High-level scheme
  • Extract from the web a base set of pages that could be good hubs or authorities.


  • From these, identify a small set of top hub and authority pages
    •  iterative algorithm
21
Base set
  • Given text query (say university), use a text index to get all pages containing university.
    • Call this the root set of pages
  • Add in any page that either:
    • points to a page in the root set, or
    • is pointed to by a page in the root set
  • Call this the base set
22
 
23
Assembling the base set
  • Root set typically 200-1000 nodes.
  • Base set may have up to 5000 nodes.
  • How do you find the base set nodes?


    • Follow out-links by parsing root set pages.


    • Get in-links (and out-links) from a connectivity server.
24
Distilling hubs and authorities
  • Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).
  • Initialize: for all x, h(x)¬1; a(x) ¬1;
  • Iteratively update all h(x), a(x);
  • After iterations:
    • highest h() scores are hubs
    • highest a() scores are authorities
25
Iterative update
  • Repeat the following updates, for all x:
26
How many iterations?
  • Relative values of scores will converge after a few iterations
  • We only require the _____________ of the h() and a() scores - not their absolute values
  • In practice, ~5 iterations needed
27
Things to think about
  • Use only link analysis _____ base set assembled
    • iterative scoring is query-__________
  • Iterative computation ____ text index retrieval - significant overhead
28
Things to think about
  • How does the selection of the base set influence computation of H & As?
  • Can we embed the computation of H & A during the standard VS retrieval algorithm?
  • A pagerank score is a global score.  Can there be a fusion between H&A (which are query sensitive) and pagerank?  How would you do it?