Pagerank and HITS*
Module 7 Applied Bibliometrics
KAN Min-Yen
*Part of these lecture notes come from Manning, Raghavan and Schütze @ Stanford CS

Connectivity analysis
 Idea: mine hyperlink information in the Web
 Assumptions:
 Links often connect related pages
 A link between pages is a recommendation
“people vote with their links”

Query-independent ordering
Using link counts as simple measures of popularity
Two basic suggestions:
Undirected popularity:
in-links plus out-links (3+2=5)
Directed popularity:
number of its in-links (3)

Algorithm
Retrieve all pages meeting the text query (say venture capital), perhaps by using Boolean model
Order these by link popularity
(either variant on the previous page)

Pagerank scoring
Imagine a browser doing a random walk on web pages:
Start at a random page
At each step, follow one of the n links on that page, each with 1/n probability
Do this repeatedly.  Use the “long-term visit rate” as the page’s score

Not quite enough
The web is full of dead ends.
What sites have dead ends?
Our random walk can get stuck.

Teleporting
At each step, with probability 10%, teleport to a random web page
With remaining probability (90%), follow a random link on the page
If a dead-end, stay put in this case

Result of teleporting
Now we cannot get stuck locally
There is a long-term rate at which any page is visited (not obvious, will show this)
How do we compute this visit rate?

Markov chains
A Markov chain consists of n _____, plus an n´n __________________ P.
At each step, we are in exactly one of the states.
For 1 £ i,k £ n, the matrix entry Pik tells us the probability of k being the next state, given we are currently in state i.

Markov chains
Clearly, for all i,
 Markov chains are abstractions of random walks

Ergodic Markov chains
A Markov chain is ergodic if
you have a path from any state to any other
you can be in any state at every time step, with non-zero probability
With teleportation, our Markov chain is ergodic

Steady State
For any ergodic Markov chain, there is a unique long-term visit rate for each state
Over a long period, we’ll visit each state in proportion to this rate
It doesn’t matter where we start

Probability vectors
A probability (row) vector x = (x1, … xn) tells us where the walk is at any point
E.g., (000…1…000) means we’re in state i.

Change in probability vector
If the probability vector is  x = (x1, … xn) at this step, what is it at the next step?
Recall that row i of the transition prob. Matrix P tells us where we go next from state i.
So from x, our next state is distributed as __.

Pagerank algorithm
Regardless of where we start, we eventually reach the steady state a
Start with any distribution (say x=(10…0))
After one step, we’re at xP
After two steps at xP2 , then xP3 and so on.
“Eventually” means for “large” k, _______
Algorithm: multiply x by increasing powers of P until the product looks stable

Pagerank summary
Pre-processing:
Given graph of links, build matrix P
From it compute a
The pagerank ai is a scaled number between 0 and 1
Query processing:
Retrieve pages meeting query
Rank them by their pagerank
Order is query-______________

Hyperlink-Induced Topic Search (HITS)
In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:
Hub pages are good lists of links on a subject.
e.g., “Bob’s list of cancer-related links.”
Authority pages occur recurrently on good hubs for the subject.
Best suited for _______________ rather than for known-item searches.
Gets at a broader slice of common opinion.

Hubs and Authorities
Thus, a good __________ for a topic points to many authoritative pages for that topic.
A good ______________ for a topic is pointed to by many good hubs for that topic.
Circular definition - will turn this into an iterative computation.

Hubs and Authorities

High-level scheme
Extract from the web a base set of pages that could be good hubs or authorities.
From these, identify a small set of top hub and authority pages
 iterative algorithm

Base set
Given text query (say university), use a text index to get all pages containing university.
Call this the root set of pages
Add in any page that either:
points to a page in the root set, or
is pointed to by a page in the root set
Call this the base set

Slide 22

Assembling the base set
Root set typically 200-1000 nodes.
Base set may have up to 5000 nodes.
How do you find the base set nodes?
Follow out-links by parsing root set pages.
Get in-links (and out-links) from a connectivity server.

Distilling hubs and authorities
Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).
Initialize: for all x, h(x)¬1; a(x) ¬1;
Iteratively update all h(x), a(x);
After iterations:
highest h() scores are hubs
highest a() scores are authorities

Iterative update
Repeat the following updates, for all x:

How many iterations?
Relative values of scores will converge after a few iterations
We only require the _____________ of the h() and a() scores - not their absolute values
In practice, ~5 iterations needed

Things to think about
Use only link analysis _____ base set assembled
iterative scoring is query-__________
Iterative computation ____ text index retrieval - significant overhead

Things to think about
How does the selection of the base set influence computation of H & As?
Can we embed the computation of H & A during the standard VS retrieval algorithm?
A pagerank score is a global score.  Can there be a fusion between H&A (which are query sensitive) and pagerank?  How would you do it?