Pagerank and HITS*

Module 7 Applied Bibliometrics

KAN Min-Yen

*Part of these lecture notes come from Manning, Raghavan and Schütze @ Stanford CS

Connectivity analysis

Idea: mine hyperlink information in the Web

Assumptions:

Links often connect related pages

A link between pages is a recommendation

“people vote with their links”

Query-independent ordering

Using link counts as simple measures of popularity

Two basic suggestions:

Undirected popularity:

in-links plus out-links (3+2=5)

Directed popularity:

number of its in-links (3)

Algorithm

Retrieve all pages meeting the text query (say venture capital), perhaps by using Boolean model

Order these by link popularity
(either variant on the previous page)

Pagerank scoring

Imagine a browser doing a random walk on web pages:

Start at a random page

At each step, follow one of the n links on that page, each with 1/n probability

Do this repeatedly. Use the “long-term visit rate” as the page’s score

Not quite enough

The web is full of dead ends.

What sites have dead ends?

Our random walk can get stuck.

Teleporting

At each step, with probability 10%, teleport to a random web page

With remaining probability (90%), follow a random link on the page

If a dead-end, stay put in this case

Result of teleporting

Now we cannot get stuck locally

There is a long-term rate at which any page is visited (not obvious, will show this)

How do we compute this visit rate?

Markov chains

A Markov chain consists of n _____, plus an n´n __________________ P.

At each step, we are in exactly one of the states.

For 1 £ i,k £ n, the matrix entry P_ik tells us the probability of k being the next state, given we are currently in state i.

Markov chains

Clearly, for all i,

Markov chains are abstractions of random walks

Ergodic Markov chains

A Markov chain is ergodic if

you have a path from any state to any other

you can be in any state at every time step, with non-zero probability

With teleportation, our Markov chain is ergodic

Steady State

For any ergodic Markov chain, there is a unique long-term visit rate for each state

Over a long period, we’ll visit each state in proportion to this rate

It doesn’t matter where we start

Probability vectors

A probability (row) vector x = (x₁, … x_n) tells us where the walk is at any point

E.g., (000…1…000) means we’re in state i.

Change in probability vector

If the probability vector is x = (x₁, … x_n) at this step, what is it at the next step?

Recall that row i of the transition prob. Matrix P tells us where we go next from state i.

So from x, our next state is distributed as __.

Pagerank algorithm

Regardless of where we start, we eventually reach the steady state a

Start with any distribution (say x=(10…0))

After one step, we’re at xP

After two steps at xP² , then xP³ and so on.

“Eventually” means for “large” k, _______

Algorithm: multiply x by increasing powers of P until the product looks stable

Pagerank summary

Pre-processing:

Given graph of links, build matrix P

From it compute a

The pagerank a_i is a scaled number between 0 and 1

Query processing:

Retrieve pages meeting query

Rank them by their pagerank

Order is query-______________

Hyperlink-Induced Topic Search (HITS)

In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:

Hub pages are good lists of links on a subject.

e.g., “Bob’s list of cancer-related links.”

Authority pages occur recurrently on good hubs for the subject.

Best suited for _______________ rather than for known-item searches.

Gets at a broader slice of common opinion.

Hubs and Authorities

Thus, a good __________ for a topic points to many authoritative pages for that topic.

A good ______________ for a topic is pointed to by many good hubs for that topic.

Circular definition - will turn this into an iterative computation.

Hubs and Authorities

High-level scheme

Extract from the web a base set of pages that could be good hubs or authorities.

From these, identify a small set of top hub and authority pages

iterative algorithm

Base set

Given text query (say university), use a text index to get all pages containing university.

Call this the root set of pages

Add in any page that either:

points to a page in the root set, or

is pointed to by a page in the root set

Call this the base set

Slide 22

Assembling the base set

Root set typically 200-1000 nodes.

Base set may have up to 5000 nodes.

How do you find the base set nodes?

Follow out-links by parsing root set pages.

Get in-links (and out-links) from a connectivity server.

Distilling hubs and authorities

Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).

Initialize: for all x, h(x)¬1; a(x) ¬1;

Iteratively update all h(x), a(x);

After iterations:

highest h() scores are hubs

highest a() scores are authorities

Iterative update

Repeat the following updates, for all x:

How many iterations?

Relative values of scores will converge after a few iterations

We only require the _____________ of the h() and a() scores - not their absolute values

In practice, ~5 iterations needed

Things to think about

Use only link analysis _____ base set assembled

iterative scoring is query-__________

Iterative computation ____ text index retrieval - significant overhead

Things to think about

How does the selection of the base set influence computation of H & As?

Can we embed the computation of H & A during the standard VS retrieval algorithm?

A pagerank score is a global score. Can there be a fusion between H&A (which are query sensitive) and pagerank? How would you do it?


	Module 7 Applied Bibliometrics
	KAN Min-Yen
	*Part of these lecture notes come from Manning, Raghavan and Schütze @ Stanford CS


Idea: mine hyperlink information in the Web
Assumptions:
	Links often connect related pages
	A link between pages is a recommendation
		“people vote with their links”


Using link counts as simple measures of popularity

Two basic suggestions:
	Undirected popularity:
		in-links plus out-links (3+2=5)
	Directed popularity:
		number of its in-links (3)


	Retrieve all pages meeting the text query (say venture capital), perhaps by using Boolean model

	Order these by link popularity (either variant on the previous page)


	Imagine a browser doing a random walk on web pages:
		Start at a random page
		At each step, follow one of the n links on that page, each with 1/n probability
	Do this repeatedly. Use the “long-term visit rate” as the page’s score


	The web is full of dead ends.
		What sites have dead ends?
		Our random walk can get stuck.


	At each step, with probability 10%, teleport to a random web page

	With remaining probability (90%), follow a random link on the page
		If a dead-end, stay put in this case


	Now we cannot get stuck locally
	There is a long-term rate at which any page is visited (not obvious, will show this)
		How do we compute this visit rate?


	A Markov chain consists of n _____, plus an n´n __________________ P.
		At each step, we are in exactly one of the states.
		For 1 £ i,k £ n, the matrix entry P_ik tells us the probability of k being the next state, given we are currently in state i.


	Clearly, for all i,
	Markov chains are abstractions of random walks


	A Markov chain is ergodic if
		you have a path from any state to any other
		you can be in any state at every time step, with non-zero probability


		With teleportation, our Markov chain is ergodic


	For any ergodic Markov chain, there is a unique long-term visit rate for each state
		Over a long period, we’ll visit each state in proportion to this rate
		It doesn’t matter where we start


	A probability (row) vector x = (x₁, … x_n) tells us where the walk is at any point
	E.g., (000…1…000) means we’re in state i.


	If the probability vector is x = (x₁, … x_n) at this step, what is it at the next step?
	Recall that row i of the transition prob. Matrix P tells us where we go next from state i.
	So from x, our next state is distributed as __.


	Regardless of where we start, we eventually reach the steady state a
		Start with any distribution (say x=(10…0))
		After one step, we’re at xP
		After two steps at xP² , then xP³ and so on.
		“Eventually” means for “large” k, _______
	Algorithm: multiply x by increasing powers of P until the product looks stable


	Pre-processing:
		Given graph of links, build matrix P
		From it compute a
		The pagerank a_i is a scaled number between 0 and 1
	Query processing:
		Retrieve pages meeting query
		Rank them by their pagerank
		Order is query-______________


In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:
	Hub pages are good lists of links on a subject.
		e.g., “Bob’s list of cancer-related links.”
	Authority pages occur recurrently on good hubs for the subject.
Best suited for _______________ rather than for known-item searches.
Gets at a broader slice of common opinion.


	Thus, a good __________ for a topic points to many authoritative pages for that topic.

	A good ______________ for a topic is pointed to by many good hubs for that topic.

	Circular definition - will turn this into an iterative computation.


	Extract from the web a base set of pages that could be good hubs or authorities.

	From these, identify a small set of top hub and authority pages
		iterative algorithm


	Given text query (say university), use a text index to get all pages containing university.
		Call this the root set of pages
	Add in any page that either:
		points to a page in the root set, or
		is pointed to by a page in the root set
	Call this the base set


	Root set typically 200-1000 nodes.
	Base set may have up to 5000 nodes.
	How do you find the base set nodes?

		Follow out-links by parsing root set pages.

		Get in-links (and out-links) from a connectivity server.


	Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).
	Initialize: for all x, h(x)¬1; a(x) ¬1;
	Iteratively update all h(x), a(x);
	After iterations:
		highest h() scores are hubs
		highest a() scores are authorities


	Relative values of scores will converge after a few iterations
	We only require the _____________ of the h() and a() scores - not their absolute values
	In practice, ~5 iterations needed


	Use only link analysis _____ base set assembled
		iterative scoring is query-__________
	Iterative computation ____ text index retrieval - significant overhead


	How does the selection of the base set influence computation of H & As?
	Can we embed the computation of H & A during the standard VS retrieval algorithm?
	A pagerank score is a global score. Can there be a fusion between H&A (which are query sensitive) and pagerank? How would you do it?