|
1
|
- Module 7 Applied Bibliometrics
- KAN Min-Yen
- *Part of these lecture notes come from Manning, Raghavan and Schütze @
Stanford CS
|
|
2
|
- Idea: mine hyperlink information
in the Web
- Assumptions:
- Links often connect related
pages
- A link between pages is a
recommendation
- “people vote with their links”
|
|
3
|
- Using link counts as simple measures of popularity
- Two basic suggestions:
- Undirected popularity:
- in-links plus out-links (3+2=5)
- Directed popularity:
- number of its in-links (3)
|
|
4
|
- Retrieve all pages meeting the text query (say venture capital), perhaps
by using Boolean model
- Order these by link popularity
(either variant on the previous page)
|
|
5
|
- Imagine a browser doing a random walk on web pages:
- Start at a random page
- At each step, follow one of the n links on that page, each with 1/n
probability
- Do this repeatedly. Use the
“long-term visit rate” as the page’s score
|
|
6
|
- The web is full of dead ends.
- What sites have dead ends?
- Our random walk can get stuck.
|
|
7
|
- At each step, with probability 10%, teleport to a random web page
- With remaining probability (90%), follow a random link on the page
- If a dead-end, stay put in this case
|
|
8
|
- Now we cannot get stuck locally
- There is a long-term rate at which any page is visited (not obvious,
will show this)
- How do we compute this visit rate?
|
|
9
|
- A Markov chain consists of n _____, plus an n´n __________________ P.
- At each step, we are in exactly one of the states.
- For 1 £ i,k £ n, the matrix entry Pik
tells us the probability of k being the next state, given we are
currently in state i.
|
|
10
|
- Clearly, for all i,
- Markov chains are abstractions of
random walks
|
|
11
|
- A Markov chain is ergodic if
- you have a path from any state to any other
- you can be in any state at every time step, with non-zero probability
- With teleportation, our Markov chain is ergodic
|
|
12
|
- For any ergodic Markov chain, there is a unique long-term visit rate for
each state
- Over a long period, we’ll visit each state in proportion to this rate
- It doesn’t matter where we start
|
|
13
|
- A probability (row) vector x = (x1, … xn) tells us
where the walk is at any point
- E.g., (000…1…000) means we’re in state i.
|
|
14
|
- If the probability vector is x =
(x1, … xn) at this step, what is it at the next
step?
- Recall that row i of the transition prob. Matrix P tells us where we go
next from state i.
- So from x, our next state is distributed as __.
|
|
15
|
- Regardless of where we start, we eventually reach the steady state a
- Start with any distribution (say x=(10…0))
- After one step, we’re at xP
- After two steps at xP2 , then xP3 and so on.
- “Eventually” means for “large” k, _______
- Algorithm: multiply x by increasing powers of P until the product looks
stable
|
|
16
|
- Pre-processing:
- Given graph of links, build matrix P
- From it compute a
- The pagerank ai is a scaled number between 0 and 1
- Query processing:
- Retrieve pages meeting query
- Rank them by their pagerank
- Order is query-______________
|
|
17
|
- In response to a query, instead of an ordered list of pages each meeting
the query, find two sets of inter-related pages:
- Hub pages are good lists of links on a subject.
- e.g., “Bob’s list of cancer-related links.”
- Authority pages occur recurrently on good hubs for the subject.
- Best suited for _______________ rather than for known-item searches.
- Gets at a broader slice of common opinion.
|
|
18
|
- Thus, a good __________ for a topic points to many authoritative pages
for that topic.
- A good ______________ for a topic is pointed to by many good hubs for
that topic.
- Circular definition - will turn this into an iterative computation.
|
|
19
|
|
|
20
|
- Extract from the web a base set of pages that could be good hubs or
authorities.
- From these, identify a small set of top hub and authority pages
|
|
21
|
- Given text query (say university), use a text index to get all pages
containing university.
- Call this the root set of pages
- Add in any page that either:
- points to a page in the root set, or
- is pointed to by a page in the root set
- Call this the base set
|
|
22
|
|
|
23
|
- Root set typically 200-1000 nodes.
- Base set may have up to 5000 nodes.
- How do you find the base set nodes?
- Follow out-links by parsing root set pages.
- Get in-links (and out-links) from a connectivity server.
|
|
24
|
- Compute, for each page x in the base set, a hub score h(x) and an authority
score a(x).
- Initialize: for all x, h(x)¬1;
a(x) ¬1;
- Iteratively update all h(x), a(x);
- After iterations:
- highest h() scores are hubs
- highest a() scores are authorities
|
|
25
|
- Repeat the following updates, for all x:
|
|
26
|
- Relative values of scores will converge after a few iterations
- We only require the _____________ of the h() and a() scores - not their
absolute values
- In practice, ~5 iterations needed
|
|
27
|
- Use only link analysis _____ base set assembled
- iterative scoring is query-__________
- Iterative computation ____ text index retrieval - significant overhead
|
|
28
|
- How does the selection of the base set influence computation of H &
As?
- Can we embed the computation of H & A during the standard VS
retrieval algorithm?
- A pagerank score is a global score.
Can there be a fusion between H&A (which are query sensitive)
and pagerank? How would you do
it?
|