| Module 7 Applied Bibliometrics | |
| KAN Min-Yen | |
| *Part of these lecture notes come from Manning, Raghavan and Schütze @ Stanford CS |
| Idea: mine hyperlink information in the Web | |||
| Assumptions: | |||
| Links often connect related pages | |||
| A link between pages is a recommendation | |||
| “people vote with their links” | |||
| Using link counts as simple measures of popularity | |||||
| Two basic suggestions: | |||||
| Undirected popularity: | |||||
| in-links plus out-links (3+2=5) | |||||
| Directed popularity: | |||||
| number of its in-links (3) | |||||
| Retrieve all pages meeting the text query (say venture capital), perhaps by using Boolean model | |||||
| Order these by link popularity (either variant on the previous page) |
|||||
| Imagine a browser doing a random walk on web pages: | ||
| Start at a random page | ||
| At each step, follow one of the n links on that page, each with 1/n probability | ||
| Do this repeatedly. Use the “long-term visit rate” as the page’s score | ||
| The web is full of dead ends. | ||
| What sites have dead ends? | ||
| Our random walk can get stuck. | ||
| At each step, with probability 10%, teleport to a random web page | |||||
| With remaining probability (90%), follow a random link on the page | |||||
| If a dead-end, stay put in this case | |||||
| Now we cannot get stuck locally | ||
| There is a long-term rate at which any page is visited (not obvious, will show this) | ||
| How do we compute this visit rate? | ||
| A Markov chain consists of n _____, plus an n´n __________________ P. | ||
| At each step, we are in exactly one of the states. | ||
| For 1 £ i,k £ n, the matrix entry Pik tells us the probability of k being the next state, given we are currently in state i. | ||
| Clearly, for all i, | |
| Markov chains are abstractions of random walks | |
| A Markov chain is ergodic if | ||
| you have a path from any state to any other | ||
| you can be in any state at every time step, with non-zero probability | ||
| With teleportation, our Markov chain is ergodic | ||
| For any ergodic Markov chain, there is a unique long-term visit rate for each state | ||
| Over a long period, we’ll visit each state in proportion to this rate | ||
| It doesn’t matter where we start | ||
| A probability (row) vector x = (x1, … xn) tells us where the walk is at any point | |
| E.g., (000…1…000) means we’re in state i. |
| If the probability vector is x = (x1, … xn) at this step, what is it at the next step? | |
| Recall that row i of the transition prob. Matrix P tells us where we go next from state i. | |
| So from x, our next state is distributed as __. |
| Regardless of where we start, we eventually reach the steady state a | ||
| Start with any distribution (say x=(10…0)) | ||
| After one step, we’re at xP | ||
| After two steps at xP2 , then xP3 and so on. | ||
| “Eventually” means for “large” k, _______ | ||
| Algorithm: multiply x by increasing powers of P until the product looks stable | ||
| Pre-processing: | ||
| Given graph of links, build matrix P | ||
| From it compute a | ||
| The pagerank ai is a scaled number between 0 and 1 | ||
| Query processing: | ||
| Retrieve pages meeting query | ||
| Rank them by their pagerank | ||
| Order is query-______________ | ||
Hyperlink-Induced Topic Search (HITS)
| In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: | |||
| Hub pages are good lists of links on a subject. | |||
| e.g., “Bob’s list of cancer-related links.” | |||
| Authority pages occur recurrently on good hubs for the subject. | |||
| Best suited for _______________ rather than for known-item searches. | |||
| Gets at a broader slice of common opinion. | |||
| Thus, a good __________ for a topic points to many authoritative pages for that topic. | |||||
| A good ______________ for a topic is pointed to by many good hubs for that topic. | |||||
| Circular definition - will turn this into an iterative computation. | |||||
| Extract from the web a base set of pages that could be good hubs or authorities. | ||
| From these, identify a small set of top hub and authority pages | ||
| iterative algorithm | ||
| Given text query (say university), use a text index to get all pages containing university. | ||
| Call this the root set of pages | ||
| Add in any page that either: | ||
| points to a page in the root set, or | ||
| is pointed to by a page in the root set | ||
| Call this the base set | ||
| Root set typically 200-1000 nodes. | |||||
| Base set may have up to 5000 nodes. | |||||
| How do you find the base set nodes? | |||||
| Follow out-links by parsing root set pages. | |||||
| Get in-links (and out-links) from a connectivity server. | |||||
Distilling hubs and authorities
| Compute, for each page x in the base set, a hub score h(x) and an authority score a(x). | ||
| Initialize: for all x, h(x)¬1; a(x) ¬1; | ||
| Iteratively update all h(x), a(x); | ||
| After iterations: | ||
| highest h() scores are hubs | ||
| highest a() scores are authorities | ||
| Repeat the following updates, for all x: |
| Relative values of scores will converge after a few iterations | |
| We only require the _____________ of the h() and a() scores - not their absolute values | |
| In practice, ~5 iterations needed |
| Use only link analysis _____ base set assembled | ||
| iterative scoring is query-__________ | ||
| Iterative computation ____ text index retrieval - significant overhead | ||
| How does the selection of the base set influence computation of H & As? | |
| Can we embed the computation of H & A during the standard VS retrieval algorithm? | |
| A pagerank score is a global score. Can there be a fusion between H&A (which are query sensitive) and pagerank? How would you do it? |