Notes
Slide Show
Outline
1
From last time
  • Examined DL policy and some specific examples


  • Undoing the Digital Divide –
    • Unequal access rights for privileged / unprivileged
    • Preservation via indexing and archiving of most valuable knowledge

2
Introduction to Bibliometrics
  • Module 7                Applied Bibliometrics
     KAN Min-Yen
3
What is Bibliometrics?
  • Statistical and other forms of quantitative analysis


  • Used to discover and chart the growth patterns of information
      • Production
      • Use
4
Outline
  • What is bibliometrics? √
  • Bibliometric laws


  • Properties of information and its production
5
Properties of Academic Literature
  • Growth
  • Fragmentation
  • Obsolescence
  • Linkage
6
Growth
  • Exponential rate for several centuries: “information overload”
  • 1st known scientific journal: ~1600
  • Today:
    • LINC has about 15,000 in all libraries


  • Factors:
    • Ease of publication
    • Ease of use and increased availability
    • Known reputation
7
Zipf-Yule-Pareto Law
  • Pn ≈ 1/na
    • where Pn is the frequency of occurrence of the nth ranked item and a ≈ 1.


      • “The probability of occurrence of a value of some variable starts high and tapers off. Thus, a few values occur very often while many others occur rarely.”

  • Pareto – for land ownership in the 1800’s
  • Zipf – for word frequency
  • Also known as the 80/20 rule and as Zipf-Mandelbrot
  • Used to measure of citings per paper:
  • # of papers cited n times is about 1/na of those being cited once, where a ≈ 1
8
Random processes and Zipfian behavior
  • Some random processes can also result in Zipfian behavior:


    • At the beginning there is one “seminal" paper.
    • Every sequential paper makes at most ten citations (or cites all preceding papers if their number does not exceed ten).
    • All preceding papers have an equal probability to be cited.

  • Result: A Zipfian curve, with a≈1.
    What’s your conclusion?
9
Lotka’s Law
  • The number of authors making n contributions is about 1/na of those making one contribution, where a ≈ 2.


  • Implications:
    • A small number of authors produce large number of papers, e.g., 10% of authors produce half of literature in a field
    • Those who achieve success in writing papers are likely continue having it

10
Lotka’s Law in Action
11
Bradford’s Law of Scattering
  • Journals in a field can be divided into three parts, each with about one-third of all articles:


    • 1) a core of a few journals,
    • 2) a second zone, with more journals, and
    • 3) a third zone, with the bulk of journals.

  • The number of journals is 1:n:n2


  • To think about: Why is this true?
12
Fragmentation
  • Influenced by scientific method
    • Information is continuous, but discretized into standard chunks
    • (e.g., conference papers, journal article, surveys, texts, Ph.D. thesis)

  • One paper reports one experiment
  • Scientists aim to publish in diverse places
13
Fragmentation
  • Motivation from academia
    • The “popularity contest”
    • Getting others to use your intellectual property and credit you with it
      • Spread your knowledge wide across disciplines

    • Academic yardstick for tenure (and for hiring)
      • The more the better – fragment your results
      • The higher quality the better – chase best journals

  • To think about: what is fragmentation’s relation to the aforementioned bibliometric laws?
14
Obsolescence
  • Literature gets outdated fast!
    • ½ references < 8 yrs. Chemistry
    • ½ references < 5 yrs. Physics
  • Textbooks out dated when published
  • Practical implications in the digital library
  • What about computer science?


  • To think about: Is it really outdated-ness that is measured or something else?
15
ISI Impact Factor
16
Half Life Decay
 in Action
17
Expected Citation Rates
  • From a large sample can calculate expected rates of citations
    • For journals vs. conferences
    • For specific journals vs. other ones

  • Can find a researcher’s productivities against this specific rate
    • Basis for promotion
18
 
19
Linkage
  • Citations in scientific papers are important:
    • Demonstrate awareness of background
    • Prior work being built upon
    • Substantiate claims
    • Contrast to competing work


    • Any other reasons?


    • One of the main reasons # of citations by themselves not a good rationale for evaluation.
20
Non-trivial to unify citations
  • Citations have different styles:




  • Citeseer tried edit distance, structured field recognition
    • Settled on word (unigram) + section n-gram matching after normalization
    • More work to be done here: OpCit GPL code
21
Computational Analysis of Links
  • If we know what type of citations/links exist, that can help:


    • In scientific articles:
      • In calculating impact
      • In relevance judgment (browsing à survey paper)
      • Checking whether paper author’s are informed

    • In DL item retrieval:
      • In classifying items pointed by a link
      • In calculating an item’s importance (removal of self-citations)

22
Calculating citation types
  • Teufel (00): creates Rhetorical Document Profiles
    • Capitalizes on fixed structure and argumentative goals in scientific articles (e.g. Related Work)
    • Uses discourse cue phrases and position of citation to classify (e.g., In constrast to [1], we …) a zone


23
Using link text for classification
  • The link text that
    describes a page
    in another page
    can be used for
    classification.


  • Amitay (98)
    extended this
    concept by ranking nearby text fragments using (among other things) positional information.
    • XXXX: …. … .. … ..
    • … … … .. …. XXX, …. … .. … …
    • … XXXX[ … ] [ … ] [ …. ]
24
Ranking related papers in retrieval
  • Citeseer uses two forms of relatedness to recommend “related articles”:


    • TF × IDF
      • If above a threshold, report it

    • CC (Common Citation) × IDF
      • CC = Bibliographic Coupling
      • If two papers share a rare citation, this is more important than if they share a common one.
25
Citation Analysis
  • Deciding which (web sites, authors) are most prominent
26
Citation Analysis
  • Despite shortcomings, still useful
  • Citation links viewed as a DAG
  • Incoming and outgoing links have different treatments


27
Sociometric experiment types
  • Ego-centered: focal person and its alters
    (Wasserman and Faust, pg. 53)


  • Small World: how many actors a respondent is away from a target


28
Prominence
  • Consider a node prominent if its ties make it particularly visible to other nodes in the network
    (adapted from WF, pg 172)


    • Centrality – no distinction on incoming or outgoing edges (thus directionality doesn’t matter.  How involved is the node in the graph.


    • Prestige – “Status”.  Ranking the prestige of nodes among other nodes. In degree counts towards prestige.
29
Centrality
  • How central is a particular
    • Graph?
    • Node?
  • Graph-wide measures assist in comparing graphs, subgraphs
30
Node Degree Centrality
  • Degree (In + Out)
  • Normalized Degree (In+Out/Possible)
    • What’s max possible?
  • Variance of Degrees
31
Distance Centrality
  • Closeness = minimal distance
  • Sum of shortest paths should be minimal in a central graph
  • (Jordan) Center = subset of nodes that have minimal sum distance to all nodes.


  • What about disconnected components?
32
Betweenness Centrality
  • A node is central iff it lies between other nodes on their shortest path.
  • If there is more than one shortest path,
    • Treat each with equal weight
    • Use some weighting scheme
      • Inverse of path length
33
References (besides readings)
  • Bollen and Luce (02) Evaluation of Digital Library Impact and User Communities by Analysis of Usage Patterns http://www.dlib.org/dlib/june02/bollen/06bollen.html


  • Kaplan and Nelson (00) Determining the publication impact of a digital library
    http://download.interscience.wiley.com/cgi-bin/fulltext?ID=69503874&PLACEBO=IE.pdf&mode=pdf


  • Wasserman and Faust (94) Social Network Analysis (on reserve)
34
Things to think about
  • What’s the relationship between these three laws (Bradford, Zipf-Yule-Pareto and Lotka)?
  • How would you define the three zones in Bradford’s law?


35
Pagerank and HITS*
  • Module 7 Applied Bibliometrics
  • KAN Min-Yen
  • *Part of these lecture notes come from Manning, Raghavan and Schütze @ Stanford CS
36
Connectivity analysis
  •  Idea: mine hyperlink information in the Web
  •  Assumptions:
    •  Links often connect related pages
    •  A link between pages is a recommendation
      • “people vote with their links”
37
Query-independent ordering
  • Using link counts as simple measures of popularity


  • Two basic suggestions:
    • Undirected popularity:
      • in-links plus out-links (3+2=5)
    • Directed popularity:
      • number of its in-links (3)
38
Algorithm
  • Retrieve all pages meeting the text query (say venture capital), perhaps by using Boolean model


  • Order these by link popularity
    (either variant on the previous page)
39
Pagerank scoring
  • Imagine a browser doing a random walk on web pages:
    • Start at a random page
    • At each step, follow one of the n links on that page, each with 1/n probability
  • Do this repeatedly.  Use the “long-term visit rate” as the page’s score
40
Not quite enough
  • The web is full of dead ends.
    • What sites have dead ends?
    • Our random walk can get stuck.
41
Teleporting
  • At each step, with probability 10%, teleport to a random web page


  • With remaining probability (90%), follow a random link on the page
    • If a dead-end, stay put in this case


42
Result of teleporting
  • Now we cannot get stuck locally
  • There is a long-term rate at which any page is visited (not obvious, will show this)
    • How do we compute this visit rate?
43
Markov chains
  • A Markov chain consists of n states, plus an n´n transition probability matrix P.
    • At each step, we are in exactly one of the states.
    • For 1 £ i,k £ n, the matrix entry Pik tells us the probability of k being the next state, given we are currently in state i.
44
Markov chains
  • Clearly, for all i,
  •  Markov chains are abstractions of random walks
45
Ergodic Markov chains
  • A Markov chain is ergodic if
    • you have a path from any state to any other
    • you can be in any state at every time step, with non-zero probability



    • With teleportation, our Markov chain is ergodic
46
Steady State
  • For any ergodic Markov chain, there is a unique long-term visit rate for each state
    • Over a long period, we’ll visit each state in proportion to this rate
    • It doesn’t matter where we start

47
Probability vectors
  • A probability (row) vector x = (x1, … xn) tells us where the walk is at any point
  • E.g., (000…1…000) means we’re in state i.
48
Change in probability vector
  • If the probability vector is  x = (x1, … xn) at this step, what is it at the next step?
  • Recall that row i of the transition prob. Matrix P tells us where we go next from state i.
  • So from x, our next state is distributed as xP.
49
Pagerank algorithm
  • Regardless of where we start, we eventually reach the steady state a
    • Start with any distribution (say x=(10…0))
    • After one step, we’re at xP
    • After two steps at xP2 , then xP3 and so on.
    • “Eventually” means for “large” k, xPk = a
  • Algorithm: multiply x by increasing powers of P until the product looks stable
50
Pagerank summary
  • Pre-processing:
    • Given graph of links, build matrix P
    • From it compute a
    • The pagerank ai is a scaled number between 0 and 1
  • Query processing:
    • Retrieve pages meeting query
    • Rank them by their pagerank
    • Order is query-independent
51
Hyperlink-Induced Topic Search (HITS)
  • In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:
    • Hub pages are good lists of links on a subject.
      • e.g., “Bob’s list of cancer-related links.”
    • Authority pages occur recurrently on good hubs for the subject.
  • Best suited for “broad topic” browsing queries rather than for known-item queries.
  • Gets at a broader slice of common opinion.
52
Hubs and Authorities
  • Thus, a good hub page for a topic points to many authoritative pages for that topic.


  • A good authority page for a topic is pointed to by many good hubs for that topic.


  • Circular definition - will turn this into an iterative computation.
53
Hubs and Authorities
54
High-level scheme
  • Extract from the web a base set of pages that could be good hubs or authorities.


  • From these, identify a small set of top hub and authority pages
    •  iterative algorithm
55
Base set
  • Given text query (say university), use a text index to get all pages containing university.
    • Call this the root set of pages
  • Add in any page that either:
    • points to a page in the root set, or
    • is pointed to by a page in the root set
  • Call this the base set
56
 
57
Assembling the base set
  • Root set typically 200-1000 nodes.
  • Base set may have up to 5000 nodes.
  • How do you find the base set nodes?


    • Follow out-links by parsing root set pages.


    • Get in-links (and out-links) from a connectivity server.
58
Distilling hubs and authorities
  • Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).
  • Initialize: for all x, h(x)¬1; a(x) ¬1;
  • Iteratively update all h(x), a(x);
  • After iterations:
    • highest h() scores are hubs
    • highest a() scores are authorities
59
Iterative update
  • Repeat the following updates, for all x:
60
How many iterations?
  • Relative values of scores will converge after a few iterations
  • We only require the relative order of the h() and a() scores - not their absolute values
  • In practice, ~5 iterations needed
61
Things to think about
  • Use only link analysis after base set assembled
    • iterative scoring is query-independent
  • Iterative computation after text index retrieval - significant overhead
62
Things to think about
  • How does the selection of the base set influence computation of H & As?
  • Can we embed the computation of H & A during the standard VS retrieval algorithm?
  • A pagerank score is a global score.  Can there be a fusion between H&A (which are query sensitive) and pagerank?  How would you do it?
  • How do you relate CCIDF in Citeseer to Pagerank?