Notes
Slide Show
Outline
1
Introduction to Bibliometrics
  • Module 7     Applied Bibliometrics
     KAN Min-Yen
2
What is Bibliometrics?
  • Statistical and other forms of quantitative analysis


  • Used to discover and chart the growth patterns of information
      • Production
      • Use
3
Outline
  • What is bibliometrics? √
  • Bibliometric laws


  • Properties of information and its production
4
Properties of Academic Literature
  • Growth
  • Fragmentation
  • Obsolescence
  • Linkage
5
Growth
  • Exponential rate for several centuries: “information overload”
  • 1st known scientific journal: ~1600
  • Today:
    • LINC has about 15,000 in all libraries


  • Factors:
    • Ease of publication
    • Ease of use and increased availability
    • Known reputation
6
Zipf-Yule-Pareto Law
  • Pn ≈ 1/na
    • where Pn is the frequency of occurrence of the nth ranked item, and a is some constant.


      • “The probability of occurrence of a value of some variable starts high and tapers off. Thus, a few values occur very often while many others occur rarely.”

  • Pareto – for land ownership in the 1800’s
  • Zipf – for word frequency
  • Also known as the 80/20 rule and as Zipf-Mandelbrot
  • Used to measure of citings per paper:
  • # of papers cited n times is about 1/na of those being cited once, where a ≈ ____
7
Random processes and Zipfian behavior
  • Some random processes can also result in Zipfian behavior:


    • At the beginning there is one “seminal" paper.
    • Every sequential paper makes at most ten citations (or cites all preceding papers if their number does not exceed ten).
    • All preceding papers have an equal probability to be cited.

  • Result: A Zipfian curve, with a≈1.
    What’s your conclusion?
8
Lotka’s Law
  • The number of authors making n contributions is about 1/na of those making one contribution, where a ≈__


  • Implications:
    • A small number of authors produce large number of papers, e.g., 10% of authors produce half of literature in a field
    • Those who achieve success in writing papers are likely continue having it

9
Lotka’s Law in Action
10
Bradford’s Law of Scattering
  • Journals in a field can be divided into three parts, each with about one-third of all articles:


    • 1) a core of a few journals,
    • 2) a second zone, with more journals, and
    • 3) a third zone, with the bulk of journals.

  • The number of journals is __:__:__


  • To think about: Why is this true?
11
Fragmentation
  • Influenced by scientific method
    • Information is continuous, but discretized into ________________
    • (e.g., conference papers, journal article, surveys, texts, Ph.D. thesis)

  • One paper reports one experiment
  • Scientists aim to publish in _____________
12
Fragmentation
  • Motivation from academia
    • The “popularity contest”
    • Getting others to use your intellectual property and credit you with it
      • Spread your knowledge wide across disciplines

    • Academic yardstick for tenure (and for hiring)
      • The more the better – __________________
      • The higher quality the better – ___________

  • To think about: what is fragmentation’s relation to the aforementioned bibliometric laws?
13
Obsolescence
  • Literature gets outdated fast!
    • ½ references < 8 yrs. Chemistry
    • ½ references < 5 yrs. Physics
  • Textbooks out dated when published
  • Practical implications in the digital library
  • What about computer science?


  • To think about: Is it really outdated-ness that is measured or something else?
14
ISI Impact Factor
15
Half Life Decay
 in Action
16
Expected Citation Rates
  • From a large sample can calculate expected rates of citations
    • For journals vs. conferences
    • For specific journals vs. other ones

  • Can find a researcher’s productivities against this specific rate
    • Basis for promotion
17
 
18
Linkage
  • Citations in scientific papers are important:
    • Demonstrate awareness of background
    • Prior work being built upon
    • Substantiate claims
    • Contrast to competing work


    • Any other reasons?


    • One of the main reasons # of citations by themselves not a good rationale for evaluation.
19
Non-trivial to unify citations
  • Citations have different styles:




  • Citeseer tried edit distance, structured field recognition
    • Settled on _______________________________________________________________
    • More work to be done here: OpCit GPL code
20
Computational Analysis of Links
  • If we know what type of citations/links exist, that can help:


    • In scientific articles:
      • In calculating impact
      • In relevance judgment (browsing à survey paper)
      • ______________________________

    • In DL item retrieval:
      • In classifying items pointed by a link
      • _________________________
21
Calculating citation types
  • Teufel (00): creates Rhetorical Document Profiles
    • Capitalizes on fixed structure and argumentative goals in scientific articles (e.g. Related Work)
    • Uses _______________________________ of citation to classify (e.g., In constrast to [1], we …) a zone


22
Using link text for classification
  • The link text that
    describes a page
    in another page
    can be used for
    classification.


  • Amitay (98)
    extended this
    concept by ranking nearby text fragments using (among other things) __________.
    • XXXX: …. … .. … ..
    • … … … .. …. XXX, …. … .. … …
    • … XXXX[ … ] [ … ] [ …. ]
23
Ranking related papers in retrieval
  • Citeseer uses two forms of relatedness to recommend “related articles”:


    • TF × IDF
      • If above a threshold, report it

    • CC (Common Citation) × IDF
      • CC = Bibliographic Coupling
      • If two papers share a rare citation, this is more important than if they share a common one.
24
Citation Analysis
  • Deciding which (web sites, authors) are most prominent
25
Citation Analysis
  • Despite shortcomings, still useful
  • Citation links viewed as a DAG
  • Incoming and outgoing links have different treatments


26
Prominence
  • Consider a node prominent if its ties make it particularly visible to other nodes in the network
    (adapted from WF, pg 172)


    • Centrality – no distinction on incoming or outgoing edges (thus directionality doesn’t matter.  How involved is the node in the graph.


    • Prestige – “Status”.  Ranking the prestige of nodes among other nodes. In degree counts towards prestige.
27
Centrality
  • How central is a particular
    • Graph?
    • Node?
  • Graph-wide measures assist in comparing graphs, subgraphs
28
Node Degree Centrality
  • Degree (In + Out)
  • Normalized Degree (In+Out/Possible)
    • What’s max possible?
  • Variance of Degrees
29
Distance Centrality
  • Closeness = minimal distance
  • Sum of shortest paths should be minimal in a central graph
  • ____________ = subset of nodes that have minimal sum distance to all nodes.


  • What about disconnected components?
30
Betweenness Centrality
  • A node is central iff it lies between other nodes on their shortest path.
  • If there is more than one shortest path,
    • Treat each with equal weight
    • Use some weighting scheme
      • Inverse of path length
31
References (besides readings)
  • Bollen and Luce (02) Evaluation of Digital Library Impact and User Communities by Analysis of Usage Patterns http://www.dlib.org/dlib/june02/bollen/06bollen.html


  • Kaplan and Nelson (00) Determining the publication impact of a digital library
    http://download.interscience.wiley.com/cgi-bin/fulltext?ID=69503874&PLACEBO=IE.pdf&mode=pdf


  • Wasserman and Faust (94) Social Network Analysis (on reserve)
32
Things to think about
  • What’s the relationship between these three laws (Bradford, Zipf-Yule-Pareto and Lotka)?


  • How would you define the three zones in Bradford’s law?