Introduction to Bibliometrics
Module 7     Applied Bibliometrics
 KAN Min-Yen

What is Bibliometrics?
Statistical and other forms of quantitative analysis
Used to discover and chart the growth patterns of information
Production
Use

Outline
What is bibliometrics? √
Bibliometric laws
Properties of information and its production

Properties of Academic Literature
Growth
Fragmentation
Obsolescence
Linkage

Growth
Exponential rate for several centuries: “information overload”
1st known scientific journal: ~1600
Today:
LINC has about 15,000 in all libraries
Factors:
Ease of publication
Ease of use and increased availability
Known reputation

Zipf-Yule-Pareto Law
Pn ≈ 1/na
where Pn is the frequency of occurrence of the nth ranked item, and a is some constant.
“The probability of occurrence of a value of some variable starts high and tapers off. Thus, a few values occur very often while many others occur rarely.”
Pareto – for land ownership in the 1800’s
Zipf – for word frequency
Also known as the 80/20 rule and as Zipf-Mandelbrot
Used to measure of citings per paper:
# of papers cited n times is about 1/na of those being cited once, where a ≈ ____

Random processes and Zipfian behavior
Some random processes can also result in Zipfian behavior:
At the beginning there is one “seminal" paper.
Every sequential paper makes at most ten citations (or cites all preceding papers if their number does not exceed ten).
All preceding papers have an equal probability to be cited.
Result: A Zipfian curve, with a≈1.
What’s your conclusion?

Lotka’s Law
The number of authors making n contributions is about 1/na of those making one contribution, where a ≈__
Implications:
A small number of authors produce large number of papers, e.g., 10% of authors produce half of literature in a field
Those who achieve success in writing papers are likely continue having it

Lotka’s Law in Action

Bradford’s Law of Scattering
Journals in a field can be divided into three parts, each with about one-third of all articles:
1) a core of a few journals,
2) a second zone, with more journals, and
3) a third zone, with the bulk of journals.
The number of journals is __:__:__
To think about: Why is this true?

Fragmentation
Influenced by scientific method
Information is continuous, but discretized into ________________
(e.g., conference papers, journal article, surveys, texts, Ph.D. thesis)
One paper reports one experiment
Scientists aim to publish in _____________

Fragmentation
Motivation from academia
The “popularity contest”
Getting others to use your intellectual property and credit you with it
Spread your knowledge wide across disciplines
Academic yardstick for tenure (and for hiring)
The more the better – __________________
The higher quality the better – ___________
To think about: what is fragmentation’s relation to the aforementioned bibliometric laws?

Obsolescence
Literature gets outdated fast!
½ references < 8 yrs. Chemistry
½ references < 5 yrs. Physics
Textbooks out dated when published
Practical implications in the digital library
What about computer science?
To think about: Is it really outdated-ness that is measured or something else?

ISI Impact Factor

Half Life Decay
 in Action

Expected Citation Rates
From a large sample can calculate expected rates of citations
For journals vs. conferences
For specific journals vs. other ones
Can find a researcher’s productivities against this specific rate
Basis for promotion

Slide 17

Linkage
Citations in scientific papers are important:
Demonstrate awareness of background
Prior work being built upon
Substantiate claims
Contrast to competing work
Any other reasons?
One of the main reasons # of citations by themselves not a good rationale for evaluation.

Non-trivial to unify citations
Citations have different styles:
Citeseer tried edit distance, structured field recognition
Settled on _______________________________________________________________
More work to be done here: OpCit GPL code

Computational Analysis of Links
If we know what type of citations/links exist, that can help:
In scientific articles:
In calculating impact
In relevance judgment (browsing à survey paper)
______________________________
In DL item retrieval:
In classifying items pointed by a link
_________________________

Calculating citation types
Teufel (00): creates Rhetorical Document Profiles
Capitalizes on fixed structure and argumentative goals in scientific articles (e.g. Related Work)
Uses _______________________________ of citation to classify (e.g., In constrast to [1], we …) a zone

Using link text for classification
The link text that
describes a page
in another page
can be used for
classification.
Amitay (98)
extended this
concept by ranking nearby text fragments using (among other things) __________.
XXXX: …. … .. … ..
… … … .. …. XXX, …. … .. … …
… XXXX[ … ] [ … ] [ …. ]

Ranking related papers in retrieval
Citeseer uses two forms of relatedness to recommend “related articles”:
TF × IDF
If above a threshold, report it
CC (Common Citation) × IDF
CC = Bibliographic Coupling
If two papers share a rare citation, this is more important than if they share a common one.

Citation Analysis
Deciding which (web sites, authors) are most prominent

Citation Analysis
Despite shortcomings, still useful
Citation links viewed as a DAG
Incoming and outgoing links have different treatments

Prominence
Consider a node prominent if its ties make it particularly visible to other nodes in the network
(adapted from WF, pg 172)
Centrality – no distinction on incoming or outgoing edges (thus directionality doesn’t matter.  How involved is the node in the graph.
Prestige – “Status”.  Ranking the prestige of nodes among other nodes. In degree counts towards prestige.

Centrality
How central is a particular
Graph?
Node?
Graph-wide measures assist in comparing graphs, subgraphs

Node Degree Centrality
Degree (In + Out)
Normalized Degree (In+Out/Possible)
What’s max possible?
Variance of Degrees

Distance Centrality
Closeness = minimal distance
Sum of shortest paths should be minimal in a central graph
____________ = subset of nodes that have minimal sum distance to all nodes.
What about disconnected components?

Betweenness Centrality
A node is central iff it lies between other nodes on their shortest path.
If there is more than one shortest path,
Treat each with equal weight
Use some weighting scheme
Inverse of path length

References (besides readings)
Bollen and Luce (02) Evaluation of Digital Library Impact and User Communities by Analysis of Usage Patterns http://www.dlib.org/dlib/june02/bollen/06bollen.html
Kaplan and Nelson (00) Determining the publication impact of a digital library
http://download.interscience.wiley.com/cgi-bin/fulltext?ID=69503874&PLACEBO=IE.pdf&mode=pdf
Wasserman and Faust (94) Social Network Analysis (on reserve)

Things to think about
What’s the relationship between these three laws (Bradford, Zipf-Yule-Pareto and Lotka)?
How would you define the three zones in Bradford’s law?