Introduction to
Bibliometrics
|
|
|
Module 7 Applied Bibliometrics
KAN Min-Yen |
What is Bibliometrics?
|
|
|
|
|
Statistical and other forms of
quantitative analysis |
|
|
|
Used to discover and chart the growth
patterns of information |
|
Production |
|
Use |
Outline
|
|
|
What is bibliometrics? √ |
|
Bibliometric laws |
|
|
|
Properties of information and its
production |
Properties of Academic
Literature
|
|
|
Growth |
|
Fragmentation |
|
Obsolescence |
|
Linkage |
Growth
|
|
|
|
|
|
|
Exponential rate for several centuries:
“information overload” |
|
1st known scientific
journal: ~1600 |
|
Today: |
|
LINC has about 15,000 in all libraries |
|
|
|
Factors: |
|
Ease of publication |
|
Ease of use and increased availability |
|
Known reputation |
Zipf-Yule-Pareto Law
|
|
|
|
|
Pn ≈ 1/na |
|
where Pn is the frequency
of occurrence of the nth ranked item, and a is some constant. |
|
|
|
“The probability of occurrence of a
value of some variable starts high and tapers off. Thus, a few values occur
very often while many others occur rarely.” |
|
|
|
Pareto – for land ownership in the
1800’s |
|
Zipf – for word frequency |
|
Also known as the 80/20 rule and as
Zipf-Mandelbrot |
|
Used to measure of citings per paper: |
|
# of papers cited n times is about 1/na
of those being cited once, where a ≈ ____ |
Random processes and
Zipfian behavior
|
|
|
|
Some random processes can also result
in Zipfian behavior: |
|
|
|
At the beginning there is one
“seminal" paper. |
|
Every sequential paper makes at most
ten citations (or cites all preceding papers if their number does not exceed
ten). |
|
All preceding papers have an equal
probability to be cited. |
|
|
|
Result: A Zipfian curve, with a≈1.
What’s your conclusion? |
Lotka’s Law
|
|
|
|
The number of authors making n
contributions is about 1/na of those making one contribution,
where a ≈__ |
|
|
|
Implications: |
|
A small number of authors produce large
number of papers, e.g., 10% of authors produce half of literature in a field |
|
Those who achieve success in writing
papers are likely continue having it |
|
|
Lotka’s Law in Action
Bradford’s Law of
Scattering
|
|
|
|
Journals in a field can be divided
into three parts, each with about one-third of all articles: |
|
|
|
1) a core of a few journals, |
|
2) a second zone, with more journals,
and |
|
3) a third zone, with the bulk of
journals. |
|
|
|
The number of journals is __:__:__ |
|
|
|
To think about: Why is this true? |
Fragmentation
|
|
|
|
Influenced by scientific method |
|
Information is continuous, but
discretized into ________________ |
|
(e.g., conference papers, journal
article, surveys, texts, Ph.D. thesis) |
|
|
|
One paper reports one experiment |
|
Scientists aim to publish in
_____________ |
Fragmentation
|
|
|
|
|
Motivation from academia |
|
The “popularity contest” |
|
Getting others to use your intellectual
property and credit you with it |
|
Spread your knowledge wide across
disciplines |
|
|
|
Academic yardstick for tenure (and for
hiring) |
|
The more the better –
__________________ |
|
The higher quality the better –
___________ |
|
|
|
To think about: what is fragmentation’s
relation to the aforementioned bibliometric laws? |
Obsolescence
|
|
|
|
Literature gets outdated fast! |
|
½ references < 8 yrs. Chemistry |
|
½ references < 5 yrs. Physics |
|
Textbooks out dated when published |
|
Practical implications in the digital
library |
|
What about computer science? |
|
|
|
To think about: Is it really
outdated-ness that is measured or something else? |
ISI Impact Factor
Half Life
Decay
in Action
Expected Citation Rates
|
|
|
|
From a large sample can calculate
expected rates of citations |
|
For journals vs. conferences |
|
For specific journals vs. other ones |
|
|
|
Can find a researcher’s productivities
against this specific rate |
|
Basis for promotion |
Slide 17
Linkage
|
|
|
|
Citations in scientific papers are
important: |
|
Demonstrate awareness of background |
|
Prior work being built upon |
|
Substantiate claims |
|
Contrast to competing work |
|
|
|
Any other reasons? |
|
|
|
One of the main reasons # of citations
by themselves not a good rationale for evaluation. |
Non-trivial to unify
citations
|
|
|
|
|
|
|
Citations have different styles: |
|
|
|
|
|
|
|
Citeseer tried edit distance,
structured field recognition |
|
Settled on
_______________________________________________________________ |
|
More work to be done here: OpCit GPL
code |
Computational Analysis of
Links
|
|
|
|
|
If we know what type of citations/links
exist, that can help: |
|
|
|
In scientific articles: |
|
In calculating impact |
|
In relevance judgment (browsing à survey paper) |
|
______________________________ |
|
|
|
In DL item retrieval: |
|
In classifying items pointed by a link |
|
_________________________ |
Calculating citation
types
|
|
|
|
Teufel (00): creates Rhetorical
Document Profiles |
|
Capitalizes on fixed structure and
argumentative goals in scientific articles (e.g. Related Work) |
|
Uses _______________________________ of
citation to classify (e.g., In constrast to [1], we …) a zone |
|
|
Using link text for
classification
|
|
|
|
The link text that
describes a page
in another page
can be used for
classification. |
|
|
|
Amitay (98)
extended this
concept by ranking nearby text fragments using (among other things)
__________. |
|
XXXX: …. … .. … .. |
|
… … … .. …. XXX, …. … .. … … |
|
… XXXX[ … ] [ … ] [ …. ] |
Ranking related papers in
retrieval
|
|
|
|
|
Citeseer uses two forms of relatedness
to recommend “related articles”: |
|
|
|
TF × IDF |
|
If above a threshold, report it |
|
|
|
CC (Common Citation) × IDF |
|
CC = Bibliographic Coupling |
|
If two papers share a rare citation,
this is more important than if they share a common one. |
Citation Analysis
|
|
|
Deciding which (web sites, authors) are
most prominent |
Citation Analysis
|
|
|
Despite shortcomings, still useful |
|
Citation links viewed as a DAG |
|
Incoming and outgoing links have
different treatments |
|
|
Prominence
|
|
|
|
Consider a node prominent if its ties
make it particularly visible to other nodes in the network
(adapted from WF, pg 172) |
|
|
|
Centrality – no distinction on incoming
or outgoing edges (thus directionality doesn’t matter. How involved is the node in the graph. |
|
|
|
Prestige – “Status”. Ranking the prestige of nodes among other
nodes. In degree counts towards prestige. |
Centrality
|
|
|
|
How central is a particular |
|
Graph? |
|
Node? |
|
Graph-wide measures assist in comparing
graphs, subgraphs |
Node Degree Centrality
|
|
|
|
Degree (In + Out) |
|
Normalized Degree (In+Out/Possible) |
|
What’s max possible? |
|
Variance of Degrees |
Distance Centrality
|
|
|
Closeness = minimal distance |
|
Sum of shortest paths should be minimal
in a central graph |
|
____________ = subset of nodes that
have minimal sum distance to all nodes. |
|
|
|
What about disconnected components? |
Betweenness Centrality
|
|
|
|
|
A node is central iff it lies between
other nodes on their shortest path. |
|
If there is more than one shortest
path, |
|
Treat each with equal weight |
|
Use some weighting scheme |
|
Inverse of path length |
References (besides
readings)
|
|
|
Bollen and Luce (02) Evaluation of
Digital Library Impact and User Communities by Analysis of Usage Patterns http://www.dlib.org/dlib/june02/bollen/06bollen.html |
|
|
|
Kaplan and Nelson (00) Determining the
publication impact of a digital library
http://download.interscience.wiley.com/cgi-bin/fulltext?ID=69503874&PLACEBO=IE.pdf&mode=pdf |
|
|
|
Wasserman and Faust (94) Social Network
Analysis (on reserve) |
Things to think about
|
|
|
|
|
|
|
What’s the relationship between these
three laws (Bradford, Zipf-Yule-Pareto and Lotka)? |
|
|
|
How would you define the three zones in
Bradford’s law? |
|
|