|
1
|
- Module 7 Applied
Bibliometrics
KAN Min-Yen
|
|
2
|
- Statistical and other forms of quantitative analysis
- Used to discover and chart the growth patterns of information
|
|
3
|
- What is bibliometrics? √
- Bibliometric laws
- Properties of information and its production
|
|
4
|
- Growth
- Fragmentation
- Obsolescence
- Linkage
|
|
5
|
- Exponential rate for several centuries: “information overload”
- 1st known scientific journal: ~1600
- Today:
- LINC has about 15,000 in all libraries
- Factors:
- Ease of publication
- Ease of use and increased availability
- Known reputation
|
|
6
|
- Pn ≈ 1/na
- where Pn is the frequency of occurrence of the nth
ranked item, and a is some constant.
- “The probability of occurrence of a value of some variable starts high
and tapers off. Thus, a few values occur very often while many others
occur rarely.”
- Pareto – for land ownership in the 1800’s
- Zipf – for word frequency
- Also known as the 80/20 rule and as Zipf-Mandelbrot
- Used to measure of citings per paper:
- # of papers cited n times is about 1/na of those being cited
once, where a ≈ ____
|
|
7
|
- Some random processes can also result in Zipfian behavior:
- At the beginning there is one “seminal" paper.
- Every sequential paper makes at most ten citations (or cites all
preceding papers if their number does not exceed ten).
- All preceding papers have an equal probability to be cited.
- Result: A Zipfian curve, with a≈1.
What’s your conclusion?
|
|
8
|
- The number of authors making n contributions is about 1/na of
those making one contribution, where a ≈__
- Implications:
- A small number of authors produce large number of papers, e.g., 10% of
authors produce half of literature in a field
- Those who achieve success in writing papers are likely continue having
it
|
|
9
|
|
|
10
|
- Journals in a field can be divided into three parts, each with about
one-third of all articles:
- 1) a core of a few journals,
- 2) a second zone, with more journals, and
- 3) a third zone, with the bulk of journals.
- The number of journals is __:__:__
- To think about: Why is this true?
|
|
11
|
- Influenced by scientific method
- Information is continuous, but discretized into ________________
- (e.g., conference papers, journal article, surveys, texts, Ph.D.
thesis)
- One paper reports one experiment
- Scientists aim to publish in _____________
|
|
12
|
- Motivation from academia
- The “popularity contest”
- Getting others to use your intellectual property and credit you with it
- Spread your knowledge wide across disciplines
- Academic yardstick for tenure (and for hiring)
- The more the better – __________________
- The higher quality the better – ___________
- To think about: what is fragmentation’s relation to the aforementioned
bibliometric laws?
|
|
13
|
- Literature gets outdated fast!
- ½ references < 8 yrs. Chemistry
- ½ references < 5 yrs. Physics
- Textbooks out dated when published
- Practical implications in the digital library
- What about computer science?
- To think about: Is it really outdated-ness that is measured or something
else?
|
|
14
|
|
|
15
|
|
|
16
|
- From a large sample can calculate expected rates of citations
- For journals vs. conferences
- For specific journals vs. other ones
- Can find a researcher’s productivities against this specific rate
|
|
17
|
|
|
18
|
- Citations in scientific papers are important:
- Demonstrate awareness of background
- Prior work being built upon
- Substantiate claims
- Contrast to competing work
- Any other reasons?
- One of the main reasons # of citations by themselves not a good
rationale for evaluation.
|
|
19
|
- Citations have different styles:
- Citeseer tried edit distance, structured field recognition
- Settled on
_______________________________________________________________
- More work to be done here: OpCit GPL code
|
|
20
|
- If we know what type of citations/links exist, that can help:
- In scientific articles:
- In calculating impact
- In relevance judgment (browsing à survey paper)
- ______________________________
- In DL item retrieval:
- In classifying items pointed by a link
- _________________________
|
|
21
|
- Teufel (00): creates Rhetorical Document Profiles
- Capitalizes on fixed structure and argumentative goals in scientific
articles (e.g. Related Work)
- Uses _______________________________ of citation to classify (e.g., In
constrast to [1], we …) a zone
|
|
22
|
- The link text that
describes a page
in another page
can be used for
classification.
- Amitay (98)
extended this
concept by ranking nearby text fragments using (among other
things) __________.
- XXXX: …. … .. … ..
- … … … .. …. XXX, …. … .. … …
- … XXXX[ … ] [ … ] [ …. ]
|
|
23
|
- Citeseer uses two forms of relatedness to recommend “related articles”:
- TF × IDF
- If above a threshold, report it
- CC (Common Citation) × IDF
- CC = Bibliographic Coupling
- If two papers share a rare citation, this is more important than if
they share a common one.
|
|
24
|
- Deciding which (web sites, authors) are most prominent
|
|
25
|
- Despite shortcomings, still useful
- Citation links viewed as a DAG
- Incoming and outgoing links have different treatments
|
|
26
|
- Consider a node prominent if its ties make it particularly visible to
other nodes in the network
(adapted from WF, pg 172)
- Centrality – no distinction on incoming or outgoing edges (thus
directionality doesn’t matter.
How involved is the node in the graph.
- Prestige – “Status”. Ranking the
prestige of nodes among other nodes. In degree counts towards prestige.
|
|
27
|
- How central is a particular
- Graph-wide measures assist in comparing graphs, subgraphs
|
|
28
|
- Degree (In + Out)
- Normalized Degree (In+Out/Possible)
- Variance of Degrees
|
|
29
|
- Closeness = minimal distance
- Sum of shortest paths should be minimal in a central graph
- ____________ = subset of nodes that have minimal sum distance to all
nodes.
- What about disconnected components?
|
|
30
|
- A node is central iff it lies between other nodes on their shortest
path.
- If there is more than one shortest path,
- Treat each with equal weight
- Use some weighting scheme
|
|
31
|
- Bollen and Luce (02) Evaluation of Digital Library Impact and User
Communities by Analysis of Usage Patterns http://www.dlib.org/dlib/june02/bollen/06bollen.html
- Kaplan and Nelson (00) Determining the publication impact of a digital
library
http://download.interscience.wiley.com/cgi-bin/fulltext?ID=69503874&PLACEBO=IE.pdf&mode=pdf
- Wasserman and Faust (94) Social Network Analysis (on reserve)
|
|
32
|
- What’s the relationship between these three laws (Bradford,
Zipf-Yule-Pareto and Lotka)?
- How would you define the three zones in Bradford’s law?
|