Notes
Slide Show
Outline
1
Search Engine Driven
Author Disambiguation
  • Yee Fan Tan and Min-Yen Kan
  • Department of Computer Science
    National University of Singapore, Singapore
  • {tanyeefa,kanmy}@comp.nus.edu.sg

  • Dongwon Lee
  • College of Information Sciences and Technology
    The Pennsylvania State University, USA
  • dongwon@psu.edu
2
Introduction
  • Bibliographic digital libraries
    • Contains a large number of publication metadata records
    • e.g. Citeseer, DBLP
    • Commonly used to measure impact of researchers on community
  • Problem
    • What happens when different authors share the same name?
3
Motivation @ DBLP
4
 
5
Author disambiguation with mixed citations
  • Given:
    • An author name string X representing k unique individuals
    • A list of citations C containing the name X
  • The task:
    • For each citation c in C, determine which of these k individuals c belongs to
  • K-way classification or clustering problem
6
Internal Resources
  • Past work (Lee et al. 05, Han et al. 05) used internal resources:
    • Knowledge encoded in the records themselves
    • Used field similarity, common co-author strings for clustering
  • Problems with using only internal resources
    • May provide insufficient information or difficult to extract
    • e.g., two papers on the same topic using disjoint keywords in their titles
  • Therefore, we use resources external to the citation data
7
Research Hypothesis
  • Hypothesis: Using external resources as in URL would help disambiguate author names with mixed citation


  •   Many factors to consider:
    •   Which external resources to use: URL, web page contents, affiliation, etc
    •   How to use: both internal and external? How to mix?
    •   How to apply external resources? Weighting?
  •   Preliminary study focuses on the case using URL and simple weighting


8
External Resources
  • Lay people doing this task with unfamiliar publications may use a search engine, using paper title as query
  • Our method tries to approximate this
  • For each citation c in C
    • Query search engine with title of c as phrase search to obtain a set of relevant URLs
    • Represent c by a feature vector of relevant URLs and weighting scheme
  • Apply hierarchical agglomerative clustering (HAC) on C to derive k clusters
    • Cosine similarity
    • Tested with single link, complete link and group average
9
Weighting: Inverse Host Frequency (IHF)
  • Observation
    • Not all URLs are equally useful
    • e.g., aggregator services
  • Desired weighting scheme
    • Low weights to aggregator web sites
    • High weights to personal and group publication pages
  • Inverse Host Frequency (IHF)
    • Similar to Inverse Document Frequency (IDF) in information retrieval
  • Consider citations of top 100 authors in DBLP (by number of citations)
  • For each such citation, query search engine with its title to obtain URLs, truncate them to their hostnames
  • If a hostname h has frequency f(h), then its IHF is
10
Weighting: Inverse Host Frequency (IHF)
  • We notice that using hostnames alone may be problematic
    • Especially when a host has multiple hostnames or is represented by an IP address with dissimilar distributions
    • e.g. www.informatik.uni-trier.de, ftp.informatik.uni-trier.de and 136.199.54.185 are the same host
  • Therefore, we also experimented with
    • Domain (e.g. uni-trier.de)
    • Resolving hostnames to IP addresses
11
Evaluation
  • Dataset
    • Manually-disambiguated dataset of 24 ambiguous names in computer science domain
    • Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3
    • Each name is attributed to 30 citations on average
    • Proportion of largest class ranges from 50% to 97%
  • Search engine
    • Google (http://www.google.com/)
12
Evaluation
  • Single link performs best
    • Good for clustering citations from different publication pages together (some pages list only selected publications)
    • Some authors have disparate research areas, not well represented by a centroid vector
  • Resolving hostnames to IP addresses give best accuracy
13
Comparison to [Lee et al. 05, IQIS]
14
Discussion
15
Discussion
  • Apparent correlation between accuracy and average number of URLs returned per citation
    • Author names with few URLs tend to fare poorly since results are mainly aggregator web sites
  • We do not observe any apparent relation between accuracy and number of citations for an author name
    • Our algorithm is scalable for large number of citations
  • Analysis of returned URLs is very fast, execution time is dominated by search engine querying
    • Querying may already be done while spidering, so our algorithm is time-efficient
16
Conclusion
  • Summary
    • We focused on using URLs returned from searching citation titles
    • Respectable average accuracy of 0.836 using IP addresses with single link HAC clustering
  • Future work
    • Explore other sources of information, such as the publication venues of the citations as well as utilizing the actual contents of the web pages
    • Combine knowledge gained externally and internally to obtain improved performance