Notes
Slide Show
Outline
1
Document Clustering for DLs
  • CS6210: DLs & Computing in the Humanities
  • 16 October 2003
2
Outline

  • Document Clustering approaches  in DLs
      • - Lin Li


  • Document Features for clustering/classification
      • - Wong Swee Seong
3
A Comparison of Document Clustering Techniques
  • Authors:
    • Michael Steinbach
    • George Karypis
    • Vipin Kumar
4
What are the general applications of document clustering ?
  • Improve the precision or recall in information retrieval systems.


  • Automatically generate hierarchical clusters of documents.


  • For document classification.


  • Organizing the results returned by a search engine.
5
Application of Document Clustering in DLs
  • As a post-process to organize search results from DL search engine.
  • Same word can mean different things in different context.
6
Common Clustering Techniques
  • Agglomerative hierarchical clustering
  • K-means clustering
  • Bisecting k-means clustering
7
Agglomerative Hierarchical Clustering
  • Bottom-up strategy


  • Each document represented as a weighted attribute vector.


  • Greedy.


  • Global.
8
Distance Functions
  • Single Link -- O(n2)
    • Distance = minimum document distance between 2 clusters.
  • Complete Link -- O(n3)
    • Distance = maximum distance between 2 clusters.
  • Group Average – O(n2)
    • Distance = average document distance between 2 clusters.
  • Distance function -- cosine measure
    • Cosine(d1,d2) = (d1 • d2) / ||d1|| ||d2||

9
K-means Clustering
  • Takes input parameter k, and partitions a set of n documents into k clusters.
  • Intracluster similarity is high.
  • Intercluster similarity is low.
  • Cluster similarity is measured in regard to the mean value of the documents in a cluster, known as centroids.
10
K-means Algorithm
  • Select k documents randomly as centroids.
  • Assign all documents to their closest centroids.
  • Recompute the centroid of each cluster.
  • Repeat steps 2 & 3 until centroids do not change.
  • Computational complexity – O(nkt).


11
Bisecting K-means Algorithm
  • Starts with a single cluster
  • Pick a cluster to split.
  • Find 2 sub-clusters using the basic k-means algorithm.
  • Repeat step 2 for X times and select the split that results  in the highest overall similarity.
  • Repeat steps 1, 2 and 3 until the desired number of clusters is reached.

  • Splitting criteria: largest cluster
12
Evaluation
  • Agglomerative hierarchical clustering more superior to k-means.
  • Speed is important.
  • Fast algorithm preferred
    • Bisecting k-means
  • Suffix tree
    • Linear time complexity
    • Suffix tree built incrementally
    • O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In SIGIR, 1998.
13
Web Classification using Support Vector Machine
  • Authors:
    • Aixin Sun
    • Ee-Peng Lim
    • Wee-Keong Ng
14
Motivation
  • Web Document Classification
    • A good example of a large un-organized collection of heterogeneous digital information media.
  • Managing the Information
    • Search tool like Google to locate the information.
    • Pre-classify the information into groups for categorical browsing.
15
Objective
  • Study the impact of different  web page features on the performance of web classification using SVM.
    • Text features as the baseline, experiment with different combinations of text, title, and hyperlink features.
16
Experiments
  • Experiment features
    • Text only (X)
    • Text + Title (T)
    • Text + Anchor Words (A)
    • Text + Title + Anchor Words (TA)
  • Dataset
    • WebKB containing 4159 web pages form computer science departments of 4 universities (Cornell, Texas, Washington & Wisconsin).
    • 7 categories – student, faculty, staff, department, course, project & other.
17
Results
18
Discussion
  • Indication that structural or context information of a document may help in classifying or clustering the documents.
    • E.g. Scientific paper: Title, Abstract, Keywords, Introduction, Related works, Method, Experiment, Conclusion
  • Other linkage information like references both to and from are VERY useful features as well.
    • E.g. What books are purchased or borrowed together
19
Q & A