Document Clustering for DLs
CS6210: DLs & Computing in the Humanities
16 October 2003

Outline
Document Clustering approaches  in DLs
- Lin Li
Document Features for clustering/classification
- Wong Swee Seong

A Comparison of Document Clustering Techniques
Authors:
Michael Steinbach
George Karypis
Vipin Kumar

What are the general applications of document clustering ?
Improve the precision or recall in information retrieval systems.
Automatically generate hierarchical clusters of documents.
For document classification.
Organizing the results returned by a search engine.

Application of Document Clustering in DLs
As a post-process to organize search results from DL search engine.
Same word can mean different things in different context.

Common Clustering Techniques
Agglomerative hierarchical clustering
K-means clustering
Bisecting k-means clustering

Agglomerative Hierarchical Clustering
Bottom-up strategy
Each document represented as a weighted attribute vector.
Greedy.
Global.

Distance Functions
Single Link -- O(n2)
Distance = minimum document distance between 2 clusters.
Complete Link -- O(n3)
Distance = maximum distance between 2 clusters.
Group Average – O(n2)
Distance = average document distance between 2 clusters.
Distance function -- cosine measure
Cosine(d1,d2) = (d1 • d2) / ||d1|| ||d2||

K-means Clustering
Takes input parameter k, and partitions a set of n documents into k clusters.
Intracluster similarity is high.
Intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of the documents in a cluster, known as centroids.

K-means Algorithm
Select k documents randomly as centroids.
Assign all documents to their closest centroids.
Recompute the centroid of each cluster.
Repeat steps 2 & 3 until centroids do not change.
Computational complexity – O(nkt).

Bisecting K-means Algorithm
Starts with a single cluster
Pick a cluster to split.
Find 2 sub-clusters using the basic k-means algorithm.
Repeat step 2 for X times and select the split that results  in the highest overall similarity.
Repeat steps 1, 2 and 3 until the desired number of clusters is reached.

Splitting criteria: largest cluster

Evaluation
Agglomerative hierarchical clustering more superior to k-means.
Speed is important.
Fast algorithm preferred
Bisecting k-means
Suffix tree
Linear time complexity
Suffix tree built incrementally
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In SIGIR, 1998.

Web Classification using Support Vector Machine
Authors:
Aixin Sun
Ee-Peng Lim
Wee-Keong Ng

Motivation
Web Document Classification
A good example of a large un-organized collection of heterogeneous digital information media.
Managing the Information
Search tool like Google to locate the information.
Pre-classify the information into groups for categorical browsing.

Objective
Study the impact of different  web page features on the performance of web classification using SVM.
Text features as the baseline, experiment with different combinations of text, title, and hyperlink features.

Experiments
Experiment features
Text only (X)
Text + Title (T)
Text + Anchor Words (A)
Text + Title + Anchor Words (TA)
Dataset
WebKB containing 4159 web pages form computer science departments of 4 universities (Cornell, Texas, Washington & Wisconsin).
7 categories – student, faculty, staff, department, course, project & other.

Results

Discussion
Indication that structural or context information of a document may help in classifying or clustering the documents.
E.g. Scientific paper: Title, Abstract, Keywords, Introduction, Related works, Method, Experiment, Conclusion
Other linkage information like references both to and from are VERY useful features as well.
E.g. What books are purchased or borrowed together

Q & A