Document Clustering for DLs

CS6210: DLs & Computing in the Humanities

16 October 2003

Outline

Document Clustering approaches in DLs

- Lin Li

Document Features for clustering/classification

- Wong Swee Seong

A Comparison of Document Clustering Techniques

Authors:

Michael Steinbach

George Karypis

Vipin Kumar

What are the general applications of document clustering ?

Improve the precision or recall in information retrieval systems.

Automatically generate hierarchical clusters of documents.

For document classification.

Organizing the results returned by a search engine.

Application of Document Clustering in DLs

As a post-process to organize search results from DL search engine.

Same word can mean different things in different context.

Common Clustering Techniques

Agglomerative hierarchical clustering

K-means clustering

Bisecting k-means clustering

Agglomerative Hierarchical Clustering

Bottom-up strategy

Each document represented as a weighted attribute vector.

Greedy.

Global.

Distance Functions

Single Link -- O(n²)

Distance = minimum document distance between 2 clusters.

Complete Link -- O(n³)

Distance = maximum distance between 2 clusters.

Group Average – O(n²)

Distance = average document distance between 2 clusters.

Distance function -- cosine measure

Cosine(d₁,d₂) = (d₁ • d₂) / ||d₁|| ||d₂||

K-means Clustering

Takes input parameter k, and partitions a set of n documents into k clusters.

Intracluster similarity is high.

Intercluster similarity is low.

Cluster similarity is measured in regard to the mean value of the documents in a cluster, known as centroids.

K-means Algorithm

Select k documents randomly as centroids.

Assign all documents to their closest centroids.

Recompute the centroid of each cluster.

Repeat steps 2 & 3 until centroids do not change.

Computational complexity – O(nkt).

Bisecting K-means Algorithm

Starts with a single cluster

Pick a cluster to split.

Find 2 sub-clusters using the basic k-means algorithm.

Repeat step 2 for X times and select the split that results in the highest overall similarity.

Repeat steps 1, 2 and 3 until the desired number of clusters is reached.

Splitting criteria: largest cluster

Evaluation

Agglomerative hierarchical clustering more superior to k-means.

Speed is important.

Fast algorithm preferred

Bisecting k-means

Suffix tree

Linear time complexity

Suffix tree built incrementally

O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In SIGIR, 1998.

Web Classification using Support Vector Machine

Authors:

Aixin Sun

Ee-Peng Lim

Wee-Keong Ng

Motivation

Web Document Classification

A good example of a large un-organized collection of heterogeneous digital information media.

Managing the Information

Search tool like Google to locate the information.

Pre-classify the information into groups for categorical browsing.

Objective

Study the impact of different web page features on the performance of web classification using SVM.

Text features as the baseline, experiment with different combinations of text, title, and hyperlink features.

Experiments

Experiment features

Text only (X)

Text + Title (T)

Text + Anchor Words (A)

Text + Title + Anchor Words (TA)

Dataset

WebKB containing 4159 web pages form computer science departments of 4 universities (Cornell, Texas, Washington & Wisconsin).

7 categories – student, faculty, staff, department, course, project & other.

Results

Discussion

Indication that structural or context information of a document may help in classifying or clustering the documents.

E.g. Scientific paper: Title, Abstract, Keywords, Introduction, Related works, Method, Experiment, Conclusion

Other linkage information like references both to and from are VERY useful features as well.

E.g. What books are purchased or borrowed together

Q & A



	Document Clustering approaches in DLs
			- Lin Li

	Document Features for clustering/classification
			- Wong Swee Seong


	Improve the precision or recall in information retrieval systems.

	Automatically generate hierarchical clusters of documents.

	For document classification.

	Organizing the results returned by a search engine.


	As a post-process to organize search results from DL search engine.
	Same word can mean different things in different context.


	Agglomerative hierarchical clustering
	K-means clustering
	Bisecting k-means clustering


	Bottom-up strategy

	Each document represented as a weighted attribute vector.

	Greedy.

	Global.


	Single Link -- O(n²)
		Distance = minimum document distance between 2 clusters.
	Complete Link -- O(n³)
		Distance = maximum distance between 2 clusters.
	Group Average – O(n²)
		Distance = average document distance between 2 clusters.
	Distance function -- cosine measure
		Cosine(d₁,d₂) = (d₁ • d₂) / \|\|d₁\|\| \|\|d₂\|\|


	Takes input parameter k, and partitions a set of n documents into k clusters.
	Intracluster similarity is high.
	Intercluster similarity is low.
	Cluster similarity is measured in regard to the mean value of the documents in a cluster, known as centroids.


	Select k documents randomly as centroids.
	Assign all documents to their closest centroids.
	Recompute the centroid of each cluster.
	Repeat steps 2 & 3 until centroids do not change.
	Computational complexity – O(nkt).


	Starts with a single cluster
	Pick a cluster to split.
	Find 2 sub-clusters using the basic k-means algorithm.
	Repeat step 2 for X times and select the split that results in the highest overall similarity.
	Repeat steps 1, 2 and 3 until the desired number of clusters is reached.
	Splitting criteria: largest cluster


	Agglomerative hierarchical clustering more superior to k-means.
	Speed is important.
	Fast algorithm preferred
		Bisecting k-means
	Suffix tree
		Linear time complexity
		Suffix tree built incrementally
		O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In SIGIR, 1998.


	Web Document Classification
		A good example of a large un-organized collection of heterogeneous digital information media.
	Managing the Information
		Search tool like Google to locate the information.
		Pre-classify the information into groups for categorical browsing.


	Study the impact of different web page features on the performance of web classification using SVM.
		Text features as the baseline, experiment with different combinations of text, title, and hyperlink features.


	Experiment features
		Text only (X)
		Text + Title (T)
		Text + Anchor Words (A)
		Text + Title + Anchor Words (TA)
	Dataset
		WebKB containing 4159 web pages form computer science departments of 4 universities (Cornell, Texas, Washington & Wisconsin).
		7 categories – student, faculty, staff, department, course, project & other.


	Indication that structural or context information of a document may help in classifying or clustering the documents.
		E.g. Scientific paper: Title, Abstract, Keywords, Introduction, Related works, Method, Experiment, Conclusion
	Other linkage information like references both to and from are VERY useful features as well.
		E.g. What books are purchased or borrowed together