1	Document Clustering for DLs CS6210: DLs & Computing in the Humanities 16 October 2003
2	Outline Document Clustering approaches in DLs - Lin Li Document Features for clustering/classification - Wong Swee Seong
3	A Comparison of Document Clustering Techniques Authors: Michael Steinbach George Karypis Vipin Kumar
4	What are the general applications of document clustering ? Improve the precision or recall in information retrieval systems. Automatically generate hierarchical clusters of documents. For document classification. Organizing the results returned by a search engine.
5	Application of Document Clustering in DLs As a post-process to organize search results from DL search engine. Same word can mean different things in different context.
6	Common Clustering Techniques Agglomerative hierarchical clustering K-means clustering Bisecting k-means clustering
7	Agglomerative Hierarchical Clustering Bottom-up strategy Each document represented as a weighted attribute vector. Greedy. Global.
8	Distance Functions Single Link -- O(n²) Distance = minimum document distance between 2 clusters. Complete Link -- O(n³) Distance = maximum distance between 2 clusters. Group Average – O(n²) Distance = average document distance between 2 clusters. Distance function -- cosine measure Cosine(d₁,d₂) = (d₁ • d₂) / \|\|d₁\|\| \|\|d₂\|\|
9	K-means Clustering Takes input parameter k, and partitions a set of n documents into k clusters. Intracluster similarity is high. Intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the documents in a cluster, known as centroids.
10	K-means Algorithm Select k documents randomly as centroids. Assign all documents to their closest centroids. Recompute the centroid of each cluster. Repeat steps 2 & 3 until centroids do not change. Computational complexity – O(nkt).
11	Bisecting K-means Algorithm Starts with a single cluster Pick a cluster to split. Find 2 sub-clusters using the basic k-means algorithm. Repeat step 2 for X times and select the split that results in the highest overall similarity. Repeat steps 1, 2 and 3 until the desired number of clusters is reached. Splitting criteria: largest cluster
12	Evaluation Agglomerative hierarchical clustering more superior to k-means. Speed is important. Fast algorithm preferred Bisecting k-means Suffix tree Linear time complexity Suffix tree built incrementally O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In SIGIR, 1998.
13	Web Classification using Support Vector Machine Authors: Aixin Sun Ee-Peng Lim Wee-Keong Ng
14	Motivation Web Document Classification A good example of a large un-organized collection of heterogeneous digital information media. Managing the Information Search tool like Google to locate the information. Pre-classify the information into groups for categorical browsing.
15	Objective Study the impact of different web page features on the performance of web classification using SVM. Text features as the baseline, experiment with different combinations of text, title, and hyperlink features.
16	Experiments Experiment features Text only (X) Text + Title (T) Text + Anchor Words (A) Text + Title + Anchor Words (TA) Dataset WebKB containing 4159 web pages form computer science departments of 4 universities (Cornell, Texas, Washington & Wisconsin). 7 categories – student, faculty, staff, department, course, project & other.
17	Results
18	Discussion Indication that structural or context information of a document may help in classifying or clustering the documents. E.g. Scientific paper: Title, Abstract, Keywords, Introduction, Related works, Method, Experiment, Conclusion Other linkage information like references both to and from are VERY useful features as well. E.g. What books are purchased or borrowed together
19	Q & A