Document Clustering for
DLs
|
|
|
CS6210: DLs & Computing in the
Humanities |
|
16 October 2003 |
Outline
|
|
|
|
|
|
|
Document Clustering approaches in DLs |
|
- Lin Li |
|
|
|
Document Features for
clustering/classification |
|
- Wong Swee Seong |
A Comparison of Document
Clustering Techniques
|
|
|
|
Authors: |
|
Michael Steinbach |
|
George Karypis |
|
Vipin Kumar |
What are the general
applications of document clustering ?
|
|
|
|
|
|
Improve the precision or recall in
information retrieval systems. |
|
|
|
Automatically generate hierarchical
clusters of documents. |
|
|
|
For document classification. |
|
|
|
Organizing the results returned by a
search engine. |
Application of Document
Clustering in DLs
|
|
|
As a post-process to organize search
results from DL search engine. |
|
Same word can mean different things in
different context. |
Common Clustering
Techniques
|
|
|
Agglomerative hierarchical clustering |
|
K-means clustering |
|
Bisecting k-means clustering |
Agglomerative
Hierarchical Clustering
|
|
|
|
|
|
Bottom-up strategy |
|
|
|
Each document represented as a weighted
attribute vector. |
|
|
|
Greedy. |
|
|
|
Global. |
Distance Functions
|
|
|
|
Single Link -- O(n2) |
|
Distance = minimum document distance
between 2 clusters. |
|
Complete Link -- O(n3) |
|
Distance = maximum distance between 2
clusters. |
|
Group Average – O(n2) |
|
Distance = average document distance
between 2 clusters. |
|
Distance function -- cosine measure |
|
Cosine(d1,d2) = (d1
• d2) / ||d1|| ||d2|| |
|
|
|
|
K-means Clustering
|
|
|
Takes input parameter k, and partitions
a set of n documents into k clusters. |
|
Intracluster similarity is high. |
|
Intercluster similarity is low. |
|
Cluster similarity is measured in
regard to the mean value of the documents in a cluster, known as centroids. |
K-means Algorithm
|
|
|
Select k documents randomly as
centroids. |
|
Assign all documents to their closest
centroids. |
|
Recompute the centroid of each cluster. |
|
Repeat steps 2 & 3 until centroids
do not change. |
|
Computational complexity – O(nkt). |
|
|
Bisecting K-means
Algorithm
|
|
|
Starts with a single cluster |
|
Pick a cluster to split. |
|
Find 2 sub-clusters using the basic k-means
algorithm. |
|
Repeat step 2 for X times and select
the split that results in the highest
overall similarity. |
|
Repeat steps 1, 2 and 3 until the
desired number of clusters is reached. |
|
Splitting criteria: largest cluster |
Evaluation
|
|
|
|
Agglomerative hierarchical clustering
more superior to k-means. |
|
Speed is important. |
|
Fast algorithm preferred |
|
Bisecting k-means |
|
Suffix tree |
|
Linear time complexity |
|
Suffix tree built incrementally |
|
O. Zamir and O. Etzioni. Web document
clustering: A feasibility demonstration. In SIGIR, 1998. |
Web Classification using
Support Vector Machine
|
|
|
|
Authors: |
|
Aixin Sun |
|
Ee-Peng Lim |
|
Wee-Keong Ng |
Motivation
|
|
|
|
Web Document Classification |
|
A good example of a large un-organized
collection of heterogeneous digital information media. |
|
Managing the Information |
|
Search tool like Google to locate the
information. |
|
Pre-classify the information into
groups for categorical browsing. |
Objective
|
|
|
|
Study the impact of different web page features on the performance of web
classification using SVM. |
|
Text features as the baseline,
experiment with different combinations of text, title, and hyperlink
features. |
Experiments
|
|
|
|
Experiment features |
|
Text only (X) |
|
Text + Title (T) |
|
Text + Anchor Words (A) |
|
Text + Title + Anchor Words (TA) |
|
Dataset |
|
WebKB containing 4159 web pages form
computer science departments of 4 universities (Cornell, Texas, Washington
& Wisconsin). |
|
7 categories – student, faculty, staff,
department, course, project & other. |
Results
Discussion
|
|
|
|
Indication that structural or context
information of a document may help in classifying or clustering the
documents. |
|
E.g. Scientific paper: Title, Abstract,
Keywords, Introduction, Related works, Method, Experiment, Conclusion |
|
Other linkage information like
references both to and from are VERY useful features as well. |
|
E.g. What books are purchased or
borrowed together |
Q & A