|
1
|
- CS6210: DLs & Computing in the Humanities
- 16 October 2003
|
|
2
|
- Document Clustering approaches in
DLs
- Document Features for clustering/classification
|
|
3
|
- Authors:
- Michael Steinbach
- George Karypis
- Vipin Kumar
|
|
4
|
- Improve the precision or recall in information retrieval systems.
- Automatically generate hierarchical clusters of documents.
- For document classification.
- Organizing the results returned by a search engine.
|
|
5
|
- As a post-process to organize search results from DL search engine.
- Same word can mean different things in different context.
|
|
6
|
- Agglomerative hierarchical clustering
- K-means clustering
- Bisecting k-means clustering
|
|
7
|
- Bottom-up strategy
- Each document represented as a weighted attribute vector.
- Greedy.
- Global.
|
|
8
|
- Single Link -- O(n2)
- Distance = minimum document distance between 2 clusters.
- Complete Link -- O(n3)
- Distance = maximum distance between 2 clusters.
- Group Average – O(n2)
- Distance = average document distance between 2 clusters.
- Distance function -- cosine measure
- Cosine(d1,d2) = (d1 • d2) /
||d1|| ||d2||
|
|
9
|
- Takes input parameter k, and partitions a set of n documents into k
clusters.
- Intracluster similarity is high.
- Intercluster similarity is low.
- Cluster similarity is measured in regard to the mean value of the
documents in a cluster, known as centroids.
|
|
10
|
- Select k documents randomly as centroids.
- Assign all documents to their closest centroids.
- Recompute the centroid of each cluster.
- Repeat steps 2 & 3 until centroids do not change.
- Computational complexity – O(nkt).
|
|
11
|
- Starts with a single cluster
- Pick a cluster to split.
- Find 2 sub-clusters using the basic k-means algorithm.
- Repeat step 2 for X times and select the split that results in the highest overall similarity.
- Repeat steps 1, 2 and 3 until the desired number of clusters is reached.
Splitting criteria: largest cluster
|
|
12
|
- Agglomerative hierarchical clustering more superior to k-means.
- Speed is important.
- Fast algorithm preferred
- Suffix tree
- Linear time complexity
- Suffix tree built incrementally
- O. Zamir and O. Etzioni. Web document clustering: A feasibility
demonstration. In SIGIR, 1998.
|
|
13
|
- Authors:
- Aixin Sun
- Ee-Peng Lim
- Wee-Keong Ng
|
|
14
|
- Web Document Classification
- A good example of a large un-organized collection of heterogeneous
digital information media.
- Managing the Information
- Search tool like Google to locate the information.
- Pre-classify the information into groups for categorical browsing.
|
|
15
|
- Study the impact of different web
page features on the performance of web classification using SVM.
- Text features as the baseline, experiment with different combinations
of text, title, and hyperlink features.
|
|
16
|
- Experiment features
- Text only (X)
- Text + Title (T)
- Text + Anchor Words (A)
- Text + Title + Anchor Words (TA)
- Dataset
- WebKB containing 4159 web pages form computer science departments of 4
universities (Cornell, Texas, Washington & Wisconsin).
- 7 categories – student, faculty, staff, department, course, project
& other.
|
|
17
|
|
|
18
|
- Indication that structural or context information of a document may help
in classifying or clustering the documents.
- E.g. Scientific paper: Title, Abstract, Keywords, Introduction, Related
works, Method, Experiment, Conclusion
- Other linkage information like references both to and from are VERY
useful features as well.
- E.g. What books are purchased or borrowed together
|
|
19
|
|