Data Clustering

Learning Objectives

After this unit, students should be able to

explain the metrics of good clustering.
compare and contrast various kinds of clustering.
describe the \(K\)-means clustering with its assumption and limitations.
describe techniques to find value of \(K\) in the \(K\)-means clustering algorithm.
describe hierarchical clustering techniques.
explain the role of various linkages.

clustering

Cluster is a group datapoints that exhibit similar characteristics. Data clustering is an unsupervised learning technique that partitions a dataset into distinct clusters. The goal of clustering is to ensure that data points within the same cluster are more similar to each other than to those in other clusters. Unlike supervised classification, clustering does not require a labeled data. Instead, it aims to identify latent patterns and structures within the data, which can subsequently be used to label datapoints. As a result, clustering is often employed as a foundational step in exploratory data analysis.

Data clustering has a wide range of applications across various fields. Some of them are listed below.

In marketing and retail, data clustering is used for customer segmentation, which groups customers based on their purchasing patterns, demographics, etc. This helps companies in designing campaigns for targeted marketing.
In image and video analysis, data clustering is used as part of image segmentation and object detection.
In social network analysis, data clustering helps identify communities of individuals who are closely connected to each other.
In biological data analysis, data clustering is used to identify group of genes/proteins with similar biological functions.
In financial analytics, data clustering is used to group stocks with similar market trends.

Similarity Metrics

Clusters are defined based on the notion of the similarity between the datapoints. The choice of data representation directly affects the choice of similarity metrics. Data representation determines the structure and format of the data (e.g., numerical, categorical, textual), while the similarity metric is the mathematical method used to assess how alike or different data points are. Numerical data can be easily represented as vectors in high-dimensional space. Categorical features can be encoded into vectors by hot-encoding. Text data can be encoded into vectors using transforming into count vector or tf-idf vector format¹.

Following are the most common similarity metrics used on numerical data:

Euclidean Distance. It measures the straight-line distance between two points in Euclidean space. For \(x, y \in \mathbb{R}^d\),

\[ d_{Euclidean}(x, y) = \sqrt{\sum_{i=1}^d (x_i - y_i)^2} \]

Manhattan Distance. It measures the distance between two points along axes at right angles. Useful when differences in dimensions are independent. For \(x, y \in \mathbb{R}^d\),

\[ d_{manhattan}(x, y) = \sum_{i=1}^d \lvert x_i - y_i \rvert \]

Cosine Similarity. It measures the cosine of the angle between two vectors. For \(x, y \in \mathbb{R}^d\),

\[ d_{cosine}(x, y) = \frac{x \cdot y}{\lVert x \rVert \lVert y \rVert} \]

Following are the most common similarity metrics used on categorical data:

Hamming Distance. It counts the number of positions at which two sequences/strings of same length differ. In other works, it counts the number of substitutions required to change one string to another.
Jaccard Similarity. It measures the similarity between two sets as the fraction of items that are members of both. For two sets \(A\) and \(B\),

\[ d_{jaccard} = \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert} \]

Edit Distance (Levenshtein Distance). It measures the similarity between strings as the minimum number of single-character changes (insertion, deletion or substitution) to convert one string into another.

Types of Clustering

There are different kinds of clustering based on the kind of data as well as based on the technique used for clustering.

Type	Comments	Key-Algorithms
Partitional	It partitions the data into a fixed (predefined) number of clusters wherein each datapoint belongs to exactly one cluster.	\(K\)-means, \(K\)-medoids
Hierarchical	It builds a hierarchy of clusters by either starting with individual data points and progressively merging them (agglomerative) or starting with a single cluster and progressively splitting it (divisive).	AGNES (agglomerative), DIANA (divisive)
Overlapping	It partitions the data into a fixed (predefined) number of clusters wherein each datapoint belongs to exactly more than on cluster at the same time.	Probabilistic clustering, Fuzzy clustering

K-Means Clustering

\(K\)-Means Clustering is one of the most widely used partition-based clustering algorithms. The goal of \(K\)-Means is to partition a dataset into \(K\) clusters, where each data point belongs to the cluster with the nearest mean (centroid).

Formalism

Given an unlabeled dataset \(\mathcal{D} = \{x_1, x_2, x_3, ..., x_n\}\), we want to partition it into \(K\) clusters \(\{C_1, C_2, ..., C_K\}\). Let \(\{\mu_1, \mu_2, ..., \mu_K \}\) denotes the set of cluster representatives for each cluster, centroids in the case of \(K\)-means clustering.

Let \(\delta_{ij}\) denotes the membership indicator. \(\delta_{ij}\) equals to \(1\) if \(x_i\) is member of \(C_j\), otherwise it is \(0\). \(K\)-means clustering wants to find a partition that minimises intra-cluster distance as defined below:

\[ \text{intra-cluster distance} ~=~ \sum_{i=1}^n \sum_{j=1}^K \delta_{ij} d_{Euclidean}(x_i, \mu_j) \]

Since Euclidean distance is used as the measure of similarity, the intra-cluster similarity resembles the residual error in regression. As a result, this metric is commonly referred to as the sum of squared errors (SSE).

Greedy Solution

Initialisation. Randomly choose \(K\) initial datapoints as centroids from the data.
Assignment. Assign each datapoint to its nearest cluster.
Update. Recalculate the centroids based on the new assignment.
Repeat. Repeat the process the centroids no longer change or a predefined number of repetitions have been made.

An illustrative animation of the algorithm on Wikipedia provides an insightful visualisation.

Choosing \(K\)

The choice of \(K\) does not only impact the quality of the clusters but also affect the convergence of the greedy solution. The \(K\)-means algorithm always converges, but it may not converge to global minimum. It may converge to a local minimum based on the data as well as the initial placement of the centroids. Therefore, different initialisation can lead to different clusters for the same value of \(K\).

clustering

The elbow method is a heuristic used to determine the optimal number of clusters in \(K\)-means clustering. It is based on the observation that as the number of clusters increases, the intra-cluster distance (SSE) decreases. However, after a certain point, adding more clusters doesn't significantly reduce the SSE, and the plot of SSE against the number of clusters starts to bend like an elbow. The elbow method suggests that the optimal number of clusters is the point where this bend occurs. It is important to note that the elbow method is not always definitive, and sometimes the elbow point can be unclear or ambiguous.

The silhouette method is another method to determine the optimal number of clusters in \(K\)-means clustering. For every datapoint, it starts with computation of the average distance between the data point and all other points in its own cluster (denoted as \(a_i\)) and the average distance between the datapoint and all points in the nearest cluster that is not its own (denoted as \(b_i\)). Further, it computes the average silhouette coefficient \(s_i\) for a \(i^{th}\) datapoints using the following formula:

\[ s_i = \frac{b_i - a_i}{\max{(b_i, a_i)}} \]

Value of the silhouette coefficient lies between \([-1, 1]\) - a value closer to \(1\) indicates well-clustered datapoint whereas a value closer to \(-1\) indicates poorly-clustered datapoint. The average silhouette coefficient for the entire dataset gives an overall measure of how well the clusters are formed. The silhouette method chooses the value of \(K\) that produces the highest average silhouette coefficient over the entire dataset.

Limitations

\(K\)-means is highly sensitive to the initial choice of centroids.
\(K\)-means struggles to find clusters with non-globular shapes.
\(K\)-means struggles to identify clusters with different densities.

Variants of K-means Clustering

K-means++ starts with one randomly chosen datapoint as the initial centroid and sequentially chooses \(K-1\) centroids that are well spread out in the entire dataset. It reduces the bias induced by the initial choice of centroids as well as avoid generation of empty cluster.
X-means starts with running \(K\)-means algorithm with \(K=2\). It further splits each cluster using \(K\)-means clustering with \(K=2\). It uses scoring functions to evaluate purity of the clusters to decide whether the cluster should be further split. Thus, it automatically determines the final number of clusters as part of this process.

Hierarchical Clustering

clustering

Hierarchical clustering builds a hierarchy of clusters by either starting with individual data points and progressively merging them (agglomerative) or starting with a single cluster and progressively splitting it (divisive). Divisive clustering starts with a single cluster containing all datapoints whereas agglomerative clustering starts with every datapoint in its own cluster. Each of these techniques requires a way to compute distance between two clusters, either to find optimal division or to find optimal merge. The outcome of the hierarchical clustering is shown using tree-like structure, called as dendrogram, in the adjoining figure.

Linkage

Linkage refers to the method used to calculate the distance between clusters when merging them. The choice of linkage directly affects the shape and structure of the final clusters. The popular techniques given in the following table.

Linkage	Distance	Characteristics	Comments
Single	Shortest distance between two points in two clusters	Produces long, chain-link clusters	Good for irregular shaped clusters
Complete	Longest distance between two points in two clusters	Produces globular, well-separated clusters	Sensitive to outliers
Average	Average distance between two points in two clusters	Balanced compact clusters	Struggles to find clusters of different shapes
Centroid	Distance between centroids of clusters	Compact centroid based clusters	Struggles with identification of non-convex clusters.
Ward	Increase in the variance if the clusters were merged	Minimises variance	Balanced, well-separated, spherical clusters

Agglomerative two clusters with the smallest linkage until it reaches one cluster containing all datapoints. Divisive clustering algorithm selects a cluster to split into two smaller clusters, typically using a method like \(K\)-means or by calculating distances using a linkage.

Comments

Unlike \(K\)-means, hierarchical clustering does not require the number of clusters to be predefined. It allows for exploration of different cluster levels by appropriately tuning the threshold. Every run of the hierarchical clustering algorithm results in the same dendrogram. Despite these advantages, hierarchical clustering algorithms are computationally intensive for large datasets.

K-Means versus K-NN

People often confuse K-Means and K-Nearest Neighbors (K-NN) models. K-Means is an unsupervised clustering algorithm, while K-NN is a supervised classification method. Similar to K-Means, K-NN relies on a concept of similarity (or dissimilarity) between data points. It classifies a new data point by identifying the \(K\) nearest points to it and assigns a label based on the majority label among those \(K\) neighbors.

Please refer to Appendix for various ways to represent textual data. ↩