Principle Component Analysis

Learning Objectives

After this unit, students should be able to

reason the need of dimensionality reduction.
describe Principal Component Analysis (PCA).
appreciate the interpretation of the linear algebraic constructs.
explore variants of PCA.

Large datasets can be cumbersome to work with. By reducing the data size, we can improve processing speed, making tasks like analysis, modeling, and training machine learning algorithms significantly faster. Very large datasets can contain a lot of noise or irrelevant information. Reduction techniques can help us focus on the most relevant data points, potentially leading to more accurate and insightful findings. Data reduction can be achieved by reducing

the number of datapoints. Sampling techniques can be used reduce the number of datapoints. It is commonly employed for the exploratory analysis of large datasets. One should use valid statistical techniques to ensure that the sample represents the population.
the number of features. Dimensionality¹ reduction techniques presented in this unit can be used to reduce the number of features.
the number of values for features. Feature aggregation techniques presented in Unit 13 can be used to reduce the number of feature values.

Curse of Dimensionality

Higher dimensions in the data increase the sparsity (number of datapoints per unit volume) of the data points. Consider the following figure wherein we randomly sample \(100\) points in one to four dimensional space. For each of the four cases, the number datapoints lying inside a \(1/2\)-unit sided hypercube is counted. We observe that the data becomes exponentially sparse as the dimensionality reduces.

Why do we care about sparsity of the data? When data becomes sparse, the distances between data points become distorted, with all data points appearing equally distant from each other. This distortion of distances diminishes the existing patterns within the data and makes it challenging to identify outliers effectively. Thus, dimensionality reduction can improve the performance of machine learning models by focusing on the most informative features and reducing the impact of irrelevant or redundant features. This can lead to more accurate and robust predictions.

Motivation

Often recognized as a cornerstone of applied linear algebra ², PCA is a powerful mathematical technique for reducing data complexity. It not only identifies the most informative directions within the data but also measures the information lost when condensing it. While the in-depth proof might not be essential for this course, a basic grasp of PCA's inner workings will enhance your understanding. To gain a well-rounded perspective, we'll explore some key aspects of PCA. Brushing up on linear algebra fundamentals beforehand is highly recommended.

Consider the following scatter plot \(500\) two-dimensional datapoints. Thus, they consume \(1000\) units of the storage. On a closer examination, we observe that all datapoints lie in the close vicinity of the \(x=y\) line.

scatter

To save memory, we decide to retain only the \(x\)-coordinates of the data points. This results in losing half of the original dataset's information, as the \(y\)-coordinate for each point is roughly equal to its \(x\)-coordinate. Now, imagine rotating the original axes by \(45\) degrees so that the new \(x\)-axis aligns with the \(x = y\) line. In this new coordinate system, the \(x\)-coordinates remain unchanged, but the \(y\)-coordinates will be around zero. By retaining only the \(x\)-coordinates in this new system, we will not lose much information. This exemplifies how a simple change of coordinate system (a.k.a. change of basis) helps reducing the complexity of the data. This concept is fundamental to PCA, which identifies directions that best capture the maximum variance of the data. While we can visually determine these directions in two-dimensional data, it's not feasible for high-dimensional data.

Derivation

Let us assume that the dataset comprises \(n\) datapoints with \(m\) numerical features. Thus, dataset can be represented as a \(n \times m\) dimensional matrix, \(X \in \mathbb{R}^{n \times m}\), wherein every row represents a datapoint. We also assume that we have centered our data such that every feature has mean \(0\). We want to find a change of basis matrix \(W \in \mathbb{R}^{m \times k}\) that transforms original data to a \(k ~ ( < m)\) dimensional space using the following equation.

PCA

As explained in the earlier example, we want to find the direction of maximum variance in the data. Such a direction will contain maximum fraction of information in the data. \(W\) contains vectors that enable this transformation. Let \(\mathbf{w_1}\) denotes a unit vector (vector with magnitude unity) that points in the direction of maximum variance. The variance can be calculated as follows:

\[ \begin{aligned} \sigma^2_{\mathbf{w}} &= \frac{1}{n} \sum_i (\mathbf{w}_1 \cdot \mathbf{x_i})^2 \\ &= \frac{1}{n} (XW)^T (XW) \\ &= \frac{1}{n} W^T(X^T X) W \\ &= W^T \Sigma W \end{aligned} \]

where \(\Sigma\) denotes the covariance matrix of the data. The goal is to maximise the variance subject to the constraint that \(\mathbf{w}_i^T\mathbf{w}_i\) equals to \(1\) (unit vectors). Using optimisation techniques from calculus³, we obtain the following solution,

\[ \Sigma \mathbf{w}_1 = \lambda_1 \mathbf{w} \]

This is known as the eigenvalue equation in linear algebra. Thus, eigenvectors of the covariance matrix point in the direction of the high variance in the data. The problem of finding \(W\) reduces to finding the eigenvectors of the covariance matrix.

For a full-rank square matrix, the number of non-zero eigenvalues (and the number of associated eigenvectors) equals to the number of rows in the matrix. Ideally, there are \(m\) eigenvectors of the covariance matrix. Let us choose sort all eignevalues (and assoicated eigenvectors) in the descending order. \(\lambda_1\) denotes the largest eigenvalue and \(\mathbf{w}_1\) denotes the eigenvector associated to the largest eigenvalue. \(\mathbf{w}_1\) points in the direction of the highest variance in the data. Explained variance of an \(i^{th}\) eigenvector is the fraction of total variance in the data explained by it. It is computed as follows:

\[ \text{Explained Variance}_i = \frac{\lambda_i}{\sum_j \lambda_j} \]

Practically, we only keep top \(k (< m)\) vectors that explain most of the variance in the data.

PCA in Python

Derivation of PCA sounds very complicated but thanks to Python the implementation isn't. It is as simple as follows:

from sklearn.decomposition import PCA

# Let X denotes the mean-centered data matrix
# choose k that explains 90% of the data
pca = PCA(0.9)
pca.fit(X)

# k eigenvectors
pca.n_components_

# Transforming data into lowed dimensional space
X_new = pca.transform(X) 

Comments

PCA is indeed a powerful technique to reduce the dimensionality of complex dataset. It helps visualisation of high-dimensional data as well. But we need to be equally aware of the limitations of PCA.

Linearity. PCA performs linear transformation of the data. It fails to capture non-linear correlations in the data. Consider the following image, PCA (in blue) indeed chooses the direction of maximum variance but it does not capture most of the information contained in the data. There are variants of PCA, such as ICA and CCA, they further work on this limitation.

non-linearPCA

Semantic Loss. Features in the original carry a semantics for the users. The transformed features may skew the semantics. For instance, original data consists of features such as: age, marital status, income. The direction of maximum variance may be \(1.2\text{age} + 0.04\text{marital status} + 2.3\text{age}\). Thus, the transformed space loses the semantics from the original data.

Dimensionality of the data equals to the number features in the data. ↩
A tutorial on PCA ↩
PCA ↩