Kernel Density Estimation

Learning Objectives

After this unit, students should be able to

explain the need of kernel density estimation.
describe kernel density estimation.
describe the properties of non-parametric models.
compare and contrast parametric versus non-parametric models.

kde

Kerenel Density Estimation (KDE) is a non-parametric method used to estimate the probability density function of a random variable. In the Bayesian picture of machine learning, the observed data is assumed to be a sample from an unknown data generating distribution. KDE allows us to understand that unknown distribution of a dataset and provides insights that are crucial for data analysis, decision-making, and model building.

Formalism

Kernel in the kernel density estimation is a smooth, symmetric function used to estimate the density around each data point. The parameter that controls the smoothness of a kernel is called as the bandwidth of the kernel. The most commonly used kernels are Gaussian, Uniform, Epanechnikov.

Given a set of datapoints \(x_1, x_2, ..., x_n\), the probability density of a new datapoint \(x\) is given by

\[ p(x) = \frac{1}{n} \sum_{i=1}^n K_h(x, x_i) \]

where \(K\) is a kernel with bandwidth h.

KDE places a kernel function centered at every datapoint. The probability density estimate is calculated by summing up the contributions from all the kernels, effectively smoothing the distribution.

Effect of Bandwidth

The bandwidth determines how the range of influence of a data point, which directly affects the shape and accuracy of the estimated density. Kernel with small bandwidths closely follow the training datapoints. It results in a highly detailed, fluctuating density estimate that might reflect noise or random variations in the data. On the other hand, kernel with large bandwidth tends to smoothen out the influence of any individual datapoint. This generally leads to oversmoothening of the data, blurring the details and creating overly general trends. Optimal bandwidth provides a balance between these extremes. Optimal bandwidth can be estimated using cross-validation.

The following figure shows the effect of choice of bandwidths.

Non-Parametric Models

Non-parametric models are statistical models that do not assume a fixed, predefined form for the underlying distribution of the data.

The number of parameters in the hypothesis for parametric models do not depend on the number of training datapoint. For instance, linear regression always learns \(d + 1\) coefficients, where \(d\) denotes the number of predictors, regardless of the number of training datapoints. Whereas the number of parameters in the hypothesis for non-parametric models may depend on the number of training datapoints.
Parametric models undergo a training process to estimate a fixed set of parameters based on the training data. Once this training is complete, the model no longer needs the training dataset for making predictions. In contrast, non-parametric models don't have a distinct training phase. Instead, their parameters, often referred to as hyperparameters, are adjusted through techniques like grid search or cross-validation. Non-parametric models require access to the training dataset when making predictions.
Non-parametric models tend to easily overfit over small training datasets. In general, they require larger dataset to produce accurate results.