Data Transformation

Learning Objectives

After this unit, students should be able to

describe the need of feature construction.
describe feature normalisation techniques.
describe feature encoding techniques.
describe feature aggregation techniques.
describe data sampling techniques.

Both data cleaning and data transformation are instrumental phases within the data pre-processing phase. However, they address distinct aspects of readying data for meaningful exploration. Imagine data as a messy room. Data cleaning is like tidying up the room - organizing things, throwing away trash, and putting misplaced items back where they belong. Data transformation, on the other hand, is like renovating the room - knocking down walls, adding furniture, or completely changing the layout to suit a new purpose.

Data transformation encompasses a range of techniques. We will explore them in this unit.

Feature Construction

Large number of features affect not only efficiency but also effectiveness of the models trained on the data. Many times the features are redundant and do not provide any new information. We can replace such redundant features by creating new features. New features can provide additional insights into the underlying processes or mechanisms driving the data. These features may help explain the relationships between variables and enhance the interpretability of the model's predictions. An example is provided in the following table where profit per sold unit provides more useful information if analyst is interested in deriving profit related insights from the data. In [Unit X], we will also learn feature construction can help linear models to learn non-linear relationships in the data.

Units Sold	Old Features		New Feature
Units Sold	Selling Cost	Purchasing Cost	Profit/Unit
3	6	2	4
2	3	3	0
1	4	5	-1
2	9	5	4

Feature Normalisation

Feature normalization is a preprocessing technique used to rescale numerical features to a standard range or distribution. The goal of normalization is to ensure that all features have similar scales. There are various methods for normalisation.

Min-max normalisation. This method rescales the data to a fixed range, typically between \(0\) and \(1\). For a given feature \(X\), min-max normalised value is computed as follows:

\[ X_{normalised} = \frac{X - X_{min}}{X_{max} - X_{min}} \]

where \(X_{max}\) and \(X_{min}\) are the maximum and minimum permissible values of the feature. It is evident that min-max normalisation assumes that the maximum and minimum values are known prior the analysis.

Z-normalisation (Standardisation). This method transforms the data to have a mean of \(0\) and a standard deviation of \(1\). For a given feature \(X\), \(Z\)-value (a.k.a. standard score) is computed as follows:

\[ X_{standardised} = \frac{X - \mu}{\sigma} \]

where \(\mu\) and \(\sigma\) are mean and standard deviation of the feature. Unlike min-max normalisation, standardisation does not restrict feature values to a specified range after the transformation. Stadardisation is less sensitive to outliers due to the use of the mean and standard deviation.

Normalising features ensures that all features have a similar scale or range of values. This is important for algorithms that are sensitive to the scale of features, such as gradient descent-based optimization algorithms used in machine learning models. Normalization prevents features with larger magnitudes from dominating the learning process. Normalising features helps improve the interpretability of the model by ensuring that the coefficients or weights associated with each feature are on a comparable scale. This makes it easier to interpret the relative importance of each feature in the model.

Feature Encoding

Many machine learning algorithms require numerical input, but datasets often contain categorical features. Encoding categorical features converts them into numerical format, allowing algorithms to process them effectively. Ordinal encoding assigns an integer value to each value of the feature. If the original feature is not ordinal, ordinally encoded feature may induce an ordering bias in the data. One-hot encoding creates a new feature for every value of the original feature. It assigns value of \(1\) to indicate the presence of value. One-hot encoding is widely used in machine learning because it preserves the categorical nature of variables without assuming any ordinal relationship between categories. Dummy variable encoding also creates binary columns for each category, but it typically drops one of the categories to avoid multicollinearity. Such encoding is particularly useful if the data is to be used to train a linear model. We will learn more about it in [Unit XX]. The following figure shows a small example for each of these types of feature encoding.

encoding

Feature Aggregation

Data aggregation refers to adjusting the granularity of the values of features. For categorical features, one can merge various categories into a single category or split one category into its sub-categories. For numerical features, binning is one of the popular techniques to reduce the granularity of continuous data. In order to perform binning, the data is sorted and then divided into equal bins. All datapoints within a bin are replaced by a representative from the respective bin. An example of binning is shown in the following figure.

One may look at binning as a way to transform numerical feature to a categorical feature.

Heuristic Based Feature Reduction

Let us exemplify heuristic based methods using the following toy-dataset with four features.

F1	F2	F3	F4
3	5	3	6
	5	4	8
	5	6	13
1	5	6	10
	6	5	10
	5	2	4
	5	2	3

Missing value ratio method removes features that have large missing values (such as \(F1\) in the example). Low variance filter method removes features that do not significantly change, and hence do not provide any meaningful information (such as \(F2\) in the example). High correlation filter method removes features that are almost perfectly correlated to each other (such as \(F3\) and \(F4\) in the example). All of these techniques require the analyst to tune the threshold values and qualitatively assess improvements in the model.

Data Sampling

A feature is said to be imbalanced if there is a severe skew in the distribution of its values. For instance consider the dataset with \(500\) images of cats and \(4\) images of dogs. Such a dataset has a strong bias to the cat class. This bias in the training dataset can influence machine learning algorithms, may lead some to ignore the minority class altogether. In order to avoid this random sampling techniques can be used.

Random Undersampling. This technique randomly samples few examples from a majority class to balance the minority class. This approach may is suitable for imbalanced datasets that have sufficient number of examples in the minority class to fit a model. A limitation of undersampling is that the (randomly) deleted examples from the majority class may contain patterns that are important, or perhaps critical to fitting a robust decision boundary.
Random Oversampling. This technique randomly duplicates examples from the minority class in the dataset. The duplication may lead to overfitting over the minority class.

Both of these are heuristic based methods for balancing the dataset. There are more advanced techniques such as Synthetic Minority Oversampling Technique (SMOTE) that synthetically creates new datapoints for the minority class. It does so by geometric interpolation of datapoints. There is Python library named imbalanced-learn dedicated to imbalanced datasets.