Skip to content

Data Cleaning

Learning Objectives

After this unit, students should be able to

  • reason the need to clean data.
  • describe the metrics of data quality.
  • identify and treat missing values in data.
  • detect noise in the data.

In data anlaytics, the principle of garbage in, garbage out plays an important role. Analyses conducted on messy and unprocessed data lead to unreliable results. Data quality is paramount for reliable insights and informed decision-making across all industries. According to a Forbes's survey data scientists spend \(60%\) of their time in simply cleaning and organising the data. Following are a few reasons why business entities emphasise on the data cleaning in their analytics projects.

  • Ensures Data Integrity: Clean data fosters trust in the analytical process. By removing errors and inconsistencies, data cleaning guarantees the foundation of your analysis is accurate and reliable.
  • Improves Analytical Accuracy: Flawed data leads to skewed results. Cleaning ensures the analysis reflects the true representation of the data, preventing misleading conclusions and fostering more accurate insights.
  • Optimises Resource Allocation: Data is a valuable asset. Clean data empowers businesses to optimize resource allocation by highlighting genuine trends and patterns, leading to more effective strategies and resource utilization.
  • Enhances Decision-Making: Business decisions are only as good as the data informing them. Clean data empowers leaders to make well-informed choices with greater confidence, mitigating risks associated with inaccurate information.

Quality Assessment Metrics

  • Accuracy is affected by the presence by erroneous data. Erroneous data, such as typos or readings from faulty/uncalibrated sensors, introduces bias. This can lead to misleading conclusions and ultimately, poor decision-making.
  • Completeness is affected by lacking features or values. Models trained on incomplete data lead to poor effectiveness as well as lack of insights.
  • Consistency is affected due disparities arising during aggregation of datasets. Inconsistent data, such as varying units of measurement, if not preprocessed yield ineffective models.

Missing Values

There are several reasons why data might be missing.

  • Human Error. During data collection or entry, mistakes can happen. People might forget to record values, enter them incorrectly, or equipment malfunctions could occur.
  • Study Design. The design of the study might lead to missing data. For instance, participants might refuse to answer certain questions, or some data points might not be applicable to every subject.
  • Technical Issues. Data can get corrupted or lost due to improper storage or technical glitches.

How to handle missing values in the data? The simplest approach to dealing with missing values is to eliminate the data points containing them. However, this method is only feasible when the missing data constitute a small portion of the overall dataset. Alternatively, rather than removing data points, one can opt to drop attributes with a significant number of missing values. More advanced strategies such as [Data Imputation] can also be employed to substitute missing values.

Noisy Data

Noise is introduced in data at both collection and processing phase. Most common reasons are listed below:

  • Measurement errors are inaccuracies during data collection process. They may arise due to human errors, equipment malfunction or environmental factors.
  • Sampling error is introduced due to the process in which a sample is drawn from the population.
  • Outliers are the datapoints that deviate significantly from the rest of the dataset. Outliers may be caused by measurement errors, anomalies or rae events.
  • Missing values can be also considered as noise in the data.

How to remove noise in the data? Removing noise from data is not a straightforward task for analysts. Descriptive statistical tools can aid analysts in evaluating the noise within the data. However, validating the removal of noise also requires input from domain experts.

Outliers can be useful!

Outliers are the datapoints that deviate significantly from the rest of the dataset. Therefore, outliers have a significant impact on the models trained on the dataset. But some times, outliers play the role of signal than the noise. If we are analysing frauds or training anomaly detector then outliers are the targets for such models. For these cases, we do not remove the outliers from the dataset.