Kinds of Data and Analytics Problems

Learning Objectives

After this unit, students should be able to

classify data as nominal, ordinal, ratio and interval.
classify data as per the representation.
distinguish between predictors and responses.
describe difference between labeled and unlabeled dataset.
classify an analytics problem as regression, classification or pattern discovery.

Kinds of Data

Statisticians classify data based on the level of measurements in one of the four categories. They are tabulated as follows:

	Nominal	Ordinal	Interval	Ratio
What is it?	Values are categorical without any quantitative connotation.	Values are categorical with a comparative connotation.	Values are numerical and they are equally spaced from one another.	Values are numerical and they have a well-defined noition of zero.
Supported Operations.	Equality (\(=, \neq\)).	Equality (\(=, \neq\)) and relational (\(<, >, \leq, \geq\))	Equality (\(=, \neq\)), relational (\(<, >, \leq, \geq\)), addition (\(+\)) and subtraction (\(-\))	Equality (\(=, \neq\)), relational (\(<, >, \leq, \geq\)) and all arithmetic (\(+, -, \times, /\))
Examples.	Names of colours, Genders	Grades, Any response on Likert Scale	Temperature measured in Celcius/Fahreheit (Zero is relative)	Temperature measured in Kelvin, Weight of the person, Account balance

Nominal and ordinal data is referred as categorical data whereas interval and ratio data is referred as numerical data.

Computer scientists classify data based on how the data is stored. It is classified as follows.

	(Well-) Structured Data	Semi-Structured Data	Unstructured Data
What is it?	Data adheres to a predefined data model. Each datapoint has the same fixed set of attributes.	Some part of data have a rigid strucutre and some does not.	Data does not have any fixed data model.
Popular Formats.	`.csv`, `.xlsx`. `.sql`	`.json`, `.xml`	Text, images, video, audio data

Each data point within a dataset possesses a fixed set of attributes (or features). In structured data, these attributes are delineated by the data model, while in unstructured data, further processing is required to discover these attributes. For instance, in a census dataset, each data point represents an individual with attributes such as age, education, marital status, and salary. These attributes can be categorized into two groups: predictors and response. Reponse refers to the attributes that an analyst wants to predict or forecast whereas predictor refers to the attributes that facilitate this task. This categorization is not implicit to the dataset; rather, it is defined by the analytics problem at hand. For example, if the objective is to predict an individual's salary based on their age and education, then age and education serve as predictors, while salary acts as the response variable. The dataset with the attributes classified as predictors and responses is called as labeled dataset; otherwise it is called as unlabeled dataset.

Kinds of Predictive Analytics Problems

The majority of predictive analytics problems fall into one of three categories: regression, classification, and clustering. We will learn about each of these categories and models in the later part of the course.

	Regression	Classification	Clustering (Pattern Discovery)
Setup.	Labeled dataset with numerical responses.	Labeled dataset with categorical responses.	Unlabeled dataset.
Popular models.	Linear regression, Neural networks, Bayesian Regression	Logistic regression, Naive Bayes, Neural networks	\(K\)-Means clustering, Autoencoders, Latent Dirichlet Allocation