Skip to content

Introduction to Machine Learning

Learning Objectives

After this unit, students should be able to

  • describe the idea behind machine learning.
  • find the hypothesis set and goodness of fit for a machine learning model.
  • compare and contrast various kinds of learning.

Let us consider two tasks.

For the first task, you are given the following functional forms.

\[ \begin{aligned} f_1(x) &= \sin{x} + \cos^2{x} \\ f_2(x) &= \log{\sin{x}} + 3 * \sqrt{x} \\ f_3(x) &= e^{x^3} + \sqrt{\tan{4x}} - x + 2 \end{aligned} \]

You task is two compute the value of each of these functions at the specified value of \(x\), say \(x=2.5\). Isn't it a programming 101 task? A straightforward program can be written in any preferred programming language to evaluate these functions. Even for functions of greater complexity, as long as an analytical expression is available, computation is not a daunting task.

For the second task you are given pairs of \((x, y = f(x))\) for various values of \(x\). The scatter plot of such an example is shown in the following figure. Your task is two estimate the analytical form of the function \(f\). Is this as straightforward the first task? There are so many analytical form that would approximately fit the specified data. The fitted functions figure shows the fit of linear, quadratic and exponential functions to the data. Which is the answer for the second task? How can we definitively determine whether a given answer is accurate?

Scatter Plot Fitted Functions
scatter scatter_fit

Definition: Machine Learning

Given a dataset, machine learning is the discovery of a function from the set of possible functions that accurately represents patterns in the dataset. The search space for the function is very big since there are (uncountably) infinite real-valued functions.

Notation

Before diving into the formal definition of machine learning, we'll establish a common set of symbols and notations that we'll consistently use throughout this course.

  • A dataset is denoted by \(D\). The individual datapoints are denoted as \(d_1, d_2, ...\)
  • Unless specified the dataset consists of \(n\) datapoints.
  • Each datapoint \(d_i\) is a \(m\)-vector where \(m\) denoted the number of features.
  • A labeled datapoint \(d_i\) is represented as a pair \((x_i, y_i)\) where \(y_i\) is the label of datapoints.
  • A dataset with labeled datapoints is called as a labeled dataset; otherwise it is called as an unlabeled dataset.

Defining Machine Learning

Hypothesis Set

Hypothesis set, denoted as \(\mathcal{H}\), is the set of all possible function that would possibly map the dataset to the desired output. Examples of hypothesis sets that represent linear, quadratics and exponential functions are shown as below.

\[ \begin{aligned} \mathcal{H}_{linear} &= \{ax + b | a, b \in \mathbb{R} \} \\ \mathcal{H}_{quadratic} &= \{ax^2 + bx + c | a, b, c \in \mathbb{R} \} \\ \mathcal{H}_{exponential} &= \{e^{ax} | a \in \mathbb{R} \} \\ \end{aligned} \]

Each of these sets are massively big. We need to way to quantify which function is a better fit than the other.

Goodness of Fit

Goodness of fit is a quantifier of how good a particular function fits the observed data. For a given hypothesis \(h\) and dataset \(D\) it is denote as \(\ell_D(h)\). Some popular goodness of fit measures are given below. A goodness of fit function is also called as a loss function if it quantifies the degree of "unfitness" of the dataset to the specified hypothesis.

\[ \begin{aligned} \ell_D(h) &= \sqrt{\frac{\sum_i (h(x_i) - y_i)^2}{n}} &&\text{Root Mean Squared Error (RMSE)} \\ \ell_D(h) &= \frac{\sum_i |h(x_i) - y_i |}{n} &&\text{Mean Absolute Error (MAE)} \\ \ell_D(h) &= \frac{\mathbb{I}(h(x_i) \neq y_i)}{n} &&\text{Misclassification Error} \\ \end{aligned} \]

Machine Learning as Loss Minimisation

For a given dataset \(D\), hypothesis set \(\mathcal{H}\) and loss function \(\ell_D\), machine learning can be defined as the procedure to find hypothesis \(\hat{h} \in \mathcal{H}\) that minimises the loss on the given dataset. Mathematically, it can be written as the following optimisation problem:

\[ \hat{h} = \underset{h \in \mathcal{H}}{\operatorname{\arg \min}} ~~ \ell_D(h) \]

Types of Learning

  • Supervised Learning. Supervised learning models require the labeled dataset. These models find the hypothesis that map datapoint to its label. If the labels are numeric, the model is called as a regression. For categorical labels, the model is called as a classification model. Examples are linear regression, logistic regression, decision trees, etc.

  • Unsupervised Learning. Unsupervised learning models work on labeled as well as unlabeled datasets. These models aim to discover clusters, that denote the latent patterns in the data. Examples are \(K\)-means clustering, Latent Dirichlet Allocation, etc.

  • Parametric Learning. For parametric learning models, the hypothesis function takes the parametric form. They assume that the latent patterns in the dataset can be captured using the fixed set of parameters. For instance a linear hypothesis of the form \(h(x) = ax + b\) can be fully specified using two parameters \(a\) and \(b\).

  • Non-parametric Learning. For non-parametric learning models, the hypothesis function does not take a parametric form. They assume that the latent patterns in the dataset can not be captured using the fixed set of parameters. For instance a non-parametric hypothesis may assume a different hypothesis function to describe the behaviour of each datapoint. The aggregate hypothesis may take the form \(h(x) = 1/n \sum_i h_i(x)\). Thus the number of parameters for non-parametric models grow proportional to the number of datapoints.