Decision Trees

Learning Objectives

After this unit, students should be able to

describe the decision tree model.
compute purity metrics.
build decision trees.
analyse the results of decision trees.
handle of the numerical attributes.
describe the importance of pruning decision trees.

A decision tree is a hierarchical structure used in decision-making and machine learning. Unlike many other classification models, its intuitive structure makes it an highly interpretable model. Decision trees have been widely used in business studies, especially in operations research, to learn patterns in the data to derive appropriate strategies. Consider a toy dataset that we used in Unit 30, which we have replicated below for convenience.

Age	Edu	Marital	Income	Credit
23	Masters	Single	75k	Yes
35	Bachelor	Married	50k	No
26	Masters	Single	70k	Yes
41	PhD	Single	95k	Yes
18	Bachelor	Single	40k	No
55	Masters	Married	85k	No
30	Bachelor	Single	60k	No
35	PhD	Married	60k	Yes
28	PhD	Married	65k	Yes

Following are the two (of many) possible decision trees based for the toy dataset.

Tree 1

Tree 2

Nodes in the every decision tree can be classified into one of the three kinds. Root node, the topmost node in the tree, represents the entire dataset or problem. Each internal (or Decision) node represents a decision or test based on a feature (like a "yes/no" question). Every terminal (or Leaf) node provide the final outcome or prediction. Every internal node represents a decision that splits the dataset partitions the dataset into at least two subsets based on the result of the decision. Thus, the general process to build the tree cab be described follows:

Start at the root node with all datapoints in the dataset.
If all datapoints have same (or similar) label, you have finished processing the present node.
Otherwise, formulate a decision to split the dataset into smaller subsets.
Recursively apply the procedure on every node created by the original model.

One can manually apply this procedure to a small dataset to draw decision trees as shown above. Choosing an appropriate feature in the dataset to formulate a decision is not a trivial task. We require a quantitative metric to help us make this choice.

Algorithms

Finding optimal solution to the decision tree is an NP-hard problem (there is no efficient solution available). CART, ID3 and C4.5 are widely used algorithms to construct nearly-optimal decision trees. Scikit-learn package uses an optimised version of CART algorithm. The details can be found here.

Impurity Metrics

Decision trees use various impurity metrics (sometimes also called as purity metrics) to measure how a node's data should be split further into its children nodes. The goal of the general process to build decision trees is to make each node as pure as possible, meaning the data within the node is dominated by a single class. Following purity metrics are widely used in decision tree algorithms.

Gini Impurity. Gini impurity measures a function of the probability of inaccurately classifying a randomly chosen element from a set. For a binary classifier, it ranges from \(0\) (perfectly pure node) to \(0.5\) (perfectly impure node). For a dataset with \(K\) classes, it computed as follows:

\[ Gini = 1 - \sum_{i=1}^K p_i^2 \]

where \(p_i\) is probability of an element being classified into class \(i\). For instance, if a certain node contains \(60%\) datapoints of class A and \(40%\) datapoints of class B, the Gini impurity is: \(1 - (0.6^2 + 0.4^2) = 0.48\).
Entropy. Entropy is an information theoretic metric to measure the randomness or uncertainty in the dataset. For a binary classifier, it ranges from \(0\) (perfectly pure node) to \(1\) (perfectly impure node). For a dataset with \(K\) classes, it computed as follows:

\[ Entropy = - \sum_{i=1}^K p_i \log{(p_i)} \]

where \(p_i\) is probability of an element being classified into class \(i\). For instance, if a certain node contains \(60%\) datapoints of class A and \(40%\) datapoints of class B, the entropy is: \(-(0.6\log_2{0.6} + 0.4\log_2{0.4}) = 0.97\).
Miclassification Rate. Misclassification rate computes the fraction of misclassified samples, in turn favouring the majority class in the dataset. For a dataset with \(K\) classes, it computed as follows:

\[ Error = 1 -\max(p_1, p_2, ..., p_K) \]

where \(p_i\) is probability of an element being classified into class \(i\). For instance, if a certain node contains \(60%\) datapoints of class A and \(40%\) datapoints of class B, the misclassification rate is \(1 - 0.6 = 0.4\).

Since misclassification rate favours the majority class, it is typically used for early termination of splitting of the nodes. Gini impurity and entropy are the most commonly used metrics in decision trees. In practice, they often lead to similar results. Gini impurity tends to be a bit faster to compute since it does not involve computation of the logarithm.

Information Gain

Decision trees use information gain (derived from a purity metric) to determine the best feature and threshold to split the data at each node. The primary objective is to maximize the impurity of the resulting subsets after each split, meaning that each subset should contain data that is predominantly from a single class, a.k.a a pure node. Information Gain quantifies this improvement in purity, guiding the decision tree to split the data in the most informative way.

Consider that a node \(u\) is split into \(k\) children \((v_1, v_2, ..., v_k)\) based on a certain decision. For instance a node can be split based on a categorical predictor with \(k\) classes. Let \(|u|\) denotes the number of datapoints in node \(u\). Similarly, we can denote the number of datapoints in node \(v_i\) as \(|v_i|\). Let, \(I(\cdot)\) denotes an impurity metric. The information gain is computed as follows:

\[ Information ~ Gain~ = I(u) - \sum_{i=1}^k \frac{|v_i|}{|u|} I(v_i) \]

Every decision splits node \(u\) into its children. Information gain quantifies the quality of the decision and decision trees choose the decision that offers the maximum information gain.

Illustration

Let us refer back to the toy dataset that we used in Unit 30. We can represent the dataset using a tuple, such as \((5, 4)\), that denotes the number of datapoints in each of the two-classes (yes/no). Let \(u\) denotes the node with this dataset. We will use entropy as the impurity metric for this illustration.

Entropy of the dataset. \(I(u) = I((5, 4)) = -(\frac{5}{9}\log\frac{5}{9} + \frac{4}{9}\log\frac{4}{9}) = 0.99\).
Decision 1. Let us say we decide to split the dataset based on the Marital feature. This decision would split \(u\) into \(v_1\) (node with married people) and \(v_2\) (node with single people). We can compute the information gain as follows:

\[ \begin{aligned} I(v_1) &= I((2, 2)) = 1 \\ I(v_2) &= I((3, 2)) = 0.97 \\ Information ~ Gain~ &= 0.99 - \left(\frac{4}{9} \cdot 1 + \frac{5}{9} \cdot 0.97 \right) = 0.006 \end{aligned} \]
Decision 2. Let us say we decide to split the dataset based on the Edu feature. This decision would split \(u\) into \(v_1\) (node with bachelor degree holders) and \(v_2\) (node with master or PhD holders). We can compute the information gain as follows:

\[ \begin{aligned} I(v_1) &= I((0, 3)) = 0 \\ I(v_2) &= I((5, 1)) = 0.65 \\ Information ~ Gain~ &= 0.99 - \left(\frac{3}{9} \cdot 0 + \frac{6}{9} \cdot 0.65 \right) = 0.556 \end{aligned} \]

Since the second decision provides higher information gain, the decision tree will prefer Decision 2 over Decision 1 to split \(u\) into two subsets.

Non-Binary Splits

In decision trees, handling non-binary splits (i.e., splits involving more than two possible branches) depends on the type of data and the algorithm's design. Non-binary splits are common when working with categorical features that have multiple categories, or with certain modifications of the decision tree algorithm.

Multiway splitting is possible when the decision involves the use of a categorical predictor. Every class value gives rise to a subset of dataset after the split. It works well when there are a few distinct classes; but it can result in many branches if there are too many classes.
Another alternative is to convert the categorical variable into a binary representation and use binary splits for each category. For instance we split the toy dataset on Edu feature as bachelor degree holders versus other degree holders.
Numerical predictors can lead to countless possible ways to split the dataset. A binary split for numerical predictors can be created using a threshold, but selecting the right threshold isn't always straightforward. Domain knowledge can guide the choice of the threshold. In the absence of such knowledge, various percentiles can be tested as potential threshold values, with the one that provides the highest information gain being selected for the split.

Pruning

Decision trees to easily overfit the training data. An overfitted tree performs well on the training data but generalizes poorly to unseen data, resulting in lower accuracy on test data. Large and deep decision trees are not only less efficient but also hard to interpret. Pruning simplifies the tree by removing nodes or branches that contribute little to predictive power, leading to a more generalized model that performs better on unseen data.

There are two main approaches to pruning the decision tree pre-pruning (early-stopping) and post-pruning (post-processing).

Pre-pruning stops the growth of the decision tree early, preventing it from becoming overly complex. The tree is constructed with certain restrictions, and if any of these stopping criteria are met, the growth halts. The restrictions can be in terms of maximum allowed depth (stop the splitting process once tree reaches a certain depth), minimum samples per node (split the node only if the node passes this threshold), or minimum information gain threshold.
Post-pruning starts by fully growing the tree, then simplifies it by removing nodes or branches that do not contribute significantly to model performance. It evaluates the impact of removing a branch and prunes it if doing so does not substantially increase the error rate.