Decision Trees And Random Forests

What are decision trees?

According to Introduction to Statistical Learning 7th edition

tree based methods for regression and classification… involve stratifying or segmenting the predictor space into a number of simple regions… since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.

Decision trees are one of those things better explained and defined with a picture.

What are these decision tree methods?

Decision trees can be categorized by (1) the algorithm used to build it and (2) whether the tree is applied to a regression or classification problem.

Algorithms for Building Decision Trees

There are many decision tree algorithms. They include ID3 (Iterative Dichotomizer 3), C4.5, CART (Classification and Regression Tree), CHAID, QUEST, GUIDE, CRUISE, and CTREE. The three most common are ID3, C4.5, and CART.

This guide will focus on CART. Coincidentally, sklearn also uses CART.

Classification Trees

Let’s walk through how to to build a classification tree, that is a decision tree where the target variable is a categorical variable, using CART. Our example dataset will be a very, very small sample of the Titanic data set. The dataset is of people who boarded the Titanic. Our target variable is whether a passenger survived the Titanic. The attributes are Pclass, Sex, Age, and Survived. We say the possible values are upper, middle, lower, female, etc.

import pandas as pd
from sklearn import tree
import warnings; warnings.filterwarnings('ignore')

train = pd.read_csv('../../src/train.csv')
train = train.set_index('PassengerId')

toy_feats = ['Pclass', 'Sex', 'Age']

seed = 2
toy = train[toy_feats + ['Survived']].sample(n = 5, random_state = seed)

ec_class = {1:'upper', 2:'middle', 3:'lower'}
for old, new in ec_class.items():
    toy.Pclass[toy.Pclass == old] = new
    
toy

	Pclass	Sex	Age	Survived
PassengerId
708	upper	male	42.0	1
38	lower	male	21.0	0
616	middle	female	24.0	1
170	lower	male	28.0	0
69	lower	female	17.0	1

How do we split our root node?

Let’s take a small detour and consider this data set.

pd.DataFrame({'dates': ['1/2/18', '3/12/18', '8/3/18'],
              'weather': ['cold', 'cold', 'hot'],
              'doIeatIceCream': ['no', 'no', 'yes']})

	dates	weather	doIeatIceCream
0	1/2/18	cold	no
1	3/12/18	cold	no
2	8/3/18	hot	yes

The decision tree for this dataset is obvious. The answer is obvious because the nodes for hot and cold weather are pure, ie I ate icream on all hot days and did not on cold days.

decision tree to know if I eat ice cream

Now let’s return to the Titanic dataset. ID3, CART, and C4.5 split based on purity. The differences are how many nodes are in a split and how purity is measured. CART only creates binary trees meaning one split only has two nodes.

To measure purity, CART uses the gini index or towing index. We will use the gini index because it is easier. We need to find the gini index for every possible split and pick the split with the lowest gini index. A smaller gini index indicates more purity, less impurity.

Gini Index

Formula for Gini Index

Sometimes, the Gini index is written another way. This is the proof that the two formulas are equal.

Binary Variables

Let’s calculate the Gini index for the male node, female node, and finally the split on Sex.

Gini Index for Split on Sex

Categorical Variables

Recall that CART only does binary splits. We cannot do a three-way split for the Pclass variable. We can only do a split where one branch is for one or two values and the other branch is not those values. We need to find Gini indices for the following possible splits: Possible Splits on Pclass

Here is how to calculate the Gini index for split to is upper class and is not upper class. Gini Index for Split on Upper Class

As you can see, this is a very tedious process. Fortunately, sklearn can build our tree for us. Unfortunately, sklearn does not support tree pruning.

from sklearn import tree
dumm = pd.get_dummies(toy)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(dumm.drop(columns = ['Survived']), dumm.Survived)
clf

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

ff = list(set(dumm.columns) - {'Survived'})
dot_data = tree.export_graphviz(clf, out_file=None,
                                feature_names = ff
                               ) 
graph = graphviz.Source(dot_data) 

Tree Pruning

If we stopped here, we would get very good predictions on the training set, but probably not on the test set. Decision trees in particular suffer from overfitting. We should prune our decision tree. (To prune an actual tree or bush is to trim it.) CART uses cost complexity pruning also known as weakest link pruning. From Introduction to Statistical Learning 7th edition, when using cost complexity pruning, “rather than considering every possible subree [as we do in cross-validation], we consider a sequence of trees indexed by a nonnegative tuning parameter α.

Cost Complexity Pruning

Notes About Using `sklearn` to Build Decision Trees

sklearn is missing many functionalities. It only implements the CART algorithm. It “does not support categorical variables for now.” Note that we had to one-hot encode our dataframe before providing it as an argument in sklearn’s tree function. If Pclass had more than three values, then just one-hot encoding would not have been enough to build a decision tree with categorical variables. We would need to create columns for when a row takes on one of two values of an attribute.

Consider this dataset where Pclass has four values, not three.

new_toy = toy.append(pd.DataFrame({'Pclass': ['dirt_poor'], 'Sex': ['female'], 'Age':[40], 'Survived': [0]}))
new_toy

	Pclass	Sex	Age	Survived
708	upper	male	42.0	1
38	lower	male	21.0	0
616	middle	female	24.0	1
170	lower	male	28.0	0
69	lower	female	17.0	1
0	dirt_poor	female	40.0	0

dummies = pd.get_dummies(new_toy); dummies

	Age	Survived	Pclass_dirt_poor	Pclass_lower	Pclass_middle	Pclass_upper	Sex_female	Sex_male
708	42.0	1	0	0	0	1	0	1
38	21.0	0	0	1	0	0	0	1
616	24.0	1	0	0	1	0	1	0
170	28.0	0	0	1	0	0	0	1
69	17.0	1	0	1	0	0	1	0
0	40.0	0	1	0	0	0	1	0

Passing in a dummied dataframe with the Sex_female column would effectively give us the same decision tree as passing in a dataframe without Sex_female

dummies = dummies.drop(columns=['Sex_female']); dummies.head()

	Age	Survived	Pclass_lower	Pclass_middle	Pclass_upper	Sex_male
708	42.0	1	0	0	1	1
38	21.0	0	1	0	0	1
616	24.0	1	0	1	0	0
170	28.0	0	1	0	0	1
69	17.0	1	1	0	0	0

This is why we did not need to add a Pclass_lowVmid column and the like. We already have Pclass_upper and that’s enough. However, if Pclass had four variables, our dummied dataframe would not have enough columns to make a decision tree with categorical variables. We can easily make a work-around so that sklearn can handle categorical variables.

dummies['Pclass_dirtVlow'] = dummies.Pclass_dirt_poor + dummies.Pclass_lower
dummies['Pclass_dirtVmid'] = dummies.Pclass_dirt_poor + dummies.Pclass_middle
dummies['Pclass_dirtVup'] = dummies.Pclass_dirt_poor + dummies.Pclass_upper

dummies['Pclass_lowVmid'] = dummies.Pclass_lower + dummies.Pclass_middle
dummies['Pclass_lowVup'] = dummies.Pclass_lower + dummies.Pclass_upper

dummies['Pclass_midVup'] = dummies.Pclass_middle + dummies.Pclass_upper

dummies.head()

	Age	Survived	Pclass_lower	Pclass_middle	Pclass_upper	Sex_male	Pclass_dirtVlow	Pclass_dirtVmid	Pclass_dirtVup	Pclass_lowVmid	Pclass_lowVup	Pclass_midVup
708	42.0	1	0	0	1	1	0	0	1	0	1	1
38	21.0	0	1	0	0	1	1	0	0	1	1	0
616	24.0	1	0	1	0	0	0	1	0	1	0	1
170	28.0	0	1	0	0	1	1	0	0	1	1	0
69	17.0	1	1	0	0	0	1	0	0	1	1	0

Other Measures of Purity

Entropy

entropy formula

where $p̂_{mk}$ is the proportion of training observations in the mth region that are from the kth class.

Like the Gini index, a small entropy implies less impurity. The Gini index and entropy are actually quite similar numerically. You can specify that entropy rather than the Gini index be used in sklearn with sklearn.tree.DecisionTreeClassifier(criterion = 'entropy'). Note that using entropy means sklearn is not technically using CART. This is a technical detail that rarely matters. In practice, we prune and use random forests. If we’re applying random forests, the exact algorithm applied matters doesn’t really make a difference.

Information Gain

Introduction to Data Mining by Steinbach describes information gain well.

Comparing Impurity Measures

Bagging, Random Forests, and Boosting

Bagging, random forests, and boosting uses many decision trees as building blocks to construct a more powerful decision tree.

Bagging (Bootstrap AGGregating)

To understand bagging, we need to know what it means to bootstrap data. To bootstrap data is to resample with replacement from the original, initial same multiple times and to use each resample, bootstramped sample, to compute a statistic. We can gain insight from the computed statistics of our bootstrapped samples. It can be applied to many contexts, not just decision trees. Knowing where the term bootstrap comes from may help us remember its definition. It comes from the idiom “to pull yourself up by your bootstraps” meaning to help or save yourself without help from others. Bootstrapping gleans insight using sample we already have and not by obtaining more samples.

Bagging is also not unique to decision trees. Bagging is just particularly effective with decision trees. Bootstrap aggegation or bagging is a technique where we bootstrap the data, ie resample the data and then create a predictive model from each bootstrapped sample, and then aggregate all the predictive models, using whatever method such as averaging for regression or simple voting for classification. It reduces variance in prediction models.

Random Forests

From Introduction to to Statistical Learning 7th edition, “Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh sample of m predictors is taken at each split, and typically we choose m ≈ sqrt(p) where p is the number of predictors, also called attributes or features.”

By randomly choosing which predictors to split at in each decision tree, we decorrelate the trees. With bagging alone, we will get many trees that generally split the same way; the decision trees are correlated. Random forests uses decorrelated trees.

Example of Random Forest

The description above is very abstract. Assistant Professor Josh Starmer does a very good job of walking through how to build a random forests with a toy dataset. He uses a classification example and aggregates the data using a simple majority voting system. Remember that there are many ways you can aggregate the data in bagging.