Decision Trees and Random Forests on Titanic Data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn import tree
from sklearn.preprocessing import OneHotEncoder
import graphviz
from sklearn.ensemble import RandomForestClassifier

Load Data

raw_train = train = pd.read_csv('../../src/train.csv', index_col= 'PassengerId')
train = raw_train.copy()

train.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

Counting Number of Unique Values for Each Feature

`Name` is Unique So We Are Dropping It

A column with unique values has no predictive value. It can be used as an index, but we are not using it as an index. We already have an index.
Recall that our training set has 891 rows.

for f in train:
    n_dups = len(train[f].unique())
    print(f, n_dups)

Survived 2
Pclass 3
Name 891
Sex 2
Age 89
SibSp 7
Parch 7
Ticket 681
Fare 248
Cabin 148
Embarked 4

train = train.drop(columns=['Name'])

`Ticket` and `Cabin` are Categorical Features with Many Unique Values

Recall that a categorical feature with only unique values has no predictive value. It makes sense then that a feature with a large number of unique values relative to the number of rows has little predictive value if any. Ticket and Cabin may have little to no predictive value for this reason. Let’s look at Ticket and Cabin closer.

`Ticket` Doesn’t Appear Useful

pd.options.display.max_rows = 12
train.Ticket.value_counts()

347082      7
1601        7
CA. 2343    7
CA 2144     6
3101295     6
347088      6
           ..
29105       1
PC 17473    1
113056      1
323951      1
27267       1
345364      1
Name: Ticket, Length: 681, dtype: int64

train = train.drop(columns=['Ticket'])

`Cabin` Probably Not Useful

Let’s drop Cabin. The letters at the front of cabin values may be useful, but probably not. 25% of the data are nulls. The letters are probably correlated with wealth, socio-economic status and we already know that with PClass. If our models are not effective, we can just try using Cabin later.

train.Cabin.value_counts()

B96 B98        4
C23 C25 C27    4
G6             4
F33            3
C22 C26        3
E101           3
              ..
E31            1
A14            1
B3             1
A6             1
C30            1
B86            1
Name: Cabin, Length: 147, dtype: int64

cabin_grp = train.Cabin.str[0]

cabin_grp.value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Cabin, dtype: int64

n_nan_cg = cabin_grp.count(); n_nan_cg

train = train.drop(columns = ['Cabin'])

Identifying Numerical and Categorical Variables

target = 'Survived'

is_num = train.dtypes\
        .apply(lambda x: x == 'float64' or x == 'int64')
num_cols = train.dtypes.index[is_num]
num_f = set(num_cols) - {target} 
num_f

{'Age', 'Fare', 'Parch', 'Pclass', 'SibSp'}

cat_f = set(train.columns) - num_f - {target}
cat_f

{'Embarked', 'Sex'}

fig, axes = plt.subplots(nrows = 1, ncols = len(cat_f),
                         figsize = (16,8),
                         sharey = True)

for i, f in enumerate(cat_f):
    pd.crosstab(train[f], train['Survived'])\
        .plot(kind = 'bar', stacked = True, ax = axes[i])

png

fig, axes = plt.subplots(nrows = len(num_f), ncols = 1,
                         figsize = (16,28))

for i, f in enumerate(num_f):
    f_col = train[f].copy()
    if len(train[f].unique()) > 7:
        f_col = pd.cut(train[f], 12)
    
    ax = axes[i]
    subplot = pd.crosstab(f_col, train['Survived'])\
            .plot(kind = 'bar', stacked = True, ax = ax, 
                 fontsize = 15)
    ax.tick_params(labelrotation = 15, labelsize = 10)
    plt.rc('axes', labelsize = 10)
    plt.subplots_adjust(bottom = 0.01)
    

png

Note About Not Deskewing Data

Note that although Parch, Fare, and SibSp are highly skewed, we do not need to deskew those features because skewed distributions are not harmful to decision tree models. In fact it can be harmful because deskewing the model can prevent splitting at very high or low points.

Imputing Nans

Let’s keep this simple. We will impute nans for categorical variables with the mode and for numarical variables the mean.

X = train.drop(columns=['Survived']).copy(); y = train.Survived.copy()

X.isnull().sum().sum()

num_f = list(num_f); cat_f = list(cat_f)

imp_X = X.copy()
cat_imp = SimpleImputer(strategy='most_frequent')
num_imp = SimpleImputer(strategy='mean')

imp_X[cat_f] = cat_imp.fit_transform(imp_X[cat_f])

imp_X[num_f] = num_imp.fit_transform(imp_X[num_f])

imp_X.isnull().sum().sum()

dumm_X = pd.get_dummies(imp_X); dumm_X.head(3)

	Pclass	Age	SibSp	Parch	Fare	Sex_female	Sex_male	Embarked_C	Embarked_Q	Embarked_S
PassengerId
1	3.0	22.0	1.0	0.0	7.2500	0	1	0	0	1
2	1.0	38.0	1.0	0.0	71.2833	1	0	1	0	0
3	3.0	26.0	0.0	0.0	7.9250	1	0	0	0	1

Preparing Test Set

test = pd.read_csv('../../src/test.csv', index_col='PassengerId')

test.head()

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

test = test[X.columns]

test.head(2)

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
PassengerId
892	3	male	34.5	0	0	7.8292	Q
893	3	female	47.0	1	0	7.0000	S

test.isnull().sum().sum()

test[cat_f] = cat_imp.transform(test[cat_f])
test[num_f] = num_imp.transform(test[num_f])

test.isnull().sum().sum()

def get_test_dummies(test_df, X_dummies):
    test_dummies = pd.get_dummies(data = test_df)
    
    # There may be values for categorical features in the training set and not in the test set.      
    values_not_in_test_dummies = set(X_dummies.columns) - set(test_dummies.columns)
    for v in values_not_in_test_dummies:
        test_dummies[v] = 0
    test_dummies.shape
    # Values not in the training set, but are in the test set. 
    # Eg The range of GarageCars in the training set is [1,4];
    # The test set may have a house with more than four cars
    values_not_in_X = set(test_dummies.columns) - set(X_dummies.columns)
    test_dummies = test_dummies.drop(columns = list(values_not_in_X))
    
    col_order = X_dummies.columns.tolist()
    test_dummies = test_dummies[col_order]
    
    return test_dummies
    
dumm_test = get_test_dummies(test, dumm_X)

dumm_test.isnull().sum()

Pclass        0
Age           0
SibSp         0
Parch         0
Fare          0
Sex_female    0
Sex_male      0
Embarked_C    0
Embarked_Q    0
Embarked_S    0
dtype: int64

Decision Tree Classifier

Decison trees are not very accurate. They are prone to overfitting.

Random forests, however, can be powerful.

dtree = tree.DecisionTreeClassifier()
dtree.fit(dumm_X, y)

ff = dumm_X.columns
dot_data = tree.export_graphviz(dtree, out_file=None,
                               feature_names = ff) 
graph = graphviz.Source(dot_data) 

graph

predictions = pd.DataFrame({'Survived': dtree.predict(dumm_test)},
                           index = dumm_test.index)

predictions.to_csv('dtree-naive-predictions.csv')

Kaggle accuracy = 0.727272; 9837th place out of 10736

>>> clf = RandomForestClassifier(n_estimators=100, max_depth=2,
...                              random_state=0)
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
>>> print(clf.feature_importances_)
[0.14205973 0.76664038 0.0282433  0.06305659]
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]

rf = RandomForestClassifier(n_estimators = 50 )
rf.fit(dumm_X, y)

predictions = pd.DataFrame({'Survived': rf.predict(dumm_test)},
                           index = dumm_test.index)

predictions.to_csv('rf-predictions.csv')

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Kaggle score 0.74162 which is unsurprisingly an improvement

rf100 = RandomForestClassifier(n_estimators = 100 )
rf100.fit(dumm_X, y)

predictions = pd.DataFrame({'Survived': rf100.predict(dumm_test)},
                           index = dumm_test.index)

predictions.to_csv('rf100-predictions.csv')

score improved to 0.76076

Is An Accuracy of 0.76076 Good?

Not really. It’s more accurate than guessing that every passenger died. 61% of passengers died so that method would be 61% accurate. However, if we just guessed that every female passenger survived and every male passenger died, we would be accurate 78.6% of the time, more than 76% of the time, the accurac of our best random forest.

pd.crosstab(train.Sex, train.Survived, normalize = 'all')

Survived	0	1
Sex
female	0.090909	0.261504
male	0.525253	0.122334

0.261+0.525

0.786

How Can We Do Better?

We can improve our XG Boost or Ada boost and compare which decision tree produces better results.