Predicting Housing Prices 06 Failed Attempt Random Forest

Experimenting with Random Forest

Random forests do not perform as well as linear models on this data set.

import numpy as np
import pandas as pd
import warnings; warnings.filterwarnings('ignore')
%matplotlib inline
import pickle
train = pd.read_pickle('../data/train.p')
X = train.drop(columns=['SalePrice'])
y = train.SalePrice
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100)
rf.fit(X_dummies, y)
print(rf.score(X_dummies, y))
0.9849464232457522
test = pd.read_pickle('../data/test.p')
rows_test = set(test.index)
rows_train = set(X.index)
dummies = pd.get_dummies(X.append(test), drop_first= True)

X_dummies = dummies.loc[list(rows_train)]
test_dummies = dummies.loc[list(rows_test)]

predictions = pd.DataFrame({'SalePrice': rf.predict(test_dummies)},
                           index = test_dummies.index)

predictions.to_csv('rf-predictions.csv')

png

What about Cross Validation?

When building our linear models. we used cross-validation to tune our hyperparameters. That is, we did not use sklearn.linear_model.Lasso, but rather sklearn.linear_model.LassoCV. We are not using cross-validation because random forests, by definition, already cross validates in a sense. When we create a random forest, we are creating many boot-strapped decision trees and aggregating (yes, ‘aggregating’ is vague. There are many ways to aggregate.) them to make one decision tree.

Many textbooks will say not to use cross-validation, and only use “bagging” a term for bootstrap aggregation, to make our random forest (note that this is not the same as comparing random forests and picking the best one) because that requires too much computational power. Computers have gotten more powerful though. But it still just seems unnecessary to use cross-validation instead of bagging which is more efficient. I’m not sure that’s even an option in sklearn to make a random forest using cross-validation instead of default bagging.