Predicting Housing Prices 010 Data Prep Formatting

import numpy as np
import pandas as pd
import warnings; warnings.filterwarnings('ignore')
import pickle
import lib.clean_helper as cl

Looking at Data and Data Preparation

My first step is to always look at some basic descriptors of the dataset, make sure the data loaded correctly, and format the data as necessary. After determining how I want my dataset to be formatted, I will make those formats in an python file that I will load in each notebook. Rather than load the raw dataset, I will load my python file.

Loading Data

raw_train = pd.read_csv('../data/kaggle_train.csv', index_col='Id')
train = raw_train.copy()

Getting Basic Descriptors of Dataset

I always start by making sure the loaded data matches the data description and get a general idea of the dataset in the process

Looking at Dimensions of Dataframe

The dataframe has 81 columns which matches the dataset description. According to the data description, the dataset includes 79 explanatory variables. This does not conflict with the loaded dataset because one column, Id does not contain an explanatory variable, and SalePrice is the target not an explantory variable.

For reference, the description can be found in the doc directory of this repository.

dimensions = train.shape; print(dimensions)
(1460, 80)
train.head(3)
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500

3 rows × 80 columns

train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(34), object(43)
memory usage: 923.9+ KB

Data Preparation

Renaming 3SsnPorch to ThreeSsnPorch

It is easier to work with a column whose name does not start with an number character

train = train.rename(columns={ '3SsnPorch': 'ThreeSsnPorch'})

Classifying Numerical and Categorical Features

train.dtypes
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
                  ...   
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64
GarageFinish      object
GarageCars         int64
GarageArea         int64
GarageQual        object
GarageCond        object
PavedDrive        object
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
ThreeSsnPorch      int64
ScreenPorch        int64
PoolArea           int64
PoolQC            object
Fence             object
MiscFeature       object
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 80, dtype: object

Ordinal Categorical Features

The following features are ordinal, but that does not necessarily mean they should be numerical. There are two reasons why I am making the following ordinal features categorical.

  1. We cannot say with certainty that the difference between two consecutive values of these ordinal features is not the same as the difference between between two other consecutive values. Take OverallQual for example. It would be reasonable to say that the difference between a house with “OverallQual 4 Below Average” (data description) and a house with “OverallQual 5 Average” is not the same as the difference between a house with OverallQual being very excellent and a house with OverallQual being excellent. We would say there is diminishing returns.

  2. We could scale the values to reflect the true relationship, but that will likely introduce a lot of error. It is better to not use the ordinality of these features than to use it incorrectly.

OverallQual (Ordinal): Rates the overall material and finish of the house

  10	Very Excellent
  9	Excellent
  8	Very Good
  7	Good
  6	Above Average
  5	Average
  4	Below Average
  3	Fair
  2	Poor
  1	Very Poor
ordinal_f = ['MSSubClass' ,'OverallQual',
             'OverallCond',
             'FullBath','HalfBath',
             'BedroomAbvGr','KitchenAbvGr',
             'TotRmsAbvGrd','Fireplaces',
             'MoSold', 'BsmtCond', 'BsmtQual',
             'ExterCond', 'ExterQual',
             'GarageQual', 'GarageCond', 'GarageCars',
             'HeatingQC', 'KitchenQual',
             'PoolQC',
             'BsmtHalfBath', 'BsmtFullBath']            
train[ordinal_f] = train[ordinal_f].apply(cl.obj_to_ordinal)

Cleaning Exterior2nd

From data description:

Exterior2nd: Exterior covering on house (if more than one material)

  AsbShng	Asbestos Shingles
  AsphShn	Asphalt Shingles
  BrkComm	Brick Common
  BrkFace	Brick Face
  CBlock	Cinder Block
  CemntBd	Cement Board
  HdBoard	Hard Board
  ImStucc	Imitation Stucco
  MetalSd	Metal Siding
  Other	Other
  Plywood	Plywood
  PreCast	PreCast
  Stone	Stone
  Stucco	Stucco
  VinylSd	Vinyl Siding
  Wd Sdng	Wood Siding
  WdShing	Wood Shingles
train.Exterior2nd.drop_duplicates().sort_values()
Id
70      AsbShng
299     AsphShn
196     Brk Cmn
31      BrkFace
1371     CBlock
24      CmentBd
8       HdBoard
275     ImStucc
2       MetalSd
596       Other
13      Plywood
530       Stone
64       Stucco
1       VinylSd
16      Wd Sdng
4       Wd Shng
Name: Exterior2nd, dtype: object
def clean_ext_typos(ext_col):
    ext_col[ext_col == 'Brk Cmn'] = 'BrkComm'
    ext_col[ext_col == 'Wd Shng'] = 'WdShing'
    ext_col[ext_col == 'CmentBd'] = 'CemntBd'
    
    return ext_col
train.Exterior2nd = cl.clean_ext_typos(train.Exterior2nd)

Nans that Add Information

Nan values are difficult to work with programmatically.The data description lists the following features as having nan value that add information unlike nans that mean a data point is unknown or missing. Those features describe physical structures that may not exist in a home. For example, a nan value in PoolQC means there is no pool. It stands to reason that if PoolQC is nan, then PoolArea should be nan or zero. Let’s replace nan values in columns with strings such as ‘none’ or the number zero.

  • Alley
  • BsmtQual
  • BsmtCond
  • BsmtExposure
  • BsmtFinType1
  • BsmtFinSF1
  • BsmtFinType2
  • BsmtFinSF2
  • BsmtUnfSF
  • TotalBsmtSF
  • BsmtFullBath
  • BsmtHalfBath
  • FireplaceQu
  • GarageType
  • GarageYrBlt
  • GarageFinish
  • GarageCars
  • GarageArea
  • GarageQual
  • GarageCond
  • PoolArea
  • PoolQC
  • Fence
  • MiscFeature

Replacing Nans with String ‘none’

train = cl.nan_to_str_none(train)
train.to_pickle('../data/to_clean_1_train.p')