Predicting Housing Prices 010 Data Prep Formatting
import numpy as np
import pandas as pd
import warnings; warnings.filterwarnings('ignore')
import pickle
import lib.clean_helper as cl
Looking at Data and Data Preparation
My first step is to always look at some basic descriptors of the dataset, make sure the data loaded correctly, and format the data as necessary. After determining how I want my dataset to be formatted, I will make those formats in an python file that I will load in each notebook. Rather than load the raw dataset, I will load my python file.
Loading Data
raw_train = pd.read_csv('../data/kaggle_train.csv', index_col='Id')
train = raw_train.copy()
Getting Basic Descriptors of Dataset
I always start by making sure the loaded data matches the data description and get a general idea of the dataset in the process
Looking at Dimensions of Dataframe
The dataframe has 81 columns which matches the dataset description. According to the data description, the dataset includes 79 explanatory variables. This does not conflict with the loaded dataset because one column, Id
does not contain an explanatory variable, and SalePrice
is the target not an explantory variable.
For reference, the description can be found in the doc directory of this repository.
dimensions = train.shape; print(dimensions)
(1460, 80)
train.head(3)
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 rows × 80 columns
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(34), object(43)
memory usage: 923.9+ KB
Data Preparation
Renaming 3SsnPorch
to ThreeSsnPorch
It is easier to work with a column whose name does not start with an number character
train = train.rename(columns={ '3SsnPorch': 'ThreeSsnPorch'})
Classifying Numerical and Categorical Features
train.dtypes
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
Street object
Alley object
LotShape object
LandContour object
Utilities object
LotConfig object
LandSlope object
Neighborhood object
Condition1 object
Condition2 object
BldgType object
HouseStyle object
OverallQual int64
OverallCond int64
YearBuilt int64
YearRemodAdd int64
RoofStyle object
RoofMatl object
Exterior1st object
Exterior2nd object
MasVnrType object
MasVnrArea float64
ExterQual object
ExterCond object
Foundation object
BsmtQual object
...
BedroomAbvGr int64
KitchenAbvGr int64
KitchenQual object
TotRmsAbvGrd int64
Functional object
Fireplaces int64
FireplaceQu object
GarageType object
GarageYrBlt float64
GarageFinish object
GarageCars int64
GarageArea int64
GarageQual object
GarageCond object
PavedDrive object
WoodDeckSF int64
OpenPorchSF int64
EnclosedPorch int64
ThreeSsnPorch int64
ScreenPorch int64
PoolArea int64
PoolQC object
Fence object
MiscFeature object
MiscVal int64
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 80, dtype: object
Ordinal Categorical Features
The following features are ordinal, but that does not necessarily mean they should be numerical. There are two reasons why I am making the following ordinal features categorical.
-
We cannot say with certainty that the difference between two consecutive values of these ordinal features is not the same as the difference between between two other consecutive values. Take OverallQual for example. It would be reasonable to say that the difference between a house with “OverallQual 4 Below Average” (data description) and a house with “OverallQual 5 Average” is not the same as the difference between a house with OverallQual being very excellent and a house with OverallQual being excellent. We would say there is diminishing returns.
-
We could scale the values to reflect the true relationship, but that will likely introduce a lot of error. It is better to not use the ordinality of these features than to use it incorrectly.
OverallQual (Ordinal): Rates the overall material and finish of the house
10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor
ordinal_f = ['MSSubClass' ,'OverallQual',
'OverallCond',
'FullBath','HalfBath',
'BedroomAbvGr','KitchenAbvGr',
'TotRmsAbvGrd','Fireplaces',
'MoSold', 'BsmtCond', 'BsmtQual',
'ExterCond', 'ExterQual',
'GarageQual', 'GarageCond', 'GarageCars',
'HeatingQC', 'KitchenQual',
'PoolQC',
'BsmtHalfBath', 'BsmtFullBath']
train[ordinal_f] = train[ordinal_f].apply(cl.obj_to_ordinal)
Cleaning Exterior2nd
From data description:
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles
train.Exterior2nd.drop_duplicates().sort_values()
Id
70 AsbShng
299 AsphShn
196 Brk Cmn
31 BrkFace
1371 CBlock
24 CmentBd
8 HdBoard
275 ImStucc
2 MetalSd
596 Other
13 Plywood
530 Stone
64 Stucco
1 VinylSd
16 Wd Sdng
4 Wd Shng
Name: Exterior2nd, dtype: object
def clean_ext_typos(ext_col):
ext_col[ext_col == 'Brk Cmn'] = 'BrkComm'
ext_col[ext_col == 'Wd Shng'] = 'WdShing'
ext_col[ext_col == 'CmentBd'] = 'CemntBd'
return ext_col
train.Exterior2nd = cl.clean_ext_typos(train.Exterior2nd)
Nans that Add Information
Nan values are difficult to work with programmatically.The data description lists the following features as having nan value that add information unlike nans that mean a data point is unknown or missing. Those features describe physical structures that may not exist in a home. For example, a nan value in PoolQC means there is no pool. It stands to reason that if PoolQC is nan, then PoolArea should be nan or zero. Let’s replace nan values in columns with strings such as ‘none’ or the number zero.
- Alley
- BsmtQual
- BsmtCond
- BsmtExposure
- BsmtFinType1
- BsmtFinSF1
- BsmtFinType2
- BsmtFinSF2
- BsmtUnfSF
- TotalBsmtSF
- BsmtFullBath
- BsmtHalfBath
- FireplaceQu
- GarageType
- GarageYrBlt
- GarageFinish
- GarageCars
- GarageArea
- GarageQual
- GarageCond
- PoolArea
- PoolQC
- Fence
- MiscFeature
Replacing Nans with String ‘none’
train = cl.nan_to_str_none(train)
train.to_pickle('../data/to_clean_1_train.p')