 
					Predicting Housing Prices 010 Data Prep Formatting
import numpy as np
import pandas as pd
import warnings; warnings.filterwarnings('ignore')
import pickle
import lib.clean_helper as cl
Looking at Data and Data Preparation
My first step is to always look at some basic descriptors of the dataset, make sure the data loaded correctly, and format the data as necessary. After determining how I want my dataset to be formatted, I will make those formats in an python file that I will load in each notebook. Rather than load the raw dataset, I will load my python file.
Loading Data
raw_train = pd.read_csv('../data/kaggle_train.csv', index_col='Id')
train = raw_train.copy()
Getting Basic Descriptors of Dataset
I always start by making sure the loaded data matches the data description and get a general idea of the dataset in the process
Looking at Dimensions of Dataframe
The dataframe has 81 columns which matches the dataset description. According to the data description, the dataset includes 79 explanatory variables. This does not conflict with the loaded dataset because one column, Id does not contain an explanatory variable, and SalePrice is the target not an explantory variable.
For reference, the description can be found in the doc directory of this repository.
dimensions = train.shape; print(dimensions)
(1460, 80)
train.head(3)
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | |||||||||||||||||||||
| 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 | 
| 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 | 
| 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 | 
3 rows × 80 columns
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(34), object(43)
memory usage: 923.9+ KB
Data Preparation
Renaming 3SsnPorch to ThreeSsnPorch
It is easier to work with a column whose name does not start with an number character
train = train.rename(columns={ '3SsnPorch': 'ThreeSsnPorch'})
Classifying Numerical and Categorical Features
train.dtypes
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
                  ...   
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64
GarageFinish      object
GarageCars         int64
GarageArea         int64
GarageQual        object
GarageCond        object
PavedDrive        object
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
ThreeSsnPorch      int64
ScreenPorch        int64
PoolArea           int64
PoolQC            object
Fence             object
MiscFeature       object
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 80, dtype: object
Ordinal Categorical Features
The following features are ordinal, but that does not necessarily mean they should be numerical. There are two reasons why I am making the following ordinal features categorical.
- 
    We cannot say with certainty that the difference between two consecutive values of these ordinal features is not the same as the difference between between two other consecutive values. Take OverallQual for example. It would be reasonable to say that the difference between a house with “OverallQual 4 Below Average” (data description) and a house with “OverallQual 5 Average” is not the same as the difference between a house with OverallQual being very excellent and a house with OverallQual being excellent. We would say there is diminishing returns. 
- 
    We could scale the values to reflect the true relationship, but that will likely introduce a lot of error. It is better to not use the ordinality of these features than to use it incorrectly. 
OverallQual (Ordinal): Rates the overall material and finish of the house
10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor
ordinal_f = ['MSSubClass' ,'OverallQual',
             'OverallCond',
             'FullBath','HalfBath',
             'BedroomAbvGr','KitchenAbvGr',
             'TotRmsAbvGrd','Fireplaces',
             'MoSold', 'BsmtCond', 'BsmtQual',
             'ExterCond', 'ExterQual',
             'GarageQual', 'GarageCond', 'GarageCars',
             'HeatingQC', 'KitchenQual',
             'PoolQC',
             'BsmtHalfBath', 'BsmtFullBath']            
train[ordinal_f] = train[ordinal_f].apply(cl.obj_to_ordinal)
Cleaning Exterior2nd
From data description:
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles
train.Exterior2nd.drop_duplicates().sort_values()
Id
70      AsbShng
299     AsphShn
196     Brk Cmn
31      BrkFace
1371     CBlock
24      CmentBd
8       HdBoard
275     ImStucc
2       MetalSd
596       Other
13      Plywood
530       Stone
64       Stucco
1       VinylSd
16      Wd Sdng
4       Wd Shng
Name: Exterior2nd, dtype: object
def clean_ext_typos(ext_col):
    ext_col[ext_col == 'Brk Cmn'] = 'BrkComm'
    ext_col[ext_col == 'Wd Shng'] = 'WdShing'
    ext_col[ext_col == 'CmentBd'] = 'CemntBd'
    
    return ext_col
train.Exterior2nd = cl.clean_ext_typos(train.Exterior2nd)
Nans that Add Information
Nan values are difficult to work with programmatically.The data description lists the following features as having nan value that add information unlike nans that mean a data point is unknown or missing. Those features describe physical structures that may not exist in a home. For example, a nan value in PoolQC means there is no pool. It stands to reason that if PoolQC is nan, then PoolArea should be nan or zero. Let’s replace nan values in columns with strings such as ‘none’ or the number zero.
- Alley
- BsmtQual
- BsmtCond
- BsmtExposure
- BsmtFinType1
- BsmtFinSF1
- BsmtFinType2
- BsmtFinSF2
- BsmtUnfSF
- TotalBsmtSF
- BsmtFullBath
- BsmtHalfBath
- FireplaceQu
- GarageType
- GarageYrBlt
- GarageFinish
- GarageCars
- GarageArea
- GarageQual
- GarageCond
- PoolArea
- PoolQC
- Fence
- MiscFeature
Replacing Nans with String ‘none’
train = cl.nan_to_str_none(train)
train.to_pickle('../data/to_clean_1_train.p')