Market Segmentation Using Customer Data 01 Data Preparation & Eda
library(mlbench); library(ggplot2); library(GGally); library(reshape2) # melt fn
Starting Project
We start by just getting a general idea of the data and formatting the data as appropriate. For example, the dataset may load some numbers in as string data type when a numeric data type would be more appropriate.
We will put code that pulls the dataset and makes those edits in an R file, called load_data.r, that I will run at the top of each notebook. The following notebooks will not run wholesale_data <- read.csv('../data/dataset.csv', header = TRUE)
but rather source('../src/load_data.r')
wholesale_data = read.csv('../data/dataset.csv', header = TRUE)
Dataset Description
We will just get some basic descriptors of the data set such as the dimensions, memory size, what the first dozen or so rows look like, data types in the dataset, and the dataset’s memory size. We will make sure that the data loaded correctly and matches the description from the UCI Machine Learning Repository.
# dim stands for dimensions
dim(wholesale_data)
<ol class=list-inline> <li>440</li> <li>8</li> </ol>
# size in memory
object.size(wholesale_data)
15512 bytes
# first dozen or so rows
head(wholesale_data)
2 | 3 | 12669 | 9656 | 7561 | 214 | 2674 | 1338 |
2 | 3 | 7057 | 9810 | 9568 | 1762 | 3293 | 1776 |
2 | 3 | 6353 | 8808 | 7684 | 2405 | 3516 | 7844 |
1 | 3 | 13265 | 1196 | 4221 | 6404 | 507 | 1788 |
2 | 3 | 22615 | 5410 | 7198 | 3915 | 1777 | 5185 |
2 | 3 | 9413 | 8259 | 5126 | 666 | 1795 | 1451 |
#str stands for structure
str(wholesale_data)
'data.frame': 440 obs. of 8 variables:
$ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
$ Region : int 3 3 3 3 3 3 3 3 3 3 ...
$ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
$ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
$ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
$ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
$ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
$ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
Summary Statistics
summary(wholesale_data)
Channel Region Fresh Milk
Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
Median :1.000 Median :3.000 Median : 8504 Median : 3627
Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
Grocery Frozen Detergents_Paper Delicassen
Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
Data Conditioning
Fixing Spelling Error
There is a typo in the dataset. “Delicatessen” is spelled incorrectly. We will change column name to “Deli” which is more commonly used than the long form in the U.S. at least.
colnames(wholesale_data)[8] <- 'Deli'
head(wholesale_data)
2 | 3 | 12669 | 9656 | 7561 | 214 | 2674 | 1338 |
2 | 3 | 7057 | 9810 | 9568 | 1762 | 3293 | 1776 |
2 | 3 | 6353 | 8808 | 7684 | 2405 | 3516 | 7844 |
1 | 3 | 13265 | 1196 | 4221 | 6404 | 507 | 1788 |
2 | 3 | 22615 | 5410 | 7198 | 3915 | 1777 | 5185 |
2 | 3 | 9413 | 8259 | 5126 | 666 | 1795 | 1451 |
Making Dataset More Readable and Understandable
From the output of str(wholesale_data)
, we see that all the data is of type integer which matches the UCI Machine Learning repository’s description of the dataset. I will change those integers to the actual names of channels and regions.
We can what region 1, 2, and 3 correspond by matching the frequency of each region with the frequency given in the dataset description. In the dataset, the regions are called region 1, 2, and 3.
From the dataset description:
REGION Frequency
Lisbon 77
Oporto 47
Other Region 316
Total 440
We used the same method to determine what channels 1 and 2 were.
CHANNEL Frequency
Horeca 298
Retail 142
Total 440
# This will give me the frequency of 1 and 2 in Channel column
cat("Channel"); table(wholesale_data$Channel)
# This will give me the frequency of 1, 2, and 3 in Region column
cat("Region"); table(wholesale_data$Region)
Channel
1 2
298 142
Region
1 2 3
77 47 316
We can infer the following.
Region 1 is Lisbon
Region 2 is Oporto
Region 3 is “Other Region”.
Channel 1 is Horeca (which stands for hotel/restaurant/cafe).
Channel 2 is retail.
wholesale_data$Channel[wholesale_data$Channel == 1] <- "Horeca"
wholesale_data$Channel[wholesale_data$Channel == 2] <- "Retail"
wholesale_data$Region[wholesale_data$Region == 1] <- "Lisbon"
wholesale_data$Region[wholesale_data$Region == 2] <- "Oporto"
wholesale_data$Region[wholesale_data$Region == 3] <- "Other"
By changing the datatypes of the Channel and Region columns from integers to factors (categorical values in R), we are also telling R that those two features are not numerical factors. This affects calculations. We see this in how the summary statistics have changed.
summary(wholesale_data)
Channel Region Fresh Milk
Length:440 Length:440 Min. : 3 Min. : 55
Class :character Class :character 1st Qu.: 3128 1st Qu.: 1533
Mode :character Mode :character Median : 8504 Median : 3627
Mean : 12000 Mean : 5796
3rd Qu.: 16934 3rd Qu.: 7190
Max. :112151 Max. :73498
Grocery Frozen Detergents_Paper Deli
Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
Checking for NA and Null Values
## checkg for NA
# in R: TRUE + TRUE = 2
print(sum(is.na(wholesale_data)))
print(sum(is.null(wholesale_data)))
[1] 0
[1] 0