Statistics and Statistical Data Mining - Coursework¶

Introduction

In this task, we were instructed to implement several data pre-processing and engineering
steps that are common in data science and machine learning. These steps involve several
key topics in statistics.

The dataset we were provided for this task contains data about house sale prices.
The file ‘data_description.txt’ contains a detailed description of all the variables,
what they represent, their values and so on. The target variable is ‘SalePrice’,
which is the house’s sale price in US dollars.

Implementation

1.

Import the required libraries.

In [1]:

Copied!





# to handle datasets
import pandas as pd
import numpy as np

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)
# suppress some warning
pd.options.mode.chained_assignment = None  # default='warn'

# ===== Add Imports here =====

# to visualize qq plot
import pylab

# to do normality tests 
import scipy.stats as stats
from scipy.stats import normaltest
from scipy.stats import chisquare
from scipy.stats import shapiro

# to do train/test data split
from sklearn.model_selection import train_test_split

# to one-hot encode data sets
from sklearn.preprocessing import OneHotEncoder

# to scale data sets
from sklearn.preprocessing import MinMaxScaler

# to handle datasets
import pandas as pd
import numpy as np

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)
# suppress some warning
pd.options.mode.chained_assignment = None  # default='warn'

# ===== Add Imports here =====

# to visualize qq plot
import pylab

# to do normality tests 
import scipy.stats as stats
from scipy.stats import normaltest
from scipy.stats import chisquare
from scipy.stats import shapiro

# to do train/test data split
from sklearn.model_selection import train_test_split

# to one-hot encode data sets
from sklearn.preprocessing import OneHotEncoder

# to scale data sets
from sklearn.preprocessing import MinMaxScaler

2.

Load the data using pandas and plot a Histogram of the SalePrice
column. This code is provided for you, do not change it

In [2]:

Copied!

# load dataset
data = pd.read_csv('house-price-data.csv')

data.hist(column='SalePrice',  bins = 250)
# load dataset
data = pd.read_csv('house-price-data.csv')

data.hist(column='SalePrice',  bins = 250)

Out[2]:

array([[<Axes: title={'center': 'SalePrice'}>]], dtype=object)

No description has been provided for this image

3.

The SalePrice column is not normally distributed (i.e. not Gaussian),
prove this by running a statistical test and obtaining and interpreting the
p-value.

[5 marks]

[1][2]

In [3]:

Copied!





# Test to check normality of the SalePrice column
stats.probplot(data['SalePrice'], dist="norm", plot=pylab)
pylab.show()

# do a normality test
normaltest_p = normaltest(data['SalePrice'])[1]
chisquare_p = chisquare(data['SalePrice'])[1]
shapiro_p = shapiro(data['SalePrice'])[1]
print("Test p's:")
print(f"- Normal: {normaltest_p}")
print(f"- Chi-Square: {chisquare_p}")
print(f"- Shapiro-Wilk: {shapiro_p}")

if (normaltest_p > 0.05 or chisquare_p > 0.05 or shapiro_p > 0.05):
    print("Does look Gaussian!")
else:
    print("Does not look Gaussian!")
# Test to check normality of the SalePrice column
stats.probplot(data['SalePrice'], dist="norm", plot=pylab)
pylab.show()

# do a normality test
normaltest_p = normaltest(data['SalePrice'])[1]
chisquare_p = chisquare(data['SalePrice'])[1]
shapiro_p = shapiro(data['SalePrice'])[1]
print("Test p's:")
print(f"- Normal: {normaltest_p}")
print(f"- Chi-Square: {chisquare_p}")
print(f"- Shapiro-Wilk: {shapiro_p}")

if (normaltest_p > 0.05 or chisquare_p > 0.05 or shapiro_p > 0.05):
    print("Does look Gaussian!")
else:
    print("Does not look Gaussian!")

Test p's:
- Normal: 2.283848353787728e-133
- Chi-Square: 0.0
- Shapiro-Wilk: 3.2072044604461286e-33
Does not look Gaussian!

4.

Split data into train and test sets making sure the test set is 30% of the
original data and the remaining 70% are for training. This code is
provided for you, do not change it.

[3]

In [4]:

Copied!





### Split data into train and test sets 
X = data.drop(columns=['SalePrice'])
y = data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(
    X, # features or predictive variables
    y, # target
    test_size=0.3, # portion of dataset to allocate to test set
    random_state=10, # we are setting the seed here
)

X_train.shape, X_test.shape
### Split data into train and test sets 
X = data.drop(columns=['SalePrice'])
y = data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(
    X, # features or predictive variables
    y, # target
    test_size=0.3, # portion of dataset to allocate to test set
    random_state=10, # we are setting the seed here
)

X_train.shape, X_test.shape

Out[4]:

((1022, 54), (438, 54))

5.

Create a list of all categorical variables (by checking their type in the
original dataset).

[2 marks]

Correct solution says there are 19 of them.

[4]

In [5]:

Copied!





# let's identify the categorical variables
# we will capture those of type object

# get columns filtered by type 'object'
category_columns = X_train.select_dtypes(include=['object'])
cat_vars = category_columns.dtypes.index.tolist()

# number of categorical variables
len(cat_vars)
# let's identify the categorical variables
# we will capture those of type object

# get columns filtered by type 'object'
category_columns = X_train.select_dtypes(include=['object'])
cat_vars = category_columns.dtypes.index.tolist()

# number of categorical variables
len(cat_vars)

Out[5]:

6.

Using the training set (X_train), create a list of all categorical variables
that contain missing data and print the percentage of missing values
per variable in X_train.

[3 marks]

[5]

In [6]:

Copied!





# make a list of the categorical variables that contain missing values

rows = X_train.shape[0]

cat_vars_with_na = (
    category_columns
        .isnull()
        .aggregate(func= lambda x: 100*sum(x)/rows)
        .sort_values(ascending= False)
        .loc[lambda x : x > 0.0]
)

# print percentage of missing values per variable

print(cat_vars_with_na)
# make a list of the categorical variables that contain missing values

rows = X_train.shape[0]

cat_vars_with_na = (
    category_columns
        .isnull()
        .aggregate(func= lambda x: 100*sum(x)/rows)
        .sort_values(ascending= False)
        .loc[lambda x : x > 0.0]
)

# print percentage of missing values per variable

print(cat_vars_with_na)

Alley           93.444227
Fence           80.528376
FireplaceQu     46.868885
GarageCond       5.381605
GarageQual       5.381605
GarageFinish     5.381605
BsmtFinType1     2.544031
BsmtFinType2     2.544031
BsmtExposure     2.544031
BsmtCond         2.544031
BsmtQual         2.544031
dtype: float64

7.

Using the result of the previous step: For categorical variables with
more than 10% of data missing, replace missing data with the word
'Missing', in other variables replace the missing data with the most
frequent category in the training set (Apply the replacement to X_train
and X_test and make sure it is based on the results you have obtained
from the training set).

[5 marks]

In [7]:

Copied!

# variables to impute with the string missing
with_string_missing = cat_vars_with_na.loc[lambda x : x > 10.0].index.tolist()

# variables to impute with the most frequent category
with_frequent_category = cat_vars_with_na.loc[lambda x : x <= 10.0].index.tolist()
# variables to impute with the string missing
with_string_missing = cat_vars_with_na.loc[lambda x : x > 10.0].index.tolist()

# variables to impute with the most frequent category
with_frequent_category = cat_vars_with_na.loc[lambda x : x <= 10.0].index.tolist()

In [8]:

Copied!

print(with_string_missing)
print(with_frequent_category)
print(with_string_missing)
print(with_frequent_category)

['Alley', 'Fence', 'FireplaceQu']
['GarageCond', 'GarageQual', 'GarageFinish', 'BsmtFinType1', 'BsmtFinType2', 'BsmtExposure', 'BsmtCond', 'BsmtQual']

In [9]:

Copied!

# replace missing values in X_train and X_test with new label: "Missing"

X_train[with_string_missing] = X_train[with_string_missing].fillna("Missing")
X_test[with_string_missing] = X_test[with_string_missing].fillna("Missing")
# replace missing values in X_train and X_test with new label: "Missing"

X_train[with_string_missing] = X_train[with_string_missing].fillna("Missing")
X_test[with_string_missing] = X_test[with_string_missing].fillna("Missing")

In [10]:

Copied!

# replace missing values in X_train and X_test with the mode of each variable

X_train[with_frequent_category] = X_train[with_frequent_category].fillna(X_train.mode().iloc[0])
X_test[with_frequent_category] = X_test[with_frequent_category].fillna(X_test.mode().iloc[0])
# replace missing values in X_train and X_test with the mode of each variable

X_train[with_frequent_category] = X_train[with_frequent_category].fillna(X_train.mode().iloc[0])
X_test[with_frequent_category] = X_test[with_frequent_category].fillna(X_test.mode().iloc[0])

In [11]:

Copied!

# check that we have no missing information in the engineered variables
X_train[cat_vars_with_na.index.tolist()].isnull().sum()
# check that we have no missing information in the engineered variables
X_train[cat_vars_with_na.index.tolist()].isnull().sum()

Out[11]:

Alley           0
Fence           0
FireplaceQu     0
GarageCond      0
GarageQual      0
GarageFinish    0
BsmtFinType1    0
BsmtFinType2    0
BsmtExposure    0
BsmtCond        0
BsmtQual        0
dtype: int64

8.

Create a list of all numerical variables (do not include SalePrice).

[2 marks]

Correct solution says there are 35 of them.

In [12]:

Copied!

# now let's identify the numerical variables

# get columns filtered by type 'int64' or 'float'
num_columns = X_train.select_dtypes(include=['int64','float'])
num_vars = num_columns.dtypes

# number of numerical variables
len(num_vars)
# now let's identify the numerical variables

# get columns filtered by type 'int64' or 'float'
num_columns = X_train.select_dtypes(include=['int64','float'])
num_vars = num_columns.dtypes

# number of numerical variables
len(num_vars)

Out[12]:

9.

Create a list of all numerical variables that contain missing data and
print out the percentage of missing values per variable (use the training
data).

[3 marks]

In [13]:

Copied!





# make a list with the numerical variables that contain missing values
rows = X_train.shape[0]

vars_with_na = (
    num_columns
        .isnull()
        .aggregate(func= lambda x: 100*sum(x)/rows)
        .sort_values(ascending= False)
        .loc[lambda x : x > 0.0]
)

# print percentage of missing values per variable

print(vars_with_na)
# make a list with the numerical variables that contain missing values
rows = X_train.shape[0]

vars_with_na = (
    num_columns
        .isnull()
        .aggregate(func= lambda x: 100*sum(x)/rows)
        .sort_values(ascending= False)
        .loc[lambda x : x > 0.0]
)

# print percentage of missing values per variable

print(vars_with_na)

LotFrontage    17.318982
GarageYrBlt     5.381605
MasVnrArea      0.489237
dtype: float64

10.

Using the result of the previous step: For numerical variables with less
than 15% of data missing, replace missing data with the mean of the
variable, in other variables replace the missing data with the median of
the variable in the training set (Apply the replacement to X_train and
X_test and make sure it is based on the results you have obtained from
the training set).

[5 marks]

In [14]:

Copied!

# variables to impute with the median
with_mean = vars_with_na.loc[lambda x : x < 15.0].index.tolist()

# variables to impute with the mean
with_median = vars_with_na.loc[lambda x : x >= 15.0].index.tolist()
# variables to impute with the median
with_mean = vars_with_na.loc[lambda x : x < 15.0].index.tolist()

# variables to impute with the mean
with_median = vars_with_na.loc[lambda x : x >= 15.0].index.tolist()

In [15]:

Copied!

# replace missing values as we described above

X_train[with_mean] = X_train[with_mean].fillna(X_train[with_mean].mean())
X_test[with_mean] = X_test[with_mean].fillna(X_test[with_mean].mean())

X_train[with_median] = X_train[with_median].fillna(X_train[with_median].median())
X_test[with_median] = X_test[with_median].fillna(X_test[with_median].median())
# replace missing values as we described above

X_train[with_mean] = X_train[with_mean].fillna(X_train[with_mean].mean())
X_test[with_mean] = X_test[with_mean].fillna(X_test[with_mean].mean())

X_train[with_median] = X_train[with_median].fillna(X_train[with_median].median())
X_test[with_median] = X_test[with_median].fillna(X_test[with_median].median())

In [16]:

Copied!

# check that we have no more missing values in the engineered variables
X_test[vars_with_na.index.tolist()].isnull().sum()
# check that we have no more missing values in the engineered variables
X_test[vars_with_na.index.tolist()].isnull().sum()

Out[16]:

LotFrontage    0
GarageYrBlt    0
MasVnrArea     0
dtype: int64

11.

In the train and test sets, replace the values of variables 'YearBuilt',
'YearRemodAdd' and 'GarageYrBlt' with the time elapsed between
them and the year in which the house was sold 'YrSold'. After that drop
the 'YrSold' column.

[5 marks]

In [17]:

Copied!

X_train[['YearBuilt','YearRemodAdd','GarageYrBlt']] = X_train[['YearBuilt','YearRemodAdd','GarageYrBlt']].subtract(X_train['YrSold'], axis=0) * -1
X_test[['YearBuilt','YearRemodAdd','GarageYrBlt']] = X_test[['YearBuilt','YearRemodAdd','GarageYrBlt']].subtract(X_test['YrSold'], axis=0) * -1
X_train[['YearBuilt','YearRemodAdd','GarageYrBlt']] = X_train[['YearBuilt','YearRemodAdd','GarageYrBlt']].subtract(X_train['YrSold'], axis=0) * -1
X_test[['YearBuilt','YearRemodAdd','GarageYrBlt']] = X_test[['YearBuilt','YearRemodAdd','GarageYrBlt']].subtract(X_test['YrSold'], axis=0) * -1

In [18]:

Copied!





# now we drop YrSold
X_train.drop(['YrSold'], axis=1, inplace=True)
X_test.drop(['YrSold'], axis=1, inplace=True)
X_train.head()
# now we drop YrSold
X_train.drop(['YrSold'], axis=1, inplace=True)
X_test.drop(['YrSold'], axis=1, inplace=True)
X_train.head()

Out[18]:

	LotFrontage	LotArea	Alley	LandContour	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	MasVnrArea	ExterQual	ExterCond	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	HeatingQC	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Fireplaces	FireplaceQu	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	WoodDeckSF	OpenPorchSF	EnclosedPorch	ScreenPorch	Fence	MiscVal	MoSold
912	51.0	6120	Missing	Lvl	1Fam	1Story	5	7	81	56	Gable	0.0	TA	TA	TA	TA	No	Rec	489	Unf	0	279	768	TA	1015	0	1015	0	1	0	3	1	TA	6	0	Missing	81.0	Unf	1	450	TA	TA	0	0	112	120	MnPrv	620	7
373	79.0	10634	Missing	Lvl	1Fam	1Story	5	6	56	56	Gable	0.0	TA	TA	TA	TA	No	BLQ	428	LwQ	180	0	608	TA	1319	0	1319	1	1	0	3	1	TA	5	0	Missing	56.0	Unf	1	270	TA	TA	66	0	0	0	GdWo	0	11
20	101.0	14215	Missing	Lvl	1Fam	2Story	8	5	1	0	Gable	380.0	Gd	TA	Ex	TA	Av	Unf	0	Unf	0	1158	1158	Ex	1158	1218	2376	0	3	1	4	1	Gd	9	1	Gd	1.0	RFn	3	853	TA	TA	240	154	0	0	Missing	0	11
800	79.0	12798	Missing	HLS	1Fam	2Story	6	5	11	11	Gable	0.0	TA	TA	Gd	TA	Gd	GLQ	462	Unf	0	154	616	Gd	616	1072	1688	1	2	1	4	1	Gd	8	0	Missing	11.0	RFn	2	603	TA	TA	403	114	185	0	Missing	400	5
152	70.0	14803	Missing	Lvl	1Fam	2Story	6	5	35	35	Gable	252.0	TA	TA	TA	TA	No	Rec	416	Unf	0	409	825	Gd	1097	896	1993	0	2	1	4	1	TA	8	1	Gd	35.0	RFn	2	495	TA	TA	0	66	0	0	GdWo	0	6

12.

Apply mappings to categorical variables that have an order.

Some of the categorical variables have values with an assigned order (in total there should be 14 of them),
related to quality (For more information, check the data description file).
This means, we can replace categories by numbers to determine quality.
For example, values in the 'BsmtExposure' can be mapped as follows: 'No' can be mapped to 1, 'Mn' can be mapped to 2, 'Av' can be mapped to 3 and 'Gd' can be mapped to 4.

One way of doing this is to manually create mappings similar to the example given.
Each mapping can be saved as a Python dictionary and used to perform the actual mapping to transform the described variables from categorical to numerical.

To Make it easier for you, here are groups of variables that have the same mappings (Hint: you can map both categories 'Missing' and 'NA' to 0):

- The following variable groups have the same mapping: 
    - ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu','GarageQual', 'GarageCond']
    - ['BsmtFinType1', 'BsmtFinType2']

- Each of the following variables has its own mapping: 'BsmtExposure', 'GarageFinish', 'Fence'

[5 marks]

(in total there should be 14 of them).

[6]

In [19]:

Copied!

# re-map strings to numbers, which determine quality

qual_mappings = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}

qual_vars = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond']

## Loop through the variables and apply the mapping here
# Apply mapping on the train and test sets

X_train[qual_vars] = X_train[qual_vars].applymap(lambda x: qual_mappings[x])
X_test[qual_vars] = X_test[qual_vars].applymap(lambda x: qual_mappings[x])
# re-map strings to numbers, which determine quality

qual_mappings = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}

qual_vars = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond']

## Loop through the variables and apply the mapping here
# Apply mapping on the train and test sets

X_train[qual_vars] = X_train[qual_vars].applymap(lambda x: qual_mappings[x])
X_test[qual_vars] = X_test[qual_vars].applymap(lambda x: qual_mappings[x])

In [20]:

Copied!

exposure_mappings = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}

var = 'BsmtExposure'

## Apply the mapping here
# Apply mapping on the train and test sets

X_train[var] = X_train[var].apply(lambda x: exposure_mappings[x])
X_test[var] = X_test[var].apply(lambda x: exposure_mappings[x])
exposure_mappings = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}

var = 'BsmtExposure'

## Apply the mapping here
# Apply mapping on the train and test sets

X_train[var] = X_train[var].apply(lambda x: exposure_mappings[x])
X_test[var] = X_test[var].apply(lambda x: exposure_mappings[x])

In [21]:

Copied!

finish_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}

finish_vars = ['BsmtFinType1', 'BsmtFinType2']

## Loop through the variables and apply the mapping here
# Apply mapping on the train and test sets

X_train[finish_vars] = X_train[finish_vars].applymap(lambda x: finish_mappings[x])
X_test[finish_vars] = X_test[finish_vars].applymap(lambda x: finish_mappings[x])
finish_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}

finish_vars = ['BsmtFinType1', 'BsmtFinType2']

## Loop through the variables and apply the mapping here
# Apply mapping on the train and test sets

X_train[finish_vars] = X_train[finish_vars].applymap(lambda x: finish_mappings[x])
X_test[finish_vars] = X_test[finish_vars].applymap(lambda x: finish_mappings[x])

In [22]:

Copied!

garage_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

var = 'GarageFinish'

## Apply the mapping here
# Apply mapping on the train and test sets

X_train[var] = X_train[var].apply(lambda x: garage_mappings[x])
X_test[var] = X_test[var].apply(lambda x: garage_mappings[x])
garage_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

var = 'GarageFinish'

## Apply the mapping here
# Apply mapping on the train and test sets

X_train[var] = X_train[var].apply(lambda x: garage_mappings[x])
X_test[var] = X_test[var].apply(lambda x: garage_mappings[x])

In [23]:

Copied!

fence_mappings = {'Missing': 0, 'NA': 0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4}

var = 'Fence'

## Apply the mapping here
# Apply mapping on the train and test sets

X_train[var] = X_train[var].apply(lambda x: fence_mappings[x])
X_test[var] = X_test[var].apply(lambda x: fence_mappings[x])
fence_mappings = {'Missing': 0, 'NA': 0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4}

var = 'Fence'

## Apply the mapping here
# Apply mapping on the train and test sets

X_train[var] = X_train[var].apply(lambda x: fence_mappings[x])
X_test[var] = X_test[var].apply(lambda x: fence_mappings[x])

In [24]:

Copied!

# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]
# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

Out[24]:

[]

In [25]:

Copied!

X_train.head()
X_train.head()

Out[25]:

	LotFrontage	LotArea	Alley	LandContour	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	MasVnrArea	ExterQual	ExterCond	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	HeatingQC	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Fireplaces	FireplaceQu	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	WoodDeckSF	OpenPorchSF	EnclosedPorch	ScreenPorch	Fence	MiscVal	MoSold
912	51.0	6120	Missing	Lvl	1Fam	1Story	5	7	81	56	Gable	0.0	3	3	3	3	1	3	489	1	0	279	768	3	1015	0	1015	0	1	0	3	1	3	6	0	0	81.0	1	1	450	3	3	0	0	112	120	3	620	7
373	79.0	10634	Missing	Lvl	1Fam	1Story	5	6	56	56	Gable	0.0	3	3	3	3	1	4	428	2	180	0	608	3	1319	0	1319	1	1	0	3	1	3	5	0	0	56.0	1	1	270	3	3	66	0	0	0	2	0	11
20	101.0	14215	Missing	Lvl	1Fam	2Story	8	5	1	0	Gable	380.0	4	3	5	3	3	1	0	1	0	1158	1158	5	1158	1218	2376	0	3	1	4	1	4	9	1	4	1.0	2	3	853	3	3	240	154	0	0	0	0	11
800	79.0	12798	Missing	HLS	1Fam	2Story	6	5	11	11	Gable	0.0	3	3	4	3	4	6	462	1	0	154	616	4	616	1072	1688	1	2	1	4	1	4	8	0	0	11.0	2	2	603	3	3	403	114	185	0	0	400	5
152	70.0	14803	Missing	Lvl	1Fam	2Story	6	5	35	35	Gable	252.0	3	3	3	3	1	3	416	1	0	409	825	4	1097	896	1993	0	2	1	4	1	3	8	1	4	35.0	2	2	495	3	3	0	66	0	0	2	0	6

13.

For the remaining five categorical variables (the variables that you did not apply value mappings to, they should be five variables),
you will need to group those categories that are present in less than 1% of the observations in the training set.
That is, all values of categorical variables that are shared by less than 1% of houses in the training set will be replaced by the string "Rare" in both the training and test set. 


Find rare labels in the remaining categorical variables and replace them with the category 'Rare'.
Rare labels are those categories that only appear in a small percentage of the observations (in our case in < 1%).
If you look at unique values in a categorical variable in the training set and count how many times each of the unique values appear in the variable,
you can compute the percentage of each unique value by dividing its count by the total number of observations.
Remember to make the computions using the training set and replacement in both training and test sets.

[5 marks]

[7]

In [26]:

Copied!

# capture all quality variables

qual_vars  = qual_vars + finish_vars + ['BsmtExposure','GarageFinish','Fence']

# capture the remaining categorical variables
# (those that we did not re-map)

cat_others = [
    var for var in cat_vars if var not in qual_vars
]

len(cat_others)
# capture all quality variables

qual_vars  = qual_vars + finish_vars + ['BsmtExposure','GarageFinish','Fence']

# capture the remaining categorical variables
# (those that we did not re-map)

cat_others = [
    var for var in cat_vars if var not in qual_vars
]

len(cat_others)

Out[26]:

In [27]:

Copied!

#values before mapping
for cat in cat_others:
    print(str(cat) + ": " + str(list(dict.fromkeys(X_train[cat].sort_values().tolist()))))
#values before mapping
for cat in cat_others:
    print(str(cat) + ": " + str(list(dict.fromkeys(X_train[cat].sort_values().tolist()))))

Alley: ['Grvl', 'Missing', 'Pave']
LandContour: ['Bnk', 'HLS', 'Low', 'Lvl']
BldgType: ['1Fam', '2fmCon', 'Duplex', 'Twnhs', 'TwnhsE']
HouseStyle: ['1.5Fin', '1.5Unf', '1Story', '2.5Fin', '2.5Unf', '2Story', 'SFoyer', 'SLvl']
RoofStyle: ['Flat', 'Gable', 'Gambrel', 'Hip', 'Mansard', 'Shed']

In [28]:

Copied!





## Loop through the variables and apply the replacement here
# Apply replacement on the train and test sets

for cat in cat_others:
    frequencies = X_train[cat].value_counts(normalize=True)
    
    mapping = X_train[cat].map(frequencies)
    X_train[cat] = X_train[cat].mask(mapping < 0.01, 'Rare')
    
    mapping = X_test[cat].map(frequencies)
    X_test[cat] = X_test[cat].mask(mapping < 0.01, 'Rare')
## Loop through the variables and apply the replacement here
# Apply replacement on the train and test sets

for cat in cat_others:
    frequencies = X_train[cat].value_counts(normalize=True)
    
    mapping = X_train[cat].map(frequencies)
    X_train[cat] = X_train[cat].mask(mapping < 0.01, 'Rare')
    
    mapping = X_test[cat].map(frequencies)
    X_test[cat] = X_test[cat].mask(mapping < 0.01, 'Rare')

In [29]:

Copied!

#values after mapping
for cat in cat_others:
    print(str(cat) + ": " + str(list(dict.fromkeys(X_train[cat].sort_values().tolist()))))
#values after mapping
for cat in cat_others:
    print(str(cat) + ": " + str(list(dict.fromkeys(X_train[cat].sort_values().tolist()))))

Alley: ['Grvl', 'Missing', 'Pave']
LandContour: ['Bnk', 'HLS', 'Low', 'Lvl']
BldgType: ['1Fam', '2fmCon', 'Duplex', 'Twnhs', 'TwnhsE']
HouseStyle: ['1.5Fin', '1Story', '2Story', 'Rare', 'SFoyer', 'SLvl']
RoofStyle: ['Gable', 'Hip', 'Rare']

14.

Perform one hot encoding to transform the previous five categorical
variables into binary variables. Make sure you do it correctly for both
the training and testing sets. After this, remember to drop the original
five categorical variables (the ones with the strings) from the training
and test after the encoding.

[5 marks]

[8]

In [30]:

Copied!





## Loop through the variables and apply the one-hot encoding here
# Apply encoding on the train and test sets
# Pay attention and make sure you do it correctly

enc = OneHotEncoder(sparse_output=False,drop='first')

X_train = pd.concat([X_train, pd.DataFrame(enc.fit_transform(X_train[cat_others]), index=X_train.index, columns=enc.get_feature_names_out())], axis='columns')
X_test = pd.concat([X_test, pd.DataFrame(enc.transform(X_test[cat_others]), index=X_test.index, columns=enc.get_feature_names_out())], axis='columns')
## Loop through the variables and apply the one-hot encoding here
# Apply encoding on the train and test sets
# Pay attention and make sure you do it correctly

enc = OneHotEncoder(sparse_output=False,drop='first')

X_train = pd.concat([X_train, pd.DataFrame(enc.fit_transform(X_train[cat_others]), index=X_train.index, columns=enc.get_feature_names_out())], axis='columns')
X_test = pd.concat([X_test, pd.DataFrame(enc.transform(X_test[cat_others]), index=X_test.index, columns=enc.get_feature_names_out())], axis='columns')

In [31]:

Copied!

X_train
X_train

Out[31]:

	LotFrontage	LotArea	Alley	LandContour	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	MasVnrArea	ExterQual	ExterCond	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	HeatingQC	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Fireplaces	FireplaceQu	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	Fence	MiscVal	MoSold	Alley_Missing	Alley_Pave	LandContour_HLS	LandContour_Low	LandContour_Lvl	BldgType_2fmCon	BldgType_Duplex	BldgType_Twnhs	BldgType_TwnhsE	HouseStyle_1Story	HouseStyle_2Story	HouseStyle_Rare	HouseStyle_SFoyer	HouseStyle_SLvl	RoofStyle_Hip	RoofStyle_Rare
912	51.0	6120	Missing	Lvl	1Fam	1Story	5	7	81	56	Gable	0.0	3	3	3	3	1	3	489	1	0	279	768	3	1015	0	0	1015	0	0	1	0	3	1	3	6	0	0	81.0	1	1	450	3	3	0	0	112	0	120	0	3	620	7	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
373	79.0	10634	Missing	Lvl	1Fam	1Story	5	6	56	56	Gable	0.0	3	3	3	3	1	4	428	2	180	0	608	3	1319	0	0	1319	1	0	1	0	3	1	3	5	0	0	56.0	1	1	270	3	3	66	0	0	0	0	0	2	0	11	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
20	101.0	14215	Missing	Lvl	1Fam	2Story	8	5	1	0	Gable	380.0	4	3	5	3	3	1	0	1	0	1158	1158	5	1158	1218	0	2376	0	0	3	1	4	1	4	9	1	4	1.0	2	3	853	3	3	240	154	0	0	0	0	0	0	11	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
800	79.0	12798	Missing	HLS	1Fam	2Story	6	5	11	11	Gable	0.0	3	3	4	3	4	6	462	1	0	154	616	4	616	1072	0	1688	1	0	2	1	4	1	4	8	0	0	11.0	2	2	603	3	3	403	114	185	0	0	0	0	400	5	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
152	70.0	14803	Missing	Lvl	1Fam	2Story	6	5	35	35	Gable	252.0	3	3	3	3	1	3	416	1	0	409	825	4	1097	896	0	1993	0	0	2	1	4	1	3	8	1	4	35.0	2	2	495	3	3	0	66	0	0	0	0	2	0	6	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1393	60.0	10800	Pave	Lvl	2fmCon	1.5Fin	6	7	103	8	Gable	0.0	3	3	2	3	1	1	0	1	0	482	482	5	1221	691	0	1912	0	0	2	0	3	2	3	7	1	3	5.0	1	2	672	4	3	0	25	212	0	0	0	0	0	4	0.0	1.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1344	85.0	11103	Missing	Lvl	1Fam	2Story	7	5	1	1	Gable	0.0	4	3	4	3	1	1	0	1	0	728	728	5	728	728	0	1456	0	0	2	1	3	1	4	8	1	3	1.0	3	2	440	3	3	0	0	0	0	0	0	0	0	7	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
527	67.0	14948	Missing	Lvl	1Fam	2Story	9	5	0	0	Hip	268.0	5	3	5	3	3	6	1330	1	0	122	1452	5	1476	1237	0	2713	1	0	2	1	3	1	5	11	1	4	-0.0	3	3	858	3	3	126	66	0	0	0	0	0	0	11	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
1149	50.0	9000	Missing	Lvl	1Fam	2Story	7	9	89	21	Hip	0.0	3	4	3	3	1	5	624	1	0	26	650	5	832	650	0	1482	0	1	1	0	3	1	3	7	0	0	79.0	1	2	324	3	3	0	0	0	0	0	0	0	0	7	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
1289	86.0	11065	Missing	Lvl	1Fam	2Story	8	5	0	0	Gable	788.0	4	3	4	3	2	1	0	1	0	1085	1085	5	1120	850	0	1970	0	0	2	1	3	1	5	8	1	4	-0.0	3	3	753	3	3	177	74	0	0	0	0	0	0	10	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0

1022 rows × 69 columns

In [32]:

Copied!

# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]
# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

Out[32]:

[]

In [33]:

Copied!

# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]
# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]

Out[33]:

[]

In [34]:

Copied!

X_train.drop(cat_others, axis=1, inplace=True)
X_test.drop(cat_others, axis=1, inplace=True)
X_train.drop(cat_others, axis=1, inplace=True)
X_test.drop(cat_others, axis=1, inplace=True)

15.

Feature Scaling. Now we know all variables in our two datasets (i.e. the
training and test sets) are numerical, the final step in this exercise is to
apply scaling by making sure the minimum value in each variable is 0
and the maximum value is 1. For this step, you can use
MinMaxScaler() from sci-kit learn. Make sure you apply it correctly by
transforming the test set based on the training set.

[5 marks]

In [35]:

Copied!





## Apply scaling here
# Make sure you do it correctly
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)
## Apply scaling here
# Make sure you do it correctly
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)

In [36]:

Copied!

print(X_train.mean().mean())
print(X_test.mean().mean())
print(X_train.mean().mean())
print(X_test.mean().mean())

0.24977436797009803
0.24728067105086157

After applying all the previous steps, the overall mean value of all entries in
the training set was approximately 0.249 and in the test set was
approximately 0.247.

In [37]:

Copied!

pip freeze > requirements.txt
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.

References

Ebner, J. (2022) How to Use Sklearn train_test_split in Python - Sharp Sight. Available at: https://www.sharpsightlabs.com/blog/scikit-train_test_split/ (Accessed: 23 May 2023).

Gilad, M. (2020) ‘Handling rare categorical values in Pandas 🐼’, Gett Tech, 6 February. Available at: https://medium.com/gett-engineering/handling-rare-categorical-values-in-pandas-d1e3f17475f0 (Accessed: 23 May 2023).

How to Check the Data Type in Pandas DataFrame – Data to Fish (2021). Available at: https://datatofish.com/data-type-pandas-dataframe/ (Accessed: 23 May 2023).

Pandas Series filter() Function (2022) Spark By {Examples}. Available at: https://sparkbyexamples.com/pandas/pandas-series-filter/ (Accessed: 23 May 2023).

Pankaj (2022) Pandas DataFrame apply() Examples | DigitalOcean. Available at: https://www.digitalocean.com/community/tutorials/pandas-dataframe-apply-examples (Accessed: 23 May 2023).

Pygirl (2021) ‘Answer to “How to Avoid getting NaN when applying One Hot Encoding in Pandas”’, Stack Overflow. Available at: https://stackoverflow.com/a/67857871 (Accessed: 23 May 2023).

Tiwari, S. (2021) Guide to Normality Tests in Python. Available at: https://kaggle.com/code/shashwatwork/guide-to-normality-tests-in-python (Accessed: 23 May 2023).

Zach (2022) ‘How to Test for Normality in Python (4 Methods)’, Statology, 11 June. Available at: https://www.statology.org/normality-test-python/ (Accessed: 23 May 2023).