For the coursework, please make sure to implement your own code and not use libraries (except where explicitly asked). You will need to present your own code that performs nested cross-validation and the k-nearest neighbour algorithm, build confusion matrices, and estimate distances between data samples.
The purpose of this coursework is to help you:
- Get familiar with common python modules / functions used for ML in python
- Get practical experience implementing ML methods in python
- Get practical experience regarding parameter selection for ML methods
- Get practical experience on evaluating ML methods and applying cross-validation
Notes:
- don't use libraries that implement kNN or cross-validation. We want to see your code!
- Remember to comment all of your code (see here for tips: https://stackabuse.com/commenting-python-code/). You can also make use of Jupyter Markdown, where appropriate, to improve the layout of your code and documentation.
- Please add docstrings to all of your functions (so that users can get information on inputs/outputs and what each function does by typing SHIFT+TAB over the function name. For more detail on python docstrings, see here: https://numpydoc.readthedocs.io/en/latest/format.html)
- When a question allows a free-form answer (e.g. what do you observe?), create a new markdown cell below and answer the question in the notebook.
- Always save your notebook when you are done (this is not automatic)!
- Upload your completed notebook using the VLE
Plagiarism: please make sure that the material you submit has been created by you. Any sources you use for code should be properly referenced. Your code will be checked for plagiarism using appropriate software.
Marking¶
The grades in this coursework are allocated approximately as follows:
mark | |
---|---|
Data exploration (+ 2 questions) | 9 |
Code, docu. & comments (KNN + Evaluation + NCV) | 12 |
Results (KNN folds + Summary + Confusion matrices) | 9 |
Final questions: | 9 |
Overall quality & use of Markdown | 6 |
Total available | 45 |
import seaborn as sns
import math
1. Exploratory Data Analysis [9 pts]¶
In this coursework we are going to be working with the Wine dataset. This is a 178 sample dataset that categorises 3 different types of Italian wine using 13 different features. The code below loads the Wine dataset and selects a subset of features for you to work with.
# set matplotlib backend to inline
%matplotlib inline
# import modules
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# load data
wine=datasets.load_wine()
#print(wine.DESCR)
# this dataset has 13 features, we will only choose a subset of these
df_wine = pd.DataFrame(wine.data, columns = wine.feature_names )
selected_features = ['alcohol','flavanoids','color_intensity','ash']
# extract the data as numpy arrays of features, X, and target, y
X = df_wine[selected_features].values
y = wine.target
1.1. Visualising the data¶
The first part of tackling any ML problem is visualising the data in order to understand some of the properties of the problem at hand. When there are only a small number of classes and features, it is possible to use scatter plots to visualise interactions between different pairings of features.
The following image shows what such a visualisation might look like on the Iris dataset that you worked on during the Topic exercises.
Your first task is to recreate a similar grid for the Wine dataset, with each off-diagonal subplot showing the interaction between two features, and each of the classes represented as a different colour. The on-diagonal subplots (representing a single feature) should show a distribution (or histogram) for that feature.
You should create a function that, given data X and labels y, plots this grid. The function should be invoked something like this: myplotGrid(X,y,...)
where X is your training data and y are the labels (you may also supply additional optional arguments). You can use an appropriate library to help you create the visualisation. You might want to code it yourself using matplotlib functions scatter and hist - however, this is not strictly necessary here, so try not spend too much time on this.
!!!!! TO DO: Add documentation for the function !!!!!
# define plotting function
def myplotGrid(x_data, y_data, columns = [], class_name = 'class'):
df = pd.DataFrame(x_data)
if len(columns) > 0:
df.columns = columns
df[class_name] = y_data
sns.set_theme(style="ticks")
sns.pairplot(
df,
hue=class_name,
palette=sns.color_palette()[:3]
)
# run the plotting function
myplotGrid(X,y,selected_features,'wine')