Overview

This week, you are asked to consider the iris data that we used to explore the use of the k-nearest neighbours (KNN) classifier.

During our week 6 lecture we noted that one of the defining characteristics of the machine learning approach (versus the statistical approach) is that the full data set needs to be split, at least 2 ways (train-test) or preferably 3 ways (train-validate-test).

Usually this is not difficult, because the size of data sets considered in machine learning is generally larger than the size of the sets considered by the statistic-led approach.

Remember: our goal is to build a prediction model, so our metrics to evaluate our model must be based on how well the model predicts.

In this lab, I highlight the main steps and ask you to fill in the gaps when implementing this lab in a Jupyter notebook, noting what you find.

Setup

We first import the usual python packages, so they are available for use later.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

The well-known iris data is already in scikit-learn, so we can load it easily

from sklearn import datasets
iris = datasets.load_iris()
iris

You will notice that the iris data is actually a python dict, notably containing data, target target_names and feature_names fields.

It is not yet in the standard format for supervised learning for scikit-learn, which expects a dataframe of features (we can call this X) and a vector y (of labels in this instance) that we wish to fit.

X and y should each contain numeric data (they do this already) but the columns of X should have more meaningful names. Indeed, the names are already listed in feature_names but you need to assign them yourself.

FOR YOU: You need to create the pandas dataframe X from the numpy array, using pd.DataFrame(), taking care to replace the default 0,1,2,3 column names with those in feature_names.

Scaling the features

The first step is to scale the numeric features in X using your choice of scaler from sklearn.processing. There was some example code in Friday's lecture or you can refer to the relevant entry such as that for StandardScaler in the online manual.

In this dataset, the choice of scaler has little effect, but for other datasets, one choice might be better than the others. Indeed, searching for the best choice of scaler could even be considered an example of hyperparameter tuning.

FOR YOU: To derive the scaling from the data and apply the resulting scaling transformation to the same data, you can use the .fit_transform() method on the scaler, applying it to X.

Splitting into training and test sets

The scikit-learn splitting function needs to be imported first

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, stratify=y)

For this data, I suggest a train-test split of 80:20 which works out as 120 training observations and 30 test observations.

Note that, with classification data having very unbalanced labels, the split should be stratified according to the distribution of labels. For example, if the overall data has 50 times as many non-fraudulent transactions as fraudulent transactions, the same ratio should apply to the training and test sets. Generally speaking, even if the ratio of label values is 1:1:1 as it is here, you might as well request a stratified split anyway. You can do this by setting stratify = y in the call to train_test_split.

Choosing features

For the iris data, we can choose any non-empty subset of the 4 available features (choose from 4 individual features, 6 double features, 4 triple features and all 4 features, making 15 possible feature subsets in all).

Each of these feature subsets can be used when providing training data for the knn classifier that we used in week 2.

from sklearn.neighbors import KNeighborsClassifier

for k in range(start=1, stop=10, step=2):
  model = KNeighborsClassifier(n_neighbors=k)
  model.fit(Xtrain_1234, ytrain)

FOR YOU: For each model you fit, you need to compute the predicted labels for the test set using model.predict(Xtest).

Choosing the best hyperparameter values

Note that we have different choices of k and of the feature sets we used for fitting the data. We can score each choice by computing the accuracy of the model when it comes to predicting labels for the test set.

As described in class, optimising model to perform well on the training set without considering its performance on the test set can lead to poor performance overall.

Recall that we actually know the true labels of the test set (they were provided in the data) but we chose to ignore them when applying the model to predict those labels. In practice, there will be differences between the true and predicted labels. The (normalised) accuracy is 1 when each of the predicted labels matches the corresponding true label. We are looking to maximise this accuracy score.

FOR YOU: For each model you fit, you need to calculate its accuracy score using something like

from sklearn.metrics import accuracy_score
s = accuracy_score(ytest, ytest_pred)

FOR YOU: Looking across the models you tried, which had the best accuracy? Do you notice any patterns?

Cross validation

The iris data is a little on the small side for K-fold cross validation, unless the number of folds is kept small (like 3).

It has the benefit that it gives us the ability to estimate the uncertainty in the accuracy scores. That way, if many scores are similar, and each has high uncertainty, ranking them is not very meaningful.

In future labs with larger datasets, it would make sense to compare after cross validation, rather than simply using a simple train-test split for each configuration of hyperparameters.