This week, you are asked to consider the iris data that we used to explore the use of the k-nearest neighbours (KNN) classifier.
During our week 6 lecture we noted that one of the defining characteristics of the machine learning approach (versus the statistical approach) is that the full data set needs to be split, at least 2 ways (train-test) or preferably 3 ways (train-validate-test).
Usually this is not difficult, because the size of data sets considered in machine learning is generally larger than the size of the sets considered by the statistic-led approach.
Remember: our goal is to build a prediction model, so our metrics to evaluate our model must be based on how well the model predicts.
In this lab, I highlight the main steps and ask you to fill in the gaps when implementing this lab in a Jupyter notebook, noting what you find.
We first import the usual python packages, so they are available for use later.
1 2 3 4 5 | |
The well-known iris data is already in scikit-learn, so we can load it easily
1 2 3 | |
You will notice that the iris data is actually a python dict, notably
containing data, target target_names and feature_names fields.
It is not yet in the standard format for supervised learning for scikit-learn,
which expects a dataframe of features (we can call this X) and a vector
y (of labels in this instance) that we wish to fit.
X and y should each contain numeric data (they do this already) but the
columns of X should have more meaningful names. Indeed, the names are already
listed in feature_names but you need to assign them yourself.
FOR YOU: You need to create the pandas dataframe X from the numpy array,
using pd.DataFrame(), taking care to replace the default 0,1,2,3 column names with
those in feature_names.
The first step is to scale the numeric features in X using your choice of
scaler from sklearn.processing. There was some example code in Friday's
lecture or you can refer to the relevant entry such as that for
StandardScaler in the online
manual.
In this dataset, the choice of scaler has little effect, but for other datasets, one choice might be better than the others. Indeed, searching for the best choice of scaler could even be considered an example of hyperparameter tuning.
FOR YOU: To derive the scaling from the data and apply the resulting
scaling transformation to the same data, you can use the .fit_transform()
method on the scaler, applying it to X.
The scikit-learn splitting function needs to be imported first
1 2 | |
For this data, I suggest a train-test split of 80:20 which works out as 120 training observations and 30 test observations.
Note that, with classification data having very unbalanced labels, the split
should be stratified according to the distribution of labels. For example, if
the overall data has 50 times as many non-fraudulent transactions as fraudulent
transactions, the same ratio should apply to the training and test sets.
Generally speaking, even if the ratio of label values is 1:1:1 as it is here,
you might as well request a stratified split anyway. You can do this by setting
stratify = y in the call to train_test_split.
For the iris data, we can choose any non-empty subset of the 4 available features (choose from 4 individual features, 6 double features, 4 triple features and all 4 features, making 15 possible feature subsets in all).
Each of these feature subsets can be used when providing training data for the knn classifier that we used in week 2.
1 2 3 4 5 | |
FOR YOU: For each model you fit, you need to compute the predicted
labels for the test set using model.predict(Xtest).
Note that we have different choices of k and of the feature sets we used
for fitting the data. We can score each choice by computing the accuracy
of the model when it comes to predicting labels for the test set.
As described in class, optimising model to perform well on the training set without considering its performance on the test set can lead to poor performance overall.
Recall that we actually know the true labels of the test set (they were provided in the data) but we chose to ignore them when applying the model to predict those labels. In practice, there will be differences between the true and predicted labels. The (normalised) accuracy is 1 when each of the predicted labels matches the corresponding true label. We are looking to maximise this accuracy score.
FOR YOU: For each model you fit, you need to calculate its accuracy score using something like
1 2 | |
FOR YOU: Looking across the models you tried, which had the best accuracy? Do you notice any patterns?
The iris data is a little on the small side for K-fold cross validation, unless the number of folds is kept small (like 3).
It has the benefit that it gives us the ability to estimate the uncertainty in the accuracy scores. That way, if many scores are similar, and each has high uncertainty, ranking them is not very meaningful.
In future labs with larger datasets, it would make sense to compare after cross validation, rather than simply using a simple train-test split for each configuration of hyperparameters.