The Iris dataset

The Iris dataset, despite its simplicity, helps to illustrate many aspects of model building for classification problems.

Firstly, classification is about assigning labels to data, but there are also related problems concerned with

  1. predicting the characteristics of the data that share the same label
  2. predicting the boundaries between those classes

It could be argued that the k-nearest neighbours (KNN) and decision tree (DT) classification algorithms address topics 1 and 2, respectively, when they learn from the training data.

The associated notebook shows how this works. The notebook considers each pair of features, builds a model and shows how the model performs by plotting the data (training and test) and the regions that are associated with each label.

The focus in this notebook is not on model selection, rather it is on understanding how feature choice and hyperparameter setting can affect the quality and stability of the predictions.

By viewing the plots, it is clear that, depending on the choice of features, the 3 classes can be more easily separated. Also, the choice of hyperparameter (k for KNN or depth for DT) can adjust the boundaries in ways that effect the balance between minimising bias and variance.