Module Overview

Welcome to Data Mining,
We will start this module with a warm up lab to set up software and some basic data analysis. The formal lecture will give an overview of the module — what will be covered, how it will be delivered and assessed.

Motivating Example

This week we review some of the most useful pandas commands and look at how to classify iris plants by species

Data Handling

Introduction to Python and Numpy

  • Python features for data manipulation
  • Array handling with numpy

Exploratory Data Analysis

Before we begin to think about constructing models representing our data we need to see what kind of data we have, what is being measured, how clean it is, etc.

  • Data and metadata
  • Statistics for understanding

Exploratory Data Analysis2

Continuing our review of Exploratory Data analysis, we consider richer analytics on data, leading to identification of features for prediction

  • Review of EDA Phase 1
  • Analysing features and targets

Data Modelling

This week we will discuss general concepts/issues in the construction of data mining models.

  • Type of models
  • Model building as a process
  • Modelling concerns

Regression1

Sometimes we need to predict a numeric value or set of such values, given existing (training) data

  • Motivation - fitting a line to data
  • Perspectives - optimisation, linear algebra, statistics
  • Measuring the quality of the results

Classification1

Given labeled training data, develop models to classify new data based on what we have seen in the training data

  • Contrast with regression
  • Classification metrics
  • Classification using logistic regression

Regression2

We continue our introduction to regression, considering how to make it more robust.

  • Regularisation
  • Transformation of variables
  • Regression in practice

Classification2

Classification using techniques that use probability-based models

  • Decision Trees
  • Naive Bayes

Clustering

Given unlabeled data, look for subsets that help to improve understanding of the overall data set

  • Partitioning
  • Hierarchies