Data Exploration

Using jupyter with pandas and other modules greatly simplifies the data exploration process. However, the process is more important than the tools. Indeed one could carry out all of the data exploration for this course using SQL, Excel, Matlab or even standard programming languages. But it would require more time and effort....

So before we start on jupyter and friends, we want you to explore a small dataset, auto-mpg.csv. You can use any software tool/environment that you like — our goal here is help you realise that (even if data science is new to you) that you might be bringing more skills to the table than you realise. The data file contains descriptive data on a selection of cars produced for the US market between 1970 and 1982.

Typical things we look for include:

For each attribute (column) find the following information.
The attribute type, e.g. nominal, ordinal, numeric.
Percentage of missing values in the data.
Statistical numerical measures — centre (mean, median, mode), spread (min, max, range, standard deviation), symmetry (skewness).
Statistical graphical representations — bar plots, histograms, ... even pie charts.
Are there any rows that have a value for the attribute that no other record has (i.e. unique values)?
Are there any outliers?
Which attributes seem to be linked?

Later we will perform the same analysis using pandas, but for now the focus is on discovering interesting behaviour when you look at data.