Typically the Tuesday lab sessions will be based on the material covered in the preceding Friday's lecture. For the first week we will focus on installing the software needed for this module.
To achieve this, the aims of this lab session are:
Install / Verify a Suitable Python Environment
We will use a Python 3.10+ distribution and all additional modules that will be used during this course have been tested to ensure no incompatibilities at this time.
Overview of Anaconda interface
Here is a minimal overview of the Anaconda distribution and jupyter-lab interface.
Use Jupyter to write a simple script where pandas is used to read a csv file
We start the Jupyter notebook server and write a simple python script in the browseri to test the setup and show how Jupyter lab is used. To stop the server, press Ctrl-C and y.
Review of Basic Data Exploration Skills.
You are invited to take a small dataset based on records of American cars and their fuel usage, and to explore it using the software tool of your choice. Later we will perform the same analysis using pandas, but for now the focus is on investigating the data using the tools with which you are familiar.
Before starting, it might be a good idea to give a short primer on terminology:
Programming Language
Python is a programming language. It consists of a Python interpreter which can parse Python scripts (text files with .py extension) and execute them, and a collection of libraries/modules which extend the Python language adding capabilities such as game development, networking, database access, efficient data storage and processing, .etc.
Distributions
One can install python directly from the Python Software Foundation but people have developed distributions which also include commonly used python modules, editors and other tools useful when developing in python. Anaconda is probably the most popular distribution for data mining. The full installation is large (about 5.7GB) but using it greatly simplifies usage and setup.
Modules
Like most programming languages, Python functionality can be increased by importing additional libraries or modules. When data mining the most popular modules are
pandas for data manipulation and input/output.numpy for efficient processing of tablular data (used by pandas). matplotlib for general 2D visualisation.seaborn for statistical visualisations.... ...Package Manager
Keeping track of the large number of modules in a python distribution is a non-trivial task due to incompatibilities between different versions of various modules — most of which have their own release schedules. A package manager attempts to simplify this task. The default package manager in python is called pip, but within the Anaconda distribution the preferred package manager is conda.
Editors/Interfaces
You can perform all of your data mining activities using python scripts and an editor (VSCode is a excellent option), but an alternative interface has become standard in the data mining community which allows for more interactive development based on the concept of a notebook containing a mixture of code cells for execution and markdown cells for documentation.
jupyter-lab is (browser + local server based) interface for creating/editing/executing python notebooks.
Note, there is an alternate and older interface called jupyter-notebook. This does not offer any advantages over either jupyter-lab or VS code, so I do not encourage its use.