Install / Verify a Suitable Python Environment

Contents

  1. Background
    The tools that data science practitioners use have changed significantly in recent years. While we don't have time to go into much background detail here the next section mentions some of the main concepts and includes links for further reading.
  2. Installation
    Next we cover installation of python using either a full anaconda distribution (about 3.1 GB) or a reduced miniconda distribution (about 248MB).
  3. Extra Packages
    In addition to the main python data analysis modules of pandas, scikit-learn, etc., we will use numerous other modules which will be listed here.

Background

Traditionally, data mining was done using a suite of different tools:

Nowadays we distinguish between two data mining usage patterns:

Traditional approaches introduce a lot of complexity (in the form of interfaces and context shifts) and require significant know-how, and are not optimal for either usage pattern.

Many analysts like to use a single, GUI-based tool for their workflows. ISL Clementine (spun out from the University of Birmingham in the 1990s) was the tool that introduced the workflow metaphor and is now available as IBM SPSS Modeler Professional. Such tools are expensive, which has encouraged competition from lower price/free equivalents. KNIME is one of its (commercial) open source competitors (others include RapidMiner and orange), using the same template as the original product but with enhanced features and performance.

More recently, the R (and python) data science communities have embraced the literate programming paradigm of Knitr and Jupyter notebooks which allows them to script function calls to powerful libraries, weaving this code with its own output (including graphs) and text that is typeset as LaTeX or markdown, respectively. Such code can be extracted and inserted into applications using Big Data frameworks with relatively little effort.

Both the R and python data science communities are very active and solutions such as KNIME now position themselves as platforms that can be extended with user-provided modules written in a variety of languages, notably including R and python.

Therefore, we will use Jupyter as the primary learning tool, both for sourcing the data and for the data mining process itself.

Installation

While any Python 3.10+ distribution should be suitable for our needs, we recommend you either install a full anaconda distribution (about 3.1 GB download) or a reduced miniconda distribution (about 250MB download).

Which should you choose? To be honest, it doen't really matter but

So with miniconda you will have faster initial download and less hard disk space used (because you can use only the packages you have asked for), but you will find that you will need to install extra packages more frequently than an anaconda user.

It is possible to have multiple python distributions installed on your system — you just have to manage the system paths so to allow you to switch between.

Finally, there are other options — all of which should be suitable for this course's needs. However, while we are happy to help sort out issues our priority will be sorting issues in anaconda and miniconda.

Python comes with an inbuilt package manager called pip. This is also a good option, although conda arguably does more of the work for you.

Anaconda (Full python distribution)

To install anaconda: https://docs.brew.sh/Homebrew-and-Python) * Pick any suitable 64-bit, Python 3.10+ installer for your system from anaconda. * Run installer and default options should work for your system. Under MacOS, I typically install into my home directory, so will end up with a folder ~/anacond3.

Miniconda (python distribution)

To install miniconda:

Verify Installation

Regardless of anaconda/miniconda option you picked, once installed you should be able to run python from the command line.

On MS Windows, you should start the Anaconda Prompt app, with no need to activate conda (since it has been added to the prompt in that app).

On other platforms (Linux and MacOS, say) you can start your preferred terminal and activate conda yourself, like this

1
conda activate

Either way, you should see that your prompt has been prefixed with (base), indicating that the default (base) environment has been activated.

Students should then check that python is avalilable:

1
python

to get output and python prompt

1
2
3
Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Type 1+1 at the python prompt and press enter. You should get 2 or we are in trouble.....

1
2
3
>>> 1+1
2
>>> quit()

Use quit() to exit the python shell and return to the command prompt.

Extra Packages

If using anaconda, you can install ia package using the Anaconda Navigator GUI.

Both anaconda and miniconda use a package manager called conda. To install a package from the command line, you typically use command

1
conda install -y PACKAGE_NAME

The -y option causes the install to automatically carry out required installs/updated instead of asking for user confirmation.

Note: some packages are not installable using conda, in these cases or if your are using a standard (non-anaconda variant) distribution you can install using the pip installer using

1
pip install PACKAGE_NAME

Side Note: Environments

Both conda and pip support the idea of environments where within one python distribution you might have many environments. This is to help avoid version conflicts between modules. aanconda and miniconda both start with one default environment called base. If you wish to use environments, you need to apply two steps: 1) create an environment using

1
conda create -n data_mining

and 2) activate it using

1
conda activate data_mining

Since all of the packages that we plan to use are (currently, Sep 2024) compatible we won't need to use multiple environments but they are a good idea (e.g., if you already use conda packages for other purposes) and if you are interested there is a nice introduction at Happy Belly Bioinformatic.

Packages that need installing if using miniconda (they are already included in anaconda)

1
2
3
4
5
conda install -y numpy
conda install -y scipy
conda install -y matplotlib
conda install -y seaborn
conda install -c conda-forge -y jupyterlab

(last update 2024-04-19)

Packages that need installing if using miniconda or anaconda

1
2
3
4
5
6
7
8
conda install -c conda-forge -y dtale
conda install -c conda-forge -y voila
conda install -c conda-forge -y featuretools
conda install -c anaconda -y graphviz
conda install -c anaconda -y pydot

conda install -c conda-forge -y phik
conda install -c conda-forge -y pingouin

(last update 2024-04-19)