Overview

This week's lecture is being presented in notebook format, instead of slides, because it is easier to show how pandas works by using it in practice.

Today we look at pandas, which plays a huge role in the the python data science ecosystem. Typically, it provides the datasetructures that are used use to store and manipulate the data used for machine learning.

Much of the attention later in the module is on machine learning algorithms and their interpretation and validation. However, and the machine learning pipeline and model building steps are heavily dependent on pandas.

Introduction to Pandas Series structure

This notebook introduces the Pandas Series data structure and how it is used to store list and/or dictionary data.

Accessing data in Pandas Series

This notebook describes indexing and other strategies for querying Pandas series.

Introduction to Dataframe data structure

This notebook introduces the Pandas Dataframe data structure and how it is used to store tabular data in a very flexible way.

Loading data into Dataframe data structure and examining its structure

This notebook describes how CSV files (a common format for ML data) can be loaded into a Pandas Dataframe and how various indexing strategies can be used to refine how data elements are accessed.

Accessing data in Pandas Series

This notebook describes 2-D indexing and other access strategies for querying Pandas Dataframes.

Data

The dataset used in some of the notebooks can be found here (right click and save in a data subfolder of where you save the notebooks for this class).