EEB125 Lecture 7: Introduction to Pandas

EEB125 Lecture 7: Introduction to Pandas#

Feb 26, 2025#

Karen Reid#

Introducing `pandas`#

So far this semester, you’ve worked in base Python, using only types of data, functions, and methods that are built into Python.

For the next few weeks, we’ll learn how to use one of the most common libraries for doing data science in Python: pandas.

What is `pandas`?#

pandas “is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”

Today, we’ll learn how to use pandas to:

Read in a dataset from a CSV file
Identify, use, and differentiate two new Pandas data types, DataFrame and Series
Describe the properties of a dataset representation in Pandas
Inspect parts of a large dataset
Perform simple data cleaning and data transformation operations on a dataset
Compute some summary statistics on a dataset

Importing `pandas`#

Because pandas doesn’t come built-in with Python, we need to import it to be able to use it in our code.

This is done with a Python statement called an import statement.

import pandas

Common alternate: import with a renaming (“nickname”):

import pandas as pd

Reading in data from a CSV file#

Using pandas, we can read in data from a csv file using the read_csv function.

species_data = pd.read_csv('PanTHERIA_WR05_Aug2008.csv')

Let’s explore: what is species_data?

species_data

type(species_data)

Formally, species_data is a DataFrame, which is a custom data type defined by pandas to represent tabular data.

Exploring `DataFrame`s#

We can use the DataFrame.head() method to quickly see the first few rows.

species_data.head()

We use the .shape attribute to obtain the number of rows and columns of a DataFrame.

An attribute is like a method, but it just stores a piece of data, and is not a function.
You do not write parentheses after an attribute name.

species_data.shape

We can access just the number of rows or columns by using indexing on .shape (with square brackets), just like with lists.

num_rows = species_data.shape[0]
num_cols = species_data.shape[1]

print(f"There are {num_rows} rows and {num_cols} columns in the dataset.")

`DataFrame` columns properties#

One of the most important properties of a DataFrame are its columns. Each column has two important pieces of “metadata”:

the column’s name
the column’s type (i.e., the type of data stored in that column)

We can see the column names by accessing the .columns attribute of a DataFrame.

species_data.columns

The .columns attribute has a special type called Index, which is like a list.

You don’t need to worry about what Index is exactly, but if you want you can convert it into a list:

list(species_data.columns)

We can access column types by using the .dtypes attribute:

species_data.dtypes

Pandas uses its own custom data types to represent large datasets efficiently. They typically correspond to Python’s built-in data types.

For example:

float64 corresponds to a Python float
object is a special dtype that means “any value”

Note: by default, Pandas reads in text column data as object, not string.

We’ll see how to improve this later this lecture.

Finally, we can use the DataFrame.info() method to display all of the previous information and more:

species_data.info()

Summary#

Given a DataFrame, we can access the following attributes/methods to obtain information about it.

Attribute/Method	Description
`.shape`	(number of rows, number of columns)
`.columns`	column names
`.dtypes`	column names and types
`.info()`	all of the above, and more (e.g. non-null counts)
`.head()`	display the first few rows of the `DataFrame`

Data Wrangling: Columns#

In data science, data wrangling is the process of turning raw data into a format more suitable for subsequent computation, analysis, and visualization. This might be more properly called Data Cleaning.

There are many different types of data wrangling, but for now we’ll look at three techniques centred on columns:

renaming columns
converting column types
identifying and replacing “invalid” values
extracting a subset of columns to work with

Renaming columns#

We rename columns by using the DataFrame.rename(columns=...) method, where we pass in a dictionary mapping “original column name” to “new column name”.

old_to_new = {
    'MSW05_Genus': 'Genus',
    'MSW05_Species': 'Species',
    '1-1_ActivityCycle': 'Activity Cycle',
    '5-1_AdultBodyMass_g': 'Adult Body Mass (g)',
    '2-1_AgeatEyeOpening_d': 'Age at Eye Opening (days)',
    '17-1_MaxLongevity_m': 'Max Longevity (months)'
}

species_data_renamed = species_data.rename(columns=old_to_new)
species_data_renamed.head()

Converting column types#

We can also ask Pandas to automatically choose the best column types for an existing DataFrame. This is done with the DataFrame.convert_dtypes() method.

species_data_converted = species_data_renamed.convert_dtypes()

species_data_converted.dtypes

Identifying and replacing “missing” values#

The PanTHERIA dataset uses a special value, -999, to represent missing or unknown data.

Instead of leaving these values in our DataFrame, we’ll replace them with a special pandas value called NA.

species_data_with_na = species_data_converted.replace(-999, pd.NA)
species_data_with_na.head()

Extracting a subset of columns#

Sometimes our full dataset contains too much information, and we only care about a subset of the data.

One common occurrence is when we only want a subset of the columns in a dataset.

For example, suppose we only care about the genus, species, body mass, and longevity of each species in our dataset.

Extracting a subset of columns#

We select a subset of columns in two steps:

Define a list containing the column names that we want to select.
Use square bracket “lookup” syntax on a DataFrame, with the list inside the square brackets.

columns_to_keep = [
    'Genus',
    'Species',
    'Adult Body Mass (g)',
    'Max Longevity (months)'
]

species_data_final = species_data_with_na[columns_to_keep]
species_data_final.head()

Data Transformation: computing on columns#

A typical step in analysis of a dataset is to perform computations on invididual columns, or operations that combine columns in some way.

For example:

Add 1 to each value in a column
Multiply the values in two columns together
“Find and Replace” values in a column

Retrieving a column by name#

We can extract a single column from a DataFrame using square brackets with a single string instead of a list of strings.

masses = species_data_final['Adult Body Mass (g)']

masses

But what exactly is totals?

type(masses)

masses is a Series, which is a pandas data type that represents a single column of data.

A Series is similar to a DataFrame, but it can only hold one “series” of data, rather than storing a whole table.

But most of the descriptive attributes/methods we learned for DataFrames can be applied to Series as well:

masses.shape

masses.dtypes

masses.info()

# We can even obtain the original column name from the Series
masses.name

But if Series are a simplified version of DataFrames, why bother with them?

Because we can perform computations on Series “one element at a time”, without needing to use for loops!

Example: transform a single Series#

Goal: Given the species masses, convert to kg by dividing each one by 1000 and rounding to one decimal place.

Example: for a single mass like 492714.47, we’d compute

round(492714.47 / 1000, 1)  # 492.7

But we want to do this for every mass!

masses_kg = masses / 1000
masses_kg_rounded = masses_kg.round(1)
masses_kg_rounded

Example: combine two `Series`#

Now let’s consider another problem: we’ll calculate the ratio between the longevity and mass of each species.

Example: for Camelus dromedarius, we’ll compute

480.0 / 492714.47

But again, we want to do ths for each species!

masses = species_data_final["Adult Body Mass (g)"]
longevities = species_data_final["Max Longevity (months)"]


longevities / masses

Adding a new column to a `DataFrame`#

In addition to creating new variables to store computed Series, it is common to modify existing DataFrame by adding a computed Series as a new column.

We can do this using square bracket notation again, this time on the left-hand side of an assignment statement.

# You don't need to worry about the following line.
# It just hides a warning message that's beyond the scope of this course.
pd.set_option('mode.chained_assignment', None)

species_data_final["Longevity-to-Mass Ratio"] = longevities / masses

species_data_final

WARNING!#

Warning: the previous code cell changes the existing data frame species_data_final, rather than creating a new DataFrame.

Boolean `Series` and filtering rows#

Another common type of data transformation is to filter for specific rows in a dataset based on one or more conditions.

Goal: filter the rows of the dataset to keep the species with a mass greater than or equal to 100 kg.

As a first step, we create a boolean Series that stores True for the rows we want to keep, and False for the other rows.

is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_large

Then, we use this Series to index species_data_final by using square bracket notation.

species_data_final[is_large]

Note: lots of square brackets!#

One of the tricky things about DataFrames is that there are different ways of obtaining subsets of the dataset that all have very similar code syntax:

species_data_final[...]

The key principle is that the type of the value inside the square brackets determines what kind of “subsetting” operation is being performed.

Type inside `[...]`	Example	Return type	Which columns?	Which rows?
`str`	`species_data_final["Adult Body Mass (g)"]`	`Series`	The one specified	All rows
`list` of `str`	`species_data_final[["Genus", "Species"]]`	`DataFrame`	The ones specified	All rows
`Series` of `bool`	`species_data_final[is_large]`	`DataFrame`	All columns	The ones specified

Logical operators: `&` and `|`#

Sometimes we want to filter on two conditions. To start, suppose we have these two boolean Series:

is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_long_lived = species_data_final["Max Longevity (months)"] >= 240

There are two common ways to filter based on a combination of these two conditions.

Filter 1: find rows where the species is large and is long-lived.

To do this, we use the & operator to combine the two Series.

filter1 = is_large & is_long_lived

species_data_final[filter1]

Filter 2: find rows where the species is large or is long-lived.

To do this, we use the | operator to combine the two Series.

filter2 = is_large | is_long_lived

species_data_final[filter2]

Exploratory analysis: sorting and basic descriptive statistics#

Sorting#

Suppose we want to take our DataFrame and sort it by the "Adult Body Mass (g)" column to see which species have the largest mass.

We do this by using the DataFrame.sort_values(by=...) method, where we pass in a str that names the column to sort by.

species_data_final.sort_values(by="Adult Body Mass (g)")

By default, the column values are sorted in ascending (low-to-high) order.

If we want to sort in descending (high-to-low) order, we can pass in an optional argument ascending=False to DataFrame.sort_values:

species_data_final.sort_values(by="Adult Body Mass (g)", ascending=False)

Descriptive statistics#

Here are five simple descriptive statistics that we can use to describe a collection of numbers:

sum
count (i.e., size; number of elements)
mean (average)
min
max

Unsurprisingly, we can compute all of these on any Pandas Series containing numeric data by calling a corresponding Series method.

Statistic	`Series` method
sum	`Series.sum()`
count	`Series.count()`
mean	`Series.mean()`
min	`Series.min()`
max	`Series.max()`

Note: all five of these methods ignore NA values.

Let’s start by extracting the body mass column (again).

totals = species_data_final["Adult Body Mass (g)"]
totals.head()

totals.sum()

totals.count()

totals.mean()

totals.min()

totals.max()

EEB125 Lecture 7: Introduction to Pandas

Contents

EEB125 Lecture 7: Introduction to Pandas#

Feb 26, 2025#

Karen Reid#

Introducing `pandas`#

What is `pandas`?#

Importing `pandas`#

Reading in data from a CSV file#

Exploring `DataFrame`s#

`DataFrame` columns properties#

Summary#

Data Wrangling: Columns#

Renaming columns#

Converting column types#

Identifying and replacing “missing” values#

Extracting a subset of columns#

Extracting a subset of columns#

Data Transformation: computing on columns#

Retrieving a column by name#

Example: transform a single Series#

Example: combine two `Series`#

Adding a new column to a `DataFrame`#

WARNING!#

Boolean `Series` and filtering rows#

Note: lots of square brackets!#

Logical operators: `&` and `|`#

Exploratory analysis: sorting and basic descriptive statistics#

Sorting#

Descriptive statistics#

Further reading#

EEB125 Lecture 7: Introduction to Pandas

Contents

EEB125 Lecture 7: Introduction to Pandas#

Feb 26, 2025#

Karen Reid#

Introducing pandas#

What is pandas?#

Importing pandas#

Reading in data from a CSV file#

Exploring DataFrames#

DataFrame columns properties#

Summary#

Data Wrangling: Columns#

Renaming columns#

Converting column types#

Identifying and replacing “missing” values#

Extracting a subset of columns#

Extracting a subset of columns#

Data Transformation: computing on columns#

Retrieving a column by name#

Example: transform a single Series#

Example: combine two Series#

Adding a new column to a DataFrame#

WARNING!#

Boolean Series and filtering rows#

Note: lots of square brackets!#

Logical operators: & and |#

Exploratory analysis: sorting and basic descriptive statistics#

Sorting#

Descriptive statistics#

Further reading#

Introducing `pandas`#

What is `pandas`?#

Importing `pandas`#

Exploring `DataFrame`s#

`DataFrame` columns properties#

Example: combine two `Series`#

Adding a new column to a `DataFrame`#

Boolean `Series` and filtering rows#

Logical operators: `&` and `|`#