EEB125 Lecture 8: Merging, Grouping, and Plotting

EEB125 Lecture 8: Merging, Grouping, and Plotting#

March 5, 2025

Karen Reid

Recap#

Last class, we began learning how to use the Python library pandas, which provides lots of helpful data types and functions for performing data science.

We talked about a few different ways to extract data from a DataFrame. Fill in the table below:

Type inside `[...]`	Example	Return type	Which columns?	Which rows?
str`	`species_data_final["Adult Body Mass (g)"]`
	`species_data_final[["Genus", "Species"]]`
	`species_data_final[is_large]`

Let’s start by reading in the same PanTHERIA dataset as last week, from the file called PanTHERIA_WR05_Aug2008.csv.

import pandas as pd

species_raw_data = pd.read_csv("PanTHERIA_WR05_Aug2008.csv")

Next, let’s review some of the basic data cleaning operations from last class.

Extract the following columns, and rename them.

Original Name	New Name
`MSW05_Order`	`Order`
`MSW05_Binomial`	`Binomial`
`5-1_AdultBodyMass_g`	`Mass (g)`
`23-1_SexualMaturityAge_d`	`Maturity (days)`
`14-1_InterbirthInterval_d`	`Interbirth (days)`
`17-1_MaxLongevity_m`	`Longevity (months)`
`15-1_LitterSize`	`Litter Size`

Convert all columns into their most specific types (using DataFrame.convert_types()).
Replace all -999 with pd.NA.

# 1. Extract columns
important_columns = [
    "MSW05_Order",
    "MSW05_Binomial",
    "5-1_AdultBodyMass_g",
    "23-1_SexualMaturityAge_d",
    "14-1_InterbirthInterval_d",
    "17-1_MaxLongevity_m",
    "15-1_LitterSize"
]

species_subset_data = species_raw_data[important_columns]

# 1. Rename columns
old_to_new = {
    "MSW05_Order": "Order",
    "MSW05_Binomial": "Binomial",
    "5-1_AdultBodyMass_g": "Mass (g)",
    "23-1_SexualMaturityAge_d": "Maturity (days)",
    "14-1_InterbirthInterval_d": "Interbirth (days)",
    "17-1_MaxLongevity_m": "Longevity (months)",
    "15-1_LitterSize": "Litter Size"
}
species_renamed_data = species_subset_data.rename(columns=old_to_new)

# 2.Convert column types
species_converted_data = species_renamed_data.convert_dtypes()
species_converted_data.info()

# Replace -999 values
species_data = species_converted_data.replace(-999, pd.NA)
species_data

Exploring Maturity Age vs. Endangered Level#

Question: Are mammals that take longer go grow up at greater risk of extinction?

Right now, we have mammal maturity age: species_data["Maturity (days)"].

But what about extinction level?

We have information about extinction level, but it’s in a different dataset: iucn_status.csv.

iucn_raw_data = pd.read_csv("iucn_status.csv")
iucn_data = iucn_raw_data.convert_dtypes()
print(iucn_data.head())

Now it seems like we can “match” the species column with the PanTHERIA’s Binomial column, but unfortunately the format is different:

display(iucn_data["species"].head())
display(species_data["Binomial"].sort_values())

Working with text columns#

Let’s look at a single value first.

'Abditomys_latidens'

We need to replace the underscore _ with a space. For a single Python string, we can use the str.replace(..., ...) method:

single_species = "Abditomys_latidens"
# Add line to replace '_'

How do we do this for an entire Series? Is there a method like .round()?

Yes! Each Series has access to many string-specific methods, analogous to the built-in str methods in Python.

But there’s a catch: these methods must be accessed through a .str attribute. Here are some examples.

species_series = iucn_data["species"]
species_series.str.upper()

species_series = iucn_data["species"]
species_series.str.len()

And the one that we want:

species_series = iucn_data["species"]
species_series.str.replace("_", " ")

Hooray! Now let’s store this Series as a new column in iucn_data:

species_series = iucn_data["species"]
formatted_species_series = species_series.str.replace("_", " ")
# add a new column called "Formatted Species" to the iucn_data DataFrame

iucn_data.head()

Merging `DataFrame`s#

In pandas, a merge operation allows us to combine two different DataFrames, matching rows by their values on a specific column in each DataFrame.

In our datasets, species_data["Binomial"] has the same data as iucn_data["Formatted Species"].

display(species_data["Binomial"].head())
display(iucn_data["Formatted Species"].head())

Formally, we merge two DataFrames using a pandas function called merge. We’ll use merge with four arguments:

left: the first DataFrame to merge
right: the second DataFrame to merge
left_on: the name of the column in the left DataFrame to match with
right_on: the name of the column in the right DataFrame to match with

Here’s how we can call merge for our two datasets:

combined_data = pd.merge(
    left=species_data,
    right=iucn_data,
    left_on="Binomial",
    right_on="Formatted Species"
)
combined_data.head()

Grouping#

Now suppose we want to determine the average maturity age for each iucn_status level.

We can do this manually by filtering for each group separately (similar to what you did on last week’s homework).

is_cr = combined_data["iucn_status"] == "CR"  # Create a boolean Series
cr_species = ?????            # Use the Series to *filter rows* of the dataset
????         # Extract a column and compute a mean

is_en = combined_data["iucn_status"] == "EN"
en_species = combined_data[is_en]
en_species["Maturity (days)"].mean()

It’s also possible to use a for loop to try each of the possible statuses (at least, assuming we know them in advance).

statuses = ["CR", "EN", "VU", "NT", "LC"]

for status in statuses:
    is_status = combined_data["iucn_status"] == status
    species_with_status = combined_data[is_status]
    mean = species_with_status["Maturity (days)"].mean()
    print(f"{status} species average time to maturity (in days): {mean}")

But there’s a better approach that uses a new Pandas concept: grouping. This allows us to take a DataFrame and put the rows into groups based on their value in a given column.

combined_data.groupby("iucn_status")

A DataFrameGroupBy object is conceptually similar to a “dictionary of DataFrames”, but is implemented by pandas in a more complex way. You aren’t responsible for the details of this type of data, but you do need to know about how you can work with these groups.

Retrieving a group#

Given a DataFrameGroupBy object, we can obtain the DataFrame for one of the groups using the DataFrameGroupBy.get_group method.

groups_by_status = combined_data.groupby("iucn_status")
groups_by_status.get_group("EN")

Displaying a DataFrameGroupBy#

But what if we want to see the whole object?

# Loop over the groups
for key, item in groups_by_status:
    print(groups_by_status.get_group(key), "\n\n")

Computing descriptive statistics per group (!)#

Recall from last class, we can compute descriptive statistics on a numerical column by using Series methods:

Series.count()
Series.sum()
Series.min()
Series.max()
Series.mean()

We can do the same thing on a per-group basis in two steps:

First, extract the relevant column (using familiar square bracket syntax).
Call the desired descriptive statistics method.

Example: calculate the mean age to maturity for each IUCN status.

groups_by_status = combined_data.groupby("iucn_status")
maturities_by_status = groups_by_status["Maturity (days)"]
print(maturities_by_status.mean())

Since the result is just a regular Series, we can apply the same methods as last time—including sorting it!

maturities_by_status.mean().sort_values(ascending=False)

Data Visualization#

The last topic we’ll learn about today is how to do some basic data visualization using pandas. Both Series and DataFrame have multiple methods for plotting data, which can be accessed through the .plot attribute.

Plotting a `Series`#

We’ll start with just plotting a Series, using the grouped means from the previous example.

mean_maturities = groups_by_status["Maturity (days)"].mean().sort_values(ascending=False)
mean_maturities

Let’s plot this series using a bar plot:

mean_maturities.plot.bar()

Each plotting method accepts optional arguments to configure the appearance of the plot. For example:

title: the title of the plot
xlabel: the label for the x-axis
ylabel: the label for the y-axis

mean_maturities.plot.bar(
    title="Mean Age to Maturity by IUCN Status",
    xlabel="IUCN Status",
    ylabel="Mean Age to Maturity (days)"
)

By default, Series.plot.bar uses the Series index to label the bars. We can customize these labels by using the Series.rename(index=...).

old_to_new = {
    "DD": "Data Deficient",
    "EN": "Endangered",
    "CR": "Critically Endangered",
    "VU": "Vulnerable",
    "NT": "Near Threatened",
    "LC": "Least Concern",
    "EW": "Extinct in the Wild",
    "EX": "Extinct"
}

mean_maturities_renamed = mean_maturities.rename(index=old_to_new)

Plotting a DataFrame is similar to plotting a Series, but we must specify the columns to use for the x- and y-axes of the plot.

For example, we can take our combined_data and plot each species’ litter size against its maturity age as a scatter plot.

mean_maturities_renamed.plot.bar(
    title="Mean Age to Maturity by IUCN Status",
    xlabel="IUCN Status",
    ylabel="Mean Age to Maturity (days)"
)

Plotting (columns from) a `DataFrame`#

Since DataFrames have multiple columns, we need to specify the columns we want to visualize.

Example: create a scatter plot that plots litter size against age to maturity for each species.

combined_data.plot.scatter(
    x="Litter Size",
    y="Maturity (days)",
    title="Litter Size vs. Mean Age to Maturity Among Mammals"
)

Example: create a box plot that plots IUCN status against age to maturity.

combined_data.plot.box(
    by="iucn_status",
    column="Maturity (days)",
    title="Age to Maturity Among Mammals by IUCN Status"
)

EEB125 Lecture 8: Merging, Grouping, and Plotting

Contents

EEB125 Lecture 8: Merging, Grouping, and Plotting#

Recap#

Exploring Maturity Age vs. Endangered Level#

Working with text columns#

Merging `DataFrame`s#

Grouping#

Retrieving a group#

Displaying a DataFrameGroupBy#

Computing descriptive statistics per group (!)#

Data Visualization#

Plotting a `Series`#

Plotting (columns from) a `DataFrame`#

Further reading#

EEB125 Lecture 8: Merging, Grouping, and Plotting

Contents

EEB125 Lecture 8: Merging, Grouping, and Plotting#

Recap#

Exploring Maturity Age vs. Endangered Level#

Working with text columns#

Merging DataFrames#

Grouping#

Retrieving a group#

Displaying a DataFrameGroupBy#

Computing descriptive statistics per group (!)#

Data Visualization#

Plotting a Series#

Plotting (columns from) a DataFrame#

Further reading#

Merging `DataFrame`s#

Plotting a `Series`#

Plotting (columns from) a `DataFrame`#