EEB125 Lecture 8: Merging, Grouping, and Plotting#
March 5, 2025
Karen Reid
Recap#
Last class, we began learning how to use the Python library pandas, which provides lots of helpful data types and functions for performing data science.
We talked about a few different ways to extract data from a DataFrame. Fill in the table below:
Type inside |
Example |
Return type |
Which columns? |
Which rows? |
|---|---|---|---|---|
str` |
|
|||
|
||||
|
Let’s start by reading in the same PanTHERIA dataset as last week, from the file called PanTHERIA_WR05_Aug2008.csv.
import pandas as pd
species_raw_data = pd.read_csv("PanTHERIA_WR05_Aug2008.csv")
Next, let’s review some of the basic data cleaning operations from last class.
Extract the following columns, and rename them.
Original Name
New Name
MSW05_OrderOrderMSW05_BinomialBinomial5-1_AdultBodyMass_gMass (g)23-1_SexualMaturityAge_dMaturity (days)14-1_InterbirthInterval_dInterbirth (days)17-1_MaxLongevity_mLongevity (months)15-1_LitterSizeLitter SizeConvert all columns into their most specific types (using
DataFrame.convert_types()).Replace all
-999withpd.NA.
# 1. Extract columns
important_columns = [
"MSW05_Order",
"MSW05_Binomial",
"5-1_AdultBodyMass_g",
"23-1_SexualMaturityAge_d",
"14-1_InterbirthInterval_d",
"17-1_MaxLongevity_m",
"15-1_LitterSize"
]
species_subset_data = species_raw_data[important_columns]
# 1. Rename columns
old_to_new = {
"MSW05_Order": "Order",
"MSW05_Binomial": "Binomial",
"5-1_AdultBodyMass_g": "Mass (g)",
"23-1_SexualMaturityAge_d": "Maturity (days)",
"14-1_InterbirthInterval_d": "Interbirth (days)",
"17-1_MaxLongevity_m": "Longevity (months)",
"15-1_LitterSize": "Litter Size"
}
species_renamed_data = species_subset_data.rename(columns=old_to_new)
# 2.Convert column types
species_converted_data = species_renamed_data.convert_dtypes()
species_converted_data.info()
# Replace -999 values
species_data = species_converted_data.replace(-999, pd.NA)
species_data
Exploring Maturity Age vs. Endangered Level#
Question: Are mammals that take longer go grow up at greater risk of extinction?
Right now, we have mammal maturity age: species_data["Maturity (days)"].
But what about extinction level?
We have information about extinction level, but it’s in a different dataset: iucn_status.csv.
iucn_raw_data = pd.read_csv("iucn_status.csv")
iucn_data = iucn_raw_data.convert_dtypes()
print(iucn_data.head())
Now it seems like we can “match” the species column with the PanTHERIA’s Binomial column, but unfortunately the format is different:
display(iucn_data["species"].head())
display(species_data["Binomial"].sort_values())
Working with text columns#
Let’s look at a single value first.
'Abditomys_latidens'
We need to replace the underscore _ with a space.
For a single Python string, we can use the str.replace(..., ...) method:
single_species = "Abditomys_latidens"
# Add line to replace '_'
How do we do this for an entire Series? Is there a method like .round()?
Yes! Each Series has access to many string-specific methods, analogous to the built-in str methods in Python.
But there’s a catch: these methods must be accessed through a .str attribute. Here are some examples.
species_series = iucn_data["species"]
species_series.str.upper()
species_series = iucn_data["species"]
species_series.str.len()
And the one that we want:
species_series = iucn_data["species"]
species_series.str.replace("_", " ")
Hooray! Now let’s store this Series as a new column in iucn_data:
species_series = iucn_data["species"]
formatted_species_series = species_series.str.replace("_", " ")
# add a new column called "Formatted Species" to the iucn_data DataFrame
iucn_data.head()
Merging DataFrames#
In pandas, a merge operation allows us to combine two different DataFrames, matching rows by their values on a specific column in each DataFrame.
In our datasets, species_data["Binomial"] has the same data as iucn_data["Formatted Species"].
display(species_data["Binomial"].head())
display(iucn_data["Formatted Species"].head())
Formally, we merge two DataFrames using a pandas function called merge.
We’ll use merge with four arguments:
left: the firstDataFrameto mergeright: the secondDataFrameto mergeleft_on: the name of the column in theleftDataFrameto match withright_on: the name of the column in therightDataFrameto match with
Here’s how we can call merge for our two datasets:
combined_data = pd.merge(
left=species_data,
right=iucn_data,
left_on="Binomial",
right_on="Formatted Species"
)
combined_data.head()
Grouping#
Now suppose we want to determine the average maturity age for each iucn_status level.
We can do this manually by filtering for each group separately (similar to what you did on last week’s homework).
is_cr = combined_data["iucn_status"] == "CR" # Create a boolean Series
cr_species = ????? # Use the Series to *filter rows* of the dataset
???? # Extract a column and compute a mean
is_en = combined_data["iucn_status"] == "EN"
en_species = combined_data[is_en]
en_species["Maturity (days)"].mean()
It’s also possible to use a for loop to try each of the possible statuses (at least, assuming we know them in advance).
statuses = ["CR", "EN", "VU", "NT", "LC"]
for status in statuses:
is_status = combined_data["iucn_status"] == status
species_with_status = combined_data[is_status]
mean = species_with_status["Maturity (days)"].mean()
print(f"{status} species average time to maturity (in days): {mean}")
But there’s a better approach that uses a new Pandas concept: grouping.
This allows us to take a DataFrame and put the rows into groups based on their value in a given column.
combined_data.groupby("iucn_status")
A DataFrameGroupBy object is conceptually similar to a “dictionary of DataFrames”, but is implemented by pandas in a more complex way.
You aren’t responsible for the details of this type of data, but you do need to know about how you can work with these groups.
Retrieving a group#
Given a DataFrameGroupBy object, we can obtain the DataFrame for one of the groups using the DataFrameGroupBy.get_group method.
groups_by_status = combined_data.groupby("iucn_status")
groups_by_status.get_group("EN")
Displaying a DataFrameGroupBy#
But what if we want to see the whole object?
# Loop over the groups
for key, item in groups_by_status:
print(groups_by_status.get_group(key), "\n\n")
Computing descriptive statistics per group (!)#
Recall from last class, we can compute descriptive statistics on a numerical column by using Series methods:
Series.count()Series.sum()Series.min()Series.max()Series.mean()
We can do the same thing on a per-group basis in two steps:
First, extract the relevant column (using familiar square bracket syntax).
Call the desired descriptive statistics method.
Example: calculate the mean age to maturity for each IUCN status.
groups_by_status = combined_data.groupby("iucn_status")
maturities_by_status = groups_by_status["Maturity (days)"]
print(maturities_by_status.mean())
Since the result is just a regular Series, we can apply the same methods as last time—including sorting it!
maturities_by_status.mean().sort_values(ascending=False)
Data Visualization#
The last topic we’ll learn about today is how to do some basic data visualization using pandas.
Both Series and DataFrame have multiple methods for plotting data, which can be accessed through the .plot attribute.
Plotting a Series#
We’ll start with just plotting a Series, using the grouped means from the previous example.
mean_maturities = groups_by_status["Maturity (days)"].mean().sort_values(ascending=False)
mean_maturities
Let’s plot this series using a bar plot:
mean_maturities.plot.bar()
Each plotting method accepts optional arguments to configure the appearance of the plot. For example:
title: the title of the plotxlabel: the label for the x-axisylabel: the label for the y-axis
mean_maturities.plot.bar(
title="Mean Age to Maturity by IUCN Status",
xlabel="IUCN Status",
ylabel="Mean Age to Maturity (days)"
)
By default, Series.plot.bar uses the Series index to label the bars.
We can customize these labels by using the Series.rename(index=...).
old_to_new = {
"DD": "Data Deficient",
"EN": "Endangered",
"CR": "Critically Endangered",
"VU": "Vulnerable",
"NT": "Near Threatened",
"LC": "Least Concern",
"EW": "Extinct in the Wild",
"EX": "Extinct"
}
mean_maturities_renamed = mean_maturities.rename(index=old_to_new)
Plotting a DataFrame is similar to plotting a Series, but we must specify the columns to use for the x- and y-axes of the plot.
For example, we can take our combined_data and plot each species’ litter size against its maturity age as a scatter plot.
mean_maturities_renamed.plot.bar(
title="Mean Age to Maturity by IUCN Status",
xlabel="IUCN Status",
ylabel="Mean Age to Maturity (days)"
)
Plotting (columns from) a DataFrame#
Since DataFrames have multiple columns, we need to specify the columns we want to visualize.
Example: create a scatter plot that plots litter size against age to maturity for each species.
combined_data.plot.scatter(
x="Litter Size",
y="Maturity (days)",
title="Litter Size vs. Mean Age to Maturity Among Mammals"
)
Example: create a box plot that plots IUCN status against age to maturity.
combined_data.plot.box(
by="iucn_status",
column="Maturity (days)",
title="Age to Maturity Among Mammals by IUCN Status"
)