Homework 8: Investigating Mammalian Fecundity and Conservation using Filtering, Joins, and Arithmetic#
Logistics#
Due date: The homework is due 11:59pm on Tuesday, March 11
You will submit your work on MarkUs. To submit your work:
Download this file (
Homework_8.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the hw8 assignment. (See our MarkUs Guide for detailed instructions.)
All homeworks will take place in a Jupyter notebook (like this one).
Introduction#
For this week’s homework, we are going to continue to work with the PanTHERIA dataset and the IUCN categories.
We will create a new metric using the PanTHERIA data that estimates: how many offspring do individuals within each species produce throughout their lifetime, on average? We call this “lifetime fecundity”. We will be looking to see whether there is a relationship between average lifetime fecundity and a species’ risk of going extinct.
In this homework, you will:
Start a data story in a notebook exploring the question: is the number of offspring birthed by a lineage related to its risk of extinction?
Write and use advanced Boolean expressions to filter specific observations in our dataset. (Specifically, you’re encourage to practice using logical operators such as
!=,<=,>=,>,<.)Join two related datasets to create a larger, more comprehensive dataset.
Perform arithmetic on several pandas series to estimate the maximum theoretical number of offspring that mothers within each species are capable of siring throughout their lifetime.
Question#
The overarching question you’re answering in this homework:
Is there a difference in IUCN category between species with smaller mean lifetime fecundity and species with larger mean lifetime fecundity?
Problem 1: Read in the data files#
Import the raw data from the PanTHERIA (PanTHERIA_WR05_Aug2008.csv) and phylacine (phylacine.csv) datasets and name the DataFrames as pantheria_raw and iucn_raw, respectively.
# The following code is provided for you; please do not change it.
import pandas as pd
pd.set_option('mode.chained_assignment', None)
# Write your code here
# Check your work
display(pantheria_raw.head())
display(iucn_raw.head())
Problem 2: Cleaning the data#
You’ll now perform various data cleaning operations on these two datasets, similar to what you did last week.
At each step, we’ve specified a variable to store the result in, so that all of your work can be autograded.
Note that as we saw in lecture, all of these steps create a new DataFrame, rather than modifying an existing DataFrame. (That makes it easier for you to check your work at each step.)
You should use the result of the previous step as the “input” of the next step.
Problem 2a: Cleaning the PanTHERIA data#
Extract just the columns
'MSW05_Order','MSW05_Binomial','23-1_SexualMaturityAge_d', and'14-1_InterbirthInterval_d','17-1_MaxLongevity_m', and'15-1_LitterSize', in the order listed. Store the resultingDataFrameinpantheria_data.You are encouraged, but not required, to create a new list variable to store the column names, just like we did in lecture.
Rename the columns according to the table below. Store the result in
pantheria_data_renamed.Old column name
New column name
MSW05_OrderOrderMSW05_BinomialGenus_Species23-1_SexualMaturityAge_dAge to Maturity (days)14-1_InterbirthInterval_dInterbirth Interval (days)17-1_MaxLongevity_mMax Longevity (months)15-1_LitterSizeLitter SizeUse the
DataFrame.convert_dtypes()method to automatically convert each column into its most appropriate type, storing the resultingDataFramein a variable calledpantheria_data_converted.Finally, use the
DataFrame.replace(old, new)method to replace all occurrences of-999withpd.NA. Store the result in a variable calledpantheria_data_clean.
# Write your code here
# Check your work
pantheria_data_clean.head()
Problem 2b: Cleaning the IUCN data#
Extract just the columns
'Binomial.1.2'and'IUCN.Status.1.2'. Store the resultingDataFrameiniucn_data.Rename the columns to
Genus_Species_IUCNandIUCN Status, respectively. Store the resultingDataFrameiniucn_data_renamed.Convert column types using
DataFrame.convert_dtypes, and store the resultingDataFrameiniucn_data_clean.
# Write your code here
# Check your work
iucn_data_clean.head()
Problem 3: Merging the DataFrames#
Now let’s do something we just learned this week: merge the two DataFrames together.
To do so, we’ll need to make sure that the two “Genus_Species” columns in the DataFrames match.
We’ll take a similar, but slightly different approach, from the one we used in lecture.
Problem 3a: String formatting#
Create a new
Seriescalledgenus_species_formattedthat consists of the'Genus_Species'column frompantheria_data_clean, except with all spaces (" ") replaced by underscores ("_"). To do this, you’ll need to extract the right column from theDataFrameand then use theDataFrame.str.replace(old, new)method on the column.Modify
pantheria_data_cleanby adding theSeriesfrom the previous step to it under the column name'Genus_Species_Formatted'.Reminder: because your code for this question actually modifies
pantheria_data_clean, if you want to restart you should re-run all cells above this one (in the JupyterHub menu, select Cell -> Run All Above).
# Write your code here
# Check your work
pantheria_data_clean.head()
Problem 3b: Merge the two DataFrames#
Merge pantheria_data_clean and iucn_data_clean using function pd.merge.
You’ll need to determine the appropriate arguments for left_on and right_on.
Name the resulting DataFrame joined_pantheria_iucn_data.
# Write your code here
# Check your work
joined_pantheria_iucn_data.head()
Problem 4: Eliminate irrelevant IUCN categories#
Now that we have our joined DataFrame, we’re almost ready to perform the computation necessary to answer our question.
But first, the IUCN status values 'DD' and 'EP' are not useful to us, so we’ll remove them.
Extract all rows from
joined_pantheria_iucn_datawith IUCN categories OTHER THAN'DD'and'EP'. Name this resultingDataFramepantheria_iucn_clean.You are strongly encouraged to create your own variable to store the boolean
Seriesyou’re using as a filter. You’ll need to use a comparison operator (e.g.,==or!=) along with one of the two logical operators, either&or|.
# Write your code here
# Check your work
pantheria_iucn_clean
Problem 5: Computing fecundity#
Using pantheria_iucn_clean, you will estimate a new measurement that we will call Max Lifetime Fecundity.
This will be computed using the following columns:
'Age to Maturity (days)': How long it takes for the average individual to grow to maturity. This is measured in days as the interval between birth and the time when the individual first reproduces.
'Max Longevity (months)': How long can individuals within each species live, expressed in months.
'Interbirth Interval (days)': How long do adult females wait, on average, between giving birth and becoming pregnant again?
'Litter Size': How many babies do females within each species have at one time, on average?
The maximum fecundity of a species is calculated using the following formula:
Problem 5a: Adding the column#
Your task is to add a new column called 'Max Fecundity' to pantheria_iucn_clean that contains the maximum fecundity of each species. Do not perform any rounding.
NOTE: currently, the age to maturity/longevity/interbrith interval columns use different units. You’ll first need to convert them to years by dividing by 365 (for days) or 12 (months) before you can use the above formula.
Do not modify the existing pantheria_iucn_clean for these unit conversions; instead, use new variables to store the converted Series.
# Write your code here
# Check your work
pantheria_iucn_clean
Problem 5b: Sort#
Finally, use the DataFrame.sort_values method to sort pantheria_iucn_clean in ascending order of its 'Max Fecundity' column. You may, but are not required, to store the result in a variable.
# Write your code here
Problem 6: Computing the average Max Fecundity for each IUCN Status#
You will now calculate the average Max Fecundity value for each IUCN Status group.
Like in the lecture, this will involve two steps:
Group the
pantheria_iucn_cleanby theIUCN Statuscolumn, using theDataFrame.groupby()function.Compute the
meanof theMax Fecunditycolumn for the grouped data.
Store the output of these steps in a new variable called iucn_avg_fecundity. This variable should be of type Series, and associate each IUCN category with the average of the Max Fecundity values for the species in that category.
You may store the output of Step 1 in another variable, if you wish, or chain both the steps together in one command.
# Write your code here
# This code is provided to check your work. Do not modify it.
print(type(iucn_avg_fecundity))
display(iucn_avg_fecundity.sort_values())
Conclusion#
Based on your analysis, answer each of these questions:
Explain, in biological terms, what our new
'Max Fecundity'column measures. (3 marks)What can you say about the relationship between the IUCN Status and the average maximum fecundity of species? (3 marks)
WRITE YOUR RESPONSE HERE.