Homework 8: Investigating Mammalian Fecundity and Conservation using Filtering, Joins, and Arithmetic#

Logistics#

Due date: The homework is due 11:59pm on Tuesday, March 11

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Homework_8.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the hw8 assignment. (See our MarkUs Guide for detailed instructions.)

All homeworks will take place in a Jupyter notebook (like this one).

Introduction#

For this week’s homework, we are going to continue to work with the PanTHERIA dataset and the IUCN categories.

We will create a new metric using the PanTHERIA data that estimates: how many offspring do individuals within each species produce throughout their lifetime, on average? We call this “lifetime fecundity”. We will be looking to see whether there is a relationship between average lifetime fecundity and a species’ risk of going extinct.

In this homework, you will:

  • Start a data story in a notebook exploring the question: is the number of offspring birthed by a lineage related to its risk of extinction?

  • Write and use advanced Boolean expressions to filter specific observations in our dataset. (Specifically, you’re encourage to practice using logical operators such as !=, <=, >=, >, <.)

  • Join two related datasets to create a larger, more comprehensive dataset.

  • Perform arithmetic on several pandas series to estimate the maximum theoretical number of offspring that mothers within each species are capable of siring throughout their lifetime.

Question#

The overarching question you’re answering in this homework:

Is there a difference in IUCN category between species with smaller mean lifetime fecundity and species with larger mean lifetime fecundity?

Problem 1: Read in the data files#

Import the raw data from the PanTHERIA (PanTHERIA_WR05_Aug2008.csv) and phylacine (phylacine.csv) datasets and name the DataFrames as pantheria_raw and iucn_raw, respectively.

# The following code is provided for you; please do not change it.
import pandas as pd
pd.set_option('mode.chained_assignment', None) 

# Write your code here
# Check your work
display(pantheria_raw.head())
display(iucn_raw.head())

Problem 2: Cleaning the data#

You’ll now perform various data cleaning operations on these two datasets, similar to what you did last week. At each step, we’ve specified a variable to store the result in, so that all of your work can be autograded. Note that as we saw in lecture, all of these steps create a new DataFrame, rather than modifying an existing DataFrame. (That makes it easier for you to check your work at each step.) You should use the result of the previous step as the “input” of the next step.

Problem 2a: Cleaning the PanTHERIA data#

  1. Extract just the columns 'MSW05_Order', 'MSW05_Binomial', '23-1_SexualMaturityAge_d', and '14-1_InterbirthInterval_d', '17-1_MaxLongevity_m', and '15-1_LitterSize', in the order listed. Store the resulting DataFrame in pantheria_data.

    You are encouraged, but not required, to create a new list variable to store the column names, just like we did in lecture.

  2. Rename the columns according to the table below. Store the result in pantheria_data_renamed.

    Old column name

    New column name

    MSW05_Order

    Order

    MSW05_Binomial

    Genus_Species

    23-1_SexualMaturityAge_d

    Age to Maturity (days)

    14-1_InterbirthInterval_d

    Interbirth Interval (days)

    17-1_MaxLongevity_m

    Max Longevity (months)

    15-1_LitterSize

    Litter Size

  3. Use the DataFrame.convert_dtypes() method to automatically convert each column into its most appropriate type, storing the resulting DataFrame in a variable called pantheria_data_converted.

  4. Finally, use the DataFrame.replace(old, new) method to replace all occurrences of -999 with pd.NA. Store the result in a variable called pantheria_data_clean.

# Write your code here

# Check your work
pantheria_data_clean.head()

Problem 2b: Cleaning the IUCN data#

  1. Extract just the columns 'Binomial.1.2' and 'IUCN.Status.1.2'. Store the resulting DataFrame in iucn_data.

  2. Rename the columns to Genus_Species_IUCN and IUCN Status, respectively. Store the resulting DataFrame in iucn_data_renamed.

  3. Convert column types using DataFrame.convert_dtypes, and store the resulting DataFrame in iucn_data_clean.

# Write your code here

# Check your work
iucn_data_clean.head()

Problem 3: Merging the DataFrames#

Now let’s do something we just learned this week: merge the two DataFrames together. To do so, we’ll need to make sure that the two “Genus_Species” columns in the DataFrames match. We’ll take a similar, but slightly different approach, from the one we used in lecture.

Problem 3a: String formatting#

  1. Create a new Series called genus_species_formatted that consists of the 'Genus_Species' column from pantheria_data_clean, except with all spaces (" ") replaced by underscores ("_"). To do this, you’ll need to extract the right column from the DataFrame and then use the DataFrame.str.replace(old, new) method on the column.

  2. Modify pantheria_data_clean by adding the Series from the previous step to it under the column name 'Genus_Species_Formatted'.

    Reminder: because your code for this question actually modifies pantheria_data_clean, if you want to restart you should re-run all cells above this one (in the JupyterHub menu, select Cell -> Run All Above).

# Write your code here

# Check your work
pantheria_data_clean.head()

Problem 3b: Merge the two DataFrames#

Merge pantheria_data_clean and iucn_data_clean using function pd.merge. You’ll need to determine the appropriate arguments for left_on and right_on.

Name the resulting DataFrame joined_pantheria_iucn_data.

# Write your code here

# Check your work
joined_pantheria_iucn_data.head()

Problem 4: Eliminate irrelevant IUCN categories#

Now that we have our joined DataFrame, we’re almost ready to perform the computation necessary to answer our question. But first, the IUCN status values 'DD' and 'EP' are not useful to us, so we’ll remove them.

  1. Extract all rows from joined_pantheria_iucn_data with IUCN categories OTHER THAN 'DD' and 'EP'. Name this resulting DataFrame pantheria_iucn_clean.

    You are strongly encouraged to create your own variable to store the boolean Series you’re using as a filter. You’ll need to use a comparison operator (e.g., == or !=) along with one of the two logical operators, either & or |.

# Write your code here

# Check your work
pantheria_iucn_clean

Problem 5: Computing fecundity#

Using pantheria_iucn_clean, you will estimate a new measurement that we will call Max Lifetime Fecundity.

This will be computed using the following columns:

'Age to Maturity (days)': How long it takes for the average individual to grow to maturity. This is measured in days as the interval between birth and the time when the individual first reproduces.

'Max Longevity (months)': How long can individuals within each species live, expressed in months.

'Interbirth Interval (days)': How long do adult females wait, on average, between giving birth and becoming pregnant again?

'Litter Size': How many babies do females within each species have at one time, on average?

The maximum fecundity of a species is calculated using the following formula:

\[ \frac{\text{max longevity} - \text{age to maturity}}{\text{interbirth interval}} \times \text{litter size} \]

Problem 5a: Adding the column#

Your task is to add a new column called 'Max Fecundity' to pantheria_iucn_clean that contains the maximum fecundity of each species. Do not perform any rounding.

NOTE: currently, the age to maturity/longevity/interbrith interval columns use different units. You’ll first need to convert them to years by dividing by 365 (for days) or 12 (months) before you can use the above formula. Do not modify the existing pantheria_iucn_clean for these unit conversions; instead, use new variables to store the converted Series.

# Write your code here

# Check your work
pantheria_iucn_clean

Problem 5b: Sort#

Finally, use the DataFrame.sort_values method to sort pantheria_iucn_clean in ascending order of its 'Max Fecundity' column. You may, but are not required, to store the result in a variable.

# Write your code here

Problem 6: Computing the average Max Fecundity for each IUCN Status#

You will now calculate the average Max Fecundity value for each IUCN Status group.

Like in the lecture, this will involve two steps:

  1. Group the pantheria_iucn_clean by the IUCN Status column, using the DataFrame.groupby() function.

  2. Compute the mean of the Max Fecundity column for the grouped data.

Store the output of these steps in a new variable called iucn_avg_fecundity. This variable should be of type Series, and associate each IUCN category with the average of the Max Fecundity values for the species in that category.

You may store the output of Step 1 in another variable, if you wish, or chain both the steps together in one command.

# Write your code here
# This code is provided to check your work. Do not modify it.
print(type(iucn_avg_fecundity))
display(iucn_avg_fecundity.sort_values())

Conclusion#

Based on your analysis, answer each of these questions:

  1. Explain, in biological terms, what our new 'Max Fecundity' column measures. (3 marks)

  2. What can you say about the relationship between the IUCN Status and the average maximum fecundity of species? (3 marks)

WRITE YOUR RESPONSE HERE.