Homework 10: Bootstrap#

Logistics#

Due date: The homework is due 11:59 pm on Tuesday, March 25.

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Homework_10.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the hw10 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs. We’ve incuded submission instructions at the end of this notebook.

Introduction#

For this week’s homework, we will look at the median percentage of coral coverage over all of the quadrats sampled in the LTER data. We will analyze data coming from six sites within a coral reef in French Polynesia, from 2005 to 2022. Throughout these sites, with a quadrat sampling design along transects, percentage cover of corals, macroalgae, microalgae and sand were estimated across different depths and temperatures.

Question#

General Question: What is the distribution of the percentage of coral coverage in each of the quadrats? Can we provide an range that does a good job of estimating the median of the percentage of coral coverage?

Instructions and Learning Objectives#

You will be creating and submitting a data story answering a data science question.

In this homework, you will:

  • Create a data story in a notebook exploring the question.

  • Work with the Moorea Coral reef Long-term ecological research dataset to investigate changes in biodiversity as temperature changes.

  • Visualize and analyze the distribution of the percentage of coral coverage

  • Create a 95% confidence interval for the median percentage of coral coverage.

Setup#

First import numpy, pandas, and matplotlib by running the cell below.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Then fill in your student number to use as a random seed.

# Please fill in this cell with your student number (as an int)
student_number = ...

assert type(student_number) == int, "Did you fill in the student_number variable correctly?"

Data section#

The this part of your notebook should read the raw data, extract a DataFrame containing the important columns, rename the columns, and filter out missing values.

You might find it helpful to name intermediate values in your algorithms - e.g., corals_data.head(), corals_clean.head(). That way you can examine them to make sure they have the type you expect and that they look like what you expect. Very helpful when debugging!

Create the following pandas DataFrames:

  • corals_data: the DataFrame created by reading in the LTER_data.csv file.

  • corals_clean: the DataFrame with column names converted to ‘snake case’ format using <columns>.str.lower(), <columns>.str.replace(), and <columns>.str.strip(). Snake cases uses fully lowercase letters and underscores instead of spaces. It is the recommended naming convention for Python. For example, if we have the column name: “Tomo is a Great prof “, the snake case equivalent would be ‘tomo_is_a_great_prof’. Furthermore, we will replace the following symbols to the corresponding symbols/text:symbols/text:

    • ” ” with “_”,

    • “%” to “percent”

    • “metres” to “m”

    • “celsius” to “c”

    • “temperature” to “temp”

    Check that your column names are correct using print(list(corals_clean.columns)).

  • corals_select_data: the DataFrame with the only following columns selected:

    • site

    • year

    • corals_percent

    These columns will be relevant to this week’s anlaysis.

# Write your code below
corals_data = # read the file

corals_clean = corals_data.copy() # do not modify this line

co_columns = corals_clean.columns # do not modify this line; use the variable to access the "uncleaned" column names

corals_clean = # rename the columns

print(list(corals_clean.columns)) # do not modify this line; prints your "claned" column names to help you verify your work

corals_select_data = # select the relevant columns from corals_clean

Exploring the data#

Create a histogram of the percentage of coral coverage using the bins variable provided to specify the histogram bins. You do not need to store the result in a variable.

# Write your code below
bins = np.arange(0, 110, 5) # do not modify this line; use it to specify the histogram bins

Comment on the shape of the histogram and the distribution of the coral percentage. What is a better representation of the centre of the data, the mean or the median? Why? (2 marks)

Answer here

Compute the following summaries:

  • the median percentage of coral coverage, and store the result in median_coverage

  • the mean percentage of coral coverage, and store the result in mean_coverage

# Write your code below

# Check the values
print(f"The median coral coverage in % is {median_coverage}.")
print(f"The mean coral coverage in % is {mean_coverage}.")

Method: Bootstrap Confidence Interval#

Here we will run a bootstrap to estimate resampling technique by:

  • creating a resampling function that calculates the median of a resample

  • running the resampling function multiple times and having multiple resample medians

  • getting the 2.5% and 97.5% percentiles of all of our resample medians to get a 95% confidence interval.

Create a function called one_bootstrap_med() which will take one argument data (a DataFrame), resample the corals_percent column of data, and calculate and return the median of the new sample.

## Complete the function
def one_bootstrap_med(data):
    # Fill in the code below
    ...

Now compute 10,000 bootstrap medians by writing a for loop. Use your student number as a seed prior to running the bootstrap. Save the medians as a list in a variable called bootstrap_medians.

# The following line of code uses your student number to set a random seed.
# This ensures that every time you run this cell, you'll get the same result.
# Do not modify this line of code!
np.random.seed(student_number)

# Write your code below
bootstrap_medians = []

Create a histogram of the 10,000 bootstrap median coral coverage percentages. Set bins=12 inside the plotting function to set the number of bins to 12. You do not need to store the result in a variable.

# Write your code below

Calculate the lower bound and upper bound of the 95% confidence intervals for the median percentage of coral coverage in a given quadrat. Save these values as lower_bound and upper_bound, respectively.

# Write your code below
# Check the computed values
print(f"The lower bound of the 95% confidence interval is {lower_bound}.")
print(f"The uppwer bound of the 95% confidence interval is {upper_bound}.")

Conclusion#

Provide a 1-2 sentence explanation of what the 95% confidence interval represents. (1 mark)

Answer here