Week 11: Exercises#

A practice exercise with sample solutions.

Introduction#

For these exercises, we will investigate the predominance of coral presence in relation to temperature. We will analyze data coming from six sites within a coral reef in French Polynesia. Scientists collected data there from 2005 to 2022. The sampling design involves swimming away from the shore over the corals, and every 10 meters take a photo of the bottom surface, using a square of a defined size. This kind of data collection uses something called a “quadrat sampling design along several linear transects increasing in depth”.

Once in the lab, the biodiversity is estimated in terms of surface cover. How much area of the square is occupied by each species or abiotic component (E.g. corals, algae, sand). Since the linear transect runs from towards the inner ocean, as the scientists swimm off-shore, depth increases and temperature decreases.

We will explore how the surface cover changes as these two conditions change.

Main Question#

Questions: At which temperature do we find a greater percentage cover of corals? Do corals prefer warmer temperatures than the mean temperature across sites or colder temperatures?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Hypothesis testing#

Exercise 1:#

Create the following pandas dataframes:

  • corals_data: the DataFrame created by reading the LTER_data file.

  • corals_clean: the DataFrame with cleaner column names of your choice.

# Write your answer to Exercise 1 in this cell
# Sample solution
corals_data = pd.read_csv("LTER_data.csv")

co_columns = corals_data.columns

# We"re replacing spaces with _ and "%" with "percent" and so on.
co_columns_clean = co_columns.str.lower().str.replace(
    " ", "_").str.replace(
    "%", "percent").str.replace(
    "metres", "m").str.replace(
    "celsius", "c").str.replace(
    "temperature", "temp").str.strip("_")

corals_data.columns = co_columns_clean

corals_clean = corals_data
corals_clean
# Check your column names:
list(corals_clean.columns)

Exercise 2#

Select the columns relevant to this analysis. Those are:

  • site

  • year

  • temp_c

  • corals_percent

Name the final result corals_select_data_raw.

# Write your answer to Exercise 2 in this cell
# Sample solution
corals_select_data_raw = corals_data[["site",
                                      "year",
                                      "temp_c",
                                      "corals_percent"]]

corals_select_data_raw

Exercise 3#

A quadrat is a square in the grid.

Make a copy of corals_select_data_raw and name it corals_select_data.

Create a column indicating whether a quadrat has a greater percentage surface area of coral surface area.

Create a column in corals_select_data named "is_percent_corals_over_50" that has the value 1 if the quadrat has a percentage surface area of corals greater than or equal to 50%, or otherwise, has the value 0.

# Write your answer to Exercise 3 in this cell
# Sample solution
corals_select_data = corals_select_data_raw.copy()

corals_select_data.loc[corals_select_data["corals_percent"] >= 50.0, 
                       "is_percent_corals_over_50"] = 1 
corals_select_data.loc[corals_select_data["corals_percent"] < 50.0,
                       "is_percent_corals_over_50"] = 0

corals_select_data

Exercise 4#

Create a function to compute the difference in mean temperature between quadrats with lower coral surface area and greater coral surface area.

Name the function mean_diff_temperature_by_coral_sa.

The function has one parameter:

  • data: the DataFrame containing information about the coral surface area by quadrat and by temperature.

Using column "temp_c" in data, your function should compute the mean temperature of of all quadrats belonging to the two groups indicated in is_percent_corals_over_50.

The function then returns that difference in mean temperature.

# Write your answer to Exercise 4 in this cell
# Sample solution
def mean_diff_temperature_by_coral_sa(data):
    
    mean_temp_high_coral_sa = data.loc[data["is_percent_corals_over_50"] == 1, "temp_c"].mean()
    
    mean_temp_low_coral_sa = data.loc[data["is_percent_corals_over_50"] == 0, "temp_c"].mean()
    
    return mean_temp_low_coral_sa - mean_temp_high_coral_sa

Exercise 5#

Call your function to calculate the observed difference in mean temperature between quadrats with low coral surface area and greater coral surface area, and name your result observed_diff_in_mean_temperature.

(Later on, you’re going to call that function inside a for loop to compute a sampling distribution.)

# Write your answer to Exercise 5 in this cell
# Sample solution
observed_diff_in_mean_temperature = mean_diff_temperature_by_coral_sa(corals_select_data)

observed_diff_in_mean_temperature

Exercise 6#

Now we’re going to shuffle the two groups (the 1”s and 0’s marking whether a sample has high or low coral coverage) in order to determine if there is a meaningful difference if the group assignment is random. (This is equivalent to testing for a difference assuming the null hypothesis.)

Write a helper function that shuffles our labels (values) found in the column is_percent_corals_over_50. Name the function shuffle_labels.

The function will have one parameter:

  • data: the DataFrame containing information about the coral surface area by quadrat AND temperature.

Your function will return a series that has shuffled the labels in the column is_percent_corals_over_50.

# Write your answer to Exercise 6 in this cell
# Solution, delete to create handout
def shuffle_labels(data):
    
    return data["is_percent_corals_over_50"].sample(
        frac=1, replace=False).reset_index(drop=True)

Exercise 7#

Simulate the sample means to generate a sampling distribution.

Our null hypothesis is that there is no difference in sampling means between the two groups.

Here, we will write code for the machinery of our hypothesis test. Specifically, we will simulate samples under assumption of the null hypothesis — that is, we will generate a sampling distribution of the difference in mean temperature between our two groups of quadrats.

Brace yourselves.

We’re going to collect random sampling distribution of mean differences in a list named resample_test_statistics. We will use this list to plot a histogram.

Here are the details:

  • Initialize resample_test_statistics as the empty list. (In the for loop, you’ll append to this list.)

  • Write a for loop to collect 1000 outcomes.

    • Each iteration, call your shuffle_labels function to shuffle the labels of column "is_percent_corals_over_50" from corals_shuffled_data. Assign these shuffled labels to the column "is_percent_corals_over_50" in the corals_shuffled_data dataframe.

    • Now call your mean_diff_temperature_by_coral_sa function to calculate the difference in mean temperature between quadrats with low and high coral surface area in your random shuffled sample. Name this value mean_diff.

    • Append mean_diff to resample_test_statistics.

  • Outside of the for loop, call pd.Series to create a series based on list resample_test_statistics.

# Write your answer to Exercise 7 in this cell
# Sample solution
# np.random.seed(2) # if you wish to keep the same values

corals_shuffled_data = corals_select_data.copy()

resample_test_statistics = []

for _ in range(1000):
    
    resampled_labels = shuffle_labels(corals_shuffled_data)
    
    corals_shuffled_data["is_percent_corals_over_50"]=resampled_labels
    
    resample_test_statistics.append(
        mean_diff_temperature_by_coral_sa(corals_shuffled_data))
    
resample_test_statistics = pd.Series(resample_test_statistics)
resample_test_statistics

Exercise 8#

How likely are we to observe a mean difference that is as extreme as the mean difference that we observed in Exercise 5?

Calculate the p-value of the random sampling result. We can calculate a p-value by following these steps:

  • Take the absolute value of this Series. We do this because we want to consider both tails of our sample distribution. (resample_test_statistics.abs() will calculate this.)

  • Create a boolean series that is True if the absolute value of the test statistic is geater than or equal to the observed mean difference, observed_diff_in_mean_temperature.

  • Finally, determine the probability by calculating the proportion of entries that are True in resample_test_statistics_series. Name your result p_value.

# Write your answer to Exercise 8 in this cell
# Sample solution
abs_test_statistics = resample_test_statistics.abs()

p_value = (abs_test_statistics >= np.abs(observed_diff_in_mean_temperature)).mean()

p_value

Exercise 9#

Visualize the p-value. Although, we don”t need a visualization to calculate our p-value, it might help in illustrating whether our p-value makes sense. In one plot, plot the following three things:

  • a histogram plotting the test statistics

  • two vertical lines plotting the observed difference in mean temperature between quadrats of the two groups.

# Write your answer to Exercise 9 in this cell
# Solution, delete to create handout
plt.hist(resample_test_statistics, color="pink", edgecolor="white", bins=15)
plt.vlines(x=observed_diff_in_mean_temperature, 
           ymin=0, ymax=170, color="red", linestyle="dotted")
plt.vlines(x=-1 * observed_diff_in_mean_temperature, # check both directions
           ymin=0, ymax=170, color="red", linestyle="dotted")
observed_diff_in_mean_temperature

Exercise 10: Interpretation Questions#

  1. State the difference in mean temperature between quadrats with high coral surface area and those with low coral surface area. From your results, state the p-value, and give its interpretation. What conclusions can you make from the observed p-value.

  2. If you were to do further analysis to study how surface area covered by corals is different for various sampling units (Quadrats), which additional environmental variables might you consider? Why? Write 3-5 sentences identifying 1-2 variables of interest and what differences in percentage covered by corals you might expect to find.

Regression#

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pantheria = pd.read_csv("pantheria.txt", sep = "\t")

important_columns = ["5-1_AdultBodyMass_g", "23-1_SexualMaturityAge_d", 
                     "17-1_MaxLongevity_m","25-1_WeaningAge_d", 
                     "9-1_GestationLen_d"]

sub_pantheria = pantheria[important_columns]

columnnames = {"5-1_AdultBodyMass_g": "bodymass",
               "23-1_SexualMaturityAge_d": "maturity",
               "17-1_MaxLongevity_m": "longevity",
               "25-1_WeaningAge_d": "weaning",
               "9-1_GestationLen_d": "gestation"}

rn_pantheria = sub_pantheria.rename(columns=columnnames)

rn_pantheria = rn_pantheria.dropna()

panthdat_trans = rn_pantheria.copy()

panthdat_trans.head()

trans_cols = ["bodymass", "longevity", "gestation", "weaning", "maturity"]

panthdat_trans[trans_cols] = (panthdat_trans[trans_cols].apply(np.log))

panthdat_trans.head()

Question 1#

We are interested in the shape of the distribution for maturation time.

Make a suitable plot.

Describe the shape of the distribution and explain the role of log-transformation in generating this shape.

Answer: Our data were originally lopsided. Calculating the logarithm of the measurements helps to bring the more spread out extreme values closer together and yield a more even and less lopsided frequency distribution. This will help us to better plot our data with body size and to fit a regression model.

# Write your answer to Question 1 in this cell
# Sample solution
panthdat_trans["maturity"].plot.hist(bins=20)
plt.show()

Question 2#

Furthermore, it makes sense to visualize our data before we fit our regression model. Create scatter plot of temperature and the proportion of corals on the reef. What is the direction of the relationship. Describe the strength of the relationship.

Answer: The relationship is ‘positive’ - that is, the two variables are positively correlated. As the independent variable increases, so does the dependent variable.

# Write your answer to Question 2 in this cell
# Sample solution
plt.scatter(panthdat_trans["bodymass"],
            panthdat_trans["maturity"],
            alpha=0.3)

plt.show()

Question 3#

Fit the linear regression line with bodymass as the independent variable. Print the summary of the model. Report the coefficient that the model estimates, giving an interpretation for each.

# Write your answer to Question 3 in this cell
# Sample solution
from statsmodels.formula.api import ols


regmod = ols("maturity ~ bodymass", data = panthdat_trans) # setup the model

regmod_fit = regmod.fit() # estimate/fit the model 

print(regmod_fit.summary().tables[1]) # get parameter estimates

Sample answer: The coefficient estimate is 4.4298. That means that a longer time to maturation is associated with larger body masses, meaning that larger mammals take longer to grow up.

Question 4#

Determine and interpret the R-squared value of the model. What does this metric tell us about our regression model?

# Write your answer to Question 4 in this cell
# Sample solution
print(regmod_fit.rsquared)

Question 5#

Calculate the residuals of the regression model. Use a scatter plot to plot the residuals (y-axis) against the fitted values (x-axis). Comment on any patterns (or lack of patterns you might see). Are the values fairly evenly distributed?

# Write your answer to Question 5 in this cell
# Sample solution
residuals = regmod_fit.resid
fitted = regmod_fit.fittedvalues

plt.scatter(fitted, residuals);