Week 11: Exercises#
A practice exercise with sample solutions.
Introduction#
For these exercises, we will investigate the predominance of coral presence in relation to temperature. We will analyze data coming from six sites within a coral reef in French Polynesia. Scientists collected data there from 2005 to 2022. The sampling design involves swimming away from the shore over the corals, and every 10 meters take a photo of the bottom surface, using a square of a defined size. This kind of data collection uses something called a âquadrat sampling design along several linear transects increasing in depthâ.
Once in the lab, the biodiversity is estimated in terms of surface cover. How much area of the square is occupied by each species or abiotic component (E.g. corals, algae, sand). Since the linear transect runs from towards the inner ocean, as the scientists swimm off-shore, depth increases and temperature decreases.
We will explore how the surface cover changes as these two conditions change.
Main Question#
Questions: At which temperature do we find a greater percentage cover of corals? Do corals prefer warmer temperatures than the mean temperature across sites or colder temperatures?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Hypothesis testing#
Exercise 1:#
Create the following pandas dataframes:
corals_data: theDataFramecreated by reading theLTER_datafile.corals_clean: theDataFramewith cleaner column names of your choice.
# Write your answer to Exercise 1 in this cell
# Sample solution
corals_data = pd.read_csv("LTER_data.csv")
co_columns = corals_data.columns
# We"re replacing spaces with _ and "%" with "percent" and so on.
co_columns_clean = co_columns.str.lower().str.replace(
" ", "_").str.replace(
"%", "percent").str.replace(
"metres", "m").str.replace(
"celsius", "c").str.replace(
"temperature", "temp").str.strip("_")
corals_data.columns = co_columns_clean
corals_clean = corals_data
corals_clean
# Check your column names:
list(corals_clean.columns)
Exercise 2#
Select the columns relevant to this analysis. Those are:
siteyeartemp_ccorals_percent
Name the final result corals_select_data_raw.
# Write your answer to Exercise 2 in this cell
# Sample solution
corals_select_data_raw = corals_data[["site",
"year",
"temp_c",
"corals_percent"]]
corals_select_data_raw
Exercise 3#
A quadrat is a square in the grid.
Make a copy of corals_select_data_raw and name it corals_select_data.
Create a column indicating whether a quadrat has a greater percentage surface area of coral surface area.
Create a column in corals_select_data named "is_percent_corals_over_50" that has the value 1 if the quadrat has a percentage surface area of corals greater than or equal to 50%, or otherwise, has the value 0.
# Write your answer to Exercise 3 in this cell
# Sample solution
corals_select_data = corals_select_data_raw.copy()
corals_select_data.loc[corals_select_data["corals_percent"] >= 50.0,
"is_percent_corals_over_50"] = 1
corals_select_data.loc[corals_select_data["corals_percent"] < 50.0,
"is_percent_corals_over_50"] = 0
corals_select_data
Exercise 4#
Create a function to compute the difference in mean temperature between quadrats with lower coral surface area and greater coral surface area.
Name the function mean_diff_temperature_by_coral_sa.
The function has one parameter:
data: theDataFramecontaining information about the coral surface area by quadrat and by temperature.
Using column "temp_c" in data, your function should compute the mean temperature of
of all quadrats belonging to the two groups indicated in is_percent_corals_over_50.
The function then returns that difference in mean temperature.
# Write your answer to Exercise 4 in this cell
# Sample solution
def mean_diff_temperature_by_coral_sa(data):
mean_temp_high_coral_sa = data.loc[data["is_percent_corals_over_50"] == 1, "temp_c"].mean()
mean_temp_low_coral_sa = data.loc[data["is_percent_corals_over_50"] == 0, "temp_c"].mean()
return mean_temp_low_coral_sa - mean_temp_high_coral_sa
Exercise 5#
Call your function to calculate the observed difference in mean temperature between quadrats with low coral surface area and greater coral surface area, and name your result observed_diff_in_mean_temperature.
(Later on, youâre going to call that function inside a for loop to compute a sampling distribution.)
# Write your answer to Exercise 5 in this cell
# Sample solution
observed_diff_in_mean_temperature = mean_diff_temperature_by_coral_sa(corals_select_data)
observed_diff_in_mean_temperature
Exercise 6#
Now weâre going to shuffle the two groups (the 1âs and 0âs marking whether a sample has high or low coral coverage) in order to determine if there is a meaningful difference if the group assignment is random. (This is equivalent to testing for a difference assuming the null hypothesis.)
Write a helper function that shuffles our labels (values) found in the column is_percent_corals_over_50. Name the function shuffle_labels.
The function will have one parameter:
data: theDataFramecontaining information about the coral surface area by quadrat AND temperature.
Your function will return a series that has shuffled the labels in the column is_percent_corals_over_50.
# Write your answer to Exercise 6 in this cell
# Solution, delete to create handout
def shuffle_labels(data):
return data["is_percent_corals_over_50"].sample(
frac=1, replace=False).reset_index(drop=True)
Exercise 7#
Simulate the sample means to generate a sampling distribution.
Our null hypothesis is that there is no difference in sampling means between the two groups.
Here, we will write code for the machinery of our hypothesis test. Specifically, we will simulate samples under assumption of the null hypothesis â that is, we will generate a sampling distribution of the difference in mean temperature between our two groups of quadrats.
Brace yourselves.
Weâre going to collect random sampling distribution of mean differences in a list named resample_test_statistics. We will use this list to plot a histogram.
Here are the details:
Initialize
resample_test_statisticsas the empty list. (In the for loop, youâll append to this list.)Write a for loop to collect 1000 outcomes.
Each iteration, call your
shuffle_labelsfunction to shuffle the labels of column"is_percent_corals_over_50"fromcorals_shuffled_data. Assign these shuffled labels to the column"is_percent_corals_over_50"in thecorals_shuffled_datadataframe.Now call your
mean_diff_temperature_by_coral_safunction to calculate the difference in mean temperature between quadrats with low and high coral surface area in your random shuffled sample. Name this valuemean_diff.Append
mean_difftoresample_test_statistics.
Outside of the for loop, call
pd.Seriesto create a series based on listresample_test_statistics.
# Write your answer to Exercise 7 in this cell
# Sample solution
# np.random.seed(2) # if you wish to keep the same values
corals_shuffled_data = corals_select_data.copy()
resample_test_statistics = []
for _ in range(1000):
resampled_labels = shuffle_labels(corals_shuffled_data)
corals_shuffled_data["is_percent_corals_over_50"]=resampled_labels
resample_test_statistics.append(
mean_diff_temperature_by_coral_sa(corals_shuffled_data))
resample_test_statistics = pd.Series(resample_test_statistics)
resample_test_statistics
Exercise 8#
How likely are we to observe a mean difference that is as extreme as the mean difference that we observed in Exercise 5?
Calculate the p-value of the random sampling result. We can calculate a p-value by following these steps:
Take the absolute value of this
Series. We do this because we want to consider both tails of our sample distribution. (resample_test_statistics.abs()will calculate this.)Create a boolean series that is
Trueif the absolute value of the test statistic is geater than or equal to the observed mean difference,observed_diff_in_mean_temperature.Finally, determine the probability by calculating the proportion of entries that are
Trueinresample_test_statistics_series. Name your resultp_value.
# Write your answer to Exercise 8 in this cell
# Sample solution
abs_test_statistics = resample_test_statistics.abs()
p_value = (abs_test_statistics >= np.abs(observed_diff_in_mean_temperature)).mean()
p_value
Exercise 9#
Visualize the p-value. Although, we donât need a visualization to calculate our p-value, it might help in illustrating whether our p-value makes sense. In one plot, plot the following three things:
a histogram plotting the test statistics
two vertical lines plotting the observed difference in mean temperature between quadrats of the two groups.
# Write your answer to Exercise 9 in this cell
# Solution, delete to create handout
plt.hist(resample_test_statistics, color="pink", edgecolor="white", bins=15)
plt.vlines(x=observed_diff_in_mean_temperature,
ymin=0, ymax=170, color="red", linestyle="dotted")
plt.vlines(x=-1 * observed_diff_in_mean_temperature, # check both directions
ymin=0, ymax=170, color="red", linestyle="dotted")
observed_diff_in_mean_temperature
Exercise 10: Interpretation Questions#
State the difference in mean temperature between quadrats with high coral surface area and those with low coral surface area. From your results, state the p-value, and give its interpretation. What conclusions can you make from the observed p-value.
If you were to do further analysis to study how surface area covered by corals is different for various sampling units (Quadrats), which additional environmental variables might you consider? Why? Write 3-5 sentences identifying 1-2 variables of interest and what differences in percentage covered by corals you might expect to find.
Regression#
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pantheria = pd.read_csv("pantheria.txt", sep = "\t")
important_columns = ["5-1_AdultBodyMass_g", "23-1_SexualMaturityAge_d",
"17-1_MaxLongevity_m","25-1_WeaningAge_d",
"9-1_GestationLen_d"]
sub_pantheria = pantheria[important_columns]
columnnames = {"5-1_AdultBodyMass_g": "bodymass",
"23-1_SexualMaturityAge_d": "maturity",
"17-1_MaxLongevity_m": "longevity",
"25-1_WeaningAge_d": "weaning",
"9-1_GestationLen_d": "gestation"}
rn_pantheria = sub_pantheria.rename(columns=columnnames)
rn_pantheria = rn_pantheria.dropna()
panthdat_trans = rn_pantheria.copy()
panthdat_trans.head()
trans_cols = ["bodymass", "longevity", "gestation", "weaning", "maturity"]
panthdat_trans[trans_cols] = (panthdat_trans[trans_cols].apply(np.log))
panthdat_trans.head()
Question 1#
We are interested in the shape of the distribution for maturation time.
Make a suitable plot.
Describe the shape of the distribution and explain the role of log-transformation in generating this shape.
Answer: Our data were originally lopsided. Calculating the logarithm of the measurements helps to bring the more spread out extreme values closer together and yield a more even and less lopsided frequency distribution. This will help us to better plot our data with body size and to fit a regression model.
# Write your answer to Question 1 in this cell
# Sample solution
panthdat_trans["maturity"].plot.hist(bins=20)
plt.show()
Question 2#
Furthermore, it makes sense to visualize our data before we fit our regression model. Create scatter plot of temperature and the proportion of corals on the reef. What is the direction of the relationship. Describe the strength of the relationship.
Answer: The relationship is âpositiveâ - that is, the two variables are positively correlated. As the independent variable increases, so does the dependent variable.
# Write your answer to Question 2 in this cell
# Sample solution
plt.scatter(panthdat_trans["bodymass"],
panthdat_trans["maturity"],
alpha=0.3)
plt.show()
Question 3#
Fit the linear regression line with bodymass as the independent variable. Print the summary of the model. Report the coefficient that the model estimates, giving an interpretation for each.
# Write your answer to Question 3 in this cell
# Sample solution
from statsmodels.formula.api import ols
regmod = ols("maturity ~ bodymass", data = panthdat_trans) # setup the model
regmod_fit = regmod.fit() # estimate/fit the model
print(regmod_fit.summary().tables[1]) # get parameter estimates
Sample answer: The coefficient estimate is 4.4298. That means that a longer time to maturation is associated with larger body masses, meaning that larger mammals take longer to grow up.
Question 4#
Determine and interpret the R-squared value of the model. What does this metric tell us about our regression model?
# Write your answer to Question 4 in this cell
# Sample solution
print(regmod_fit.rsquared)
Question 5#
Calculate the residuals of the regression model. Use a scatter plot to plot the residuals (y-axis) against the fitted values (x-axis). Comment on any patterns (or lack of patterns you might see). Are the values fairly evenly distributed?
# Write your answer to Question 5 in this cell
# Sample solution
residuals = regmod_fit.resid
fitted = regmod_fit.fittedvalues
plt.scatter(fitted, residuals);