Homework 11: Linear Regression#
Logistics#
Due date: The homework is due 11:59 pm on Tuesday, April 1.
You will submit your work on MarkUs. To submit your work:
Download this file (
Homework_11.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the hw11 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs. We’ve incuded submission instructions at the end of this notebook.
Introduction#
For this week’s homework, we will look at the relationship between the length and width of fish with the weight of the fish, available in the fish.csv file. This dataset is a record of 7 common different fish species used in fish market sales. The 7 species include: Bream, Parkki, Perch, Pike, Roach, Smelt, and Whitefish.
Question#
General Question: What can we say about the relationship between the length and width of a fish while trying to predict its weight?
Instructions and Learning Objectives#
In this homework, you will:
Create a data story in a notebook exploring the question.
Visualize and analyze the relationships between length, width and weight.
Create and compare different linear regression models.
Setup#
First import numpy, pandas, matplotlib, seaborn, and statsmodels.formula.api by running the cell below.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
Data section#
The this part of your notebook should read the raw data, extract a DataFrame containing the columns of interest.
Create the following pandas DataFrames:
fish_data: theDataFramecreated by reading in thefish.csvfile.
# Write your code below
Exploring the data#
Create a histograms of the following:
vertical length of the fish (in cm)
width of the fish (in cm)
weight of the fish (in grams)
You do not need to store the results in a variable. Label the y-axis with “Frequency” and the x-axes with “Vertical Length (cm)”, “Width (cm)” and “Weight (g)” where appropriate. Start with bins=15 to create 15 bins and then try different number of bins as you see fti for each histogram. (3 marks)
# Write your code below
# add additional cells as necessary
Comment on the shape of each histogram and the distribution of the vertical length, width, and weight of the fish in terms of the skewness, the range, and the number of modes. (3 marks)
Answer goes here
Create two scatterplots:
one with
width_cmon the x-axis andweight_gon the y-axis.one with
length_vertical_cmon the x-axis andweight_gon the y-axis.
Label the axes with “Vertical Length (cm)”, “Width (cm)” and “Weight (g)” where appropriate. You do not need to save the values in a variable. (2 marks)
# Write your code below
# add additional cells as necessary
Comment on the shape of each scatterplot. Specifically commenting on whether the trends looks “positive” or “negative” and whether the trends appear linear or not. (2 marks)
Answer goes here
Create a new column in fish_data called weight_g_sqrt which takes the weight_g and square roots the values. Hint: this can be done by using .apply(np.sqrt).
# Write your code below
Now create two new scatterplots:
one with
width_cmon the x-axis andweight_g_sqrton the y-axis.one with
length_vertical_cmon the x-axis andweight_g_sqrton the y-axis.
Label the axes with “Vertical Length (cm)”, “Width (cm)” and “Weight Sqrt” where appropriate. (1 mark)
# Write your code below
# add additional cells as necessary
Comment on the linearity of the scatterplots with weight square rooted. (1 mark)
Answer goes here
Methods#
Setup a linear regression, called regmod1, estimate and fit the model (call this regmod1_fit) and calculate the parameter estimates (using .params), specifically with weight_g_sqrt as the dependent variable and width_cm as the independent variable.
# Write your code below
What is the estimated line of best fit for regmod1? Provide an interpretation for the y-intercept and slope estimates. Comment on whether the y-intercept makes sense. (2 marks)
Answer goes here
Setup another linear regression, called regmod2, estimate and fit the model (call this regmod2_fit) and calculate the parameter estimates (using .params), specifically with weight_g_sqrt as the dependent variable and both width_cm and length_vertical_cm as the independent variables.
# Write your code below
What is the estimated line of best fit for regmod2? Provide an interpretation for the y-intercept and slope estimates. (2 marks)
Answer goes here
Print out the summary tables for both models (regmod1 and regmod2) which display the p-values and 95% confidence intervals associated with the intercept and slope estimates. (1 mark)
# Write your code below
Calculate the \(R^2\) for both models, and save them in regmod1_rsquared and regmod2_rsquared.
# Write your code below
Conclusion#
Based on the scatterplots, p-values, confidence intervals and \(R^2\) which of the two models would you select to model a fish’s weight (square rooted). You will be graded based on the appropriate justification(s) provided. (2 marks)
Answer goes here