GG274 Homework 7: Summary Statistics, Histograms and Simulation#
Logistics#
Due date: The homework is due 23:59 on Monday, March 03.
You will submit your work on MarkUs. To submit your work:
Download this file (
Homework_7.ipynb
) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the hw7 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.
Introduction#
For this week’s homework, we’ll investigate the behaviour of sample statistics and distributions as we vary our sample size. Specifically, we’ll be investigating the mean amount of time spent cleaning by respondents. Furthermore, we will extend our analysis by studying how our sample mean estimate tends to change when we take samples of increasing sizes.
Question#
Question: How much time on average do respondents spend on indoor house cleaning? How does our estimate of a sample mean change as we take increasingly larger samples?
Instructions and Learning Objectives#
In this homework, you will:
Work with the Time Use dataset from lecture to investigate properties of sampling means as the sample size changes
Create and modify for loops and functions to run sampling simulations
Visualize data using histograms and scatter plots
Task 1 - Read in data#
The Data part of your notebook should read the raw data, extract a DataFrame
containing the important columns, rename the columns, and filter out missing values.
You might find it helpful to name intermediate values in your algorithms. That way you can examine them to make sure they have the type you expect and that they look like what you expect. Very helpful when debugging!
Step 1#
Create the following pandas DataFrame
s:
time_use_data_raw
: theDataFrame
created by reading thegss_tu2016_week7.csv
file. (1 mark)time_use_dur
: theDataFrame
containing the following columns fromtime_use_data_raw
:CASEID
,dur18
. (1 mark)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# add your code below
Step 2#
time_use_dur
could use more informative column names.
Replace CASEID
and dur18
in time_use_dur
by
creating a dictionary
new_column_names
that maps the column names fromtime_use_dur
to the values"participant_id"
and"time_spent_cleaning"
. (1 mark)create a new
DataFrame
stored intime_use_data
that is a copy oftime_use_dur
, but with the columns renamed usingnew_column_names
. (1 mark)
# add your code below
# Step 2 check that you have the correct column names
expected_columnnames = ["participant_id", "time_spent_cleaning"]
try:
assert expected_columnnames == list(time_use_data.columns)
print("Column names are correct!")
except:
print("Something is wrong, check your column names")
Task 2 - Compute and Visualize Distribution#
Step 1#
Compute the distribution of the column time_spent_cleaning
for respondents that spent at least some time cleaning (i.e., had a non-zero value of time_spent_cleaning
) in time_use_data
using the describe
function. To do this
Create a pandas
Series
calledclean_nonzero
that only has respondents with non-zero values oftime_spent_cleaning
.Use the
describe
function to describe the distribution oftime_spent_cleaning
, and store the results in a variable calledsummary_stats
.
# add your code below
Step 2#
Visualize clean_nonzero
column by creating a histogram using maptplotlib
with the following parameters:
bins=time_bins, linewidth = 1.3, edgecolor="black", color="white"
Label the horizontal axis (x-axis) Time spent cleaning (minutes)
.
time_bins = np.arange(start=0, stop=900, step=30) # DO NOT CHANGE THIS LINE
# Add your code below
Step 3#
In a markdown cell, describe the distribution of data, pointing out features like mode (where most observations lie), skew (the shape of the distribution symmetric or not), and potential outliers. Do our results make sense given what we know about time spent on daily cleaning?
Are there any strange values? Specifically, does it make sense that there are more people that spend less/more than 30 minutes cleaning in a day than those who spend more than 4 hours? Briefly explain. (2 marks)
Answer here:
Task 3: Compute the empirical mean#
The empirical mean is the mean of all observed data. In this homework, we distinguish this from the sample mean which is the mean of a sample or portion of all data.
Compute the empirical mean time spent cleaning by respondents and name it empirical_mean_time_spent_cleaning
. (1 mark)
# add your code below
Task 4: Set up a Simulation Experiment#
You will investigate the behaviour of sample means for the following sample sizes:
5, 10, 20, 50, 100, 200, 500, and 1000.
Step 1 - Specify Sample Sizes#
Create a numpy
array named sample_sizes
with the aforementioned values in the specified order using np.array()
. (1 mark)
# add your code below
Step 2 - Simulating Sample Means#
In this part, you will complete a function that creates and returns a list of the sample means of the sample draws.
Name the function simulate_sample_means
.
The function will have two arguments:
data
: apandas
Series
or column of aDataFrame
that we are samplingN
: anint
, the size of the sample we draw
Your function should make and return a list of 100 sample means of size N
from data
. Sample without replacement.
The function will return a list. The list will be of size 100, with each element in the list representing the sample mean from the sample of size N.
Hint: Initialize an empty list used to store the sample means. Inside your for loop generate a sample from the data, calculate the sample mean, and append it to your list.
# Finish the function header and complete the function body.
#
def <replace-with-function-name>(data, N):
'''Return a list of 100 sample means from a sample of size N from data.'''
# This next statement is for reproducability: each random number is generated
# mathematically based on the previous random number, and we can say which
# number to start with when we call sample. This allows us to have reproducability
# with "random" numbers and so we can autotest! Yay!
seed = 0
# Here, create any variables that needs to be updated inside the loop.
# In the next few lines, we will write a for loop to:
# - generate a sample of size N and compute sample mean.
# - append the sample mean to the list of sample means.
# repeat 100 times.
for _ in range(100):
seed += 5 # Don't change this line
# Here, write code to:
# 1) Take a sample of data, and calculate the sample mean.
# When you call .sample, make sure you use random_state=seed as one of the
# arguments.
# 2) Append the sample mean to the list of sample means.
return <replace-with-your-output-variable>
#check your work
simulate_sample_means(clean_nonzero, 5)
Task 5 - Simulate Sample Means#
In this part, you will complete a code block that computes and compiles simulated means for each sample size.
For each sample size in sample_sizes
, call function simulate_sample_means
from the previous step to calculate 100 sample means at that sample size. You’re going to build a dictionary where each key is a sample size and each value is the corresponding list of means that simulate_sample_means
returned.
Accumulating information in a dictionary#
Remember in lecture we used a for loop to add up a series of numbers? And then we used a for loop to accumulate a list of means? As it turns out, you can use the same technique to make a dictionary.
Here’s how you add a key/value pair to a dictionary (this is also called “inserting”):
d = {} # this creates an empty dictionary named "d"
d["key1"] = "value1"
d
d["key2"] = "value2"
d
d["key1"] = "new_value"
d
You can accumulate a new dictionary using a for loop:
ta_to_course = {}
for name in ["Ibrahim", "Adrienne", "Asana", "Yifeng", "Yongxin"]:
ta_to_course[name] = 'GGR274'
ta_to_course
for name in ["Meng", "Alan"]:
ta_to_course[name] = 'EEB125'
ta_to_course
# use "key" to access the associated value
ta_to_course["Meng"]
Step 1 - Create a dictionary of simulated means for each sample size#
As you loop through each element in sample_sizes
(you can loop through the numpy
array as you would a list), you will pass the current sample size to the function simulate_sample_means
(specifically, the argument N
). You will be sampling from the cleaned dataset, so make sure to pass the value of clean_nonzero
to the data
parameter.
The result of calling simulate_sample_means
is a list of means. Add a new key/value pair to all_sample_means_by_sample_size
. The key is the current sample size and the value is the list of means.
Finally we will be checking this in the autotester:
all_sample_means_by_sample_size
: a dictionary mapping the sample sizes to a list of 100 sample means. Because we’ll use the same random seed, we’ll get the same “random” sequence. That means that we can autotest it. (2 marks)
# Fill in your code below
# Start with an empty dictionary
all_sample_means_by_sample_size = ...
# # Finish the code
for ...
Step 2 - Answer this question#
Briefly explain what the keys and values represent in the dictionary all_sample_means_by_sample_size
.
You can inspect the keys by using the
keys
methodall_sample_means_by_sample_size.keys()
.You can inspect the values by using the
values
methodall_sample_means_by_sample_size.values()
.
print(all_sample_means_by_sample_size.keys())
print(all_sample_means_by_sample_size.values())
Answer here:
Task 6 - Plot the results#
Step 1: Create Data for Plotting#
In this section you will calculate the mean of simulation.
Create the following variables:
sample_means_by_sample_size
: aDataFrame
created from the dictionaryall_sample_means_by_sample_size
. (1 mark)mean_of_sample_means_by_sample_size
: compute the column means ofsample_means_by_sample_size
, that is the mean sample means at each sample size. (1 mark)diff_sample_mean_empirical_means_by_sample_size
: the difference between mean of sample means and the empirical mean at each sample size. (1 mark)
# add your code below
Step 2: Plot the data#
In this section you will plot the results.
Create a scatter plot using matplotlib
with
diff_sample_mean_empirical_means_by_sample_size.index
on the horizontal axis (x-axis) anddiff_sample_mean_empirical_means_by_sample_size
on the vertical axis (y-axis)
by completing the code below. You may find the reference page of matplotlib.pyplot.scatter()
helpful.
Label the horizontal axis with the text Sample size
and the vertical axis with the text Difference between sample mean and population mean
.
plt.axhline(y=0, linestyle=":", color="coral") # DO NOT change this line; it adds a dotted horizontal line at y=0
plt.scatter(
# complete the function call
);
# add your code below to label the axes
Task 7 - Answer the following Questions#
Include cells with your answers to each of these questions:
What is the empirical mean time spent cleaning by respondents per day (in minutes). Does this value make sense? Why or why not? Answer in one line. (1 mark)
Answer here:
Based on your final scatter plot, what trend or pattern do you notice between sample size and difference between the mean of sample means and empirical mean? Does the difference decrease or increase with sample size? Explain why this trend is seen, drawing on your understanding of randomness of sampling. (2 marks)
Answer here:
If you were to do further analysis to study how the time spent cleaning is different for various subpopulations, which additional sociodemographic variables might you consider? Why? Write 3-5 sentences identifying 1-2 variables (e.g. age - don’t pick this!) of interest and what differences in cleaning time you might expect to find.
Answer here: