GG274 Homework 7: Summary Statistics, Histograms and Simulation#

Logistics#

Due date: The homework is due 23:59 on Monday, March 03.

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Homework_7.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the hw7 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

Introduction#

For this week’s homework, we’ll investigate the behaviour of sample statistics and distributions as we vary our sample size. Specifically, we’ll be investigating the mean amount of time spent cleaning by respondents. Furthermore, we will extend our analysis by studying how our sample mean estimate tends to change when we take samples of increasing sizes.

Question#

Question: How much time on average do respondents spend on indoor house cleaning? How does our estimate of a sample mean change as we take increasingly larger samples?

Instructions and Learning Objectives#

In this homework, you will:

  • Work with the Time Use dataset from lecture to investigate properties of sampling means as the sample size changes

  • Create and modify for loops and functions to run sampling simulations

  • Visualize data using histograms and scatter plots

Task 1 - Read in data#

The Data part of your notebook should read the raw data, extract a DataFrame containing the important columns, rename the columns, and filter out missing values.

You might find it helpful to name intermediate values in your algorithms. That way you can examine them to make sure they have the type you expect and that they look like what you expect. Very helpful when debugging!

Step 1#

Create the following pandas DataFrames:

  • time_use_data_raw: the DataFrame created by reading the gss_tu2016_week7.csv file. (1 mark)

  • time_use_dur: the DataFrame containing the following columns from time_use_data_raw: CASEID, dur18. (1 mark)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# add your code below

Step 2#

time_use_dur could use more informative column names.

Replace CASEID and dur18 in time_use_dur by

  • creating a dictionary new_column_names that maps the column names from time_use_dur to the values "participant_id" and "time_spent_cleaning". (1 mark)

  • create a new DataFrame stored in time_use_data that is a copy of time_use_dur, but with the columns renamed using new_column_names. (1 mark)

# add your code below
# Step 2 check that you have the correct column names

expected_columnnames = ["participant_id", "time_spent_cleaning"]

try:
    assert expected_columnnames == list(time_use_data.columns)
    print("Column names are correct!")
except:
    print("Something is wrong, check your column names")

Task 2 - Compute and Visualize Distribution#

Step 1#

Compute the distribution of the column time_spent_cleaning for respondents that spent at least some time cleaning (i.e., had a non-zero value of time_spent_cleaning) in time_use_data using the describe function. To do this

  • Create a pandas Series called clean_nonzero that only has respondents with non-zero values of time_spent_cleaning.

  • Use the describe function to describe the distribution of time_spent_cleaning, and store the results in a variable called summary_stats.

# add your code below

Step 2#

Visualize clean_nonzero column by creating a histogram using maptplotlib with the following parameters:

bins=time_bins, linewidth = 1.3, edgecolor="black", color="white"

Label the horizontal axis (x-axis) Time spent cleaning (minutes).

time_bins = np.arange(start=0, stop=900, step=30) # DO NOT CHANGE THIS LINE
# Add your code below

Step 3#

In a markdown cell, describe the distribution of data, pointing out features like mode (where most observations lie), skew (the shape of the distribution symmetric or not), and potential outliers. Do our results make sense given what we know about time spent on daily cleaning?

Are there any strange values? Specifically, does it make sense that there are more people that spend less/more than 30 minutes cleaning in a day than those who spend more than 4 hours? Briefly explain. (2 marks)

Answer here:

Task 3: Compute the empirical mean#

The empirical mean is the mean of all observed data. In this homework, we distinguish this from the sample mean which is the mean of a sample or portion of all data.

Compute the empirical mean time spent cleaning by respondents and name it empirical_mean_time_spent_cleaning. (1 mark)

# add your code below

Task 4: Set up a Simulation Experiment#

You will investigate the behaviour of sample means for the following sample sizes:

5, 10, 20, 50, 100, 200, 500, and 1000.

Step 1 - Specify Sample Sizes#

Create a numpy array named sample_sizes with the aforementioned values in the specified order using np.array(). (1 mark)

# add your code below

Step 2 - Simulating Sample Means#

In this part, you will complete a function that creates and returns a list of the sample means of the sample draws.

Name the function simulate_sample_means.

The function will have two arguments:

  • data: a pandas Series or column of a DataFrame that we are sampling

  • N: an int, the size of the sample we draw

Your function should make and return a list of 100 sample means of size N from data. Sample without replacement.

The function will return a list. The list will be of size 100, with each element in the list representing the sample mean from the sample of size N.
Hint: Initialize an empty list used to store the sample means. Inside your for loop generate a sample from the data, calculate the sample mean, and append it to your list.

# Finish the function header and complete the function body.
# 
def <replace-with-function-name>(data, N):

    '''Return a list of 100 sample means from a sample of size N from data.'''
    
    
    # This next statement is for reproducability: each random number is generated
    # mathematically based on the previous random number, and we can say which
    # number to start with when we call sample. This allows us to have reproducability
    # with "random" numbers and so we can autotest! Yay!
    seed = 0

    # Here, create any variables that needs to be updated inside the loop.

    
    # In the next few lines, we will write a for loop to:
    # - generate a sample of size N and compute sample mean.
    # - append the sample mean to the list of sample means.
    # repeat 100 times.

    for _ in range(100):
        seed += 5 # Don't change this line
        
        
        # Here, write code to:
        # 1) Take a sample of data, and calculate the sample mean.
        #    When you call .sample, make sure you use random_state=seed as one of the
        #    arguments.
        # 2) Append the sample mean to the list of sample means.
        
        
         
    return <replace-with-your-output-variable>
#check your work
simulate_sample_means(clean_nonzero, 5)

Task 5 - Simulate Sample Means#

In this part, you will complete a code block that computes and compiles simulated means for each sample size.

For each sample size in sample_sizes, call function simulate_sample_means from the previous step to calculate 100 sample means at that sample size. You’re going to build a dictionary where each key is a sample size and each value is the corresponding list of means that simulate_sample_means returned.

Accumulating information in a dictionary#

Remember in lecture we used a for loop to add up a series of numbers? And then we used a for loop to accumulate a list of means? As it turns out, you can use the same technique to make a dictionary.

Here’s how you add a key/value pair to a dictionary (this is also called “inserting”):

d = {} # this creates an empty dictionary named "d"
d["key1"] = "value1"
d
d["key2"] = "value2"
d
d["key1"] = "new_value"
d

You can accumulate a new dictionary using a for loop:

ta_to_course = {}
for name in ["Ibrahim", "Adrienne", "Asana", "Yifeng", "Yongxin"]:
    ta_to_course[name] = 'GGR274'

ta_to_course
for name in ["Meng", "Alan"]:
    ta_to_course[name] = 'EEB125'

ta_to_course
# use "key" to access the associated value
ta_to_course["Meng"]

Step 1 - Create a dictionary of simulated means for each sample size#

As you loop through each element in sample_sizes (you can loop through the numpy array as you would a list), you will pass the current sample size to the function simulate_sample_means (specifically, the argument N). You will be sampling from the cleaned dataset, so make sure to pass the value of clean_nonzero to the data parameter.

The result of calling simulate_sample_means is a list of means. Add a new key/value pair to all_sample_means_by_sample_size. The key is the current sample size and the value is the list of means.

Finally we will be checking this in the autotester:

  • all_sample_means_by_sample_size: a dictionary mapping the sample sizes to a list of 100 sample means. Because we’ll use the same random seed, we’ll get the same “random” sequence. That means that we can autotest it. (2 marks)

# Fill in your code below
# Start with an empty dictionary
all_sample_means_by_sample_size = ...

# # Finish the code

for ... 

Step 2 - Answer this question#

Briefly explain what the keys and values represent in the dictionary all_sample_means_by_sample_size.

  • You can inspect the keys by using the keys method all_sample_means_by_sample_size.keys().

  • You can inspect the values by using the values method all_sample_means_by_sample_size.values().

print(all_sample_means_by_sample_size.keys())
print(all_sample_means_by_sample_size.values())

Answer here:

Task 6 - Plot the results#

Step 1: Create Data for Plotting#

In this section you will calculate the mean of simulation.

Create the following variables:

  • sample_means_by_sample_size: a DataFrame created from the dictionary all_sample_means_by_sample_size. (1 mark)

  • mean_of_sample_means_by_sample_size: compute the column means of sample_means_by_sample_size, that is the mean sample means at each sample size. (1 mark)

  • diff_sample_mean_empirical_means_by_sample_size: the difference between mean of sample means and the empirical mean at each sample size. (1 mark)

# add your code below

Step 2: Plot the data#

In this section you will plot the results.

Create a scatter plot using matplotlib with

  • diff_sample_mean_empirical_means_by_sample_size.index on the horizontal axis (x-axis) and

  • diff_sample_mean_empirical_means_by_sample_size on the vertical axis (y-axis)

by completing the code below. You may find the reference page of matplotlib.pyplot.scatter() helpful.

Label the horizontal axis with the text Sample size and the vertical axis with the text Difference between sample mean and population mean.

plt.axhline(y=0, linestyle=":", color="coral") # DO NOT change this line; it adds a dotted horizontal line at y=0
plt.scatter(
    # complete the function call
);
# add your code below to label the axes

Task 7 - Answer the following Questions#

Include cells with your answers to each of these questions:

  1. What is the empirical mean time spent cleaning by respondents per day (in minutes). Does this value make sense? Why or why not? Answer in one line. (1 mark)

Answer here:

  1. Based on your final scatter plot, what trend or pattern do you notice between sample size and difference between the mean of sample means and empirical mean? Does the difference decrease or increase with sample size? Explain why this trend is seen, drawing on your understanding of randomness of sampling. (2 marks)

Answer here:

  1. If you were to do further analysis to study how the time spent cleaning is different for various subpopulations, which additional sociodemographic variables might you consider? Why? Write 3-5 sentences identifying 1-2 variables (e.g. age - don’t pick this!) of interest and what differences in cleaning time you might expect to find.

Answer here: