GG274 Homework 7: Summary Statistics, Histograms and Simulation#

Logistics#

Due date: The homework is due 23:59 on Monday, March 04.

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Homework_7.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the hw7 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

Piloting MarkUs JupyterHub Extension (optional)#

Starting with this week’s lab and homework, we’re piloting a new way to submit files to MarkUs directly from JupyterHub (without needing to download them). This is optional so you can still submit your work the usual way, but if you have some time please try it out by following the instructions on the MarkUs Guide.

Introduction#

For this week’s homework, we’ll investigate the behaviour of sample statistics and distributions as we vary our sample size. Specifically, we’ll be investigating the mean amount of time spent cleaning by respondents. Furthermore, we will extend our analysis by studying how our sample mean estimate tends to change when we take samples of increasing sizes.

Question#

Question: How much time on average do respondents spend on indoor house cleaning? How does our estimate of a sample mean change as we take increasingly larger samples?

Instructions and Learning Objectives#

In this homework, you will:

  • Work with the Time Use dataset from lecture to investigate properties of sampling means as the sample size changes

  • Create and modify for loops and functions to run sampling simulations

  • Visualize data using histograms and scatter plots

Task 1 - Read in data#

The Data part of your notebook should read the raw data, extract a DataFrame containing the important columns, rename the columns, and filter out missing values.

You might find it helpful to name intermediate values in your algorithms. That way you can examine them to make sure they have the type you expect and that they look like what you expect. Very helpful when debugging!

Step 1#

Create the following pandas DataFrames:

  • time_use_data_raw: the DataFrame created by reading the gss_tu2016_main_file.csv file. (1 mark)

  • time_use_dur: the DataFrame containing the following columns from time_use_data_raw: 'CASEID', 'dur18'. (1 mark) (We test this after any changes are made to it. We do not check the initial value.)

import pandas as pd
import matplotlib.pyplot as plt

# add your code below

Step 2#

time_use_dur could use more informative column names.

Replace CASEID and dur18 in time_use_dur by

  • creating a dictionary new_column_names that maps the column names from time_use_dur to the values 'participant_id' and 'time_spent_cleaning'. (1 mark)

  • create a new DataFrame stored in time_use_data that is a copy of time_use_dur, but with the columns renamed using new_column_names. (1 mark)

# add your code below
# Step 2 check that you have the correct column names

expected_columnnames = ['participant_id', 'time_spent_cleaning']

try:
    assert expected_columnnames == list(time_use_data.columns)
    print('Column names are correct!')
except:
    print('Something is wrong, check your column names')

Task 2 - Compute and Visualize Distribution#

Step 1#

Compute the distribution of the column time_spent_cleaning for respondents that spent at least some time cleaning (i.e., had a non-zero value of time_spent_cleaning) in time_use_data using the describe function. To do this

  • Create a pandas Series called clean_nonzero that only has respondents with non-zero values of time_spent_cleaning.

  • Use the describe function to describe the distribution of time_spent_cleaning, and store the results in a variable called summary_stats.

# add your code below

Step 2#

Visualize clean_nonzero column by creating a histogram using maptplotlib with the following parameters:

bins = 25, edgecolor='black', linewidth = 1.3, color = 'lightgrey'

Label the horizontal axis (x-axis) Time spent cleaning (minutes).

# Add your code below

Step 3#

In a markdown cell, describe the distribution of data, pointing out features like mode (where most observations lie), skew, and potential outliers. Do our results make sense given what we know about time spent on daily cleaning?

Are there any strange values? Specifically, does it make sense for people to spend 0 mins cleaning? Briefly explain. (2 marks)

Answer goes here …

Task 3: Compute the empirical mean#

The empirical mean as the mean of all observed data. We distinguish this from the sample mean which is the mean of a sample or portion of all data.

Compute the empirical mean time spent cleaning by respondents and name it empirical_mean_time_spent_cleaning. (1 mark)

# add your code below

Task 4: Set up a Simulation Experiment#

You will investigate the behaviour of sample means for the following sample sizes:

5, 10, 20, 50, 100, 200, 500, and 1000.

Step 1 - Specify Sample Sizes#

Create a list named sample_sizes with the aforementioned values in the specified order. (1 mark)

# add your code below

Step 2 - Simulating Sample Means#

In this part, you will complete a function that creates and returns a list of the sample means of the sample draws.

Name the function simulate_sample_means.

The function will have two arguments:

  • data: a pandas Series or column of a DataFrame that we are sampling

  • N: an int, the size of the sample we draw

Your function should make and return a list of 100 sample means of size N from data.

Sample without replacement.

The function will return a list, sample_means. The list will be of size 100, with each element in the list representing the sample mean from the sample of size N. Hint: Initialize an empty list used to store the sample means. Inside your for loop generate a sample from the data, calculate the sample mean, and append it to your list.

Wherever there is a commented chunk of code of the form var_name = ..., replace the ... with the appropriate value or expression.

# Finish the function header and complete the function body.
# 
def 

    '''Return a list of 100 sample means from a sample of size N from data.'''
    
    
    # This next statement is for reproducability: each random number is generated
    # mathematically based on the previous random number, and we can say which
    # number to start with when we call sample. This allows us to have reproducability
    # with "random" numbers and so we can autotest! Yay!
    seed = 0

    # Create any variables you need here, such as the list of sample means you are
    # accumulating.

    
    # In the next few lines, we will write a for loop to:
    # - generate a sample of size N and compute sample mean.
    # - append the sample mean to the list of sample means.
    # repeat 100 times.

    for _ in range(100):
        seed += 1 # Don't change this line
        
        
        # Here, write code to:
        # 1) Take a sample of data, and calculate the sample mean.
        #    When you call .sample, make sure you use random_state=seed as one of the
        #    arguments.
        # 2) Append the sample mean to the list of sample means.
        
        
         
    return ???
#check your work
simulate_sample_means(clean_nonzero, 5)

Task 5 - Simulate Sample Means#

In this part, you will complete a code block that computes and compiles simulated means for each sample size.

For each sample size in sample_sizes, call function simulate_sample_means from the previous step to calculate 100 sample means at that sample size. You’re going to build a dictionary where each key is a sample size and each value is the corresponding list of means that simulate_sample_means returned.

Accumulating information in a dictionary#

Remember in lecture we used a for loop to add up a series of numbers? And then we used a for loop to accumulate a list of means? As it turns out, you can use the same technique to make a dictionary.

Here’s how you add a key/value pair to a dictionary (this is also called “inserting”):

d = {}
d['key1'] = 'value1'
d
d['key2'] = 'value2'
d
d['key1'] = 'new_value'
d

You can accumulate a new dictionary using a for loop:

ta_to_course = {}
for name in ['Amber','Martin', 'Davia', 'KP', 'Ilan']:
    ta_to_course[name] = 'GGR274'

ta_to_course
for name in ['Matt', 'Fiona']:
    ta_to_course[name] = 'EEB125'

print(ta_to_course)
print(ta_to_course['Matt'])

Step 1 - Create a dictionary of simulated means for each sample size#

As you loop through each element in sample_sizes, you will pass the current sample size to the function simulate_sample_means (specifically, the argument N). You will be sampling from the cleaned dataset, so make sure to pass the value of clean_nonzero to the data parameter.

The result of calling simulate_sample_means is a list of means. Add a new key/value pair to all_sample_means_by_sample_size. The key is the current sample size and the value is the list of means.

Finally we will be checking this in the autotester:

  • all_sample_means_by_sample_size: a dictionary mapping the sample sizes to a list of sample means of a size 100. (Because we’ll use the same random seed, we’ll get the same “random” sequence. That means that we can autotest it. Yay!) (2 marks)

# Fill in your code below
# Start with an empty dictionary
all_sample_means_by_sample_size = ...

# # Finish the code

for ... 

Step 2 - Answer this question#

Briefly explain what the keys and values represent in the dictionary all_sample_means_by_sample_size. You can obtain the keys by using the keys method all_sample_means_by_sample_size.keys().

answer to question …

Task 6 - Plot the results#

Step 1: Create Data for Plotting#

In this section you will calculate the mean of simulation.

Create the following variables:

  • sample_means_by_sample_size: a DataFrame created from the dictionary all_sample_means_by_sample_size. (1 mark)

  • mean_of_sample_means_by_sample_size: compute the column means of sample_means_by_sample_size, that is the mean sample means at each sample size. (1 mark)

  • diff_sample_mean_empirical_means_by_sample_size: the difference between mean of sample means and the empirical mean at each sample size. (1 mark)

# add your code below

Step 2: Plot the data#

In this section you will plot the results.

Create a scatter plot using matplotlib with

  • diff_sample_mean_empirical_means_by_sample_size.index on the horizontal axis (x-axis) and

  • diff_sample_mean_empirical_means_by_sample_size on the vertical axis (y-axis).

Label the horizontal axis with the text Sample size and the vertical axis with the text Difference between sample mean and population mean.

# add your code below

Task 7 - Answer the following Questions#

Include cells with your answers to each of these questions:

  1. What is the empirical mean time spent cleaning by respondents per day (in minutes). Does this value make sense? Why or why not? Answer in one line. (1 mark)

Answer goes here …

  1. Based on your final scatter plot, what trend or pattern do you notice between sample size and difference between the mean of sample means and empirical mean? Does the difference decrease or increase with sample size? Explain why this trend is seen, drawing on your understanding of randomness of sampling. (2 marks)

Answer goes here …

  1. If you were to do further analysis to study how the time spent cleaning is different for various subpopulations, which additional sociodemographic variables might you consider? Why? Write 3-5 sentences identifying 1-2 variables (e.g. age - don’t pick this!) of interest and what differences in cleaning time you might expect to find.

Answer goes here …