GG274 Homework 8: Hypothesis Testing#

Logistics#

Due date: The homework is due 23:59 on Monday, March 11.

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Homework_8.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the hw8 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

Piloting MarkUs JupyterHub Extension (optional)#

We’re piloting a new way to submit files to MarkUs directly from JupyterHub (without needing to download them). This is optional so you can still submit your work the usual way, but if you have some time please try it out by following the instructions on the MarkUs Guide.

Introduction#

Residential instability is one component of the Ontario Marginalization Index that includes indicators of types and density of residential accommodations, and certain family structure characteristics, such as living alone and dwelling owndership. (see OCHPP)

In this homework you will explore the following question:

Are mental health visits different in Toronto neighbourhoods with higher “residential instability”?

import numpy as np
import pandas as pd
%pip install xlrd

Step 1 - Read the Neighbourhood Instability data into a pandas DataFrame#

a) The data is stored in 1_marg_neighb_toronto_2006_OnMarg.xls - a Microsoft Excel file format with file extension .xls.

Use the pandas function read_excel to read the sheet Neighbourhood_Toronto_OnMarg into a pandas DataFrame named marg_neighb.

# Write your code below

Use marg_neighb to create a another DataFrame called instability_df that has three columns: 'Neighb id ', 'Neighbourhood name ', 'INSTABILITY'.

# Write your code below

b) Rename the column names of instability_df using the following table. The DataFrame with the new column names should be called instability_df (i.e., don’t change the name of the DataFrame).

Original column name

New column name

Neighb id

Neighbid

INSTABILITY

INSTABILITY

Neighbourhood name

name

# Write your code below

Step 2 - Read the mental health visit data into a pandas DataFrame.#

a) In this step you will read in data on rates of mental health visits stored in 2_ahd_neighb_db_ast_hbp_mhv_copd_2012.xls into a pandas DataFrame named mentalhealth_neighb.

# Write your code below

b) Create a new DataFrame mhvisitrates by selecting the columns in mentalhealth_neighb that corresponds to Neighbourhood ID, Neighbourhood Name, and ‘Age-Standardized rate of Mental Health Visits (2012), All Ages 20+’ rename this column in mhvisitrates to mhvisitrates_mf. When you rename this column don’t change the name of the DataFrame mhvisitrates.

# Write your code below

Step 3 - Merge mental health visits and instability#

In this step you will merge the mhvisitrates with mentalhealth_neighb.

a) Merge mhvisitrates with instability_df and name this DataFrame mhvisitinstab.

# Write your code below

Step 4#

a) Create a new column in mhvisitinstab named instab_HL that categorizes neighbourhoods. The new columns should have two possible values:

  • 'High', if the neighbourhood’s INSTABILITY value is greater than or equal to the mean

  • 'Low', if the neighbourhood’s INSTABILITY value is less than the mean

# Write your code below

b) Compute the frequency distribution of instab_HL. Save the results in instab_HL_frequencies.

# Write your code below

c) Is there evidence that Toronto has many neighbourhoods that have residential instability? Briefly explain. (1 mark)

answer goes here …

Step 5 - Do neighbourhoods with high residential instability have more mental health visits compared to neighbourhoods with low residential instability?#

a) Use the DataFrame describe method to compute the distribution of mhvistrates_mf in mhvisitinstab grouped by instab_HL. Store the results in median_table.

# Write your code below

Use median_table to compute the difference in medians between neighbourhoods with high and low instability. Store this value in median_diff.

# Write your code below

Step 6 - Set up a simulation in python to test if the medians are equal#

a) In this step you will write a function random_shuffle_median that returns a simulated value of the median difference (a simulated value of the test statistic) of mental health visit rates in neighbourhoods with high versus low residential instability assuming that there really is no difference in mental health visit rates between these types of neighbourhoods.

A step-by-step explantion of a similar function was given in lecture, and you can follow this example to help guide you through this step.

The function random_shuffle_median is started for you below. Your task is to complete the function by filling in the ....

Try writing a meaningful docstring for random_shuffle_median. The pandas docstring guide has some great examples and guidelines. (NB: this will not be graded)

def random_shuffle_median():
    """
    Put your docstring here (optional)
    """

    # shuffle the column of mhvisitinstab that corresponds to high/low instability

    instab_HL_shuffle = mhvisitinstab[...].sample(frac=1, replace=False).reset_index(drop = True)
    
    # calculate the median visit rate for high and low instability neighbourhoods

    visitrate_low_shuffle = mhvisitinstab.loc[instab_HL_shuffle == ..., ...].median()
    
    visitrate_high_shuffle  = mhvisitinstab.loc[instab_HL_shuffle ==  ..., ...].median()
    
    shuffled_diff = ... - ...
    
    return shuffled_diff
print(random_shuffle_median.__doc__)

b) Explain the purpose of

mhvisitinstab[...].sample(frac=1, replace=False).reset_index(drop = True)

in 1-2 sentences.

answer goes here

Step 7 - Compute the distribution of simulated values of the median difference assuming the null hypothesis is true#

We will use your student number to generate data for this homework. Complete the assignment statement below by typing your student number as an int. In other words assign your student number as an integer to the variable student_number.

# Replace the ... with your student number
student_number = ...

# This checks that you correctly typed in your student_number as an int.
# Make sure there's no error when you run this cell!
assert type(student_number) == int

a) Write a function called shuffled_diffs that returns a list. The function should use a for loop that iterates the function random_shuffle_median an arbitrary number of times. The number of times that the for loop iterates should be controlled by a function parameter named number_of_shuffles.

# Write your code below

b) Use shuffled_diffs to compute 10000 simulated median differences between high and low instability neighbourhoods assuming that there is no difference in median mental health visit rates between high and low instability neighbourhoods. Store the values in shuffled_diffs_10000.

np.random.seed(student_number)

# Write your code below

c) Plot the distribution of the 10,000 simulated values stored in shuffled_diffs_10000 using a matplotlib histogram. Name the plot nullhypothesis_distribution_plot. Label the horizontal axis as 'Difference in median visit rates for high and low instability neighbourhoods' and the vertical axis as 'Frequency'.

import matplotlib.pyplot as plt

# Write your code below

Step 8 - Compute the p-value#

a) Compute the number of simulated differences in medians in shuffled_diffs_10000 that are greater than or equal to the observed median difference (median_diff). Store this value in rightextreme.

# Write your code below

b) Compute the number of simulated differences in medians in shuffled_diffs_10000 that are less than the observed median difference (median_diff). Store this value in leftextreme.

# Write your code below

c) Use rightextreme and leftextreme to compute the p-value. Store the p-value in pvalue.

# Write your code below

Step 9 - Communicate what you did in the steps above#

a) In a few sentences introduce the question that you explored in this homework (see the beginning of this homework). For example, why do you or others think this is an important question? (1 mark)

answer goes here …

b) Briefly describe the data sources that you used to answer the question. Which statistical variables did you use and why did you use these varaibles? (1 mark)

answer goes here …

c) What computational and statistical methods or analyses did you use to answer the question? Briefly describe these methods and how they were used to answer the question. (1 mark)

answer goes here …

d) Briefly describe the results of your statistical analysis in a few sentences. (1 mark)

answer goes here …

e) What conclusions can you draw about the question you set out to answer that is supported by the data and statistical analysis of the data? State at least one limitation of your conclusions. (See the USC Research Guide section on study limitations) (1 mark)

answer goes here …