{ "cells": [ { "cell_type": "markdown", "id": "bd13e811", "metadata": {}, "source": [ "# GG274 Homework 8: Hypothesis Testing\n", "\n", "Reisdential instability is one component of the [Ontario Marginalization Index](https://www.publichealthontario.ca/-/media/Documents/O/2017/on-marg-technical.pdf?la=en≻lang=en&hash=EED54DF437EDEDA2DFE1A00A4B14A50A) that includes indicators of types and density of residential accommodations, and certain family structure characteristics, such as living alone and dwelling owndership. ([see OCHPP](https://www.ontariohealthprofiles.ca/canmargCAN.php))\n", "\n", "In this homework you will explore the following question:\n", "\n", "> **Are mental health visits different in Toronto neighbourhoods with higher \"residential instability\"?**" ] }, { "cell_type": "code", "execution_count": 22, "id": "498ebc38", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "dd16cb9a", "metadata": {}, "source": [ "## Step 1 - Read the Neighbourhood Instability data into a `pandas` `DataFrame`\n", "\n", "a) The data is stored in `1_marg_neighb_toronto_2006_OnMarg.xls` - a Microsoft Excel file format with file extension `.xls`.\n", "\n", "Use the `pandas` function `read_excel` to read the sheet `Neighbourhood_Toronto_OnMarg` into a `pandas` `DataFrame` named `marg_neighb`. \n" ] }, { "cell_type": "code", "execution_count": 23, "id": "0e70a55c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Neighb idNeighbourhood namePOPULATIONINSTABILITYINSTABILITY_QDEPRIVATIONDEPRIVATION_QETHNICCONCENTRATIONETHNICCONCENTRATION_QDEPENDENCYDEPENDENCY_QONMARG_COMBINED_Q
01West Humber-Clairville32252-0.663910.162032.45485-0.202132.4
12Mount Olive-Silverstone-Jamestown32127-0.108111.019553.74335-0.597512.4
23Thistletown-Beaumond Heights9928-0.313110.346041.622040.284552.8
34Rexdale-Kipling107250.186620.470441.239630.273452.8
45Elms-Old Rexdale9879-0.015020.804051.99114-0.352722.6
\n", "
" ], "text/plain": [ " Neighb id Neighbourhood name POPULATION INSTABILITY \\\n", "0 1 West Humber-Clairville 32252 -0.6639 \n", "1 2 Mount Olive-Silverstone-Jamestown 32127 -0.1081 \n", "2 3 Thistletown-Beaumond Heights 9928 -0.3131 \n", "3 4 Rexdale-Kipling 10725 0.1866 \n", "4 5 Elms-Old Rexdale 9879 -0.0150 \n", "\n", " INSTABILITY_Q DEPRIVATION DEPRIVATION_Q ETHNICCONCENTRATION \\\n", "0 1 0.1620 3 2.4548 \n", "1 1 1.0195 5 3.7433 \n", "2 1 0.3460 4 1.6220 \n", "3 2 0.4704 4 1.2396 \n", "4 2 0.8040 5 1.9911 \n", "\n", " ETHNICCONCENTRATION_Q DEPENDENCY DEPENDENCY_Q ONMARG_COMBINED_Q \n", "0 5 -0.2021 3 2.4 \n", "1 5 -0.5975 1 2.4 \n", "2 4 0.2845 5 2.8 \n", "3 3 0.2734 5 2.8 \n", "4 4 -0.3527 2 2.6 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "marg_neighb = pd.read_excel('1_marg_neighb_toronto_2006_OnMarg.xls',\n", " sheet_name='Neighbourhood_Toronto_OnMarg', header=1)\n", "\n", "marg_neighb.head()" ] }, { "cell_type": "markdown", "id": "a9b80a4d", "metadata": {}, "source": [ "Use `marg_neighb` to create a another `DataFrame` called `instability_df` that has three columns: `'Neighb id ', 'Neighbourhood name ', 'INSTABILITY'`." ] }, { "cell_type": "code", "execution_count": 24, "id": "260275b5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Neighb idNeighbourhood nameINSTABILITY
01West Humber-Clairville-0.6639
12Mount Olive-Silverstone-Jamestown-0.1081
23Thistletown-Beaumond Heights-0.3131
34Rexdale-Kipling0.1866
45Elms-Old Rexdale-0.0150
\n", "
" ], "text/plain": [ " Neighb id Neighbourhood name INSTABILITY\n", "0 1 West Humber-Clairville -0.6639\n", "1 2 Mount Olive-Silverstone-Jamestown -0.1081\n", "2 3 Thistletown-Beaumond Heights -0.3131\n", "3 4 Rexdale-Kipling 0.1866\n", "4 5 Elms-Old Rexdale -0.0150" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "instability_df = marg_neighb[marg_neighb.columns[[0, 1, 3]]]\n", "\n", "instability_df.head()" ] }, { "cell_type": "markdown", "id": "ba21291a", "metadata": {}, "source": [ "b) Rename the column names of `instability_df` using the following table. The DataFrame with the new column names should be called `instability_df` (i.e., don't change the name of the DataFrame).\n", "\n", "Original column name | New column name\n", "----|----\n", "Neighb id | Neighbid\n", "INSTABILITY | INSTABILITY\n", "Neighbourhood name | name\n" ] }, { "cell_type": "code", "execution_count": 25, "id": "911d50e5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NeighbidnameINSTABILITY
01West Humber-Clairville-0.6639
12Mount Olive-Silverstone-Jamestown-0.1081
23Thistletown-Beaumond Heights-0.3131
34Rexdale-Kipling0.1866
45Elms-Old Rexdale-0.0150
\n", "
" ], "text/plain": [ " Neighbid name INSTABILITY\n", "0 1 West Humber-Clairville -0.6639\n", "1 2 Mount Olive-Silverstone-Jamestown -0.1081\n", "2 3 Thistletown-Beaumond Heights -0.3131\n", "3 4 Rexdale-Kipling 0.1866\n", "4 5 Elms-Old Rexdale -0.0150" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "colnames = {'Neighb id ': 'Neighbid',\n", " 'INSTABILITY' : 'INSTABILITY',\n", " 'Neighbourhood name ': 'name'}\n", "\n", "instability_df = instability_df.copy()\n", "\n", "instability_df.rename(columns = colnames, inplace=True)\n", "\n", "instability_df.head()" ] }, { "cell_type": "markdown", "id": "cf32e981", "metadata": {}, "source": [ "## Step 2 - Read the mental health visit data into a `pandas` `DataFrame`.\n", "\n", "a) In this step you will read in data on rates of mental health visits stored in `2_ahd_neighb_db_ast_hbp_mhv_copd_2012.xls` into a `pandas` `DataFrame` named `mentalhealth_neighb`.\n" ] }, { "cell_type": "code", "execution_count": 26, "id": "0243bcf1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Unnamed: 1MaleFemaleBoth sexesMale.1Female.1Both sexes.1Male.2Female.2...Female.12Both sexes.12Rate Ratio**, Both sexes.4H/ L/ NS, Both sexes.4(95% CI) LL, Male.4(95% CI) UL, Male.4(95% CI) LL, Female.4(95% CI) UL, Female.4(95% CI) LL, Both sexes.4(95% CI) UL, Both sexes.4
01West Humber-Clairville938116821061391514046279616.68.2...7.27.20.85L6.18.36.28.36.48.0
12Mount Olive-Silverstone-Jamestown866113019961225613082253387.08.6...8.58.20.96NS6.69.47.210.07.39.3
23Thistletown-Beaumond Heights2754106854124445385776.49.2...8.17.90.93NS5.99.96.410.06.79.3
34Rexdale-Kipling3284537814130447086007.710.0...9.09.11.07NS7.211.67.311.07.810.6
45Elms-Old Rexdale2873966833787402878157.49.6...8.17.50.88NS4.89.26.210.56.19.1
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 Unnamed: 1 Male Female Both sexes \\\n", "0 1 West Humber-Clairville 938 1168 2106 \n", "1 2 Mount Olive-Silverstone-Jamestown 866 1130 1996 \n", "2 3 Thistletown-Beaumond Heights 275 410 685 \n", "3 4 Rexdale-Kipling 328 453 781 \n", "4 5 Elms-Old Rexdale 287 396 683 \n", "\n", " Male.1 Female.1 Both sexes.1 Male.2 Female.2 ... Female.12 \\\n", "0 13915 14046 27961 6.6 8.2 ... 7.2 \n", "1 12256 13082 25338 7.0 8.6 ... 8.5 \n", "2 4124 4453 8577 6.4 9.2 ... 8.1 \n", "3 4130 4470 8600 7.7 10.0 ... 9.0 \n", "4 3787 4028 7815 7.4 9.6 ... 8.1 \n", "\n", " Both sexes.12 Rate Ratio**, Both sexes.4 H/ L/ NS, Both sexes.4 \\\n", "0 7.2 0.85 L \n", "1 8.2 0.96 NS \n", "2 7.9 0.93 NS \n", "3 9.1 1.07 NS \n", "4 7.5 0.88 NS \n", "\n", " (95% CI) LL, Male.4 (95% CI) UL, Male.4 (95% CI) LL, Female.4 \\\n", "0 6.1 8.3 6.2 \n", "1 6.6 9.4 7.2 \n", "2 5.9 9.9 6.4 \n", "3 7.2 11.6 7.3 \n", "4 4.8 9.2 6.2 \n", "\n", " (95% CI) UL, Female.4 (95% CI) LL, Both sexes.4 (95% CI) UL, Both sexes.4 \n", "0 8.3 6.4 8.0 \n", "1 10.0 7.3 9.3 \n", "2 10.0 6.7 9.3 \n", "3 11.0 7.8 10.6 \n", "4 10.5 6.1 9.1 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mentalhealth_neighb = pd.read_excel('2_ahd_neighb_db_ast_hbp_mhv_copd_2012.xls',\n", " sheet_name='2_MentalHealthV_2012', header=11)\n", "mentalhealth_neighb.head()" ] }, { "cell_type": "markdown", "id": "637787f9", "metadata": {}, "source": [ "b) Create a new DataFrame `mhvisitrates` by selecting the columns in `mentalhealth_neighb` that corresponds to Neighbourhood ID, Neighbourhood Name, and 'Age-Standardized rate of Mental Health Visits (2012), All Ages 20+' rename this column in `mhvisitrates` to `mhvisitrates_mf`. When you rename this column don't change the name of the DataFrame `mhvisitrates`." ] }, { "cell_type": "code", "execution_count": 27, "id": "449b31ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Neighbidnamemhvisitrates_mf
01West Humber-Clairville7.4
12Mount Olive-Silverstone-Jamestown7.8
23Thistletown-Beaumond Heights7.8
34Rexdale-Kipling8.9
45Elms-Old Rexdale8.5
\n", "
" ], "text/plain": [ " Neighbid name mhvisitrates_mf\n", "0 1 West Humber-Clairville 7.4\n", "1 2 Mount Olive-Silverstone-Jamestown 7.8\n", "2 3 Thistletown-Beaumond Heights 7.8\n", "3 4 Rexdale-Kipling 8.9\n", "4 5 Elms-Old Rexdale 8.5" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mhvisitrates = mentalhealth_neighb[mentalhealth_neighb.columns[[0, 1, 10]]]\n", "\n", "colnames = {'Unnamed: 0': 'Neighbid',\n", " 'Both sexes.2' : 'mhvisitrates_mf',\n", " 'Unnamed: 1' : 'name'}\n", "\n", "mhvisitrates = mhvisitrates.copy()\n", "\n", "mhvisitrates.rename(columns = colnames, inplace=True)\n", "\n", "mhvisitrates.head()" ] }, { "cell_type": "markdown", "id": "6e4619af", "metadata": {}, "source": [ "## Step 3 - Merge mental health visits and instability\n", "\n", "In this step you will merge the `mhvisitrates` with `mentalhealth_neighb`.\n", "\n", "a) Merge `mhvisitrates` with `instability_df` and name this DataFrame `mhvisitinstab`." ] }, { "cell_type": "code", "execution_count": 28, "id": "5debecca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Neighbidnamemhvisitrates_mfINSTABILITY
01West Humber-Clairville7.4-0.6639
12Mount Olive-Silverstone-Jamestown7.8-0.1081
23Thistletown-Beaumond Heights7.8-0.3131
34Rexdale-Kipling8.90.1866
45Elms-Old Rexdale8.5-0.0150
\n", "
" ], "text/plain": [ " Neighbid name mhvisitrates_mf INSTABILITY\n", "0 1 West Humber-Clairville 7.4 -0.6639\n", "1 2 Mount Olive-Silverstone-Jamestown 7.8 -0.1081\n", "2 3 Thistletown-Beaumond Heights 7.8 -0.3131\n", "3 4 Rexdale-Kipling 8.9 0.1866\n", "4 5 Elms-Old Rexdale 8.5 -0.0150" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mhvisitinstab = mhvisitrates.merge(instability_df, on = ['Neighbid', 'name'])\n", "mhvisitinstab.head()" ] }, { "cell_type": "markdown", "id": "b748c831", "metadata": {}, "source": [ "## Step 4\n", "\n", "a) Create a new column in `mhvisitinstab` named `instab_HL` that categorizes neighbourhoods. The new columns should have two possible values:\n", "\n", "- `'High'`, if the neighbourhood's INSTABILITY value is greater than or equal to the mean\n", "- `'Low'`, if the neighbourhood's INSTABILITY value is less than the mean" ] }, { "cell_type": "code", "execution_count": 29, "id": "bcf2d441", "metadata": {}, "outputs": [], "source": [ "mean_instab = mhvisitinstab['INSTABILITY'].mean()\n", "\n", "mhvisitinstab.loc[mhvisitinstab['INSTABILITY'] >= mean_instab, 'instab_HL'] = 'High'\n", "\n", "mhvisitinstab.loc[mhvisitinstab['INSTABILITY'] < mean_instab, 'instab_HL'] = 'Low'" ] }, { "cell_type": "markdown", "id": "05edf48a", "metadata": {}, "source": [ "b) Compute the frequency distribution of `instab_HL`. Save the results in `instab_HL_frequencies`. " ] }, { "cell_type": "code", "execution_count": 30, "id": "12d2561d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "instab_HL\n", "Low 70\n", "High 66\n", "Name: count, dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "instab_HL_frequencies = mhvisitinstab['instab_HL'].value_counts()\n", "\n", "instab_HL_frequencies" ] }, { "cell_type": "markdown", "id": "66a0de88", "metadata": {}, "source": [ "c) Is there evidence that Toronto has many neighbourhoods that have residential instability? Briefly explain. __(1 mark)__" ] }, { "cell_type": "markdown", "id": "10873e1a", "metadata": {}, "source": [ "> There is more than one possible way to answer the question. e.g., \n", ">\n", "> i. The data shows that about half of the neighbourhoods in Toronto, 66 out of 136, having residential instability measures that are higher than the city-wide mean. It is difficult to provide a definite answer to the question based on the result since it only provide comparison within the city.\n", ">\n", "> ii. (a bit mor. naive answer) About half of the neighbourhoods in Toronto have high residential instability.\n", ">\n", "> Any answer with a sensible reasoning based on the data is acceptable." ] }, { "cell_type": "markdown", "id": "c240f8cb", "metadata": {}, "source": [ "## Step 5 - Do neighbourhoods with high residential instability have more mental health visits compared to neighbourhoods with low residential isntability?\n", "\n", "a) Use the `DataFrame` `describe` method to compute the distribution of `mhvistrates_mf` in `mhvisitinstab` **grouped by** `instab_HL`. Store the results in `median_table`." ] }, { "cell_type": "code", "execution_count": 31, "id": "26535b3d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
instab_HL
High66.08.3621210.9373885.97.8258.459.07510.5
Low70.07.8671431.0005395.77.0257.808.7759.9
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "instab_HL \n", "High 66.0 8.362121 0.937388 5.9 7.825 8.45 9.075 10.5\n", "Low 70.0 7.867143 1.000539 5.7 7.025 7.80 8.775 9.9" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median_table = mhvisitinstab.groupby('instab_HL')['mhvisitrates_mf'].describe()\n", "\n", "median_table" ] }, { "cell_type": "markdown", "id": "2cd0db75", "metadata": {}, "source": [ "Use `median_table` to compute the difference in medians between neighbourhoods with high and low instability. Store this value in `median_diff`. " ] }, { "cell_type": "code", "execution_count": 32, "id": "978a9570", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/0j/ybsv4ncn5w50v40vdh5jjlww0000gn/T/ipykernel_11385/223760169.py:1: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", " median_diff = median_table['50%'][0] - median_table['50%'][1]\n" ] }, { "data": { "text/plain": [ "0.6499999999999995" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median_diff = median_table['50%'][0] - median_table['50%'][1]\n", "\n", "median_diff" ] }, { "cell_type": "markdown", "id": "5da0b797", "metadata": {}, "source": [ "## Step 6 - Set up a simulation in python to test if the medians are equal\n", "\n", "a) In this step you will write a function `random_shuffle_median` that returns a simulated value of the median difference (a simulated value of the test statistic) of mental health visit rates in neighbourhoods with high versus low residential instability assuming that there really is no difference in mental health visit rates between these types of neighbourhoods.\n", "\n", "A step-by-step explantion of a similar function was given in lecture, and you can follow this example to help guide you through this step.\n", "\n", "The function `random_shuffle_median` is started for you below. Your task is to complete the function by filling in the `...`.\n", "\n", "Try writing a meaningful docstring for `random_shuffle_median`. The `pandas` [docstring guide](https://pandas.pydata.org/docs/development/contributing_docstring.html) has some great examples and guidelines. (NB: this will not be graded)" ] }, { "cell_type": "code", "execution_count": 33, "id": "81c53baa", "metadata": {}, "outputs": [], "source": [ "# def random_shuffle_median():\n", "# \"\"\"\n", "# Put your docstring here (optional)\n", "# \"\"\"\n", "\n", "# # shuffle the column of mhvisitinstab that corresponds to high/low instability\n", "\n", "# instab_HL_shuffle = mhvisitinstab[...].sample(frac=1, replace=False).reset_index(drop = True)\n", " \n", "# # calculate the median visit rate for high and low instability neighbourhoods\n", "\n", "# visitrate_low_shuffle = mhvisitinstab.loc[instab_HL_shuffle == ..., ...].median()\n", " \n", "# visitrate_high_shuffle = mhvisitinstab.loc[instab_HL_shuffle == ..., ...].median()\n", " \n", "# shuffled_diff = ... - ...\n", " \n", "# return shuffled_diff\n" ] }, { "cell_type": "code", "execution_count": 34, "id": "20b977f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def random_shuffle_median():\n", " \"\"\"\n", " randomly shuffles the column of mhvisitinstab that corresponds to \n", "\n", " high/low instability and calculates the difference in median visit rates \n", " \n", " for high and low instability neighbourhoods\n", " \"\"\"\n", "\n", " instab_HL_shuffle = mhvisitinstab['instab_HL'].sample(frac=1, replace=False).reset_index(drop = True)\n", " visitrate_low_shuffle = mhvisitinstab.loc[instab_HL_shuffle == 'Low', 'mhvisitrates_mf'].median()\n", " visitrate_high_shuffle = mhvisitinstab.loc[instab_HL_shuffle == 'High', 'mhvisitrates_mf'].median()\n", " shuffled_diff = visitrate_high_shuffle - visitrate_low_shuffle \n", " return shuffled_diff\n", "\n", "random_shuffle_median()\n" ] }, { "cell_type": "code", "execution_count": 35, "id": "4b22ef64", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " randomly shuffles the column of mhvisitinstab that corresponds to \n", "\n", " high/low instability and calculates the difference in median visit rates \n", " \n", " for high and low instability neighbourhoods\n", " \n" ] } ], "source": [ "print(random_shuffle_median.__doc__)" ] }, { "cell_type": "markdown", "id": "2cd897a2", "metadata": {}, "source": [ "b) Explain the purpose of \n", "\n", "```python\n", "mhvisitinstab[...].sample(frac=1, replace=False).reset_index(drop = True)\n", "```\n", "\n", "in 1-2 sentences." ] }, { "cell_type": "markdown", "id": "1727b2fb", "metadata": {}, "source": [ "> The line randomly samples 100% (`frac=1`) of the selected column (`'instab_HL'`) without replacements (`replace=False`) and removes the old ordering (`reset_index(drop=True)`). It mimicks random asingment of the high vs low labels to each neighbourhood." ] }, { "cell_type": "markdown", "id": "3b012943", "metadata": {}, "source": [ "## Step 7 - Compute the distribution of simulated values of the median difference assuming the null hypothesis is true\n", "\n" ] }, { "cell_type": "markdown", "id": "976af74e", "metadata": {}, "source": [ "We will use your student number to generate data for this homework. Complete the assignment statement below by typing your student number as an `int`. In other words assign your student number as an integer to the variable `student_number`.\n" ] }, { "cell_type": "code", "execution_count": 36, "id": "f50aed80", "metadata": {}, "outputs": [], "source": [ "# # Replace the ... with your student number\n", "student_number = 12345\n", "\n", "# # This checks that you correctly typed in your student_number as an int.\n", "# # Make sure there's no error when you run this cell!\n", "# assert type(student_number) == int" ] }, { "cell_type": "markdown", "id": "32fb4d54", "metadata": {}, "source": [ "a) Write a function called `shuffled_diffs` that returns a list. The function should use a `for` loop that iterates the function `random_shuffle_median` an arbitrary number of times. The number of times that the `for` loop iterates should be controlled by a function parameter named `number_of_shuffles`. " ] }, { "cell_type": "code", "execution_count": 37, "id": "9c843e7c", "metadata": {}, "outputs": [], "source": [ "def shuffled_diffs(number_of_shuffles):\n", " shuffled_diffs = []\n", " for _ in range(number_of_shuffles):\n", " shuffled_diffs.append(random_shuffle_median())\n", " return shuffled_diffs" ] }, { "cell_type": "markdown", "id": "f06d8325", "metadata": {}, "source": [ "b) Use `shuffled_diffs` to compute 10000 simulated median differences between high and low instability neighbourhoods assuming that there is no difference in median mental health visit rates between high and low instability neighbourhoods. Store the values in `shuffled_diffs_10000`." ] }, { "cell_type": "code", "execution_count": 38, "id": "2edcc0f2", "metadata": {}, "outputs": [], "source": [ "np.random.seed(student_number)\n", "\n", "shuffled_diffs_10000 = shuffled_diffs(10000)" ] }, { "cell_type": "markdown", "id": "c2ec390e", "metadata": {}, "source": [ "c) Plot the distribution of the 10,000 simulated values stored in `shuffled_diffs_10000` using a `matplotlib` histogram. Name the plot `nullhypothesis_distribution_plot`. Label the horizontal axis as `'Difference in median visit rates for high and low instability neighbourhoods'` and the vertical axis as `'Frequency'`." ] }, { "cell_type": "code", "execution_count": 39, "id": "add94761", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Frequency')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "nullhypothesis_distribution_plot = plt.hist(shuffled_diffs_10000)\n", "\n", "plt.xlabel('Difference in median visit rates for high and low instability neighbourhoods')\n", "\n", "plt.ylabel('Frequency')" ] }, { "cell_type": "markdown", "id": "e699095a", "metadata": {}, "source": [ "## Step 8 - Compute the p-value\n", "\n", "a) Compute the number of simulated differences in medians in `shuffled_diffs_10000` that are greater than or equal to the observed median difference (`median_diff`). Store this value in `rightextreme`." ] }, { "cell_type": "code", "execution_count": 40, "id": "1a34886f", "metadata": {}, "outputs": [], "source": [ "rightextreme = shuffled_diffs_10000 >= median_diff\n", "\n", "rightextreme = rightextreme.sum()" ] }, { "cell_type": "markdown", "id": "8391adaf", "metadata": {}, "source": [ "b) Compute the number of simulated differences in medians in `shuffled_diffs_10000` that are less than the observed median difference (`median_diff`). Store this value in `leftextreme`." ] }, { "cell_type": "code", "execution_count": 41, "id": "50e101da", "metadata": {}, "outputs": [], "source": [ "leftextreme = shuffled_diffs_10000 < -median_diff\n", "\n", "leftextreme = leftextreme.sum()" ] }, { "cell_type": "markdown", "id": "117f46b6", "metadata": {}, "source": [ "c) Use `rightextreme` and `leftextreme` to compute the p-value. Store the p-value in `pvalue`. " ] }, { "cell_type": "code", "execution_count": 42, "id": "1cebcf55", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0064" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pvalue = (leftextreme + rightextreme) / 10000\n", "\n", "pvalue" ] }, { "cell_type": "markdown", "id": "b40e7f61", "metadata": {}, "source": [ "## Step 9 - Communicate what you did in the steps above\n", "\n", "\n", "a) In a few sentences introduce the question that you explored in this homework (see the [beginning](#gg274-homework-8-hypothesis-testing) of this homework). For example, why do you or others think this is an important question? (__1 mark__)" ] }, { "cell_type": "markdown", "id": "904406ca", "metadata": {}, "source": [ "> We investigated whether neighbourhoods whose residential statuses are relatively instable visit mental health services more or less frequently compared to those with relatively stable residential statuses. The analysis may tell us whether the marginalized neighbourhoods need more support for their mental health." ] }, { "cell_type": "markdown", "id": "dd1f4fd1", "metadata": {}, "source": [ "b) Briefly describe the data sources that you used to answer the question. Which statistical variables did you use and why did you use these varaibles? (__1 mark__)" ] }, { "cell_type": "markdown", "id": "331f2bb5", "metadata": {}, "source": [ "> We used the residential instabilty measure from the Ontario Marginalization Index and the age-standardized rates for mental health visits among those above age 20. We computed the mean of the instability measure to dicotomize the neighbourhoods to those whose measures are above the mean and those whose measures are below the mean. This allowed us to label the neighbourhoods into two groups - high vs. low residential instability. \n", ">\n", "> We then computed the difference in the median mental health visit rates between the two groups. The medians provided the \"typical\" (or centre of the distributions) visit rates of the two groups." ] }, { "cell_type": "markdown", "id": "1a773e29", "metadata": {}, "source": [ "c) What computational and statistical methods or analyses did you use to answer the question? Briefly describe these methods and how they were used to answer the question. (__1 mark__)" ] }, { "cell_type": "markdown", "id": "4d0390e8", "metadata": {}, "source": [ "> We simulated the distribution of the difference in median mental health visits rates under the assumption that there is no difference in the median rates betwee the two groups using the bootstrap. We then located the observed difference in the simulated distribution to assess high likely it would be to observe the value when there is actually no difference. When the location of the value in the distribution indicates that it is unlikely, we can reasonably conclude that the median mental health visit rates are different between the two groups.\n", ">\n", "> Note: Full marks if the data and method used are reasonably well explained between part b) and c)." ] }, { "cell_type": "markdown", "id": "a6bc5e37", "metadata": {}, "source": [ "d) Briefly describe the results of your statistical analysis in a few sentences. (__1 mark__)" ] }, { "cell_type": "markdown", "id": "2affe526", "metadata": {}, "source": [ "> After locating the observed difference in the simulated distribution, the resulting p-value is 0.0064 which is very small. It is reasonable to reject the hypothtesis that there is no difference between the two groups and conclude that the mental health visit rates are different between high and low residential instability neighbourhoods. Specifically, the visit rates are higher for those in neighbourhood with higher residential instability." ] }, { "cell_type": "markdown", "id": "edd25c8b", "metadata": {}, "source": [ "e) What conclusions can you draw about the question you set out to answer that is supported by the data and statistical analysis of the data? State at least one limitation of your conclusions. (See the [USC Research Guide section on study limitations](https://libguides.usc.edu/writingguide/limitations)) (__1 mark__)" ] }, { "cell_type": "markdown", "id": "8d121d9a", "metadata": {}, "source": [ "> The data shows that the typical mental health visit rates in Toronto neighbourhoods with relatively unstable residential statuses is higher. Our statistical analysis provides a reasonably strong evidence that there is a systematic difference and we didn't observe the difference by chance. \n", ">\n", "> Limitations may include...\n", ">\n", "> - the mental health visit rate data are from 2012 whereas the Ontario Marginalization Index is from 2006; a significant change in the neighbourhood composition may have altered the residential stability.\n", "> - the mental health visit rates are based on OHIP only (this requires some diggin on the data source: https://www.ontariohealthprofiles.ca/o_documents/aboutTheDataON/1_AboutTheData_AdultHealthDisease.pdf); other types of mental health visits may be relevant.\n", "> - ...\n", ">\n", "> Any reasonable limitation identified is acceptable." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" }, "vscode": { "interpreter": { "hash": "440dd12f919b48e435ef15d7652bb5c9f2f802a3e9de582e9da805c841a6f459" } } }, "nbformat": 4, "nbformat_minor": 5 }