{ "cells": [ { "cell_type": "markdown", "id": "18a303b0-cf44-4c68-9d49-ec524a677a0b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Class 7\n", "\n", "## Mar. 4, 2025\n", "\n", "_Where we left off last week_\n", "\n", "- understand the role of empirical distributions (sample and statistic) and simulation\n", "\n", "_New topics for today_\n", "\n", "- Models\n", "\n", "- Comparing two samples\n", "\n", "- Causality\n" ] }, { "cell_type": "markdown", "id": "ace25f10", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sampling from a Population\n", "\n", "The law of averages also holds when the random sample is drawn from individuals in a large population.\n", "\n", "```\n", " persdur Duration - Personal activities\n", "\n", " VALUE LABEL\n", " 0 No time spent doing this activity\n", " 9996 Valid skip\n", " 9997 Don't know\n", " 9998 Refusal\n", " 9999 Not stated\n", "\n", "\n", " luc_rst Population centre indicator\n", "\n", " VALUE LABEL\n", " 1 Larger urban population centres (CMA/CA)\n", " 2 Rural areas and small population centres (non CMA/CA)\n", " 3 Prince Edward Island\n", " 6 Valid skip\n", " 7 Don't know\n", " 8 Refusal\n", " 9 Not stated\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "6391b618", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(17390, 3)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/0j/ybsv4ncn5w50v40vdh5jjlww0000gn/T/ipykernel_4201/752036752.py:1: DeprecationWarning: \n", "Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n", "(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n", "but was not found to be installed on your system.\n", "If this would cause problems for you,\n", "please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n", " \n", " import pandas as pd\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CASEIDpersdurluc_rst
010000301
110001301
210002301
310003751
410004151
\n", "
" ], "text/plain": [ " CASEID persdur luc_rst\n", "0 10000 30 1\n", "1 10001 30 1\n", "2 10002 30 1\n", "3 10003 75 1\n", "4 10004 15 1" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "timeuse = pd.read_csv(\"gss_tu2016_main_file.csv\")\n", "\n", "important_cols = [\"CASEID\", \"persdur\", \"luc_rst\"]\n", "\n", "timeuse_subset = timeuse[important_cols]\n", "print(timeuse_subset.shape)\n", "timeuse_subset.head()" ] }, { "cell_type": "markdown", "id": "a2a5cf5e", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Transform time spent on personal activities from minutes to hours, and add this new column called `\"persdur_hour\"` to `timeuse_subset`." ] }, { "cell_type": "code", "execution_count": 2, "id": "9737dba9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CASEIDpersdurluc_rstpersdur_hour
0100003010.50
1100013010.50
2100023010.50
3100037511.25
4100041510.25
\n", "
" ], "text/plain": [ " CASEID persdur luc_rst persdur_hour\n", "0 10000 30 1 0.50\n", "1 10001 30 1 0.50\n", "2 10002 30 1 0.50\n", "3 10003 75 1 1.25\n", "4 10004 15 1 0.25" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timeuse_subset = timeuse_subset.copy()\n", "timeuse_subset[\"persdur_hour\"] = (timeuse_subset[\"persdur\"] / 60)\n", "timeuse_subset.head()" ] }, { "cell_type": "markdown", "id": "8d13d7c7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's zoom in on respondents that had a personal activities time between 0.25 and 3 hours." ] }, { "cell_type": "code", "execution_count": 3, "id": "2b204790", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CASEIDpersdurluc_rstpersdur_hour
count13004.00000013004.00000013004.00000013004.000000
mean18669.47362455.4793141.2486930.924655
std5032.13999133.6443860.4949740.560740
min10000.00000015.0000001.0000000.250000
25%14294.75000030.0000001.0000000.500000
50%18675.50000045.0000001.0000000.750000
75%23010.50000070.0000001.0000001.166667
max27389.000000180.0000003.0000003.000000
\n", "
" ], "text/plain": [ " CASEID persdur luc_rst persdur_hour\n", "count 13004.000000 13004.000000 13004.000000 13004.000000\n", "mean 18669.473624 55.479314 1.248693 0.924655\n", "std 5032.139991 33.644386 0.494974 0.560740\n", "min 10000.000000 15.000000 1.000000 0.250000\n", "25% 14294.750000 30.000000 1.000000 0.500000\n", "50% 18675.500000 45.000000 1.000000 0.750000\n", "75% 23010.500000 70.000000 1.000000 1.166667\n", "max 27389.000000 180.000000 3.000000 3.000000" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zoom = (timeuse_subset[\"persdur_hour\"] >= 0.25) & (timeuse_subset[\"persdur_hour\"] <= 3)\n", "\n", "timeuse_subset.loc[zoom].describe()" ] }, { "cell_type": "markdown", "id": "6d6fcf61", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Empirical Distribution of the Sample\n", "\n", "- Let's think of the 13,004 respondent times (hours) as a population, and draw random samples from it with replacement. \n", "\n", "- Below is a histogram of the distribution." ] }, { "cell_type": "code", "execution_count": 4, "id": "9664412c", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "persdur_zoom = timeuse_subset.loc[zoom, \"persdur_hour\"]\n", "\n", "persdur_zoom.plot.hist(bins=10, edgecolor=\"white\", color=\"darkgrey\");" ] }, { "cell_type": "markdown", "id": "c0229407", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A random sample of 100 ..." ] }, { "cell_type": "code", "execution_count": 5, "id": "dcd4298d", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "persdur_zoom.sample(\n", " 100, \n", " replace=True\n", " ).plot.hist(bins=10, edgecolor=\"white\", color=\"darkgrey\");" ] }, { "cell_type": "markdown", "id": "7ab338c2", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The values that occur with the least frequency are less likely to occur in small random samples.\n", "- As the size of the random sample increases the sample will resemble the population, with high probability." ] }, { "cell_type": "markdown", "id": "23dbe931", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Interactive exploration of empirical distribution of a sample" ] }, { "cell_type": "code", "execution_count": null, "id": "b401d770", "metadata": {}, "outputs": [], "source": [ "from ipywidgets import interact\n", "import ipywidgets as widgets\n", "\n", "def emp_hist_plot(n):\n", " persdur_zoom.sample(n, replace = True).plot.hist(bins=10, edgecolor=\"white\", color=\"darkgrey\");\n", "\n", "interact(emp_hist_plot, n = widgets.IntSlider(min = 10, max= 500, step=50, value=10));" ] }, { "cell_type": "markdown", "id": "786fc970", "metadata": {}, "source": [ "The histogram of the \"population\"." ] }, { "cell_type": "code", "execution_count": 6, "id": "a467477a", "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAGdCAYAAADzOWwgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAnoklEQVR4nO3de3BUZZ7/8U+HpBMu6YSLSaAIkBVFIrcFB9LehptEyVogWCsjQhQcBza4hIyg1FowwuxGcQBxRHFXIbjKIuwMzgjDJQaIoyQggchFBx1lDbO5MYtJQ0aSkJzfH/7SZZOASdvJ6fC8X1VdZZ9++uTbp9rwru7THYdlWZYAAAAMFmL3AAAAAHYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYL9TuAdqD+vp6FRcXKzIyUg6Hw+5xAABAM1iWpfPnz6tXr14KCbn6a0AEUTMUFxcrPj7e7jEAAIAfzpw5o969e191DUHUDJGRkZK+PaAul8vmaQAAQHN4PB7Fx8d7/x2/GoKoGRreJnO5XAQRAADtTHNOd+GkagAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gCgKWZdk9Qou1x5kBALiSULsHgORwOJSfny+Px2P3KM3icrmUlJRk9xgAAAQMQRQkPB6PKioq7B4DAAAj8ZYZAAAwHkEEAACMRxABAADjEUQAAMB4BBEAADAeQQQAAIxHEAEAAOMRRAAAwHgEEQAAMB5BBAAAjEcQAQAA4xFEAADAeAQRAAAwHkEEAACMRxABAADjEUQAAMB4BBEAADAeQQQAAIxHEAEAAOMRRAAAwHgEEQAAMB5BBAAAjEcQAQAA4xFEAADAeAQRAAAwHkEEAACMRxABAADjEUQAAMB4BBEAADBe0ATRs88+K4fDofT0dO+2ixcvKi0tTd27d1eXLl00depUlZWV+dyvqKhIKSkp6tSpk2JiYrRw4UJdunTJZ83+/fs1fPhwhYeHq3///srKymqDRwQAANqLoAiijz76SK+++qqGDBnis33BggV69913tXXrVuXm5qq4uFhTpkzx3l5XV6eUlBTV1NTowIED2rhxo7KysrRkyRLvmtOnTyslJUVjxoxRYWGh0tPT9eijj2r37t1t9vgAAEBwsz2ILly4oOnTp+s//uM/1LVrV+/2yspKvf7661q1apXGjh2rESNGaMOGDTpw4IDy8/MlSXv27NEnn3yiN998U8OGDdM999yj5cuXa+3ataqpqZEkrVu3TgkJCVq5cqUGDhyoefPm6f7779fq1attebwAACD42B5EaWlpSklJ0fjx4322FxQUqLa21mf7TTfdpD59+igvL0+SlJeXp8GDBys2Nta7Jjk5WR6PRydPnvSuuXzfycnJ3n00pbq6Wh6Px+cCAACuXaF2/vDNmzfryJEj+uijjxrdVlpaKqfTqejoaJ/tsbGxKi0t9a75bgw13N5w29XWeDweffPNN+rYsWOjn52ZmalnnnnG78cFAADaF9teITpz5ozmz5+vt956SxEREXaN0aTFixersrLSezlz5ozdIwEAgFZkWxAVFBSovLxcw4cPV2hoqEJDQ5Wbm6sXX3xRoaGhio2NVU1NjSoqKnzuV1ZWpri4OElSXFxco0+dNVz/vjUul6vJV4ckKTw8XC6Xy+cCAACuXbYF0bhx43T8+HEVFhZ6L7fccoumT5/u/e+wsDDl5OR473Pq1CkVFRXJ7XZLktxut44fP67y8nLvmuzsbLlcLiUmJnrXfHcfDWsa9gEAAGDbOUSRkZEaNGiQz7bOnTure/fu3u2zZ89WRkaGunXrJpfLpccff1xut1tJSUmSpAkTJigxMVEzZszQihUrVFpaqqefflppaWkKDw+XJM2ZM0cvvfSSFi1apFmzZmnv3r3asmWLduzY0bYPGAAABC1bT6r+PqtXr1ZISIimTp2q6upqJScn6+WXX/be3qFDB23fvl1z586V2+1W586dlZqaqmXLlnnXJCQkaMeOHVqwYIHWrFmj3r1767XXXlNycrIdDwkAAAQhh2VZlt1DBDuPx6OoqChVVla22vlEe/bsaXS+VLCKjo7WhAkT7B4DAICrasm/37Z/DxEAAIDdCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxrM1iF555RUNGTJELpdLLpdLbrdbO3fu9N5+8eJFpaWlqXv37urSpYumTp2qsrIyn30UFRUpJSVFnTp1UkxMjBYuXKhLly75rNm/f7+GDx+u8PBw9e/fX1lZWW3x8AAAQDthaxD17t1bzz77rAoKCnT48GGNHTtWkyZN0smTJyVJCxYs0LvvvqutW7cqNzdXxcXFmjJlivf+dXV1SklJUU1NjQ4cOKCNGzcqKytLS5Ys8a45ffq0UlJSNGbMGBUWFio9PV2PPvqodu/e3eaPFwAABCeHZVmW3UN8V7du3fT888/r/vvv13XXXadNmzbp/vvvlyT96U9/0sCBA5WXl6ekpCTt3LlT//AP/6Di4mLFxsZKktatW6cnn3xSZ8+eldPp1JNPPqkdO3boxIkT3p8xbdo0VVRUaNeuXc2ayePxKCoqSpWVlXK5XIF/0JL27NmjioqKVtl3oEVHR2vChAl2jwEAwFW15N/voDmHqK6uTps3b1ZVVZXcbrcKCgpUW1ur8ePHe9fcdNNN6tOnj/Ly8iRJeXl5Gjx4sDeGJCk5OVkej8f7KlNeXp7PPhrWNOyjKdXV1fJ4PD4XAABw7bI9iI4fP64uXbooPDxcc+bM0bZt25SYmKjS0lI5nU5FR0f7rI+NjVVpaakkqbS01CeGGm5vuO1qazwej7755psmZ8rMzFRUVJT3Eh8fH4iHCgAAgpTtQTRgwAAVFhbq4MGDmjt3rlJTU/XJJ5/YOtPixYtVWVnpvZw5c8bWeQAAQOsKtXsAp9Op/v37S5JGjBihjz76SGvWrNEDDzygmpoaVVRU+LxKVFZWpri4OElSXFycDh065LO/hk+hfXfN5Z9MKysrk8vlUseOHZucKTw8XOHh4QF5fAAAIPjZ/grR5err61VdXa0RI0YoLCxMOTk53ttOnTqloqIiud1uSZLb7dbx48dVXl7uXZOdnS2Xy6XExETvmu/uo2FNwz4AAABsfYVo8eLFuueee9SnTx+dP39emzZt0v79+7V7925FRUVp9uzZysjIULdu3eRyufT444/L7XYrKSlJkjRhwgQlJiZqxowZWrFihUpLS/X0008rLS3N+wrPnDlz9NJLL2nRokWaNWuW9u7dqy1btmjHjh12PnQAABBEbA2i8vJyzZw5UyUlJYqKitKQIUO0e/du3XXXXZKk1atXKyQkRFOnTlV1dbWSk5P18ssve+/foUMHbd++XXPnzpXb7Vbnzp2VmpqqZcuWedckJCRox44dWrBggdasWaPevXvrtddeU3Jycps/XgAAEJyC7nuIghHfQ+SL7yECALQH7fJ7iAAAAOxCEAEAAOMRRAAAwHgEEQAAMB5BBAAAjEcQAQAA4xFEAADAeAQRAAAwHkEEAACMRxABAADjEUQAAMB4BBEAADAeQQQAAIznVxB9+eWXgZ4DAADANn4FUf/+/TVmzBi9+eabunjxYqBnAgAAaFN+BdGRI0c0ZMgQZWRkKC4uTj/72c906NChQM8GAADQJvwKomHDhmnNmjUqLi7W+vXrVVJSottvv12DBg3SqlWrdPbs2UDPCQAA0Gp+0EnVoaGhmjJlirZu3arnnntOf/7zn/XEE08oPj5eM2fOVElJSaDmBAAAaDU/KIgOHz6sf/qnf1LPnj21atUqPfHEE/riiy+UnZ2t4uJiTZo0KVBzAgAAtJpQf+60atUqbdiwQadOndLEiRP1xhtvaOLEiQoJ+bavEhISlJWVpX79+gVyVgAAgFbhVxC98sormjVrlh5++GH17NmzyTUxMTF6/fXXf9BwAAAAbcGvIPr888+/d43T6VRqaqo/uwcAAGhTfp1DtGHDBm3durXR9q1bt2rjxo0/eCgAAIC25FcQZWZmqkePHo22x8TE6N/+7d9+8FAAAABtya8gKioqUkJCQqPtffv2VVFR0Q8eCmgNlmXZPUKLtceZAaA98uscopiYGB07dqzRp8g+/vhjde/ePRBzAQHncDiUn58vj8dj9yjN4nK5lJSUZPcYAGAEv4LoJz/5if75n/9ZkZGRuvPOOyVJubm5mj9/vqZNmxbQAYFA8ng8qqiosHsMAECQ8SuIli9frv/5n//RuHHjFBr67S7q6+s1c+ZMziECAADtjl9B5HQ69fbbb2v58uX6+OOP1bFjRw0ePFh9+/YN9HwAAACtzq8ganDjjTfqxhtvDNQsAAAAtvAriOrq6pSVlaWcnByVl5ervr7e5/a9e/cGZDgAAIC24FcQzZ8/X1lZWUpJSdGgQYPkcDgCPRcAAECb8SuINm/erC1btmjixImBngcAAKDN+fXFjE6nU/379w/0LAAAALbwK4h+/vOfa82aNXyLLgAAuCb49ZbZBx98oH379mnnzp26+eabFRYW5nP7b3/724AMBwAA0Bb8CqLo6Gjdd999gZ4FAADAFn4F0YYNGwI9BwAAgG38OodIki5duqT33ntPr776qs6fPy9JKi4u1oULFwI2HAAAQFvw6xWir776SnfffbeKiopUXV2tu+66S5GRkXruuedUXV2tdevWBXpOAACAVuPXK0Tz58/XLbfcoq+//lodO3b0br/vvvuUk5MTsOEAAADagl+vEP3xj3/UgQMH5HQ6fbb369dP//u//xuQwQAAANqKX68Q1dfXq66urtH2v/zlL4qMjPzBQwEAALQlv4JowoQJeuGFF7zXHQ6HLly4oKVLl/LnPAAAQLvj11tmK1euVHJyshITE3Xx4kU9+OCD+vzzz9WjRw/913/9V6BnBAAAaFV+BVHv3r318ccfa/PmzTp27JguXLig2bNna/r06T4nWQMAALQHfgWRJIWGhuqhhx4K5CwAAAC28CuI3njjjavePnPmTL+GAQAAsINfQTR//nyf67W1tfrb3/4mp9OpTp06EUQAAKBd8etTZl9//bXP5cKFCzp16pRuv/12TqoGAADtjt9/y+xyN9xwg5599tlGrx4BAAAEu4AFkfTtidbFxcWB3CUAAECr8+scot///vc+1y3LUklJiV566SXddtttARkMAACgrfgVRJMnT/a57nA4dN1112ns2LFauXJlIOYCAABoM34FUX19faDnAAAAsE1AzyECAABoj/x6hSgjI6PZa1etWuXPjwAAAGgzfgXR0aNHdfToUdXW1mrAgAGSpM8++0wdOnTQ8OHDvescDkdgpgQAAGhFfgXRvffeq8jISG3cuFFdu3aV9O2XNT7yyCO644479POf/zygQwIAALQmv84hWrlypTIzM70xJEldu3bVL3/5Sz5lBgAA2h2/gsjj8ejs2bONtp89e1bnz5//wUMhuEVERMiyLLvHAAAgYPx6y+y+++7TI488opUrV2rkyJGSpIMHD2rhwoWaMmVKQAdE8AkLC5PD4VB+fr48Ho/d4zRLXFychgwZYvcYAIAg5VcQrVu3Tk888YQefPBB1dbWfruj0FDNnj1bzz//fEAHRPDyeDyqqKiwe4xmiYyMtHsEAEAQ8yuIOnXqpJdfflnPP/+8vvjiC0nS9ddfr86dOwd0OAAAgLbwg76YsaSkRCUlJbrhhhvUuXNnzisBAADtkl9B9H//938aN26cbrzxRk2cOFElJSWSpNmzZ/ORewAA0O74FUQLFixQWFiYioqK1KlTJ+/2Bx54QLt27QrYcAAAAG3Br3OI9uzZo927d6t3794+22+44QZ99dVXARkMAACgrfj1ClFVVZXPK0MNzp07p/Dw8GbvJzMzUz/60Y8UGRmpmJgYTZ48WadOnfJZc/HiRaWlpal79+7q0qWLpk6dqrKyMp81RUVFSklJUadOnRQTE6OFCxfq0qVLPmv279+v4cOHKzw8XP3791dWVlbzHzAAALim+RVEd9xxh9544w3vdYfDofr6eq1YsUJjxoxp9n5yc3OVlpam/Px8ZWdnq7a2VhMmTFBVVZV3zYIFC/Tuu+9q69atys3NVXFxsc93HdXV1SklJUU1NTU6cOCANm7cqKysLC1ZssS75vTp00pJSdGYMWNUWFio9PR0Pfroo9q9e7c/Dx8AAFxj/HrLbMWKFRo3bpwOHz6smpoaLVq0SCdPntS5c+f04YcfNns/l59vlJWVpZiYGBUUFOjOO+9UZWWlXn/9dW3atEljx46VJG3YsEEDBw5Ufn6+kpKStGfPHn3yySd67733FBsbq2HDhmn58uV68skn9Ytf/EJOp1Pr1q1TQkKC98+KDBw4UB988IFWr16t5ORkfw4BAAC4hvj1CtGgQYP02Wef6fbbb9ekSZNUVVWlKVOm6OjRo7r++uv9HqayslKS1K1bN0lSQUGBamtrNX78eO+am266SX369FFeXp4kKS8vT4MHD1ZsbKx3TXJysjwej06ePOld8919NKxp2Mflqqur5fF4fC4AAODa1eJXiGpra3X33Xdr3bp1+pd/+ZeADVJfX6/09HTddtttGjRokCSptLRUTqdT0dHRPmtjY2NVWlrqXfPdGGq4veG2q63xeDz65ptv1LFjR5/bMjMz9cwzzwTssQEAgODW4leIwsLCdOzYsYAPkpaWphMnTmjz5s0B33dLLV68WJWVld7LmTNn7B4JAAC0Ir/eMnvooYf0+uuvB2yIefPmafv27dq3b5/PR/nj4uJUU1PT6O9llZWVKS4uzrvm8k+dNVz/vjUul6vRq0OSFB4eLpfL5XMBAADXLr9Oqr506ZLWr1+v9957TyNGjGj0N8xWrVrVrP1YlqXHH39c27Zt0/79+5WQkOBz+4gRIxQWFqacnBxNnTpVknTq1CkVFRXJ7XZLktxut/71X/9V5eXliomJkSRlZ2fL5XIpMTHRu+YPf/iDz76zs7O9+wAAAGZrURB9+eWX6tevn06cOKHhw4dLkj777DOfNQ6Ho9n7S0tL06ZNm/S73/1OkZGR3nN+oqKi1LFjR0VFRWn27NnKyMhQt27d5HK59Pjjj8vtdispKUmSNGHCBCUmJmrGjBlasWKFSktL9fTTTystLc37nUhz5szRSy+9pEWLFmnWrFnau3evtmzZoh07drTk4QMAgGtUi4LohhtuUElJifbt2yfp2z/V8eKLLzY6Ybm5XnnlFUnS6NGjfbZv2LBBDz/8sCRp9erVCgkJ0dSpU1VdXa3k5GS9/PLL3rUdOnTQ9u3bNXfuXLndbnXu3FmpqalatmyZd01CQoJ27NihBQsWaM2aNerdu7dee+01PnIPAAAktTCILv9r9jt37vT5EsWWunx/TYmIiNDatWu1du3aK67p27dvo7fELjd69GgdPXq0xTMCAIBrn18nVTdoTtAAAAAEuxYFkcPhaHSOUEvOGQIAAAhGLX7L7OGHH/aerHzx4kXNmTOn0afMfvvb3wZuQgAAgFbWoiBKTU31uf7QQw8FdBgAAAA7tCiINmzY0FpzAAAA2OYHnVQNAABwLSCIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiIAgFRERIcuy7B6jxdrjzAAQavcAAJoWFhYmh8Oh/Px8eTweu8dpFpfLpaSkJLvHAIAWI4iAIOfxeFRRUWH3GABwTeMtMwAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxrM1iN5//33de++96tWrlxwOh9555x2f2y3L0pIlS9SzZ0917NhR48eP1+eff+6z5ty5c5o+fbpcLpeio6M1e/ZsXbhwwWfNsWPHdMcddygiIkLx8fFasWJFaz80AADQjtgaRFVVVRo6dKjWrl3b5O0rVqzQiy++qHXr1ungwYPq3LmzkpOTdfHiRe+a6dOn6+TJk8rOztb27dv1/vvv67HHHvPe7vF4NGHCBPXt21cFBQV6/vnn9Ytf/EL//u//3uqPDwAAtA+hdv7we+65R/fcc0+Tt1mWpRdeeEFPP/20Jk2aJEl64403FBsbq3feeUfTpk3Tp59+ql27dumjjz7SLbfcIkn69a9/rYkTJ+pXv/qVevXqpbfeeks1NTVav369nE6nbr75ZhUWFmrVqlU+4QQAAMwVtOcQnT59WqWlpRo/frx3W1RUlEaNGqW8vDxJUl5enqKjo70xJEnjx49XSEiIDh486F1z5513yul0etckJyfr1KlT+vrrr5v82dXV1fJ4PD4XAABw7QraICotLZUkxcbG+myPjY313lZaWqqYmBif20NDQ9WtWzefNU3t47s/43KZmZmKioryXuLj43/4AwIAAEEraIPITosXL1ZlZaX3cubMGbtHAgAArShogyguLk6SVFZW5rO9rKzMe1tcXJzKy8t9br906ZLOnTvns6apfXz3Z1wuPDxcLpfL5wIAAK5dQRtECQkJiouLU05Ojnebx+PRwYMH5Xa7JUlut1sVFRUqKCjwrtm7d6/q6+s1atQo75r3339ftbW13jXZ2dkaMGCAunbt2kaPBgAABDNbg+jChQsqLCxUYWGhpG9PpC4sLFRRUZEcDofS09P1y1/+Ur///e91/PhxzZw5U7169dLkyZMlSQMHDtTdd9+tn/70pzp06JA+/PBDzZs3T9OmTVOvXr0kSQ8++KCcTqdmz56tkydP6u2339aaNWuUkZFh06MGAADBxtaP3R8+fFhjxozxXm+IlNTUVGVlZWnRokWqqqrSY489poqKCt1+++3atWuXIiIivPd56623NG/ePI0bN04hISGaOnWqXnzxRe/tUVFR2rNnj9LS0jRixAj16NFDS5Ys4SP3AADAy9YgGj16tCzLuuLtDodDy5Yt07Jly664plu3btq0adNVf86QIUP0xz/+0e85AQDAtS1ozyECAABoKwQRgICJiIi46qu+wao9zgwgsGx9ywzAtSUsLEwOh0P5+fnt5hveXS6XkpKS7B4DgM0IIgAB5/F4VFFRYfcYANBsvGUGAACMRxABAADjEUQAAMB4BBEAADAeQQQAAIxHEAEAAOMRRAAAwHgEEQAAMB5BBAAAjEcQAQAA4xFEAADAeAQRAAAwHkEEAACMRxABAADjEUQAAMB4BBEAADAeQQQAAIxHEAEAAOMRRAAAwHgEEQAAMB5BBAAAjEcQAQAA4xFEAADAeAQRAAAwHkEEAACMRxABQDtkWZbdI/ilvc6Na1+o3QMAAFrO4XAoPz9fHo/H7lGazeVyKSkpye4xgCYRRADQTnk8HlVUVNg9BnBN4C0zAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAJgtIiICL4sEADfQwTAbGFhYe3uSw7j4uI0ZMgQu8cArikEEQCofX3JYWRkpN0jANcc3jIDAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAgPEIIgAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPIIIAAAYjyACAADGI4gAAIDxCCIAAGA8gggAABiPIAIAAMYjiAAAuArLsuweocXa48x2C7V7AAAAgpnD4VB+fr48Ho/dozSLy+VSUlKS3WO0OwQRAKBNREREyLIsORwOu0dpMY/Ho4qKCrvHaJb2epztnpkgAgC0ibCwsHb3aktcXJyGDBli9xgt0h6PczC8qkUQAQDaVHt6tSUyMtLuEfzWno5zMOCkagAAYDyCCAAAGI8gAgAAxiOIAACA8QgiAABgPKOCaO3aterXr58iIiI0atQoHTp0yO6RAABAEDAmiN5++21lZGRo6dKlOnLkiIYOHark5GSVl5fbPRoAALCZMUG0atUq/fSnP9UjjzyixMRErVu3Tp06ddL69evtHg0AANjMiC9mrKmpUUFBgRYvXuzdFhISovHjxysvL6/R+urqalVXV3uvV1ZWSlKrfuNnhw4dFBYW1mr7DyTLsuTxeJi5lTFz22DmttMe52bmttGhQ4dW+Te2YZ/N+WO3RgTRX//6V9XV1Sk2NtZne2xsrP70pz81Wp+Zmalnnnmm0fb4+PhWmxEAALSO8+fPKyoq6qprjAiillq8eLEyMjK81+vr63Xu3Dl179693f2xvEDweDyKj4/XmTNn5HK57B6n3eI4BgbHMTA4joHBcQyM1jqOlmXp/Pnz6tWr1/euNSKIevTooQ4dOqisrMxne1lZmeLi4hqtDw8PV3h4uM+26Ojo1hyxXXC5XPwPHwAcx8DgOAYGxzEwOI6B0RrH8fteGWpgxEnVTqdTI0aMUE5OjndbfX29cnJy5Ha7bZwMAAAEAyNeIZKkjIwMpaam6pZbbtHIkSP1wgsvqKqqSo888ojdowEAAJsZE0QPPPCAzp49qyVLlqi0tFTDhg3Trl27Gp1ojcbCw8O1dOnSRm8jomU4joHBcQwMjmNgcBwDIxiOo8NqzmfRAAAArmFGnEMEAABwNQQRAAAwHkEEAACMRxABAADjEUSQJK1du1b9+vVTRESERo0apUOHDl1xbVZWlhwOh88lIiKiDacNPu+//77uvfde9erVSw6HQ++888733mf//v0aPny4wsPD1b9/f2VlZbX6nMGupcdx//79jZ6LDodDpaWlbTNwkMrMzNSPfvQjRUZGKiYmRpMnT9apU6e+935bt27VTTfdpIiICA0ePFh/+MMf2mDa4OXPceT3Y2OvvPKKhgwZ4v3SRbfbrZ07d171PnY8Fwki6O2331ZGRoaWLl2qI0eOaOjQoUpOTlZ5efkV7+NyuVRSUuK9fPXVV204cfCpqqrS0KFDtXbt2matP336tFJSUjRmzBgVFhYqPT1djz76qHbv3t3Kkwa3lh7HBqdOnfJ5PsbExLTShO1Dbm6u0tLSlJ+fr+zsbNXW1mrChAmqqqq64n0OHDign/zkJ5o9e7aOHj2qyZMna/LkyTpx4kQbTh5c/DmOEr8fL9e7d289++yzKigo0OHDhzV27FhNmjRJJ0+ebHK9bc9FC8YbOXKklZaW5r1eV1dn9erVy8rMzGxy/YYNG6yoqKg2mq79kWRt27btqmsWLVpk3XzzzT7bHnjgASs5ObkVJ2tfmnMc9+3bZ0myvv766zaZqb0qLy+3JFm5ublXXPOP//iPVkpKis+2UaNGWT/72c9ae7x2oznHkd+PzdO1a1frtddea/I2u56LvEJkuJqaGhUUFGj8+PHebSEhIRo/frzy8vKueL8LFy6ob9++io+Pv2rpo2l5eXk+x1ySkpOTr3rMcWXDhg1Tz549ddddd+nDDz+0e5ygU1lZKUnq1q3bFdfwnPx+zTmOEr8fr6aurk6bN29WVVXVFf90ll3PRYLIcH/9619VV1fX6Bu7Y2Njr3gexoABA7R+/Xr97ne/05tvvqn6+nrdeuut+stf/tIWI18TSktLmzzmHo9H33zzjU1TtT89e/bUunXr9Jvf/Ea/+c1vFB8fr9GjR+vIkSN2jxY06uvrlZ6erttuu02DBg264rorPSdNPx+rQXOPI78fm3b8+HF16dJF4eHhmjNnjrZt26bExMQm19r1XDTmT3cgcNxut0/Z33rrrRo4cKBeffVVLV++3MbJYJoBAwZowIAB3uu33nqrvvjiC61evVr/+Z//aeNkwSMtLU0nTpzQBx98YPco7VpzjyO/H5s2YMAAFRYWqrKyUv/93/+t1NRU5ebmXjGK7MArRIbr0aOHOnTooLKyMp/tZWVliouLa9Y+wsLC9Pd///f685//3BojXpPi4uKaPOYul0sdO3a0aaprw8iRI3ku/n/z5s3T9u3btW/fPvXu3fuqa6/0nGzu74FrWUuO4+X4/fgtp9Op/v37a8SIEcrMzNTQoUO1Zs2aJtfa9VwkiAzndDo1YsQI5eTkeLfV19crJyfniu/vXq6urk7Hjx9Xz549W2vMa47b7fY55pKUnZ3d7GOOKyssLDT+uWhZlubNm6dt27Zp7969SkhI+N778JxszJ/jeDl+Pzatvr5e1dXVTd5m23OxVU/ZRruwefNmKzw83MrKyrI++eQT67HHHrOio6Ot0tJSy7Isa8aMGdZTTz3lXf/MM89Yu3fvtr744guroKDAmjZtmhUREWGdPHnSrodgu/Pnz1tHjx61jh49akmyVq1aZR09etT66quvLMuyrKeeesqaMWOGd/2XX35pderUyVq4cKH16aefWmvXrrU6dOhg7dq1y66HEBRaehxXr15tvfPOO9bnn39uHT9+3Jo/f74VEhJivffee3Y9hKAwd+5cKyoqytq/f79VUlLivfztb3/zrrn8/+sPP/zQCg0NtX71q19Zn376qbV06VIrLCzMOn78uB0PISj4cxz5/djYU089ZeXm5lqnT5+2jh07Zj311FOWw+Gw9uzZY1lW8DwXCSJYlmVZv/71r60+ffpYTqfTGjlypJWfn++97cc//rGVmprqvZ6enu5dGxsba02cONE6cuSIDVMHj4aPf19+aThuqamp1o9//ONG9xk2bJjldDqtv/u7v7M2bNjQ5nMHm5Yex+eee866/vrrrYiICKtbt27W6NGjrb1799ozfBBp6hhK8nmOXf7/tWVZ1pYtW6wbb7zRcjqd1s0332zt2LGjbQcPMv4cR34/NjZr1iyrb9++ltPptK677jpr3Lhx3hiyrOB5Ljosy7Ja9zUoAACA4MY5RAAAwHgEEQAAMB5BBAAAjEcQAQAA4xFEAADAeAQRAAAwHkEEAACMRxABAADjEUQAAMB4BBEAADAeQQQAAIxHEAEAAOP9P1CetU2+SpbcAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "persdur_zoom.plot.hist(bins=10, edgecolor=\"white\", color=\"darkgrey\");" ] }, { "cell_type": "markdown", "id": "cafd2d6f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Questions\n", "\n", "- Describe how the empirical histogram changes as the the value `n` gets larger." ] }, { "cell_type": "markdown", "id": "fef3a8a1", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Empirical Distribution of a Statistic\n", "\n", "The Law of Averages implies that with high probability, the empirical distribution of a large random sample will resemble the distribution of the population from which the sample was drawn.\n", "\n", "The resemblance is visible in two histograms: the empirical histogram of a large random sample is likely to resemble the histogram of the population." ] }, { "cell_type": "markdown", "id": "944a7331", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Statistical Parameter\n", "\n", "Numerical quantities associated with a (statistical) population are called *statistical parameters* or *parameters*. For the population of respondents in `persdur_zoom`, we know the value of the parameter \"median time (hours) spent on personal activities\":" ] }, { "cell_type": "code", "execution_count": 7, "id": "637eee6b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.75" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "persdur_zoom.median()" ] }, { "cell_type": "markdown", "id": "2010df46", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "More frequently, we are interested in parameters that are unknown to us.\n", "\n", "- In a population of voters, what percent will vote for Candidate A?\n", "\n", "- In a population of TikTok users, what is the largest number of followers for a user?\n", "\n", "- In a population of Air Canada flights, what is the median departure delay?\n", "\n", "- In a population of commercial air flights, what is the average [fatal accident rate](https://www.cnn.com/2025/02/19/business/airplane-crashes-statistics/index.html)?" ] }, { "cell_type": "markdown", "id": "3fad42c7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Statistic\n", "\n", "For this, we will rely on data from a large random sample drawn from the population.\n", "\n", "A *statistic* (note the singular!) is any number computed using the data in a sample. The sample median, therefore, is a statistic. \n", "\n", "Remember that `persdur_zoom.sample(100)` contains a random sample of 100 respondents from `persdur_zoom`. The observed value of the sample median is:" ] }, { "cell_type": "markdown", "id": "8170f8fc", "metadata": {}, "source": [ "![](stats.png)" ] }, { "cell_type": "code", "execution_count": 8, "id": "7ad59370", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.75" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "persdur_zoom.sample(100).median()" ] }, { "cell_type": "markdown", "id": "f836e0e0", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Our sample – one set of 100 people – gave us one observed value of the statistic. This raises an important problem of inference:\n", "\n", "**The statistic could have been different.**\n", "\n", "A fundamental consideration in using any statistic based on a random sample is that *the sample could have come out differently*, and therefore the statistic could have come out differently too." ] }, { "cell_type": "markdown", "id": "2d8fe60f", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Run the cell above a few times to see how the answer varies. Often it is equal to 0.75, the same value as the population parameter. But sometimes it is different." ] }, { "cell_type": "markdown", "id": "c1c15f98", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Example: Poll Tracker\n", "\n", "- Consider [the polling averages of the Canadian parties](https://newsinteractives.cbc.ca/elections/poll-tracker/canada/) as of today.\n", "\n", "- Each poll consists of responses from a random sample of survey participants.\n", "\n", "- Polls even conducted during the same period may not result in the same results." ] }, { "cell_type": "markdown", "id": "29a7b193", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](korea-poll-graph-2022.png)\n", "\n", "_Source: https://poll-mbc.co.kr/bk/2022_president.html_" ] }, { "cell_type": "markdown", "id": "917e4166", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Uh oh...\n", "\n", "- We don't know the population parameter.\n", "\n", "- Our statistic from a random sample may be a good estimate, but it may be not because it's going to be different with a different sample." ] }, { "cell_type": "markdown", "id": "272a6b93", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "**Just how different could the statistic have been?** \n", "\n", "One way to answer this is to simulate the statistic many times and note the values. \n", "A histogram of those values will tell us about the **distribution of the statistic**." ] }, { "cell_type": "markdown", "id": "9f6da1dd", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Simulating a Statistic\n", "\n", "We will simulate the sample median using the steps below. You can replace the sample size of 1000 by any other sample size, and the sample median by any other statistic.\n", "\n", "**Step 1: Decide which statistic to simulate.** We have already decided that: we are going to simulate the median of a random sample of size 1000 drawn from the population of time use survey respondents that had a median time spent on personal activities between 0.25 and 3.0 hours.\n", "\n", "**Step 2: Define a function that returns one simulated value of the statistic.** Draw a random sample of size 1000 and compute the median of the sample. We did this in the code cell above. Here it is again, encapsulated in a function." ] }, { "cell_type": "code", "execution_count": 9, "id": "5eb378fb", "metadata": {}, "outputs": [], "source": [ "def random_sample_median():\n", " return persdur_zoom.sample(100).median()" ] }, { "cell_type": "code", "execution_count": 10, "id": "1eec73a6", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.7083333333333333" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_sample_median()" ] }, { "cell_type": "markdown", "id": "3979a0fb", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Step 3: Decide how many simulated values to generate.** Let's do 5,000 repetitions.\n", "\n", "**Step 4: Use a `for` loop to generate a list of simulated values.** Start by creating an empty list in which to collect our results. We will then set up a `for` loop for generating all the simulated values. The body of the loop will consist of generating one simulated value of the sample median, and appending it to our collection list.\n", "\n", "The simulation takes a noticeable amount of time to run. That is because it is performing 5000 repetitions of the process of drawing a sample of size 1000 and computing its median. That's a lot of sampling and repeating!" ] }, { "cell_type": "markdown", "id": "1c6a6530", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's break down this step a bit further:\n", "\n", "- set up an empty list called `sim_medians`." ] }, { "cell_type": "code", "execution_count": 12, "id": "f20eea7b", "metadata": {}, "outputs": [], "source": [ "sim_medians = [] # empty list" ] }, { "cell_type": "markdown", "id": "306bbb0d", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Use the `append` function to append values to `sim_medians`." ] }, { "cell_type": "code", "execution_count": 13, "id": "e5b24aeb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.9166666666666666]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sim_medians.append(random_sample_median())\n", "sim_medians" ] }, { "cell_type": "markdown", "id": "82fd17f1", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Run the cell above several times and you will see that values keep getting appended to `sim_medians`.\n", "\n", "- Each time the cell is run a random sample of 100 is drawn, the median is calculated then appended to the list." ] }, { "cell_type": "code", "execution_count": 14, "id": "a93270f7", "metadata": {}, "outputs": [], "source": [ "sim_medians = []\n", "\n", "num_sims = 5000 \n", "\n", "for _ in range(num_sims):\n", " sim_medians.append(random_sample_median())" ] }, { "cell_type": "markdown", "id": "fcc87430", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Empirical Distribution of a Statistic\n", "\n", "We can now examine the **empirical frequency distribution** of the median statistic." ] }, { "cell_type": "markdown", "id": "0d479dcb", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Create a `pandas` series using the list `sim_medians` then use the `describe` function to describe the distribution." ] }, { "cell_type": "code", "execution_count": 15, "id": "6524a594", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5000.000000\n", "mean 0.800458\n", "std 0.081846\n", "min 0.500000\n", "25% 0.750000\n", "50% 0.750000\n", "75% 0.833333\n", "max 1.000000\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(sim_medians).describe()" ] }, { "cell_type": "markdown", "id": "03518bb4", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The distribution of the median can be visualized using a histogram. \n", "\n", "- The histogram can tell us how frequent certain values of the median occur in random samples." ] }, { "cell_type": "code", "execution_count": 16, "id": "5127a82f", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "# create bins of length 0.1 \n", "# starting at 0.5 and ending at 1.1\n", "\n", "median_bins = np.arange(start = 0.5, stop = 1.1, step = 0.1) # use a numpy function here\n", "\n", "# use the bins to plot empirical distribution of medians\n", "\n", "plt.hist(sim_medians, bins=median_bins, edgecolor=\"white\", color=\"darkgrey\");" ] }, { "cell_type": "markdown", "id": "96d02581", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The exact counts of medians in each interval can be examined using `pd.cut` and `value_counts`." ] }, { "cell_type": "code", "execution_count": 17, "id": "0cfab888", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.7, 0.8] 2727\n", "(0.8, 0.9] 1275\n", "(0.9, 1.0] 437\n", "(1.0, 1.1] 285\n", "(0.6, 0.7] 269\n", "(0.5, 0.6] 5\n", "Name: count, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a pandas series so we can use cut function\n", "sim_means_series = pd.Series(sim_medians)\n", "\n", "# frequency of values in each bin\n", "pd.cut(sim_means_series, median_bins).value_counts()" ] }, { "cell_type": "code", "execution_count": 18, "id": "94226cd1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.7, 0.8] 0.5454\n", "(0.8, 0.9] 0.2550\n", "(0.9, 1.0] 0.0874\n", "(1.0, 1.1] 0.0570\n", "(0.6, 0.7] 0.0538\n", "(0.5, 0.6] 0.0010\n", "Name: count, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# relative frequency (proportion) of values in each bin\n", "pd.cut(sim_means_series, median_bins).value_counts() / num_sims" ] }, { "cell_type": "markdown", "id": "68d81e97", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The histogram shows that the median values between 0.70 and 0.80 have the highest probability of occurring.\n", "\n", "- This means that a random sample would rarely yield a median value in the range of 0.5 - 0.6 or 1.0 - 1.1." ] }, { "cell_type": "markdown", "id": "5888f9e1", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What can we learn from simulation?\n", "\n", "- If we could generate all possible random samples of size 100, we would know all possible values of the statistic (the sample median), as well as the probabilities of all those values. We could visualize all the values and probabilities in the probability histogram of the statistic." ] }, { "cell_type": "markdown", "id": "3c898c41", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- But in many situations including this one, the number of all possible samples is large enough to exceed the capacity of the computer, and purely mathematical calculations of the probabilities can be intractably difficult --- that is, if you have access to the whole population." ] }, { "cell_type": "markdown", "id": "1d28b7fc", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This is where empirical histograms come in." ] }, { "cell_type": "markdown", "id": "24b20368", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We know that by the Law of Averages, the empirical histogram of the statistic is likely to resemble the probability histogram of the statistic, if the sample size is large and if you repeat the random sampling process numerous times." ] }, { "cell_type": "markdown", "id": "ce9dbb41", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This means that simulating random processes repeatedly is a way of approximating probability distributions *without figuring out the probabilities mathematically or generating all possible random samples*. " ] }, { "cell_type": "markdown", "id": "f2b39896", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Thus computer simulations become a powerful tool in data science. They can help data scientists understand the properties of random quantities that would be complicated to analyze in other ways." ] }, { "cell_type": "markdown", "id": "f6d31097-9363-4fb2-9ecc-1aac7b179527", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Models\n", "\n", "- A model is a set of assumptions about the data. In many cases models include assumptions about random (stochastic) processes used to generate the data.\n", "\n", "- Data scientists are often in a position of formulating and assessing models." ] }, { "cell_type": "markdown", "id": "4cebbb19-5679-4ed2-9ffd-012b985fb312", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Goals of Data Science\n", "\n", "- Deeper understanding of the world.\n", "\n", "- Make the world a better place to live.\n", "\n", "- For example, help expose injustice.\n", "\n", "- The following example demonstrates how a model can achieve such goals." ] }, { "cell_type": "markdown", "id": "8c9e9d64", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Jury Selection\n", "\n", "- U.S. Constitution grants equal protection under the law" ] }, { "cell_type": "markdown", "id": "ecbd30b8", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- All defendants have the right to due process " ] }, { "cell_type": "markdown", "id": "a2ee959c", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Robert Swain, a Black man, was convicted in Talladega County, AL" ] }, { "cell_type": "markdown", "id": "a4ce2705", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- He appealed to the U.S. Supreme Court\n", "\n", "- Main reason: Unfair jury selection in the County’s trials" ] }, { "cell_type": "markdown", "id": "97cb022e", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- At the time of the trial, only men aged 21 or more were eligible to serve on juries in Talladega County. In that population, 26% of the men were Black. \n", "\n", "- But only eight men among the panel of 100 men (that is, 8%) were Black." ] }, { "cell_type": "markdown", "id": "df094ad7", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The U.S. Supreme Court reviewed the appeal and concluded, “the overall percentage disparity has been small.” But was this assertion reasonable? " ] }, { "cell_type": "markdown", "id": "a6eef555-860c-47d6-93f4-3f82f0c227b0", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If jury panelists were selected at random from the county’s eligible population, there would be some chance variation. We wouldn’t get exactly 26 Black panelists on every 100-person panel. But would we expect as few as eight?" ] }, { "cell_type": "markdown", "id": "cc99dd90-1b5c-467e-b07d-14024fc83bb7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## A model of random selection\n", "\n", "- A model of the data is that the panel was selected at random and ended up with a small number of Black panelists just due to chance.\n", "\n", "- Since the panel was supposed to resemble the population of all eligible jurors, the model of random selection is important to assess. Let’s see if it stands up to scrutiny.\n", "\n", "- The `numpy.random` function `multinomial(n, pvals, size)` can be used to simulate sample proportions or counts with two or more categories." ] }, { "cell_type": "markdown", "id": "64577a5a-a0c2-400b-a2f5-f7281d742c86", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Example 1: Rolling a six-sided die 20 times" ] }, { "cell_type": "code", "execution_count": 19, "id": "237a8f9b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[3, 3, 4, 4, 5, 1]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "# number of times die is rolled\n", "sample_size = 20\n", "\n", "# number of experiments\n", "num_simulations = 1\n", "\n", "# probability of each side\n", "true_probabilities = [1/6] * 6\n", "\n", "# number of times each side appears\n", "counts = np.random.multinomial(sample_size, true_probabilities, size=num_simulations)\n", "\n", "counts" ] }, { "cell_type": "code", "execution_count": 20, "id": "51f82363-2dcb-4f92-8c66-80f0acd9897e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sample counts: \n", " [[3 3 4 4 5 1]]\n", "Sample proportions: \n", " [[0.15 0.15 0.2 0.2 0.25 0.05]]\n", "True probabilities: \n", " [0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666]\n" ] } ], "source": [ "proportions = counts / sample_size\n", "\n", "print(\"Sample counts: \\n\", counts) \n", "print(\"Sample proportions: \\n\", proportions)\n", "print(\"True probabilities: \\n\", true_probabilities)" ] }, { "cell_type": "markdown", "id": "0f2119c0-adcc-4095-90cf-b5e6f1dace60", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Example 2: Rolling a loaded six-side 100 times more likely to land on 6 - repeated 3 times" ] }, { "cell_type": "code", "execution_count": 21, "id": "970611e6-05ef-4836-a1a4-57f076c7e0f9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sample counts: \n", " [[10 17 9 14 16 34]\n", " [12 18 10 14 13 33]\n", " [16 14 16 14 15 25]]\n", "Sample proportions: \n", " [[0.1 0.17 0.09 0.14 0.16 0.34]\n", " [0.12 0.18 0.1 0.14 0.13 0.33]\n", " [0.16 0.14 0.16 0.14 0.15 0.25]]\n", "True probabilities: \n", " [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.2857142857142857]\n" ] } ], "source": [ "sample_size = 100\n", "\n", "num_simulations = 3\n", "\n", "true_probabilities = [1 / 7] * 5 + [2 / 7]\n", "\n", "counts = np.random.multinomial(sample_size, true_probabilities, size=num_simulations)\n", "\n", "proportions = counts / sample_size\n", "\n", "print(\"Sample counts: \\n\", counts) \n", "print(\"Sample proportions: \\n\", proportions)\n", "print(\"True probabilities: \\n\", true_probabilities)" ] }, { "cell_type": "markdown", "id": "4c6f2a05-cf84-4a68-8f8c-12eeb4313acc", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Let's use this to simulate the jury selection process.\n", "\n", "- The size of the jury panel is 100, so `sample_size` is 100. \n", "\n", "- The distribution from which we will draw the sample is the distribution in the population of eligible jurors: 26% of them were Black, so 100% - 26% = 74% are white (very simplistic assumption, but let's go with it for now). \n", "\n", "- This means `true_pobabilities` is `[0.26, 0.74]`.\n", "\n", "- One simulation is below." ] }, { "cell_type": "code", "execution_count": 22, "id": "eca2d986", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[23, 77]])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_size = 100\n", "\n", "true_probabilities = [0.26, 0.74]\n", "\n", "num_simulations = 1\n", "\n", "counts = np.random.multinomial(sample_size, true_probabilities, size=num_simulations)\n", "\n", "counts" ] }, { "cell_type": "code", "execution_count": 23, "id": "98626e7c-772e-4dda-b191-1c39e8fa735b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BlackWhite
00.230.77
\n", "
" ], "text/plain": [ " Black White\n", "0 0.23 0.77" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "proportions = counts / sample_size\n", "\n", "sim_counts = pd.DataFrame(proportions, columns = [\"Black\", \"White\"])\n", "\n", "sim_counts" ] }, { "cell_type": "code", "execution_count": 24, "id": "e974b9ae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.23" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sim_counts.iloc[0,0]" ] }, { "cell_type": "markdown", "id": "d54a719d", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Quick side note on Indexing a `pandas` `DataFrame`\n", "\n", "- `iloc` and `loc` are both functions that can access data in a `DataFrame`.\n", "\n", "- `loc` uses labels to access rows and columns.\n", "\n", "- `iloc` uses integers to access rows and columns." ] }, { "cell_type": "code", "execution_count": 25, "id": "f8833462", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 2)\n" ] } ], "source": [ "print(sim_counts.shape)" ] }, { "cell_type": "code", "execution_count": 26, "id": "38a9f6d3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RangeIndex(start=0, stop=1, step=1)\n" ] } ], "source": [ "print(sim_counts.index)" ] }, { "cell_type": "code", "execution_count": 27, "id": "1426a8f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Black', 'White'], dtype='object')\n" ] } ], "source": [ "print(sim_counts.columns)" ] }, { "cell_type": "code", "execution_count": 28, "id": "29aaea0e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.77\n", "0.77\n" ] } ], "source": [ "# row 0 column 1\n", "print(sim_counts.iloc[0, 1])\n", "\n", "# access row 0 column 1 using label names\n", "print(sim_counts.loc[0, \"White\"])" ] }, { "cell_type": "code", "execution_count": 29, "id": "69f40599", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.23\n", "0.23\n" ] } ], "source": [ "print(sim_counts.iloc[0, 0])\n", "\n", "print(sim_counts.loc[0, \"Black\"])" ] }, { "cell_type": "markdown", "id": "1e1069d8-4602-40fb-be47-7e21430eefc7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Back to simulation ... Simulate one value\n", "\n", "- Let's write a function to simulate one value." ] }, { "cell_type": "code", "execution_count": 30, "id": "f3d23692-67a0-4205-b466-7f4a5db54f54", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def simulate_one_count():\n", " sample_size = 100 # jury size\n", "\n", " true_probabilities = [0.26, 0.74] #true prob of race\n", " \n", " num_simulations = 1 # number of simulations\n", " # get the random counts\n", " counts = np.random.multinomial(sample_size, true_probabilities, size=num_simulations) \n", " \n", " # store in data frame\n", " sim_counts = pd.DataFrame(counts, columns = [\"Black\", \"White\"])\n", " \n", " return sim_counts.iloc[0,0]\n" ] }, { "cell_type": "code", "execution_count": 31, "id": "3e371a58-6c50-458f-ad07-e0dd7592e1ce", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "29" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simulate_one_count()" ] }, { "cell_type": "markdown", "id": "06961dc2-9562-46ef-adc4-a190aa44d76d", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Simulate multiple values\n", "\n", "- Our analysis is focused on the variability in the counts. \n", "\n", "- Let’s generate 10,000 simulated values of the count and see how they vary.\n", "\n", "- We will do this by using a for loop and collecting all the simulated counts in a list called `sim_counts`" ] }, { "cell_type": "code", "execution_count": 32, "id": "ccea85b6-57ee-4291-86a7-3aa6c56a31d8", "metadata": {}, "outputs": [], "source": [ "sim_counts = []\n", "\n", "number_sims = 10000\n", "\n", "for _ in np.arange(number_sims):\n", " sim_counts.append(simulate_one_count())" ] }, { "cell_type": "code", "execution_count": 33, "id": "cbc747ac-ea7c-4d32-8b37-15f6db0d9dc0", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.hist(sim_counts, bins = np.arange(5.5, 50, 1), edgecolor = 'black', \n", " color = 'grey', density = True);\n", "plt.xlabel('Count in random sample')\n", "plt.ylabel('Frequency')\n", "plt.scatter(8, 0, color = 'orange', s = 150);" ] }, { "cell_type": "markdown", "id": "830c8bf5-d743-41db-a2df-24019c281c07", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Conclusion of the data analysis\n", "\n", "- The histogram shows that if we select a panel of size 100 at random from the eligible population, we are very unlikely to get counts of Black panelists that are as low as the eight that were observed on the panel in the trial.\n", "\n", "- This is evidence that the model of random selection of the jurors in the panel is not consistent with the data from the panel. While it is possible that the panel could have been generated by chance, our simulation demonstrates that it is hugely unlikely.\n", "\n", "- Therefore the most *reasonable* conclusion is that the assumption of random selection is unjustified for this jury panel." ] }, { "cell_type": "markdown", "id": "8c372dea-0534-444e-b3ba-0ebecb785b89", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The simulation also could have been done using `np.random.multinomial`.\n", "\n", "- This is an example of a 'vectorized' computation, and are usually faster than non-vectorized computations." ] }, { "cell_type": "code", "execution_count": 34, "id": "f665c7b0-8d5f-4e85-94a7-4127c52615e0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10000" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_size = 100\n", "\n", "true_probabilities = [0.26, 0.74]\n", "\n", "num_simulations = 10000\n", "\n", "counts = np.random.multinomial(sample_size, true_probabilities, size=num_simulations)\n", "\n", "len(counts)" ] }, { "cell_type": "code", "execution_count": 35, "id": "be364d12", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[26, 74],\n", " [24, 76],\n", " [24, 76],\n", " ...,\n", " [34, 66],\n", " [22, 78],\n", " [27, 73]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts" ] }, { "cell_type": "markdown", "id": "338e88fa-54e3-4150-8ba0-c35aab5f3600", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Comparing two samples \n", "\n", "### Are mental health visit rates higher in Toronto Neighbourhoods with higher material deprivation?\n", "\n", "> Material deprivation is closely connected to poverty and it refers to inability for individuals and communities to access and attain basic material needs. The indicators included in this dimension measure income, quality of housing, educational attainment, and family structure characteristics. \n", "\n", "See [2011 Ontario Marginalization Index Documentation](https://www.publichealthontario.ca/-/media/Documents/O/2017/on-marg-technical.pdf?la=en≻lang=en&hash=EED54DF437EDEDA2DFE1A00A4B14A50A) and [Toronto Health Profiles website](http://www.torontohealthprofiles.ca/index.php?varTab=HPDtbl)" ] }, { "cell_type": "markdown", "id": "b99a6b64", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Data wrangling\n", "\n", "- The next few slides involve reading the data into pandas and getting it ready for analysis.\n", "- This won't be covered in detail in class, but we have already covered this process in previous classes." ] }, { "cell_type": "markdown", "id": "29dc46a3-8122-450f-b237-71bb36b92dd4", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The data is stored in `1_marg_neighb_toronto_2006_OnMarg.xls` - a Microsoft Excel file format with file extension `.xls`.\n", "\n", "- Use `pandas` function `read_excel` with `sheet_name` parameter." ] }, { "cell_type": "markdown", "id": "2b7c8916-4b64-4249-96b5-b2cd552620b2", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Neighbourhood deprivation scores" ] }, { "cell_type": "code", "execution_count": 36, "id": "c9df7fb0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: xlrd in /Users/moon/anaconda3/envs/ggr274/lib/python3.10/site-packages (2.0.1)\n", "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# this python library is not pre-installed\n", "# and is required to use `pandas` `read_excel`\n", "%pip install xlrd" ] }, { "cell_type": "code", "execution_count": 37, "id": "4a8ca612-4deb-42ea-9735-cd5baaa9c8ea", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Ontario Marginalization Index. Toronto Neighbourhoods 2006, \\nQuintiles: Material Deprivation, Residential Instability, Dependency, Ethnic ConcentrationUnnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10Unnamed: 11
0Neighb idNeighbourhood namePOPULATIONINSTABILITYINSTABILITY_QDEPRIVATIONDEPRIVATION_QETHNICCONCENTRATIONETHNICCONCENTRATION_QDEPENDENCYDEPENDENCY_QONMARG_COMBINED_Q
11West Humber-Clairville32252-0.663910.16232.45485-0.202132.4
22Mount Olive-Silverstone-Jamestown32127-0.108111.019553.74335-0.597512.4
33Thistletown-Beaumond Heights9928-0.313110.34641.62240.284552.8
44Rexdale-Kipling107250.186620.470441.239630.273452.8
\n", "
" ], "text/plain": [ " Unnamed: 0 \\\n", "0 Neighb id \n", "1 1 \n", "2 2 \n", "3 3 \n", "4 4 \n", "\n", " Ontario Marginalization Index. Toronto Neighbourhoods 2006, \\nQuintiles: Material Deprivation, Residential Instability, Dependency, Ethnic Concentration \\\n", "0 Neighbourhood name \n", "1 West Humber-Clairville \n", "2 Mount Olive-Silverstone-Jamestown \n", "3 Thistletown-Beaumond Heights \n", "4 Rexdale-Kipling \n", "\n", " Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 \\\n", "0 POPULATION INSTABILITY INSTABILITY_Q DEPRIVATION DEPRIVATION_Q \n", "1 32252 -0.6639 1 0.162 3 \n", "2 32127 -0.1081 1 1.0195 5 \n", "3 9928 -0.3131 1 0.346 4 \n", "4 10725 0.1866 2 0.4704 4 \n", "\n", " Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 \\\n", "0 ETHNICCONCENTRATION ETHNICCONCENTRATION_Q DEPENDENCY DEPENDENCY_Q \n", "1 2.4548 5 -0.2021 3 \n", "2 3.7433 5 -0.5975 1 \n", "3 1.622 4 0.2845 5 \n", "4 1.2396 3 0.2734 5 \n", "\n", " Unnamed: 11 \n", "0 ONMARG_COMBINED_Q \n", "1 2.4 \n", "2 2.4 \n", "3 2.8 \n", "4 2.8 " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "marg_neighb = pd.read_excel(\"1_marg_neighb_toronto_2006_OnMarg.xls\", \n", " sheet_name=\"Neighbourhood_Toronto_OnMarg\")\n", "marg_neighb.head()" ] }, { "cell_type": "markdown", "id": "55c67bf1-56b8-44c0-a7e2-e65a65ccd021", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If we specify `header` parameter then it will include column names." ] }, { "cell_type": "code", "execution_count": 38, "id": "65407a96-4452-4df2-a67e-b47eb8c88d24", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Neighb idNeighbourhood namePOPULATIONINSTABILITYINSTABILITY_QDEPRIVATIONDEPRIVATION_QETHNICCONCENTRATIONETHNICCONCENTRATION_QDEPENDENCYDEPENDENCY_QONMARG_COMBINED_Q
01West Humber-Clairville32252-0.663910.162032.45485-0.202132.4
12Mount Olive-Silverstone-Jamestown32127-0.108111.019553.74335-0.597512.4
23Thistletown-Beaumond Heights9928-0.313110.346041.622040.284552.8
34Rexdale-Kipling107250.186620.470441.239630.273452.8
45Elms-Old Rexdale9879-0.015020.804051.99114-0.352722.6
\n", "
" ], "text/plain": [ " Neighb id Neighbourhood name POPULATION INSTABILITY \\\n", "0 1 West Humber-Clairville 32252 -0.6639 \n", "1 2 Mount Olive-Silverstone-Jamestown 32127 -0.1081 \n", "2 3 Thistletown-Beaumond Heights 9928 -0.3131 \n", "3 4 Rexdale-Kipling 10725 0.1866 \n", "4 5 Elms-Old Rexdale 9879 -0.0150 \n", "\n", " INSTABILITY_Q DEPRIVATION DEPRIVATION_Q ETHNICCONCENTRATION \\\n", "0 1 0.1620 3 2.4548 \n", "1 1 1.0195 5 3.7433 \n", "2 1 0.3460 4 1.6220 \n", "3 2 0.4704 4 1.2396 \n", "4 2 0.8040 5 1.9911 \n", "\n", " ETHNICCONCENTRATION_Q DEPENDENCY DEPENDENCY_Q ONMARG_COMBINED_Q \n", "0 5 -0.2021 3 2.4 \n", "1 5 -0.5975 1 2.4 \n", "2 4 0.2845 5 2.8 \n", "3 3 0.2734 5 2.8 \n", "4 4 -0.3527 2 2.6 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "marg_neighb = pd.read_excel(\"1_marg_neighb_toronto_2006_OnMarg.xls\", \n", " sheet_name=\"Neighbourhood_Toronto_OnMarg\",\n", " header=1)\n", "marg_neighb.head()" ] }, { "cell_type": "markdown", "id": "49ac2173-9117-471a-9f5c-0b1c4ffdc6e7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Get the column names of `marg_neighb`." ] }, { "cell_type": "code", "execution_count": 39, "id": "d4432d19", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Neighb id ', 'Neighbourhood name ', 'POPULATION', 'INSTABILITY',\n", " 'INSTABILITY_Q', 'DEPRIVATION', 'DEPRIVATION_Q', 'ETHNICCONCENTRATION',\n", " 'ETHNICCONCENTRATION_Q', 'DEPENDENCY', 'DEPENDENCY_Q',\n", " 'ONMARG_COMBINED_Q'],\n", " dtype='object')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "marg_neighb.columns" ] }, { "cell_type": "markdown", "id": "52b1b861-3b0c-4dec-ba6b-ebebe99a37a7", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Select the columns corresponding to `Neighb id ` (column 0), `Neighbourhood name ` (column 1), and `DEPRIVATION` (deprivation score - column 5)." ] }, { "cell_type": "code", "execution_count": 40, "id": "ff901bbb-2102-491a-9bf0-591fc33e3736", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Neighb idNeighbourhood nameDEPRIVATION
01West Humber-Clairville0.1620
12Mount Olive-Silverstone-Jamestown1.0195
23Thistletown-Beaumond Heights0.3460
34Rexdale-Kipling0.4704
45Elms-Old Rexdale0.8040
\n", "
" ], "text/plain": [ " Neighb id Neighbourhood name DEPRIVATION\n", "0 1 West Humber-Clairville 0.1620\n", "1 2 Mount Olive-Silverstone-Jamestown 1.0195\n", "2 3 Thistletown-Beaumond Heights 0.3460\n", "3 4 Rexdale-Kipling 0.4704\n", "4 5 Elms-Old Rexdale 0.8040" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "marg_neighb[marg_neighb.columns[[0, 1, 5]]].head()" ] }, { "cell_type": "markdown", "id": "09d87100-b7e4-488f-86c8-62e8bd6aa10e", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Create a new `DataFrame` called `depriv` with only those columns." ] }, { "cell_type": "code", "execution_count": 41, "id": "90d10f12-990c-4f04-af85-bd5edc876fdf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Neighb idNeighbourhood nameDEPRIVATION
01West Humber-Clairville0.1620
12Mount Olive-Silverstone-Jamestown1.0195
23Thistletown-Beaumond Heights0.3460
34Rexdale-Kipling0.4704
45Elms-Old Rexdale0.8040
\n", "
" ], "text/plain": [ " Neighb id Neighbourhood name DEPRIVATION\n", "0 1 West Humber-Clairville 0.1620\n", "1 2 Mount Olive-Silverstone-Jamestown 1.0195\n", "2 3 Thistletown-Beaumond Heights 0.3460\n", "3 4 Rexdale-Kipling 0.4704\n", "4 5 Elms-Old Rexdale 0.8040" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "depriv = marg_neighb[marg_neighb.columns[[0, 1, 5]]]\n", "depriv.head()" ] }, { "cell_type": "markdown", "id": "2bc6a2f9", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Rename the columns of `depriv `" ] }, { "cell_type": "code", "execution_count": 42, "id": "e92c6b52-701c-4cde-b686-b57ac13b0c5f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
n_idnamedeprivation
01West Humber-Clairville0.1620
12Mount Olive-Silverstone-Jamestown1.0195
23Thistletown-Beaumond Heights0.3460
34Rexdale-Kipling0.4704
45Elms-Old Rexdale0.8040
\n", "
" ], "text/plain": [ " n_id name deprivation\n", "0 1 West Humber-Clairville 0.1620\n", "1 2 Mount Olive-Silverstone-Jamestown 1.0195\n", "2 3 Thistletown-Beaumond Heights 0.3460\n", "3 4 Rexdale-Kipling 0.4704\n", "4 5 Elms-Old Rexdale 0.8040" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "colnames = {\"Neighb id \": \"n_id\",\n", " \"DEPRIVATION\" : \"deprivation\",\n", " \"Neighbourhood name \": \"name\"}\n", "\n", "depriv = depriv.copy()\n", "\n", "depriv.rename(columns=colnames, inplace=True)\n", "depriv.head()" ] }, { "cell_type": "markdown", "id": "4d25ce3e-1a54-403c-b89b-5beef7f02441", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Mental health visit rates\n", "\n", "- Read in data on rates of mental health visits stored in `2_ahd_neighb_db_ast_hbp_mhv_copd_2012.xls`.\n", "\n", "- Use `read_excel` with `sheet_name` parameter" ] }, { "cell_type": "code", "execution_count": 43, "id": "038eaed7-e6f7-45f8-9a95-60c6c42d7dcc", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Unnamed: 1MaleFemaleBoth sexesMale.1Female.1Both sexes.1Male.2Female.2...Female.12Both sexes.12Rate Ratio**, Both sexes.4H/ L/ NS, Both sexes.4(95% CI) LL, Male.4(95% CI) UL, Male.4(95% CI) LL, Female.4(95% CI) UL, Female.4(95% CI) LL, Both sexes.4(95% CI) UL, Both sexes.4
01West Humber-Clairville938116821061391514046279616.68.2...7.27.20.85L6.18.36.28.36.48.0
12Mount Olive-Silverstone-Jamestown866113019961225613082253387.08.6...8.58.20.96NS6.69.47.210.07.39.3
23Thistletown-Beaumond Heights2754106854124445385776.49.2...8.17.90.93NS5.99.96.410.06.79.3
34Rexdale-Kipling3284537814130447086007.710.0...9.09.11.07NS7.211.67.311.07.810.6
45Elms-Old Rexdale2873966833787402878157.49.6...8.17.50.88NS4.89.26.210.56.19.1
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 Unnamed: 1 Male Female Both sexes \\\n", "0 1 West Humber-Clairville 938 1168 2106 \n", "1 2 Mount Olive-Silverstone-Jamestown 866 1130 1996 \n", "2 3 Thistletown-Beaumond Heights 275 410 685 \n", "3 4 Rexdale-Kipling 328 453 781 \n", "4 5 Elms-Old Rexdale 287 396 683 \n", "\n", " Male.1 Female.1 Both sexes.1 Male.2 Female.2 ... Female.12 \\\n", "0 13915 14046 27961 6.6 8.2 ... 7.2 \n", "1 12256 13082 25338 7.0 8.6 ... 8.5 \n", "2 4124 4453 8577 6.4 9.2 ... 8.1 \n", "3 4130 4470 8600 7.7 10.0 ... 9.0 \n", "4 3787 4028 7815 7.4 9.6 ... 8.1 \n", "\n", " Both sexes.12 Rate Ratio**, Both sexes.4 H/ L/ NS, Both sexes.4 \\\n", "0 7.2 0.85 L \n", "1 8.2 0.96 NS \n", "2 7.9 0.93 NS \n", "3 9.1 1.07 NS \n", "4 7.5 0.88 NS \n", "\n", " (95% CI) LL, Male.4 (95% CI) UL, Male.4 (95% CI) LL, Female.4 \\\n", "0 6.1 8.3 6.2 \n", "1 6.6 9.4 7.2 \n", "2 5.9 9.9 6.4 \n", "3 7.2 11.6 7.3 \n", "4 4.8 9.2 6.2 \n", "\n", " (95% CI) UL, Female.4 (95% CI) LL, Both sexes.4 (95% CI) UL, Both sexes.4 \n", "0 8.3 6.4 8.0 \n", "1 10.0 7.3 9.3 \n", "2 10.0 6.7 9.3 \n", "3 11.0 7.8 10.6 \n", "4 10.5 6.1 9.1 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mentalhealth_neighb = pd.read_excel(\"2_ahd_neighb_db_ast_hbp_mhv_copd_2012.xls\", \n", " sheet_name = \"2_MentalHealthV_2012\", \n", " header = 11)\n", "mentalhealth_neighb.head()" ] }, { "cell_type": "code", "execution_count": 44, "id": "120278f2-7a7a-4f72-956c-d687034f3df2", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "Index(['Unnamed: 0', 'Unnamed: 1', ' Male', 'Female', 'Both sexes',\n", " ' Male.1', 'Female.1', 'Both sexes.1', ' Male.2', 'Female.2',\n", " 'Both sexes.2', 'Rate Ratio**, Both sexes', 'H/ L/ NS, Both sexes',\n", " '(95% CI) LL, Male', '(95% CI) UL, Male', '(95% CI) LL, Female',\n", " '(95% CI) UL, Female', '(95% CI) LL, Both sexes',\n", " '(95% CI) UL, Both sexes', ' Male.3', 'Female.3', 'Both sexes.3',\n", " 'Rate Ratio**, Both sexes.1', 'H/ L/ NS, Both sexes.1',\n", " '(95% CI) LL, Male.1', '(95% CI) UL, Male.1', '(95% CI) LL, Female.1',\n", " '(95% CI) UL, Female.1', '(95% CI) LL, Both sexes.1',\n", " '(95% CI) UL, Both sexes.1', ' Male.4', 'Female.4', 'Both sexes.4',\n", " ' Male.5', 'Female.5', 'Both sexes.5', ' Male.6', 'Female.6',\n", " 'Both sexes.6', 'Rate Ratio**, Both sexes.2', 'H/ L/ NS, Both sexes.2',\n", " '(95% CI) LL, Male.2', '(95% CI) UL, Male.2', '(95% CI) LL, Female.2',\n", " '(95% CI) UL, Female.2', '(95% CI) LL, Both sexes.2',\n", " '(95% CI) UL, Both sexes.2', ' Male.7', 'Female.7', 'Both sexes.7',\n", " ' Male.8', 'Female.8', 'Both sexes.8', ' Male.9', 'Female.9',\n", " 'Both sexes.9', 'Rate Ratio**, Both sexes.3', 'H/ L/ NS, Both sexes.3',\n", " '(95% CI) LL, Male.3', '(95% CI) UL, Male.3', '(95% CI) LL, Female.3',\n", " '(95% CI) UL, Female.3', '(95% CI) LL, Both sexes.3',\n", " '(95% CI) UL, Both sexes.3', ' Male.10', 'Female.10', 'Both sexes.10',\n", " ' Male.11', 'Female.11', 'Both sexes.11', ' Male.12', 'Female.12',\n", " 'Both sexes.12', 'Rate Ratio**, Both sexes.4', 'H/ L/ NS, Both sexes.4',\n", " '(95% CI) LL, Male.4', '(95% CI) UL, Male.4', '(95% CI) LL, Female.4',\n", " '(95% CI) UL, Female.4', '(95% CI) LL, Both sexes.4',\n", " '(95% CI) UL, Both sexes.4'],\n", " dtype='object')" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mentalhealth_neighb.columns" ] }, { "cell_type": "code", "execution_count": 45, "id": "c59fbbf0-7190-4f03-8753-67acd96e76c6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Unnamed: 1Both sexes.2
01West Humber-Clairville7.4
12Mount Olive-Silverstone-Jamestown7.8
23Thistletown-Beaumond Heights7.8
34Rexdale-Kipling8.9
45Elms-Old Rexdale8.5
\n", "
" ], "text/plain": [ " Unnamed: 0 Unnamed: 1 Both sexes.2\n", "0 1 West Humber-Clairville 7.4\n", "1 2 Mount Olive-Silverstone-Jamestown 7.8\n", "2 3 Thistletown-Beaumond Heights 7.8\n", "3 4 Rexdale-Kipling 8.9\n", "4 5 Elms-Old Rexdale 8.5" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mh_visit_rates = mentalhealth_neighb[mentalhealth_neighb.columns[[0, 1, 10]]] # n_id, name, age-stand\n", "mh_visit_rates.head()" ] }, { "cell_type": "markdown", "id": "eba71044", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- `Unamed: 0` corresponds to `n_id` in `depriv`\n", "\n", "- `Unamed: 0` corresponds to `name` in `depriv`\n", "\n", "- `Both sexes.2` (column 10) corresponds to Age-Standardized rate of Mental Health Visits (2012), All Ages 20+ for both sexes\n", "\n", "- rename the columns of `mhvisitrates` so that identical columns in `depriv` have the same name." ] }, { "cell_type": "code", "execution_count": 46, "id": "ea078baf-66fd-4dd7-bf07-f913098627ff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['n_id', 'name', 'mh_visit_rates_mf'], dtype='object')" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "colnames = {\"Unnamed: 0\": \"n_id\",\n", " \"Both sexes.2\" : \"mh_visit_rates_mf\",\n", " \"Unnamed: 1\" : \"name\"}\n", "\n", "mh_visit_rates = mh_visit_rates.copy()\n", "\n", "mh_visit_rates.rename(columns=colnames, inplace=True)\n", "\n", "mh_visit_rates.columns" ] }, { "cell_type": "markdown", "id": "5771b295", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Merging mental health visits and deprivation score" ] }, { "cell_type": "markdown", "id": "bea97dd1-23aa-4f67-8231-58bebda46a9c", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "- Merge `mhvisitrates` and `depriv` using `merge()`." ] }, { "cell_type": "code", "execution_count": 47, "id": "e6baaf28-f07b-4ba2-8ad0-6eb164c7f62e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
n_idnamemh_visit_rates_mfdeprivation
01West Humber-Clairville7.40.1620
12Mount Olive-Silverstone-Jamestown7.81.0195
23Thistletown-Beaumond Heights7.80.3460
34Rexdale-Kipling8.90.4704
45Elms-Old Rexdale8.50.8040
\n", "
" ], "text/plain": [ " n_id name mh_visit_rates_mf deprivation\n", "0 1 West Humber-Clairville 7.4 0.1620\n", "1 2 Mount Olive-Silverstone-Jamestown 7.8 1.0195\n", "2 3 Thistletown-Beaumond Heights 7.8 0.3460\n", "3 4 Rexdale-Kipling 8.9 0.4704\n", "4 5 Elms-Old Rexdale 8.5 0.8040" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mh_visit_depriv = mh_visit_rates.merge(depriv, on = [\"n_id\", \"name\"])\n", "mh_visit_depriv.head()" ] }, { "cell_type": "markdown", "id": "c1b305fe-c7d0-4a3a-b8a0-2bc9fefb857f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Creating a categorical variable based on a numerical variable\n", "\n", "- We will create a variable that categorizes neighbourhoods above and below the median deprivation score." ] }, { "cell_type": "code", "execution_count": 48, "id": "af70608c-2667-424a-823d-971386e008ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.13124999999999998" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median_depriv = mh_visit_depriv[\"deprivation\"].median()\n", "median_depriv" ] }, { "cell_type": "code", "execution_count": 49, "id": "ba7cd20f-5420-4ab3-a7aa-7b181b1178be", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
n_idnamemh_visit_rates_mfdeprivationdepriv_binary
131136West Hill8.90.8138High
132137Woburn7.20.5257High
133138Eglinton East8.31.0344High
134139Scarborough Village9.21.3915High
135140Guildwood8.2-0.4834Low
\n", "
" ], "text/plain": [ " n_id name mh_visit_rates_mf deprivation depriv_binary\n", "131 136 West Hill 8.9 0.8138 High\n", "132 137 Woburn 7.2 0.5257 High\n", "133 138 Eglinton East 8.3 1.0344 High\n", "134 139 Scarborough Village 9.2 1.3915 High\n", "135 140 Guildwood 8.2 -0.4834 Low" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mh_visit_depriv = mh_visit_depriv.copy()\n", "\n", "# create a new column in mhvisitdepric called depriv_HL\n", "# classify neighbourhoods above the median as High\n", "\n", "mh_visit_depriv.loc[mh_visit_depriv[\"deprivation\"] > median_depriv, \"depriv_binary\"] = \"High\"\n", "\n", "# classify neighbourhoods at or below the median as Low\n", "mh_visit_depriv.loc[mh_visit_depriv[\"deprivation\"] <= median_depriv, \"depriv_binary\"] = \"Low\"\n", "\n", "mh_visit_depriv.tail(n=5) # print last 3 rows of mhvisitdepriv" ] }, { "cell_type": "markdown", "id": "416f67d7-72b4-44cc-8682-55adc6d50eea", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Do Neighbourhoods with high deprivation have more mental health visits compared to Neighbourhoods with low deprivation?\n", "\n", "- We can compare the mean mental health visit rates between the two groups." ] }, { "cell_type": "code", "execution_count": 50, "id": "29eb5419-a1c2-41ca-aa88-217c17a63e76", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "depriv_binary\n", "High 8.480882\n", "Low 7.733824\n", "Name: mh_visit_rates_mf, dtype: float64" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_table = mh_visit_depriv.groupby(\"depriv_binary\")[\"mh_visit_rates_mf\"].mean()\n", "mean_table" ] }, { "cell_type": "code", "execution_count": 51, "id": "5f2e46d8-d86f-427a-b962-cabc24798599", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7470588235294136" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "observed_mean_difference = mean_table.iloc[0] - mean_table.iloc[1]\n", "observed_mean_difference" ] }, { "cell_type": "markdown", "id": "815883ca", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Does the difference represent a _real_ diffrence between the two populations or did we have a peculiar group in 2012?" ] }, { "cell_type": "markdown", "id": "1cf9fe2a-cecd-49d5-9300-9590f4959bdd", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The Logic of Hypothesis Testing\n", "\n", "### 1. Hypotheses\n", "\n", "Two claims:\n", "\n", "1. There is no difference in the mean mental health visit rates between high and low deprivation neighbourhoods. This is called the **null** hypothesis.\n", "\n", "2. There is a difference in the mean mental health visit rates between high and low deprivation neighbourhoods. This is called the **alternative** hypothesis." ] }, { "cell_type": "markdown", "id": "396a61b0-0ea1-4fba-b654-3765b3b4bbc1", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2. Test statistic\n", "\n", "The test statistic is a number, **calculated from the data**, that captures what we're interested in.\n", "\n", "What would be a useful test statistic for this study?" ] }, { "cell_type": "markdown", "id": "b2e72994", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 3. Simulate what the null hypothesis predicts will happen\n", "\n", "- If the null hypothesis is true then the mean values of high and low deprivation neighbourhoods will be the same regardless of how they are named or labelled. \n", "\n", "- That means we can randomly assign (or shuffle) the neigbourhood names to high and low deprivation and the mean difference should be close to 0." ] }, { "cell_type": "markdown", "id": "5e099fb7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "+ Imagine we have 68 playing cards labelled `High` and 68 cards labelled `Low`." ] }, { "cell_type": "markdown", "id": "00fcfb17", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Shuffle the cards ..." ] }, { "cell_type": "markdown", "id": "f5a9c833", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Assign the cards to the 136 neighbourhoods then calculate the mean difference between high and low. This is one simulated value of the test statistic. " ] }, { "cell_type": "markdown", "id": "5727b614", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Shuffle the cards again ..." ] }, { "cell_type": "markdown", "id": "49af6fa4", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Assign the cards to the 136 neighbourhoods then calculate the mean difference between high and low. This is another simulated value of the test statistic. " ] }, { "cell_type": "markdown", "id": "88bfd64b-4672-4fda-aec2-98e8d0b835da", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Continue shuffling, assigning to neigbourhoods, and computing the mean difference." ] }, { "cell_type": "markdown", "id": "44c9b5ce-be09-41e1-973d-26f2da32b541", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Shuffling" ] }, { "cell_type": "markdown", "id": "03a939ea-d800-4c95-b7f0-0a5097ccd498", "metadata": {}, "source": [ "- The _observed_ difference in mean rate of mental health visits between high and low deprivation is 0.75.\n", "\n", "- Could this difference be due to chance?\n", "\n", "- Let's repeat this study **assuming that the difference is due to chance**.\n", "\n", "- Suppose that the (true) mean mental health visit rates in high deprivation neigbourhoods is equal to the (true) mean mental health visit rates in low deprivation neigbourhoods. Then the **labels** of `\"depriv_binary\"` `High` and `Low` are ..." ] }, { "cell_type": "markdown", "id": "b1a14244", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Quiz\n", "\n", "If the (true) mean mental health visit rates in high deprivation neigbourhoods is equal to the (true) mean mental health visit rates in low deprivation neigbourhoods. Then the **labels** of `\"depriv_binary\"` `High` and `Low` on neighbourhoods are ...\n", "\n", "A. interchangable (`High` can be changed for `Low` without effecting the mean mental health visit rates)\n", "\n", "B. not interchangable (Changing any lablel from `High` to `Low` affects the mean mental health visit rates)\n" ] }, { "cell_type": "markdown", "id": "ba4c9cab-b103-46ac-bb42-bff179ec26e7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So, if \n", "\n", "> (The true) mean mental health visit rates in high deprivation neigbourhoods are equal to the (true) mean mental health visit rates in low deprivation neigbourhoods\n", "\n", "Then\n", "\n", "> The difference in mean mental health rates between `High` and `Low` deprivation neighbourhoods after switching any two `\"deperiv_binary\"` labels was a possible observation.\n", "\n", "Thus,\n", "\n", "> Re-calculating the difference in mean mental health rates between `High` and `Low` deprivation neighbourhoods for **all possible** combinations of the **labels** will provide the distribution of the possible differences." ] }, { "cell_type": "code", "execution_count": 52, "id": "6d21e588", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "5.94910575592826e+39" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the number of possible combinations...\n", "import math\n", "math.factorial(136) / (\n", " math.factorial(68) * math.factorial(68)\n", ")" ] }, { "cell_type": "markdown", "id": "dcf90b52", "metadata": {}, "source": [ "_It will take ages (of the universe) even with the fastest supercomputer._" ] }, { "cell_type": "markdown", "id": "423918e0", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Approximate via simulation\n", "\n", "So, if \n", "\n", "> (The true) mean mental health visit rates in high deprivation neigbourhoods are equal to the (true) mean mental health visit rates in low deprivation neigbourhoods\n", "\n", "Then\n", "\n", "> The difference in mean mental health rates between `High` and `Low` deprivation neighbourhoods after switching any two `\"deperiv_binary\"` labels was a possible observation.\n", "\n", "Thus,\n", "\n", "> Re-calculating the difference in mean mental health rates between `High` and `Low` deprivation neighbourhoods for **a set of randomly shuffled labels** will provide the _approximaate_ distribution of the possible differences.\n" ] }, { "cell_type": "markdown", "id": "60311806", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Random shuffling\n", "\n", "- We can randlomly shuffle using the `sample` function in `pandas`.\n", "- The parameter `frac` in the `pandas` `sample` function refers to the fraction of rows to return. `frac=1` means all the rows are returned." ] }, { "cell_type": "code", "execution_count": 53, "id": "6848f37f-6c35-4994-aa0a-4c5043acb247", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 High\n", "1 High\n", "2 High\n", "3 High\n", "4 High\n", "5 High\n", "6 Low\n", "7 High\n", "8 Low\n", "9 Low\n", "Name: depriv_binary, dtype: object" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mh_visit_depriv.iloc[0:10, 4]" ] }, { "cell_type": "code", "execution_count": 54, "id": "bc77ed95", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8 Low\n", "5 High\n", "0 High\n", "2 High\n", "1 High\n", "9 Low\n", "7 High\n", "3 High\n", "6 Low\n", "4 High\n", "Name: depriv_binary, dtype: object" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(7) # for reproducability\n", "\n", "mh_visit_depriv.iloc[0:10, 4].sample(frac=1, replace=False)" ] }, { "cell_type": "markdown", "id": "de341713", "metadata": {}, "source": [ "- `reset_index` resets the index. \n", "- the argument `drop=True` in `reset_index` indicates not to save the index after running `sample` as a column." ] }, { "cell_type": "code", "execution_count": 55, "id": "c4adb67d-31e7-41c8-9059-b7cf4da5fe98", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Low\n", "1 High\n", "2 High\n", "3 High\n", "4 High\n", "5 Low\n", "6 High\n", "7 High\n", "8 Low\n", "9 High\n", "Name: depriv_binary, dtype: object" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(7)\n", "\n", "mh_visit_depriv.iloc[0:10,4].sample(frac=1, replace=False).reset_index(drop=True)" ] }, { "cell_type": "markdown", "id": "3db92599-56a9-4b80-a2d0-df97f548b63d", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can investigate what happens if the labels for `High` and `Low` are randomly shuffled for the first 10 rows. " ] }, { "cell_type": "code", "execution_count": 56, "id": "0f2511b5-0b93-4998-bafc-178d676dea54", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mh_visit_rates_mfdepriv_binary
07.4High
17.8High
27.8High
38.9High
48.5Low
57.6High
67.8High
78.9Low
88.2High
98.0Low
107.8Low
\n", "
" ], "text/plain": [ " mh_visit_rates_mf depriv_binary\n", "0 7.4 High\n", "1 7.8 High\n", "2 7.8 High\n", "3 8.9 High\n", "4 8.5 Low\n", "5 7.6 High\n", "6 7.8 High\n", "7 8.9 Low\n", "8 8.2 High\n", "9 8.0 Low\n", "10 7.8 Low" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# mental health visit rates for first 10 rows\n", "\n", "visits = mh_visit_depriv.loc[0:10, \"mh_visit_rates_mf\"]\n", "\n", "# randomly shuffle depriv_HL \n", "# previous index is dropped because we don't need it\n", "\n", "shuffled_depriv = mh_visit_depriv.loc[0:10, \"depriv_binary\"].sample(\n", " frac=1, replace=False).reset_index(drop=True)\n", "\n", "# put two dataframes in a list as input to\n", "# pd.concat\n", "\n", "L = [visits, shuffled_depriv]\n", "\n", "# combine two columns\n", "pd.concat(L, axis = 1)\n" ] }, { "cell_type": "markdown", "id": "acbf5c74-0237-4ada-96cd-28724d635ac6", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Set up the simulation in python\n", "\n", "**Step 1:** Shuffle the column `\"depriv_binary\"`\n", "\n", "- To do this we will use `pandas` `sample` with the following parameters for the sample function:\n", "\n", " + `frac=1` (sample 100% of values)\n", " + `replace=True` (we want the same number of `High` `Low` as original sample)\n", " \n", "- `reset_index(drop=True)` (use the dafault index - row labels in `pandas` so that we can assign shuffled labels to )" ] }, { "cell_type": "code", "execution_count": 57, "id": "9a2421bd-4015-453a-b9b1-4fbf1e8aef8e", "metadata": {}, "outputs": [], "source": [ "depriv_binary_shuffle = mh_visit_depriv[\"depriv_binary\"].sample(\n", " frac=1, replace=False).reset_index(drop=True)" ] }, { "cell_type": "markdown", "id": "ed59b00b-538b-4803-8170-5a924ae6bc32", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Step 2:** Assign the shuffled labels to the neighbourhoods and compute the mean rate of mental health visits" ] }, { "cell_type": "code", "execution_count": 58, "id": "dd0d0037-9c0d-434b-ae0d-7018289ef8a5", "metadata": {}, "outputs": [], "source": [ "visit_rate_low_shuffle = mh_visit_depriv.loc[depriv_binary_shuffle == \"High\", \"mh_visit_rates_mf\"].mean()\n", "visit_rate_high_shuffle = mh_visit_depriv.loc[depriv_binary_shuffle == \"Low\", \"mh_visit_rates_mf\"].mean()" ] }, { "cell_type": "markdown", "id": "9c881906-05e4-4e81-abd2-9bf49552cffb", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Step 3:** Compute the mean difference between the groups" ] }, { "cell_type": "code", "execution_count": 59, "id": "1591d57d-717a-4e67-8e1c-6c4cea8ff10a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.08823529411764675" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visit_rate_high_shuffle - visit_rate_low_shuffle " ] }, { "cell_type": "markdown", "id": "1b6e1bb3-5f8f-47fa-95f9-94b34a7ac4e8", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Steps 1 - 3 is an algorithm for computing one simulated value of the mean difference when the null hypothesis is true.\n", "\n", "- Let's create a function to do Steps 1 - 3 that returns a simulated value of the mean difference (a simulated value of the test statistic)." ] }, { "cell_type": "code", "execution_count": 60, "id": "6c49132b-0559-4632-b6ae-887f77eb0d07", "metadata": {}, "outputs": [], "source": [ "def random_shuffle_mean():\n", " # step 1\n", " depriv_binary_shuffle = mh_visit_depriv[\"depriv_binary\"].sample(\n", " frac=1, replace=False).reset_index(drop=True)\n", " \n", " #step 2\n", " visit_rate_low_shuffle = mh_visit_depriv.loc[\n", " depriv_binary_shuffle == \"High\", \"mh_visit_rates_mf\"].mean()\n", " visit_rate_high_shuffle = mh_visit_depriv.loc[\n", " depriv_binary_shuffle == \"Low\", \"mh_visit_rates_mf\"].mean()\n", " \n", " #step 3\n", " shuffled_diff = visit_rate_high_shuffle - visit_rate_low_shuffle \n", " \n", " return shuffled_diff " ] }, { "cell_type": "code", "execution_count": 61, "id": "4612d41d-46c9-4633-a8eb-257cada992c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.00588235294117645" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_shuffle_mean()" ] }, { "cell_type": "markdown", "id": "0b0acd61-a822-45c7-a1fc-2a217ea55ef7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Distribution of simulated values of the mean difference assuming the null hypothesis is true\n", "\n", "- Statistical tests based on shuffles or permutations of the data are called permutation tests. " ] }, { "cell_type": "code", "execution_count": 62, "id": "8dc984a6-12e7-44c6-8689-6e9f9670dae6", "metadata": {}, "outputs": [], "source": [ "shuffled_diffs = []\n", "\n", "for _ in range(5000):\n", " shuffled_diffs.append(random_shuffle_mean())" ] }, { "cell_type": "code", "execution_count": 63, "id": "f0a79597-b5ed-499f-86d7-3d1ae2dda22a", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.hist(shuffled_diffs, color=\"grey\", edgecolor=\"white\")\n", "\n", "plt.vlines(x=observed_mean_difference, ymin=0, ymax=1600, color=\"pink\")\n", "\n", "plt.vlines(x=-1 * observed_mean_difference, ymin=0, ymax=1600, color=\"pink\");" ] }, { "cell_type": "markdown", "id": "e675b176-7eea-46ef-abc6-4ea6d12b2226", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## p-value\n", "\n", "- Assuming that the null hypothesis is true, the p-value gives a measure of the probability of getting data that are *at least* as unusual or extreme as the sample data.\n", "\n", "- What does \"at least as unusual\" mean?\n", "\n", "- An unusual value would be a mean difference (`High` minus `Low`) greater than what we observed or a mean difference (`Low` minus `High`) less than what we observed.\n", "\n", "- Values that are larger than the observed value mean difference (the test statistic) 0.75 (`High` minus `Low`) or smaller than - 0.75 (`Low` minus `High`)." ] }, { "cell_type": "markdown", "id": "27462f52-dc20-4723-8310-39e1bb6d0abd", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The number of simulated mean differences that are greater than the observed mean difference can be computed using:" ] }, { "cell_type": "code", "execution_count": 64, "id": "b9931f03-9f89-4dda-9e85-4222ea35e5fd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right_extreme = shuffled_diffs >= observed_mean_difference\n", "\n", "right_extreme.sum()" ] }, { "cell_type": "markdown", "id": "78641883-0bb9-4e65-ba63-1d40e3b6c88d", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The number of simulated mean differences that are less than the -1 times observed mean difference can be computed using:" ] }, { "cell_type": "code", "execution_count": 65, "id": "fa253024-283a-49de-a895-29c2fe064240", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left_extreme = shuffled_diffs <= -1 * observed_mean_difference\n", "\n", "left_extreme.sum()" ] }, { "cell_type": "markdown", "id": "82bae078-3e00-4023-bfd0-683af3c5d820", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The proportion of simulations that are unusual or extreme are:" ] }, { "cell_type": "code", "execution_count": 66, "id": "01aa3a37-1b8e-4c1d-a96f-cf1f92976b61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(right_extreme.sum() + left_extreme.sum()) / 5000" ] }, { "cell_type": "markdown", "id": "a76cffb3-ef04-4643-9710-781ed34f0d4d", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also write a function to carry out this computation." ] }, { "cell_type": "code", "execution_count": 67, "id": "1c63e16e-3a87-4feb-8b03-d5019759b713", "metadata": {}, "outputs": [], "source": [ "def pvalue_two_sided(shuff_diffs, obs_diffs):\n", " \n", " right_extreme_count = (shuff_diffs >= obs_diffs).sum()\n", " \n", " left_extreme_count = (shuff_diffs < -1 * obs_diffs).sum()\n", " \n", " all_extreme = right_extreme_count + left_extreme_count\n", " \n", " pval = all_extreme / len(shuff_diffs)\n", " \n", " return pval " ] }, { "cell_type": "code", "execution_count": 68, "id": "3c7e5631-76a3-4f32-9418-5e4e785cfbfd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pvalue_two_sided(shuff_diffs=shuffled_diffs, obs_diffs=observed_mean_difference)" ] }, { "cell_type": "markdown", "id": "cecd15b0-2bb6-4023-8475-87267138f739", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- This is called the **P-value**. \n", "\n", "- The entire procedure is often referred to as **significance testing**.\n", "\n", "- None of the permuted 5000 samples resulted in a mean difference more extreme than what we observed." ] }, { "cell_type": "markdown", "id": "ff7ec95d-3774-43bf-bbab-ceb0a5b9a413", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Conclusions\n", "\n", "So, there are two possibilities:\n", "\n", "1. The null hypothesis of no difference is true, but we have observed a **very rare** value of the mean difference in our study.\n", "\n", "2. The null hypothesis is **false**, and there is a difference between high and low deprivation.\n", "\n", "It's scientific convention to assume that a small p-value is evidence in favour of 2 \n", "(i.e., the null hypothesis is false, and there is evidence of a difference in the means)." ] }, { "cell_type": "markdown", "id": "0035ac2f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## History of Significance Testing\n", "\n", "R.A. Fisher made fundamental contributions to statistics including developing significance testing.\n", "\n", "> \"_He was from an early age a supporter of certain eugenic ideas, and it is for this reason that he has been accused of being a racist and an advocate of forced sterilisation (Evans 2020). His promotion of eugenics has recently caused various organisations to remove his name from awards and dedications of buildings._\" ([Bodmer et al, 2021](https://www.nature.com/articles/s41437-020-00394-6))" ] }, { "cell_type": "markdown", "id": "d33f18f0-ca00-4489-aad5-9fad858a9b4b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Causality\n", "\n", "Imagine...\n", "\n", "
\n", "
\n", "
    \n", "
  • You have a headache.
  • \n", "
\n", "
\n", "
\n", "
\n", "\n" ] }, { "cell_type": "markdown", "id": "4720a94f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "
\n", "
    \n", "
  • You take an Aspirin at 10:00 to relieve your pain.
  • \n", "
  • Your pain goes away after 30 minutes.
  • \n", "
\n", "
\n", "
\n", "
\n", "\n" ] }, { "cell_type": "markdown", "id": "366ff202", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "
\n", "
    \n", "
  • Now, you go back in time to 10:00 and you don't take an Aspirin.
  • \n", "
  • Your pain goes away after 48 minutes.
  • \n", "
\n", "
\n", "
\n", "
\n", "\n" ] }, { "cell_type": "markdown", "id": "c3ca77ca", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The **causal effect** of taking an Aspirin is 18 minutes (48 minutes - 30 minutes).\n" ] }, { "cell_type": "markdown", "id": "a07027d9", "metadata": {}, "source": [ "![](causal.jpeg)" ] }, { "cell_type": "markdown", "id": "7859b66e", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Potential outcomes and randomized control trials\n", "\n", "- Establishing causality involves comparing these **potential outcomes**. " ] }, { "cell_type": "markdown", "id": "b7a9922e", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The problem is that we can never observe both taking an Aspirin and not taking as Aspirin (in the same person at the same time under the same conditions)." ] }, { "cell_type": "markdown", "id": "9e350596", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A close approximation to comparing potential outcomes is to compare two groups of people that are similar on average (age, sex, income, etc.) except one group is allowed to take Aspirin after a headache and the other group takes a fake Aspirin (sugar pill/placebo) after a headache. This is an example of a randomized control trial." ] }, { "cell_type": "markdown", "id": "8d4f8cd5-2b31-46d3-aad4-897bbb3021e2", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Then the mean difference between time to pain relief should be due to Aspirin and not other factors related to why people may or may not take an Aspirin. " ] }, { "cell_type": "markdown", "id": "5c44bf71-07a9-46af-aa37-6f46db59c025", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Does Material Deprivation Cause increases in mental health visits?\n", "\n", "- Neighbourhoods can't be randomly assigned to high or low deprivationdepravity, like people could be randomized to take a real or fake Aspirin.\n", "\n", "- There could be many other factors related to why a neighbourhood has a low or high deprivation score. \n", "\n", "- This means that when we are comparing neighbourhoods with low deprivation to high deprivation the differences in a certain outcome such as mental health visit rates could be due to factors other than deprivation such as environmental factors that are associated with mental illness, but happen to be more or less prevalent in one of the deprivation groups.\n", "\n" ] }, { "cell_type": "markdown", "id": "a13007b5-a193-4a36-b630-906be48578dc", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Just in case you have seen a two-sample t-test ...\n", "\n", "This is similar to a two-sample t-test (we won't cover in this course except here) except not as flexible. " ] }, { "cell_type": "code", "execution_count": 69, "id": "9be7e8d7-eb51-45a5-b774-e2c7eb8348d1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The p-value from the two-sample t-test is: 0.0\n" ] } ], "source": [ "from scipy import stats\n", "\n", "mh_visits_low = mh_visit_depriv.loc[\n", " mh_visit_depriv[\"depriv_binary\"] == \"Low\", \"mh_visit_rates_mf\"]\n", "\n", "mh_visits_high = mh_visit_depriv.loc[\n", " mh_visit_depriv[\"depriv_binary\"] == \"High\", \"mh_visit_rates_mf\"]\n", "\n", "statistic, pvalue = stats.ttest_ind(mh_visits_high, mh_visits_low)\n", "\n", "print(\"The p-value from the two-sample t-test is: \", round(pvalue, 3))" ] }, { "cell_type": "markdown", "id": "798bf90b", "metadata": {}, "source": [ "The two-sample t-test involves: \n", "\n", "1. Computing $t_{\\text observed} = \\frac{\\bar{x_{\\text Low}} - \\bar{x_{\\text High}}}{SD}$, where $SD$ is an estimate of the standard deviation of the mean difference.\n", "\n", "2. Computing the p-value using $t_{\\text observed}$ and the appropriate $t$-distribution.\n", "\n", "3. Assessing several statistical assumptions about the data to ensure the accuracy of the p-value.\n" ] }, { "cell_type": "markdown", "id": "5a671c9e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Review of this class\n", "\n", "- Comparing two samples by simulating the distribution of a test statistic (e.g., the difference in two means) assuming the null hypothesis is true (i.e., there is no difference in the means) performs a statistical test:\n", "\n", " \n", " 1. Stated the null and alternative hypotheses;\n", " \n", " 2. imulated the test statistic assuming the null hypotheis by:\n", " \n", " + Randomly shuffling the group labels (e.g., high/low deprivation);\n", " \n", " + Calculating the test statistic in the shuffled data se; and\n", " \n", " + Repeating the previous two steps a large number of times (e.g., 5,000).\n", " \n", " 3. Computed the p-value as a measure of how consistent the data are with the null hypothesis. \n", "\n", " + The p-value is computed by summing the number of simulated values that are more extreme in the positive or negative direction.\n" ] }, { "cell_type": "markdown", "id": "9575cf62", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "- If the p-value is small then the data are inconsistent with the null hypothesis and there is evidence in support of the alternative hypothesis.\n", "\n", "- A large p-value isn't an indication of evidence in support of the null hypothesis. The procedure only tests whether you have data inconsistent with the null hypothesis.\n", "\n", "- A small p-value may not be **causal** evidence that for the alternative hypothesis unless random assignment was used part of the study design." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "ggr274", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }