{ "cells": [ { "cell_type": "markdown", "id": "204a9e40", "metadata": {}, "source": [ "# GGR274 Lab 4: Introduction to Data Wrangling, Part 1\n", "\n", "## Logistics\n", "\n", "Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).\n", "\n", "Complete the tasks in this Jupyter notebook and submit your completed file to [MarkUs](https://markus-ds.teach.cs.toronto.edu).\n", "Here are the instructions for submitting to MarkUs (same as last week):\n", "\n", "1. Download this file (`Lab_4.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../../../guides/jupyterhub_guide.ipynb) for detailed instructions.)\n", "2. Submit this file to MarkUs under the **lab4** assignment. (See [our MarkUs Guide](../../../guides/markus_guide.ipynb) for detailed instructions.)\n", "\n", "Note: there's no autograding set up for this week's lab, but your TA will be checking that your submitted lab file is complete as part of your \"lab attendance\" grade." ] }, { "cell_type": "markdown", "id": "535ac824", "metadata": {}, "source": [ "## Task 1: Read the csv file into a `DataFrame`\n", "\n", "Read the csv file `ArrestsStripSearches.csv` into a pandas Dataframe called `police_df` related to Toronto Police Race and Identity Based Data - Arrests and Strip Searches.\n", "\n", "_The file is located in the same folder as the notebook._" ] }, { "cell_type": "code", "execution_count": 1, "id": "da33b4b3", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/0j/ybsv4ncn5w50v40vdh5jjlww0000gn/T/ipykernel_79053/1181357276.py:1: DeprecationWarning: \n", "Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n", "(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n", "but was not found to be installed on your system.\n", "If this would cause problems for you,\n", "please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n", " \n", " import pandas as pd\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idArrest_YearArrest_MonthEventIDArrestIDPersonIDPerceived_RaceSexAge_group__at_arrest_Youth_at_arrest__under_18_years...Actions_at_arrest___Resisted__dActions_at_arrest___Mental_instActions_at_arrest___Assaulted_oActions_at_arrest___CooperativeSearchReason_CauseInjurySearchReason_AssistEscapeSearchReason_PossessWeaponsSearchReason_PossessEvidenceItemsFoundObjectId
012020July-Sept10059076017884.0326622WhiteMAged 35 to 44 yearsNot a youth...NaNNaNNaN1.0NaNNaNNaNNaNNaN1
122020July-Sept10145626056669.0326622WhiteMAged 35 to 44 yearsNot a youth...NaNNaNNaNNaNNaNNaNNaNNaNNaN2
232020Oct-Dec10299226057065.0326622Unknown or LegacyMAged 35 to 44 yearsNot a youth...NaNNaNNaNNaNNaNNaNNaNNaNNaN3
342021Jan-Mar10521906029059.0327535BlackMAged 25 to 34 yearsNot a youth...NaNNaNNaNNaNNaNNaNNaNNaNNaN4
452021Jan-Mar10155126040372.0327535South AsianMAged 25 to 34 yearsNot a youth...NaNNaNNaNNaNNaNNaNNaNNaNNaN5
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " _id Arrest_Year Arrest_Month EventID ArrestID PersonID \\\n", "0 1 2020 July-Sept 1005907 6017884.0 326622 \n", "1 2 2020 July-Sept 1014562 6056669.0 326622 \n", "2 3 2020 Oct-Dec 1029922 6057065.0 326622 \n", "3 4 2021 Jan-Mar 1052190 6029059.0 327535 \n", "4 5 2021 Jan-Mar 1015512 6040372.0 327535 \n", "\n", " Perceived_Race Sex Age_group__at_arrest_ \\\n", "0 White M Aged 35 to 44 years \n", "1 White M Aged 35 to 44 years \n", "2 Unknown or Legacy M Aged 35 to 44 years \n", "3 Black M Aged 25 to 34 years \n", "4 South Asian M Aged 25 to 34 years \n", "\n", " Youth_at_arrest__under_18_years ... Actions_at_arrest___Resisted__d \\\n", "0 Not a youth ... NaN \n", "1 Not a youth ... NaN \n", "2 Not a youth ... NaN \n", "3 Not a youth ... NaN \n", "4 Not a youth ... NaN \n", "\n", " Actions_at_arrest___Mental_inst Actions_at_arrest___Assaulted_o \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", " Actions_at_arrest___Cooperative SearchReason_CauseInjury \\\n", "0 1.0 NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", " SearchReason_AssistEscape SearchReason_PossessWeapons \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", " SearchReason_PossessEvidence ItemsFound ObjectId \n", "0 NaN NaN 1 \n", "1 NaN NaN 2 \n", "2 NaN NaN 3 \n", "3 NaN NaN 4 \n", "4 NaN NaN 5 \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "police_df = pd.read_csv('ArrestsStripSearches.csv')\n", "\n", "police_df.head()" ] }, { "cell_type": "markdown", "id": "fe4903d4", "metadata": {}, "source": [ "## Task 2: Create a `DataFrame` of arrests in 2021\n", "\n", "### a. Subset the tabular data by rows\n", "\n", "- Create a boolean variable named `Arrests_2021` computed from `police_df` that is `True` if `Arrest_Year` is `2021` and `False` otherwise.\n", "\n", "- Use `Arrests_2021` to select rows of `police_df` that correspond to arrests in 2021. Save this new DataFrame in a variable called `police_2021_df`. Examine the `head()` of this `DataFrame`.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "840760c8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idArrest_YearArrest_MonthEventIDArrestIDPersonIDPerceived_RaceSexAge_group__at_arrest_Youth_at_arrest__under_18_years...Actions_at_arrest___Resisted__dActions_at_arrest___Mental_instActions_at_arrest___Assaulted_oActions_at_arrest___CooperativeSearchReason_CauseInjurySearchReason_AssistEscapeSearchReason_PossessWeaponsSearchReason_PossessEvidenceItemsFoundObjectId
342021Jan-Mar10521906029059.0327535BlackMAged 25 to 34 yearsNot a youth...NaNNaNNaNNaNNaNNaNNaNNaNNaN4
452021Jan-Mar10155126040372.0327535South AsianMAged 25 to 34 yearsNot a youth...NaNNaNNaNNaNNaNNaNNaNNaNNaN5
562021Apr-June10191456060688.0327535South AsianMAged 25 to 34 yearsNot a youth...NaNNaNNaN1.0NaNNaNNaNNaNNaN6
672021Jan-Mar10354456053833.0330778BlackMAged 25 to 34 yearsNot a youth...NaNNaNNaNNaNNaNNaNNaNNaNNaN7
782021Jan-Mar10504646063477.0330778BlackMAged 25 to 34 yearsNot a youth...NaNNaNNaN1.0NaNNaNNaNNaNNaN8
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " _id Arrest_Year Arrest_Month EventID ArrestID PersonID Perceived_Race \\\n", "3 4 2021 Jan-Mar 1052190 6029059.0 327535 Black \n", "4 5 2021 Jan-Mar 1015512 6040372.0 327535 South Asian \n", "5 6 2021 Apr-June 1019145 6060688.0 327535 South Asian \n", "6 7 2021 Jan-Mar 1035445 6053833.0 330778 Black \n", "7 8 2021 Jan-Mar 1050464 6063477.0 330778 Black \n", "\n", " Sex Age_group__at_arrest_ Youth_at_arrest__under_18_years ... \\\n", "3 M Aged 25 to 34 years Not a youth ... \n", "4 M Aged 25 to 34 years Not a youth ... \n", "5 M Aged 25 to 34 years Not a youth ... \n", "6 M Aged 25 to 34 years Not a youth ... \n", "7 M Aged 25 to 34 years Not a youth ... \n", "\n", " Actions_at_arrest___Resisted__d Actions_at_arrest___Mental_inst \\\n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN NaN \n", "6 NaN NaN \n", "7 NaN NaN \n", "\n", " Actions_at_arrest___Assaulted_o Actions_at_arrest___Cooperative \\\n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN 1.0 \n", "6 NaN NaN \n", "7 NaN 1.0 \n", "\n", " SearchReason_CauseInjury SearchReason_AssistEscape \\\n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN NaN \n", "6 NaN NaN \n", "7 NaN NaN \n", "\n", " SearchReason_PossessWeapons SearchReason_PossessEvidence ItemsFound \\\n", "3 NaN NaN NaN \n", "4 NaN NaN NaN \n", "5 NaN NaN NaN \n", "6 NaN NaN NaN \n", "7 NaN NaN NaN \n", "\n", " ObjectId \n", "3 4 \n", "4 5 \n", "5 6 \n", "6 7 \n", "7 8 \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Arrests_2021 = police_df['Arrest_Year'] == 2021\n", "\n", "police_2021_df = police_df[Arrests_2021]\n", "\n", "police_2021_df.head()" ] }, { "cell_type": "markdown", "id": "b97e5e1e", "metadata": {}, "source": [ "### b. Inspect columns of the tabular data\n", "\n", "- Create a variable called `Arrest_column_names` that stores a list of the column names of `police_2021_df` and print the list. " ] }, { "cell_type": "code", "execution_count": 3, "id": "f89b19a6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['_id',\n", " 'Arrest_Year',\n", " 'Arrest_Month',\n", " 'EventID',\n", " 'ArrestID',\n", " 'PersonID',\n", " 'Perceived_Race',\n", " 'Sex',\n", " 'Age_group__at_arrest_',\n", " 'Youth_at_arrest__under_18_years',\n", " 'ArrestLocDiv',\n", " 'StripSearch',\n", " 'Booked',\n", " 'Occurrence_Category',\n", " 'Actions_at_arrest___Concealed_i',\n", " 'Actions_at_arrest___Combative__',\n", " 'Actions_at_arrest___Resisted__d',\n", " 'Actions_at_arrest___Mental_inst',\n", " 'Actions_at_arrest___Assaulted_o',\n", " 'Actions_at_arrest___Cooperative',\n", " 'SearchReason_CauseInjury',\n", " 'SearchReason_AssistEscape',\n", " 'SearchReason_PossessWeapons',\n", " 'SearchReason_PossessEvidence',\n", " 'ItemsFound',\n", " 'ObjectId']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Arrest_column_names = list(police_2021_df)\n", "\n", "Arrest_column_names" ] }, { "cell_type": "markdown", "id": "ac06b58c", "metadata": {}, "source": [ "## Task 3: Select columns from `police_2021_df` and examine the distributions of each column\n", "\n", "- Create a new DataFrame from `police_2021_df` with the following columns: `_id`, `Sex`,` Perceived_Race`, `SearchReason_PossessWeapons` and assign it to the variable `police_2021_raceweapons`. Examine the `head()` of this `DataFrame`.\n", "\n", "- Use `.value_counts()` to compute the distributions of `Sex`, `Perceived_Race`, `SearchReason_PossessWeapons`. You do not need to save these values in variables, though you may do so if you want." ] }, { "cell_type": "code", "execution_count": 6, "id": "1f5e7c72", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idSexPerceived_RaceSearchReason_PossessWeapons
34MBlackNaN
45MSouth AsianNaN
56MSouth AsianNaN
67MBlackNaN
78MBlackNaN
\n", "
" ], "text/plain": [ " _id Sex Perceived_Race SearchReason_PossessWeapons\n", "3 4 M Black NaN\n", "4 5 M South Asian NaN\n", "5 6 M South Asian NaN\n", "6 7 M Black NaN\n", "7 8 M Black NaN" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "police_2021_raceweapons = police_2021_df[['_id', 'Sex', 'Perceived_Race', 'SearchReason_PossessWeapons']]\n", "\n", "police_2021_raceweapons.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "0fee1817", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sex\n", "M 26815\n", "F 6479\n", "U 3\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "police_2021_df['Sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": 8, "id": "c3c49996", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Perceived_Race\n", "White 14116\n", "Black 8878\n", "Unknown or Legacy 2444\n", "East/Southeast Asian 2361\n", "South Asian 1871\n", "Middle-Eastern 1730\n", "Latino 960\n", "Indigenous 935\n", "Name: count, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "police_2021_df['Perceived_Race'].value_counts()" ] }, { "cell_type": "code", "execution_count": 9, "id": "a71d1796", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SearchReason_PossessWeapons\n", "0.0 478\n", "1.0 208\n", "Name: count, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "police_2021_df['SearchReason_PossessWeapons'].value_counts()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" }, "vscode": { "interpreter": { "hash": "8b8edaa195e148f815789564e9a10f57d8b792ac9e1a5daafce5fbae42bebd0e" } } }, "nbformat": 4, "nbformat_minor": 5 }