GGR274 Lab 4: Introduction to Data Wrangling, Part 1#
Logistics#
Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).
Complete the tasks in this Jupyter notebook and submit your completed file to MarkUs. Here are the instructions for submitting to MarkUs (same as last week):
Download this file (
Lab_4.ipynb
) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the lab4 assignment. (See our MarkUs Guide for detailed instructions.)
Note: there’s no autograding set up for this week’s lab, but your TA will be checking that your submitted lab file is complete as part of your “lab attendance” grade.
Task 1: Read the csv file into a DataFrame
#
Read the csv file ArrestsStripSearches.csv
into a pandas Dataframe called police_df
related to Toronto Police Race and Identity Based Data - Arrests and Strip Searches.
The file is located in the same folder as the notebook.
import pandas as pd
police_df = pd.read_csv('ArrestsStripSearches.csv')
police_df.head()
/var/folders/0j/ybsv4ncn5w50v40vdh5jjlww0000gn/T/ipykernel_79053/1181357276.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
_id | Arrest_Year | Arrest_Month | EventID | ArrestID | PersonID | Perceived_Race | Sex | Age_group__at_arrest_ | Youth_at_arrest__under_18_years | ... | Actions_at_arrest___Resisted__d | Actions_at_arrest___Mental_inst | Actions_at_arrest___Assaulted_o | Actions_at_arrest___Cooperative | SearchReason_CauseInjury | SearchReason_AssistEscape | SearchReason_PossessWeapons | SearchReason_PossessEvidence | ItemsFound | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2020 | July-Sept | 1005907 | 6017884.0 | 326622 | White | M | Aged 35 to 44 years | Not a youth | ... | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 1 |
1 | 2 | 2020 | July-Sept | 1014562 | 6056669.0 | 326622 | White | M | Aged 35 to 44 years | Not a youth | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 |
2 | 3 | 2020 | Oct-Dec | 1029922 | 6057065.0 | 326622 | Unknown or Legacy | M | Aged 35 to 44 years | Not a youth | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3 |
3 | 4 | 2021 | Jan-Mar | 1052190 | 6029059.0 | 327535 | Black | M | Aged 25 to 34 years | Not a youth | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 |
4 | 5 | 2021 | Jan-Mar | 1015512 | 6040372.0 | 327535 | South Asian | M | Aged 25 to 34 years | Not a youth | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5 |
5 rows Ă— 26 columns
Task 2: Create a DataFrame
of arrests in 2021#
a. Subset the tabular data by rows#
Create a boolean variable named
Arrests_2021
computed frompolice_df
that isTrue
ifArrest_Year
is2021
andFalse
otherwise.Use
Arrests_2021
to select rows ofpolice_df
that correspond to arrests in 2021. Save this new DataFrame in a variable calledpolice_2021_df
. Examine thehead()
of thisDataFrame
.
Arrests_2021 = police_df['Arrest_Year'] == 2021
police_2021_df = police_df[Arrests_2021]
police_2021_df.head()
_id | Arrest_Year | Arrest_Month | EventID | ArrestID | PersonID | Perceived_Race | Sex | Age_group__at_arrest_ | Youth_at_arrest__under_18_years | ... | Actions_at_arrest___Resisted__d | Actions_at_arrest___Mental_inst | Actions_at_arrest___Assaulted_o | Actions_at_arrest___Cooperative | SearchReason_CauseInjury | SearchReason_AssistEscape | SearchReason_PossessWeapons | SearchReason_PossessEvidence | ItemsFound | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | 2021 | Jan-Mar | 1052190 | 6029059.0 | 327535 | Black | M | Aged 25 to 34 years | Not a youth | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 |
4 | 5 | 2021 | Jan-Mar | 1015512 | 6040372.0 | 327535 | South Asian | M | Aged 25 to 34 years | Not a youth | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5 |
5 | 6 | 2021 | Apr-June | 1019145 | 6060688.0 | 327535 | South Asian | M | Aged 25 to 34 years | Not a youth | ... | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 6 |
6 | 7 | 2021 | Jan-Mar | 1035445 | 6053833.0 | 330778 | Black | M | Aged 25 to 34 years | Not a youth | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7 |
7 | 8 | 2021 | Jan-Mar | 1050464 | 6063477.0 | 330778 | Black | M | Aged 25 to 34 years | Not a youth | ... | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 8 |
5 rows Ă— 26 columns
b. Inspect columns of the tabular data#
Create a variable called
Arrest_column_names
that stores a list of the column names ofpolice_2021_df
and print the list.
Arrest_column_names = list(police_2021_df)
Arrest_column_names
['_id',
'Arrest_Year',
'Arrest_Month',
'EventID',
'ArrestID',
'PersonID',
'Perceived_Race',
'Sex',
'Age_group__at_arrest_',
'Youth_at_arrest__under_18_years',
'ArrestLocDiv',
'StripSearch',
'Booked',
'Occurrence_Category',
'Actions_at_arrest___Concealed_i',
'Actions_at_arrest___Combative__',
'Actions_at_arrest___Resisted__d',
'Actions_at_arrest___Mental_inst',
'Actions_at_arrest___Assaulted_o',
'Actions_at_arrest___Cooperative',
'SearchReason_CauseInjury',
'SearchReason_AssistEscape',
'SearchReason_PossessWeapons',
'SearchReason_PossessEvidence',
'ItemsFound',
'ObjectId']
Task 3: Select columns from police_2021_df
and examine the distributions of each column#
Create a new DataFrame from
police_2021_df
with the following columns:_id
,Sex
,Perceived_Race
,SearchReason_PossessWeapons
and assign it to the variablepolice_2021_raceweapons
. Examine thehead()
of thisDataFrame
.Use
.value_counts()
to compute the distributions ofSex
,Perceived_Race
,SearchReason_PossessWeapons
. You do not need to save these values in variables, though you may do so if you want.
police_2021_raceweapons = police_2021_df[['_id', 'Sex', 'Perceived_Race', 'SearchReason_PossessWeapons']]
police_2021_raceweapons.head()
_id | Sex | Perceived_Race | SearchReason_PossessWeapons | |
---|---|---|---|---|
3 | 4 | M | Black | NaN |
4 | 5 | M | South Asian | NaN |
5 | 6 | M | South Asian | NaN |
6 | 7 | M | Black | NaN |
7 | 8 | M | Black | NaN |
police_2021_df['Sex'].value_counts()
Sex
M 26815
F 6479
U 3
Name: count, dtype: int64
police_2021_df['Perceived_Race'].value_counts()
Perceived_Race
White 14116
Black 8878
Unknown or Legacy 2444
East/Southeast Asian 2361
South Asian 1871
Middle-Eastern 1730
Latino 960
Indigenous 935
Name: count, dtype: int64
police_2021_df['SearchReason_PossessWeapons'].value_counts()
SearchReason_PossessWeapons
0.0 478
1.0 208
Name: count, dtype: int64