GGR274 Lab 4: Introduction to Data Wrangling, Part 1

GGR274 Lab 4: Introduction to Data Wrangling, Part 1#

Logistics#

Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).

Complete the tasks in this Jupyter notebook and submit your completed file to MarkUs. Here are the instructions for submitting to MarkUs (same as last week):

Download this file (Lab_4.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)
Submit this file to MarkUs under the lab4 assignment. (See our MarkUs Guide for detailed instructions.)

Note: there’s no autograding set up for this week’s lab, but your TA will be checking that your submitted lab file is complete as part of your “lab attendance” grade.

Task 1: Read the csv file into a `DataFrame`#

Read the csv file ArrestsStripSearches.csv into a pandas Dataframe called police_df related to Toronto Police Race and Identity Based Data - Arrests and Strip Searches.

The file is located in the same folder as the notebook.

import pandas as pd

police_df = pd.read_csv('ArrestsStripSearches.csv')

police_df.head()

/var/folders/0j/ybsv4ncn5w50v40vdh5jjlww0000gn/T/ipykernel_79053/1181357276.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

	_id	Arrest_Year	Arrest_Month	EventID	ArrestID	PersonID	Perceived_Race	Sex	Age_group__at_arrest_	Youth_at_arrest__under_18_years	...	Actions_at_arrest___Resisted__d	Actions_at_arrest___Mental_inst	Actions_at_arrest___Assaulted_o	Actions_at_arrest___Cooperative	SearchReason_CauseInjury	SearchReason_AssistEscape	SearchReason_PossessWeapons	SearchReason_PossessEvidence	ItemsFound	ObjectId
0	1	2020	July-Sept	1005907	6017884.0	326622	White	M	Aged 35 to 44 years	Not a youth	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	1
1	2	2020	July-Sept	1014562	6056669.0	326622	White	M	Aged 35 to 44 years	Not a youth	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2
2	3	2020	Oct-Dec	1029922	6057065.0	326622	Unknown or Legacy	M	Aged 35 to 44 years	Not a youth	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3
3	4	2021	Jan-Mar	1052190	6029059.0	327535	Black	M	Aged 25 to 34 years	Not a youth	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4
4	5	2021	Jan-Mar	1015512	6040372.0	327535	South Asian	M	Aged 25 to 34 years	Not a youth	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5

5 rows × 26 columns

Task 2: Create a `DataFrame` of arrests in 2021#

a. Subset the tabular data by rows#

Create a boolean variable named Arrests_2021 computed from police_df that is True if Arrest_Year is 2021 and False otherwise.
Use Arrests_2021 to select rows of police_df that correspond to arrests in 2021. Save this new DataFrame in a variable called police_2021_df. Examine the head() of this DataFrame.

Arrests_2021 = police_df['Arrest_Year'] == 2021

police_2021_df = police_df[Arrests_2021]

police_2021_df.head()

	_id	Arrest_Year	Arrest_Month	EventID	ArrestID	PersonID	Perceived_Race	Sex	Age_group__at_arrest_	Youth_at_arrest__under_18_years	...	Actions_at_arrest___Resisted__d	Actions_at_arrest___Mental_inst	Actions_at_arrest___Assaulted_o	Actions_at_arrest___Cooperative	SearchReason_CauseInjury	SearchReason_AssistEscape	SearchReason_PossessWeapons	SearchReason_PossessEvidence	ItemsFound	ObjectId
3	4	2021	Jan-Mar	1052190	6029059.0	327535	Black	M	Aged 25 to 34 years	Not a youth	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4
4	5	2021	Jan-Mar	1015512	6040372.0	327535	South Asian	M	Aged 25 to 34 years	Not a youth	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5
5	6	2021	Apr-June	1019145	6060688.0	327535	South Asian	M	Aged 25 to 34 years	Not a youth	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	6
6	7	2021	Jan-Mar	1035445	6053833.0	330778	Black	M	Aged 25 to 34 years	Not a youth	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7
7	8	2021	Jan-Mar	1050464	6063477.0	330778	Black	M	Aged 25 to 34 years	Not a youth	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	8

5 rows × 26 columns

b. Inspect columns of the tabular data#

Create a variable called Arrest_column_names that stores a list of the column names of police_2021_df and print the list.

Arrest_column_names = list(police_2021_df)

Arrest_column_names

['_id',
 'Arrest_Year',
 'Arrest_Month',
 'EventID',
 'ArrestID',
 'PersonID',
 'Perceived_Race',
 'Sex',
 'Age_group__at_arrest_',
 'Youth_at_arrest__under_18_years',
 'ArrestLocDiv',
 'StripSearch',
 'Booked',
 'Occurrence_Category',
 'Actions_at_arrest___Concealed_i',
 'Actions_at_arrest___Combative__',
 'Actions_at_arrest___Resisted__d',
 'Actions_at_arrest___Mental_inst',
 'Actions_at_arrest___Assaulted_o',
 'Actions_at_arrest___Cooperative',
 'SearchReason_CauseInjury',
 'SearchReason_AssistEscape',
 'SearchReason_PossessWeapons',
 'SearchReason_PossessEvidence',
 'ItemsFound',
 'ObjectId']

Task 3: Select columns from `police_2021_df` and examine the distributions of each column#

Create a new DataFrame from police_2021_df with the following columns: _id, Sex, Perceived_Race, SearchReason_PossessWeapons and assign it to the variable police_2021_raceweapons. Examine the head() of this DataFrame.
Use .value_counts() to compute the distributions of Sex, Perceived_Race, SearchReason_PossessWeapons. You do not need to save these values in variables, though you may do so if you want.

police_2021_raceweapons = police_2021_df[['_id', 'Sex', 'Perceived_Race', 'SearchReason_PossessWeapons']]

police_2021_raceweapons.head()

	_id	Sex	Perceived_Race	SearchReason_PossessWeapons
3	4	M	Black	NaN
4	5	M	South Asian	NaN
5	6	M	South Asian	NaN
6	7	M	Black	NaN
7	8	M	Black	NaN

police_2021_df['Sex'].value_counts()

Sex
M    26815
F     6479
U        3
Name: count, dtype: int64

police_2021_df['Perceived_Race'].value_counts()

Perceived_Race
White                   14116
Black                    8878
Unknown or Legacy        2444
East/Southeast Asian     2361
South Asian              1871
Middle-Eastern           1730
Latino                    960
Indigenous                935
Name: count, dtype: int64

police_2021_df['SearchReason_PossessWeapons'].value_counts()

SearchReason_PossessWeapons
0.0    478
1.0    208
Name: count, dtype: int64

GGR274 Lab 4: Introduction to Data Wrangling, Part 1

Contents