GGR274 Lab 4: Introduction to Data Wrangling, Part 1#
Logistics#
Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).
Complete the tasks in this Jupyter notebook and submit your completed file to MarkUs. Here are the instructions for submitting to MarkUs (same as last week):
Download this file (
Lab_4.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the lab4 assignment. (See our MarkUs Guide for detailed instructions.)
Note: You can use autotests with this week’s lab to see if you are on the right track. Your TA and instructors can look in more detail to see if you answered all questions. The submission is a part of your Lab attendance grade.
Task 1: Read the csv file into a DataFrame#
a. Read the file#
Read the csv file ArrestsStripSearches.csv into a pandas Dataframe called police_df. The file contains records related to Toronto Police Race and Identity Based Data - Arrests and Strip Searches. Inspect the data with .head().
The file is located in the same folder as the notebook.
import pandas as pd
police_df = pd.read_csv("ArrestsStripSearches.csv")
police_df.head()
b. Check the size#
Use
.shapeto extract the the size of the table, (number of rows, number of columns).Store the size to a variable named
police_df_size.
police_df_size = police_df.shape
police_df_size
Task 2: Create a DataFrame of arrests in 2021#
a. Subset the tabular data by rows#
Create a boolean variable named
Arrests_2021computed frompolice_dfthat isTrueifArrest_Yearis2021andFalseotherwise.Use
Arrests_2021to select rows ofpolice_dfthat correspond to arrests in 2021. Save this new DataFrame in a variable calledpolice_2021_df. Examine thehead()of thisDataFrame.
Arrests_2021 = police_df["Arrest_Year"] == 2021
police_2021_df = police_df[Arrests_2021]
police_2021_df.head()
b. Check the size#
Use
.shape[0]to extract the number of rows of the subset,police_2021_df.Store the arrest coutns to a variable named
arrest_counts_2021.
arrest_counts_2021 = police_2021_df.shape[0]
arrest_counts_2021
c. Inspect columns of the tabular data#
Create a variable called
Arrest_column_namesthat stores a list of the column names ofpolice_2021_dfand print the list.
Arrest_column_names = list(police_2021_df)
Arrest_column_names
Task 3: Select columns from police_2021_df and examine the distributions of each column#
Create a new DataFrame from
police_2021_dfwith the following columns:_id,Sex,Perceived_Race,SearchReason_PossessWeaponsand assign it to the variablepolice_2021_raceweapons. Examine thehead()of thisDataFrame.Use
.value_counts()to compute the distributions ofSex,Perceived_Race,SearchReason_PossessWeapons. You do not need to save these values in variables, though you may do so if you want.
police_2021_raceweapons = police_2021_df[["_id", "Sex", "Perceived_Race", "SearchReason_PossessWeapons"]]
police_2021_raceweapons.head()
police_2021_df["Sex"].value_counts()
police_2021_df["Perceived_Race"].value_counts()
police_2021_df["SearchReason_PossessWeapons"].value_counts()