GGR274 Lab 5: Data Transformations, Grouped Data, and Data Visualization#
Logistics#
Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).
Complete the tasks in this Jupyter notebook and submit your completed file to MarkUs. Here are the instructions for submitting to MarkUs (same as last week):
Download this file (
Lab_5.ipynb
) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the lab5 assignment. (See our MarkUs Guide for detailed instructions.)
Note: there’s no autograding set up for this week’s lab, but your TA will be checking that your submitted lab file is complete as part of your “lab attendance” grade.
Lab 5 Introduction#
In this lab, you will work with a data set called time_use_prov
. This is a data set is derived from the Statistics Canada General Social Survey’s (GSS) Time Use (TU) Survey Main File, as well as a data set containing information on aggregated provincial data. This week you will plot box plots, bar graphs, and use the logical operators from Week 4 material to develop subset data sets to visualize data on.
As usual, these labs are meant to facilitate your understanding of the material from lectures in a low-stakes environment. Please feel free to refer to your lecture content, collaborate with your peers, and seek out help from your TAs.
Task 1#
Read CSV file 'time_use_prov.csv'
into a pandas DataFrame
named prov_data
.
import pandas as pd
# Write your code here
Task 2#
a) Create a new column in prov_data
named 'age_bin'
. The values of 'age_bin'
should be obtained from the 'age'
column in prov_data
which has the values:
Age group of respondent (groups of 10)
VALUE LABEL
1 15 to 24 years
2 25 to 34 years
3 35 to 44 years
4 45 to 54 years
5 55 to 64 years
6 65 to 74 years
7 75 years and over
96 Valid skip
97 Don't know
98 Refusal
99 Not stated
'age_bin'
should have the values 'youth'
, 'young'
, 'middle'
, 'senior'
defined as :
'youth'
: ages 15-24'young'
: ages 25-44'middle'
: ages 45-64'senior'
: ages 65+
# Write your code here
b) Compute the distribution of age_bin
as a count, and store this in age_bin_count_dist
. Then compute age_bin
as a proportion of the total population, and store this in age_bin_prop_dist
.
# Write your code here
c) Sort the values of age_bin_prop_dist
in ascending order (smallest to largest) using the sort_values
method. The code is
age_bin_prop_dist.sort_values(ascending = True, inplace = True)
The inplace = True
parameter in sort_values
modifies age_bin_prop_dist
. What do you predict would happen to age_bin_prop_dist
if we used age_bin_prop_dist.sort_values(ascending=True, inplace = False)
instead?
# Write your code here
c) Create a bar plot of age_bin_prop_dist
.
Feel free to explore different aesthetic options by changing paramters for the plotting function. (See the documentation here.)
# Write your code here
Task 3#
a) Create a boxplot of Sleep duration
by age_bin
. Store this plot in sleep_by_age_boxplots
. Use figsize = (8, 8)
in the pandas.DataFrame.boxplot
function.
b) Set the label on the y-axis (vertical axis) to Sleep duration
by using the .set_ylabel()
method, as follows:
sleep_by_age_boxplots.set_ylabel('Sleep duration')
Feel free to customize the plot further to your liking with the help of the documention.
# Write your code here