GGR274 Lab 5: Data Transformations, Grouped Data, and Data Visualization#
Logistics#
Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).
Complete the tasks in this Jupyter notebook and submit your completed file to MarkUs. Here are the instructions for submitting to MarkUs (same as last week):
Download this file (
Lab_5.ipynb
) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the lab5 assignment. (See our MarkUs Guide for detailed instructions.)
Note: Use autotests with this week’s lab to see if you are on the right track. It’s important to follow the steps so your answers match the solution in not only the way they appear on screen, but also in data types, in white spaces, in rounding, etc.
Lab 5 Introduction#
In this lab, you will work with a data set called time_use_prov
. This is a data set derived from the Statistics Canada General Social Survey’s (GSS) Time Use (TU) Survey Main File, as well as a data set containing information on aggregated provincial data. This week you will plot box plots, bar graphs, and use the logical operators from Week 4 material to develop subsets to visualize data on.
As usual, these labs are meant to facilitate your understanding of the material from lectures in a low-stakes environment. Please feel free to refer to your lecture content, collaborate with your peers, and seek out help from your TAs.
Task 1#
Read CSV file "time_use_prov.csv"
into a pandas DataFrame
named prov_data
.
import pandas as pd
# Write your code here
Task 2#
a) Create a new column in prov_data
named "age_bin"
. The values of "age_bin"
should be obtained from the "age"
column in prov_data
which has the values:
Age group of respondent (groups of 10)
VALUE LABEL
1 15 to 24 years
2 25 to 34 years
3 35 to 44 years
4 45 to 54 years
5 55 to 64 years
6 65 to 74 years
7 75 years and over
96 Valid skip
97 Don't know
98 Refusal
99 Not stated
"age_bin"
should have the values "youth"
, "young"
, "middle"
, "senior"
defined as :
"youth"
: ages 15-24"young"
: ages 25-44"middle"
: ages 45-64"senior"
: ages 65+
# Write your code here
b) Compute the distribution of age_bin
as counts, and store the count distribution in age_bin_count_dist
. Then compute age_bin
as a proportion of the total population, and store this in age_bin_prop_dist
.
# Write your code here
c) Sort the values of age_bin_prop_dist
in ascending order (smallest to largest) using the sort_values
method. The code is
age_bin_prop_dist.sort_values(ascending=True, inplace=True)
(Not graded) The
inplace=True
parameter insort_values
modifiesage_bin_prop_dist
. What do you predict would happen toage_bin_prop_dist
if we usedage_bin_prop_dist.sort_values(ascending=True, inplace=False)
instead?
# Write your code here
d) (Not graded) Create a bar plot of age_bin_prop_dist
.
Feel free to explore different aesthetic options by changing paramters for the plotting function. (See the documentation here.)
# Write your code here
Task 3#
a) Create and store a boxplot of Sleep duration
by age_bin
to sleep_by_age_boxplots
by completing the code below.
Use
figsize=(8, 8)
inside thepandas.DataFrame.boxplot()
function;Set the label on the x-axis to
Age Group
by using the.set_xlabel()
method, as follows:
sleep_by_age_boxplots.set_xlabel("Age Group")
Set the label on the y-axis to
(minute)
by usign the.set_ylabel()
method, as follows:
sleep_by_age_boxplots.set_ylabel("(minute)")
# Complete the code below by replacint
sleep_by_age_boxplots = prov_data.boxplot(
column= # complete the line
by= # complete the line
figsize= #complete the line
);
# add the axis labels
# in case you don't see the plot without an error, try running the code below.
# sleep_by_age_boxplots.figure
b) (Not graded) Feel free to customize a copy of the plot, sleep_by_age_boxplots_copy
, further to your liking with the help of the documention.
# Write your code here
sleep_by_age_boxplots_copy = sleep_by_age_boxplots
# in case you don't see the plot without an error, try running the code below.
sleep_by_age_boxplots_copy.figure