GGR274 Lab 5: Data Transformations, Grouped Data, and Data Visualization

GGR274 Lab 5: Data Transformations, Grouped Data, and Data Visualization#

Logistics#

Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).

Complete the tasks in this Jupyter notebook and submit your completed file to MarkUs. Here are the instructions for submitting to MarkUs (same as last week):

  1. Download this file (Lab_5.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the lab5 assignment. (See our MarkUs Guide for detailed instructions.)

Note: there’s no autograding set up for this week’s lab, but your TA will be checking that your submitted lab file is complete as part of your “lab attendance” grade.

Lab 5 Introduction#

In this lab, you will work with a data set called time_use_prov. This is a data set is derived from the Statistics Canada General Social Survey’s (GSS) Time Use (TU) Survey Main File, as well as a data set containing information on aggregated provincial data. This week you will plot box plots, bar graphs, and use the logical operators from Week 4 material to develop subset data sets to visualize data on.

As usual, these labs are meant to facilitate your understanding of the material from lectures in a low-stakes environment. Please feel free to refer to your lecture content, collaborate with your peers, and seek out help from your TAs.

Task 1#

Read CSV file 'time_use_prov.csv' into a pandas DataFrame named prov_data.

import pandas as pd

# Write your code here

Task 2#

a) Create a new column in prov_data named 'age_bin'. The values of 'age_bin' should be obtained from the 'age' column in prov_data which has the values:

            Age group of respondent (groups of 10)

           VALUE  LABEL
               1  15 to 24 years
               2  25 to 34 years
               3  35 to 44 years
               4  45 to 54 years
               5  55 to 64 years
               6  65 to 74 years
               7  75 years and over
              96  Valid skip
              97  Don't know
              98  Refusal
              99  Not stated

'age_bin' should have the values 'youth', 'young', 'middle', 'senior' defined as :

  • 'youth' : ages 15-24

  • 'young' : ages 25-44

  • 'middle' : ages 45-64

  • 'senior' : ages 65+

# Write your code here

b) Compute the distribution of age_bin as a count, and store this in age_bin_count_dist. Then compute age_bin as a proportion of the total population, and store this in age_bin_prop_dist.

# Write your code here

c) Sort the values of age_bin_prop_dist in ascending order (smallest to largest) using the sort_values method. The code is

age_bin_prop_dist.sort_values(ascending = True, inplace = True)

The inplace = True parameter in sort_values modifies age_bin_prop_dist. What do you predict would happen to age_bin_prop_dist if we used age_bin_prop_dist.sort_values(ascending=True, inplace = False) instead?

# Write your code here

c) Create a bar plot of age_bin_prop_dist.

Feel free to explore different aesthetic options by changing paramters for the plotting function. (See the documentation here.)

# Write your code here

Task 3#

a) Create a boxplot of Sleep duration by age_bin. Store this plot in sleep_by_age_boxplots. Use figsize = (8, 8) in the pandas.DataFrame.boxplot function.

b) Set the label on the y-axis (vertical axis) to Sleep duration by using the .set_ylabel() method, as follows:

sleep_by_age_boxplots.set_ylabel('Sleep duration')

Feel free to customize the plot further to your liking with the help of the documention.

# Write your code here