Homework 5#
Logistics#
Due date: The homework is due 23:59 on Monday, February 12.
You will submit your work on MarkUs. To submit your work:
Download this file (
Homework_5.ipynb
) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the hw5 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.
Introduction#
In this homework we explore:
row, column selection
create new columns
grouping
summary statistics
visualizing distributions
Question: Explore sleeping, exercising, and socializing among Canadians.
Task 1#
a) Use the pandas
method read_csv
to read the file gss_tu2016_main_file.csv
into a DataFrame. Store this DataFrame
in a variable called time_use_df
.
import pandas as pd
# Write your code here
b) Create a subset of time_use_df
with only the following columns: dur41
, dur47
, sleepdur
, agegr10
, prv
. To do this follow these steps:
Create a list called
analysis_columns
with the column names.Use
analysis_columns
to select these columns fromtime_use_df
and store thisDataFrame
in a variable calledtime_use_subset_df
.
# Write your code here
c) In the next steps you will rename the columns of time_use_subset_df
according to the following table:
Old name |
New name |
---|---|
|
|
|
|
|
|
|
|
|
|
Step 1: Create a dictionary called new_col_names
with each Old name as a key and each New name as the corresponding value.
# Write your code here
Step 2: Use new_col_names
to rename the columns of time_use_subset_df
and store the DataFrame with renamed columns in a variable called time_use_subset_renamed_df
.
# Write your code here
Task 2#
Create columns in time_use_subset_renamed_df
that converts time use from minutes to hours. Since 60 minutes is equal to 1 hour we can divide the time use columns by 60 to compute the time in hours.
To do this create new columns in time_use_subset_renamed_df
called
Socializing time (hour)
,Exercising time (hour)
, andSleep time (hour)
these columns are (respectively) Socializing time
, Exercising time
, and Sleep time
in hours.
# Write your code here
Task 3#
Some respondents in the time use survey spent no time exercising, socializing, and sleeping. In this section we will create a DataFrame
that only has respondents who spent time sleeping, exercising, and socializing. In other words respondents that spent no time on these activities will be excluded.
a) Create a boolean Series
called well_balanced
that is True
if time spent exercising and time spent sleeping and time spent socializing are all greater than 0, and False
otherwise.
# Write your code here
b) Use well_balanced
to filter (i.e. select) the rows of time_use_subset_renamed_df
where respondents had non-zero times of sleeping, exercising, and socializing. Store this filtered DataFrame in well_balanced_df
.
# Write your code here
c) The number of rows in a pandas
DataFrame
can be computed by len()
. For example, len(well_balanced_df)
is the number of rows in well_balanced_df
. Compute the number of respondents who were removed from time_use_subset_renamed_df
when it was filtered using well_balanced
and store this number in a variable called diff
.
# Write your code here
d) Use diff
to compute the percentage of respondents removed from time_use_subset_renamed_df
. Round the percentage to two decimal places, and store the result value in a variable called pct_lost
.
# Write your code here
Task 4#
In this section you will explore the distributions of time spent socializing, exercising, and sleeping by age group and province.
a) Compute the mean hours spent sleeping, socializing, and exercising by age group using .groupby
on well_balanced_df
. Store this DataFrame in a variable called group_means
.
b) Create a new column in group_means
called Total time (hour)
that is the sum of the time (in hours) spent sleeping, exercising, and socializing.
c) Create a new index for group_means
using the labels of Age group found in the code book (gss_tu2016_codebook.txt
) and store the values in a list called index_new
.
d) Change the index of group_means
to correspond to index_new
.
# Write your code here
# Display group_means to check that the index has been updated.
# On the left-hand side you should see the Age group labels, from "15-24" to "75+".
group_means
e) Sort group_means
in descending order of Total time (hour)
. Store this sorted DataFrame
in a variable called group_means_sorted
# Write your code here
f) Use well_balanced_df
to create three side-by-side boxplots using layout = (1, 3)
and figsize = (20, 10)
of time spent (in hours) socializing, exercising, and sleeping for each age group. Store these boxplots in a variable called time_boxplots
.
# Write your code here
Task 5#
The tick marks on the horizontal axes of time_boxplots
are not informative unless the viewer knows which age group each value represents. Fix labels of boxplot by recoding Age group
using the labels in the code book (see gss_tu2016_codebook.txt
).
a) First, create a copy of well_balanced_df
(using the DataFrame
.copy()
method), and store it in a variable called well_balanced_age_label_df
. For that new DataFrame
, recode Age group
by adding a column called Age group label
with the Age group labels found in the code book.
# Write your code here
b) Create the same boxplots as in Task 4 f) using layout = (1, 3)
and figsize = (20, 10)
, but use Age group label
to create the boxplot, so that the ticks on horizontal axes of the boxplot are informative. Store this boxplot in a variable called time_boxplots_age_label
.
# Write your code here
Task 6 (Written Discussion)#
a) Which age group spends the most time sleeping, exercising, and socializing? Does your ranking change if you use mean or median as a summary measure of time? Briefly explain why or why not your ranking changes, and which ranking is the best representation.
b) Which age group shows the most variability in time spent socializing? Provide a brief explanation of why this group shows the most variability.
c) State one limitations of basing this data analysis on only respondents that spent more than zero time sleeping, exercising, and socializing. Briefly explain why it’s a limitation to your findings in Tasks 4 and 5.
Answer Task 6 here.