Homework 5

Homework 5#

Logistics#

Due date: The homework is due 23:59 on Monday, February 12.

You will submit your work on MarkUs. To submit your work:

Download this file (Homework_5.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)
Submit this file to MarkUs under the hw5 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

Introduction#

In this homework we explore:

row, column selection
create new columns
grouping
summary statistics
visualizing distributions

Question: Explore sleeping, exercising, and socializing among Canadians.

Task 1#

a) Use the pandas method read_csv to read the file gss_tu2016_main_file.csv into a DataFrame. Store this DataFrame in a variable called time_use_df.

import pandas as pd

# Write your code here

b) Create a subset of time_use_df with only the following columns: dur41, dur47, sleepdur, agegr10, prv. To do this follow these steps:

Create a list called analysis_columns with the column names.
Use analysis_columns to select these columns from time_use_df and store this DataFrame in a variable called time_use_subset_df.

# Write your code here

c) In the next steps you will rename the columns of time_use_subset_df according to the following table:

Old name	New name
`dur41`	`Socializing time`
`dur47`	`Exercising time`
`sleepdur`	`Sleep time`
`agegr10`	`Age group`
`prv`	`Province`

Step 1: Create a dictionary called new_col_names with each Old name as a key and each New name as the corresponding value.

# Write your code here

Step 2: Use new_col_names to rename the columns of time_use_subset_df and store the DataFrame with renamed columns in a variable called time_use_subset_renamed_df.

# Write your code here

Task 2#

Create columns in time_use_subset_renamed_df that converts time use from minutes to hours. Since 60 minutes is equal to 1 hour we can divide the time use columns by 60 to compute the time in hours.

To do this create new columns in time_use_subset_renamed_df called

Socializing time (hour),
Exercising time (hour), and
Sleep time (hour)

these columns are (respectively) Socializing time, Exercising time, and Sleep time in hours.

# Write your code here

Task 3#

Some respondents in the time use survey spent no time exercising, socializing, and sleeping. In this section we will create a DataFrame that only has respondents who spent time sleeping, exercising, and socializing. In other words respondents that spent no time on these activities will be excluded.

a) Create a boolean Series called well_balanced that is True if time spent exercising and time spent sleeping and time spent socializing are all greater than 0, and False otherwise.

# Write your code here

b) Use well_balanced to filter (i.e. select) the rows of time_use_subset_renamed_df where respondents had non-zero times of sleeping, exercising, and socializing. Store this filtered DataFrame in well_balanced_df.

# Write your code here

c) The number of rows in a pandas DataFrame can be computed by len(). For example, len(well_balanced_df) is the number of rows in well_balanced_df. Compute the number of respondents who were removed from time_use_subset_renamed_df when it was filtered using well_balanced and store this number in a variable called diff.

# Write your code here

d) Use diff to compute the percentage of respondents removed from time_use_subset_renamed_df. Round the percentage to two decimal places, and store the result value in a variable called pct_lost.

# Write your code here

Task 4#

In this section you will explore the distributions of time spent socializing, exercising, and sleeping by age group and province.

a) Compute the mean hours spent sleeping, socializing, and exercising by age group using .groupby on well_balanced_df. Store this DataFrame in a variable called group_means.

b) Create a new column in group_means called Total time (hour) that is the sum of the time (in hours) spent sleeping, exercising, and socializing.

c) Create a new index for group_means using the labels of Age group found in the code book (gss_tu2016_codebook.txt) and store the values in a list called index_new.

d) Change the index of group_means to correspond to index_new.

# Write your code here


# Display group_means to check that the index has been updated.
# On the left-hand side you should see the Age group labels, from "15-24" to "75+".
group_means

e) Sort group_means in descending order of Total time (hour). Store this sorted DataFrame in a variable called group_means_sorted

# Write your code here

f) Use well_balanced_df to create three side-by-side boxplots using layout = (1, 3) and figsize = (20, 10) of time spent (in hours) socializing, exercising, and sleeping for each age group. Store these boxplots in a variable called time_boxplots.

# Write your code here

Task 5#

The tick marks on the horizontal axes of time_boxplots are not informative unless the viewer knows which age group each value represents. Fix labels of boxplot by recoding Age group using the labels in the code book (see gss_tu2016_codebook.txt).

a) First, create a copy of well_balanced_df (using the DataFrame .copy() method), and store it in a variable called well_balanced_age_label_df. For that new DataFrame, recode Age group by adding a column called Age group label with the Age group labels found in the code book.

# Write your code here

b) Create the same boxplots as in Task 4 f) using layout = (1, 3) and figsize = (20, 10), but use Age group label to create the boxplot, so that the ticks on horizontal axes of the boxplot are informative. Store this boxplot in a variable called time_boxplots_age_label.

# Write your code here

Task 6 (Written Discussion)#

a) Which age group spends the most time sleeping, exercising, and socializing? Does your ranking change if you use mean or median as a summary measure of time? Briefly explain why or why not your ranking changes, and which ranking is the best representation.

b) Which age group shows the most variability in time spent socializing? Provide a brief explanation of why this group shows the most variability.

c) State one limitations of basing this data analysis on only respondents that spent more than zero time sleeping, exercising, and socializing. Briefly explain why it’s a limitation to your findings in Tasks 4 and 5.

Answer Task 6 here.