GGR274 Homework 4: Time Use Survey Data#

Introduction#

For this week’s homework, you will use the Statistics Canada GSS Time Use Dataset. This time, we’re going to dig into some of the well-being variables (feeling rushed) and respondent characteristic variables (how people commute to work).

Question#

The question you’re answering in this homework:

Among Canadians that live in rural communities, is it less common for people to feel rushed and take transit to work or feel rushed and not take transit to work?

Homework Instructions and Learning Objectives#

The goal of this homework is to answer the question above performing these steps:

  • Read the Time Use Dataset into a pandas DataFrame.

  • Select specific columns and rows of the DataFrame.

  • Compute the proportions of rural respondents that feel rushed and either use or don’t use public transit to commute to work.

  • Interpret the results of the analysis.

Task 1#

a) Read the time use data set stored in the csv file gss_tu2016_main_file.csv into a pandas DataFrame and store the DataFrame in a variable named time_use_data_raw.

The file is located in the same folder as the notebook.

import pandas as pd

time_use_data_raw = pd.read_csv("gss_tu2016_main_file.csv")
/var/folders/0j/ybsv4ncn5w50v40vdh5jjlww0000gn/T/ipykernel_98063/2880543848.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

b. The columns we will need for the analysis to answer the question are:

  • CASEID: participant ID

  • luc_rst: Urban/Rural

  • gtu_110: How often does one feel rushed?

  • ctw_140c: Commute to work - Public transit

Create a new DataFrame using time_use_data_raw that only contains the four columns in the order listed above. The first column should be CASEID, the second column should be luc_rst, etc. This new DataFrame should be stored in a variable named time_use_data.

time_use_data = time_use_data_raw[["CASEID",
                                   "luc_rst",
                                   "gtu_110",
                                   "ctw_140c"]]

c) Create a Python dictionary stored in a variable called new_column_names, that maps old column name to new column name according to the following table:

old name

new name

CASEID

case_ID

luc_rst

urban_rural

gtu_110

feeling_rushed

ctw_140c

public_transit

You’ll use this dictionary to rename the columns in part (d) below.

new_column_names = {'CASEID': 'case_ID',
                    'luc_rst': 'urban_rural',
                    'gtu_110': 'feeling_rushed',
                    "ctw_140c": "public_transit"}

d) Use the dictionary new_column_names created in the previous step to rename the columns of the DataFrame stored in time_use_data. Store this new DataFrame in a variable called clean_time_use_data.

clean_time_use_data = time_use_data.rename(columns = new_column_names)

Task 2#

a) Use the codebook for the Time Use Survey in the file gss_tu2016_codebook.txt to guide you in creating boolean variables using clean_time_use_data that correspond the the following conditions and store the results in the variable names specified below.

Condition

variable name

Commutes to work by taking public transit

transit_yes

Does not commute to work by taking public transit

transit_no

Respondent feels rushed (regardless of the frequency)

feeling_rushed

Lives in a rural area or small population centre

rural

Tip: go to File -> Open menu action to find the gss_tu2016_codebook.txt file in the same folder as this notebook.

# compute to work - public transit : 1 Yes
transit_yes = (clean_time_use_data['public_transit'] == 1)
# compute to work - public transit : 2 No
transit_no = (clean_time_use_data['public_transit'] == 2)
# general time use - feel: reushed: 1 - 5 Everday to Less than once a month
feeling_rushed = clean_time_use_data['feeling_rushed'] <= 5
# population centre indicator: 2 Rural areas and somall population centres
rural = (clean_time_use_data['urban_rural'] == 2)

b) In this part of the task you will investigate the data types of one of the variables you created in the previous part.

i) Store the data type of transit_yes in a variable called transit_col_type and print the value of transit_col_type.

transit_col_type = type(transit_yes)

print(transit_col_type)
<class 'pandas.core.series.Series'>

ii) Store the data type of values in transit_yes in a variable called transit_data_type and print the value of transit_data_type.

transit_data_type = transit_yes.dtypes

print(transit_data_type)
bool

c) Briefly explain the difference between the values of transit_yes and transit_data_type.

Each column in a padas DataFrame is a pandas Series. The values stored in each can take different data types. In this case, it is of type Boolean.

Task 3#

In this section you will write a program in a series of steps to analyse the data.

Use the DataFrame clean_time_use_data and the variables that you created in Task 2 a).

The data analysis will be implemented by writing a Python program to compute two proportions that you will express as percentages (i.e., multiplying by 100).

\[{\text{Percent}_\text{Transit}} = \frac{\text{Total number of respondents in rural areas that take transit and feel rushed}}{\text{Total number of respondents in rural areas}}\times 100 \]
\[{\text{Percent}_\text{No Transit}} = \frac{\text{Total number of respondents in rural areas that do not take transit and feel rushed}}{\text{Total number of respondents in rural areas}}\times 100 \]

The program will be written in a series of steps.

a) Create a variable called total_rural that stores the total number of respondents that live in a rural area. Print the value of this variable. This is the value of: \(\text{Total number of respondents in rural areas}\) in the proportions above.

total_rural = len(clean_time_use_data[rural])

print(total_rural)
3551

b) Create a variable called rural_rush_transit that is True if a respondent has ever felt rushed AND uses public transit to work and lives in a rural area.

Then, use this variable to select rows in clean_time_use_data and then compute the number of such rows, storing the result in a variable called rural_rush_transit_num. This is the value of: \(\text{Total number of respondents in rural areas that take transit and feel rushed}\).

rural_rush_transit = (feeling_rushed & transit_yes & rural)
rural_rush_transit_num = len(clean_time_use_data[rural_rush_transit])

c) Calculate the proportion of respondents in rural areas that feel rushed and use public transit to work. Store the result in a variable called rural_rush_transit_prop.

rural_rush_transit_prop = rural_rush_transit_num/total_rural

d) Print the value of rural_rush_transit_prop multiplied by 100 and rounded to two decimal places with the percent character (i.e., “%”) added to the end of the proportion. This is the value of: \({\text{Percent}_\text{Transit}}\).

rounded = round(rural_rush_transit_prop * 100, 2)
print(f"{rounded}%")
1.15%

e) Use the print function to print the following sentence:

The number of people that use transit and feel rushed is {XX}.

Fill in the value of {XX}.

print(f"The number of people that use transit and feel rushed is {len(clean_time_use_data[rural_rush_transit])}.")
The number of people that use transit and feel rushed is 41.

f) Create a variable called rural_rush_notransit that is True if a respondent has ever felt rushed AND does not use public transit to work AND lives in a rural area.

Then, use this variable to select rows in clean_time_use_data and then compute the number of such rows, storing the result in a variable called rural_rush_notransit_num.

This is the value of: \(\text{Total number of respondents in rural areas that do not take transit and feel rushed}\).

rural_rush_notransit = (feeling_rushed & transit_no & rural)
rural_rush_notransit_num = len(clean_time_use_data[rural_rush_notransit])

g) Use rural_rush_notransit to select rows in clean_time_use_data and compute the proportion of rural respondents (i.e., rows) that feel rushed and do not take public transit. Store the proportion in a variable rural_rush_monthly_prop.

rural_rush_monthly_prop = rural_rush_notransit_num/total_rural

h) Print the value of rural_rush_monthly_prop multiplied by 100 and rounded to two decimal places with the percent character (i.e., “%”) added to the end of the proportion. This is the value of: \({\text{Percent}_\text{No Transit}}\).

notransit_rounded = round(rural_rush_monthly_prop * 100, 2)
print(f"{notransit_rounded}%")
40.52%

h) Use the print function to print the following sentence:

The number of people that do not use transit and feel rushed is {XX}.

Fill in the value of {XX}.

print(f"The number of people that do not use transit and feel rushed is {len(clean_time_use_data[rural_rush_notransit])}.")
The number of people that do not use transit and feel rushed is 1439.

Task 4#

Answer the following questions.

a) Is the data analysis above sufficient to answer the original question? If yes then explain why it’s sufficient, otherwise explain what type of analysis would have provided appropriate information to help answer the question. Briefly explain your reasoning.

Sample answer (there are several possibilities):

The data analysis is sufficient to answer the original question. 40.52% of people who did not use transit feel and feel rushed compared to 1.15% of people who used transit and feel rushed in rural communities. This is evidence that Canadians in rural communities that take transit and feel rushed are less prevalant compared to those that don’t take transit and feel rushed.

b) Does the data analysis you performed above provide evidence that Canadians who live in rural areas and use public transit have a poorer mental health than those who don’t use public transit to commute to work? If the analysis doesn’t support this claim then describe an analysis that would give you evidence to evaluate this claim. Briefly explain your reasoning.

Sample answer (there are several possibilities):

If we assume that feeling rushed is equivalent to quality of life then this statement could be justified. But, this is a tenuous assumption since there is no reason to assume that these two measures correspond to the same underlying issue. Instead of using feeling rushed - gtu_110 we could have used srh_115 - self rated mental health. We could conduct the same analysis but use this column in place of gtu_110.

Marking Rubric#

Section

0

1

2

3

Computational questions (for each part)

auto test fails

auto test passes

NA

NA

Qualitative questions (for each part)

No answer

The question is answered but no explanation is given

The question is answered but the explanation is not supported or weakly supported

The question is answered and the explanation is supported