GGR274 Midterm Test#

Read this section before starting the test

Test Instructions#

  • The test is divided into eight questions. Complete all questions below.

  • The answers to the questions will be submitted on MarkUs using a similar workflow to the lab and homework assignments, except that MarkUs will not give you feedback on passing or failing the autotests.

  • We have provided some code to help you check your programming work in each cell.

  • Answers where you are asked to write Python code will be autograded, and written answers will be graded manually by the teaching team.

Marking Rubric#

Section

0

1

2

3

python computation steps

auto test fails

auto test passes

NA

NA

Describe what you did to the data (for each part)

No answer

A partial description is given that explains what the python code did to the data

A full description that uses data science terminology is given that explains what the python code did to the data and why this step is important

NA

Conclusion (for each part)

No answer

The question is answered but no explanation is given

The question is answered but the explanation is not supported or weakly supported by the data

The question is answered and the explanation is supported by the data

Aids Allowed and Academic Integrity#

  • You are allowed to use any materials from the course or any other written sources (e.g., books, websites).

  • You are not allowed to directly receive or give help during the test period. In other words, all work must be your own, and you must not discuss or post any information about this test with anyone during the test period.

  • As a student, you alone are responsible for ensuring the integrity of your work and for understanding what constitutes an academic offense.

Time Allowed#

  • The test will be available at 12:00 PM on Sunday, February 18, and must be submitted on MarkUs under midterm-makeup section (see Submission Instructions) by 2:05 PM on Sunday, February 18.

  • Late tests will receive a grade of zero unless you have an approved accommodation from your instructor.

How do I ask a question during the test?#

  • The teaching team will be available during the test on a zoom class link in case you have any questions during the midterm.

  • The class discussion forum on Piazza will be disabled during the test period.

Submission Instructions#

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Midterm.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the midterm assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

Questions#

Below is an outline of the questions that you will answer in this midterm test.

For your convenience you can keep track of which questions you have answered by ticking off the boxes by adding an x in between the square brackets [  ].

- [ ] 
- [x] 

The above Markdown will render like this

  • [ ]

  • [x]

Check marks are not required for full marks. This is to help you stay organized during the test, but please feel free to ignore this if you wish.

List of Questions in this Test#

Below are hyperlinks to individual test questions. A hyperlink back to this section is provided at the end of most questions.

# Compute total marks in this test and print out a nice message.

num_question_parts = 21

total_test_marks = 41

time = 120

print(f"""
There are {num_question_parts} question parts.\n
The total number of marks in this test is {total_test_marks}. \n
The total amount of time to complete this test is {time} minutes. \n
\U0001F447The test questions start below.\U0001F447
""")

Question 1#

Canada has a current population of 39,401,996, which we have stored below in the variable canada_population.

Of these, the total number of immigrants is 8,361,505, which we have stored below in the variable num_immigrants.

[Take me back to Questions section.]

# These variables are given to you. Do not change them!
canada_population = "39,401,996"
num_immigrants = 8361505

Question 1a#

Use the Python functions int and str.replace to convert canada_population to an integer.

Step 1: Use str.replace(",", "") to remove the , from the string '39,401,996'. Store the result in canada_population_no_comma. (1 mark)

Step 2: Use int to convert canada_population_no_comma to an integer. Store the result in canada_population_int. (1 mark)

[Take me back to Questions section.]

# Write your code here

Question 1b#

Use the variables in Question 1a to calculate the percentage of Canada’s population who are immigrants, as a float between 0.00% and 100.00%, rounded to 2 decimal places. Store the result in a variable called percent_immigrants. (1 mark)

[Take me back to Questions section.]

# Write your code here


# This code is provided for you to check your work
print(f"Percentage of immigrants: {percent_immigrants}%")

Question 2#

We have given you a long string in the code cell below. You have the following tasks:

  1. First, split the string into sentences using the split() method with a correct argument. Store the result in a variable called my_sentences. (0.5 mark)

  2. Then, create an empty list called my_lengths (0.5 mark).

  3. Using a for loop, iterate over my_sentences. Split each sentence into words inside the loop (use the split() method again). You can store the resulting list corresponding to the current sentence in current_words variable inside the loop. (2 marks).

  4. Calculate the length of each sentence in words, and append the length to my_lengths (1 mark).

You can print out your two variables my_sentences and my_lengths to check your work for this question. Remember you can also print out variables from inside the loop if you want to check intermediary steps.

[Take me back to Questions section.]

# This variable is given to you: do not change it!
mindfulness_statement = "Breathe deeply. Trust your preparation. Remember: data doesn't judge"
# Write your code here

# This code is provided to you to check your work
print(my_sentences)
print(my_lengths)

Question 3#

In the questions below you will use data from the Statistics Canada Time Use dataset to explore how time spent reading vs. watching TV/videos differ between age groups.

The code blocks and questions below will guide you through this analysis.

[Take me back to Questions section.]

Question 3a#

We will use your student number as data for this midterm. Complete the assignment statement below by typing your student number as an int. In other words assign your student number as an integer to the variable student_number (1 mark)

[Take me back to Questions section.]

# Replace the ... with your student number
student_number = ...

# This checks that you correctly typed in your student_number as an int.
# Make sure there's no error when you run this cell!
assert type(student_number) == int

Question 3b#

Read the CSV file gss_tu2016_main_file.csv into a DataFrame and store it in a variable named time_use_all_data. (1 mark)

[Take me back to Questions section.]

import pandas as pd


# Write your answer in this cell


# This code is provided to help you check your work
time_use_all_data.head()

Question 3c#

Run the code cell below. This code uses pandas.DataFrame.sample to take a random sample of 75% of the rows from time_use_all_data, and names the resulting DataFrame time_use_data. (1 mark)

[Take me back to Questions section.]

#
# CAUTION: Don't modify this code cell
#
seed = student_number % 1000
time_use_data = time_use_all_data.sample(frac = 0.75, random_state = seed)

# The following checks that your DataFrame has the correct shape.
# You will automatically get the 1 mark if the first message ('Your time_use_data ... move on!') is printed.

if time_use_data.shape == (13042, 350):
    print('Your time_use_data variable has the correct shape, and you are ready to move on!')
else:
    print('Please check that you have entered your student number correctly and have run the previous cells.')

Question 4#

Question 4a#

Create a list of these column names from time_use_data and store this list in a variable named important_columns:

  • 'agegr10': Age group of respondent (groups of 10)

  • 'readdur': Duratin - Reading - Online or paper version

  • 'tvdur': Duration - Watching television or videos

Important: Make sure that the column names in important_columns are in the order presented above (i.e., the first element of the list should be 'agegr10').

(1 mark)

[Take me back to Questions section.]

# Write your answer in this cell


# Check your work: important_columns should be a list of 3 strings
print(f"important_columns: {important_columns}")

Question 4b#

Use important_columns to create a new DataFrame from time_use_data that only contains the three columns listed in Question 4a and store this new DataFrame in a variable named sub_time_use_data. (1 mark)

[Take me back to Questions section.]

# Write your answer in this cell


# This code is provided to help you check your work
sub_time_use_data.head()

Question 4c#

Briefly describe in plain language what you did to the data when you completed Question 4a and Question 4b. Put your answer in the Markdown cell below. (2 marks)

[Take me back to Questions section.]

Write your answer to 4c here

Definition of Data Values#

agegr10 : Age group of respondent (groups of 10)

           VALUE  LABEL
               1  15 to 24 years
               2  25 to 34 years
               3  35 to 44 years
               4  45 to 54 years
               5  55 to 64 years
               6  65 to 74 years
               7  75 years and over
              96  Valid skip
              97  Don't know
              98  Refusal
              99  Not stated

           Data type: numeric
           Missing-data codes: 96-99

readdur : Duration - Reading - Online or paper version

    VALUE  LABEL
       0  No time spent doing this activity
    9996  Valid skip
    9997  Don't know
    9998  Refusal
    9999  Not stated

    Data type: numeric
    Missing-data codes: 9996-9999

tvdur : Duration - Watching television or videos

    VALUE  LABEL
       0  No time spent doing this activity
    9996  Valid skip
    9997  Don't know
    9998  Refusal
    9999  Not stated

    Data type: numeric
    Missing-data codes: 9996-9999

Question 4d#

Using Code cell 1 written by a Data Scientist below, answer the following question:

Which column in sub_time_use_data is transformed in that code cell? Briefly explain how you know. (2 marks)

[Take me back to Questions section.]

Write your answer to 4d here

Question 4e#

In the Code cell 1 written by a Data Scientist below, briefly explain why

cond = (sub_time_use_data['agegr10'] == 6) | (sub_time_use_data['agegr10'] == 7)
sub_time_use_data.loc[cond, 'agegr10 recode'] = '65 years and over'

required 2 logical operators as opposed to 1 used for the other age group created. (2 marks)

[Take me back to Questions section.]

Write your answer to 4e here

Code cell 1 written by a Data Scientist#

Run the code cell below, but do not modify the code.

# This code was written by a Data Scientist
#
# CAUTION: Don't modify this code cell
#
sub_time_use_data = sub_time_use_data.copy()

sub_time_use_data.loc[sub_time_use_data['agegr10'] == 5, 'agegr10 recode'] = '55-64'

cond = (sub_time_use_data['agegr10'] == 6) | (sub_time_use_data['agegr10'] == 7)

sub_time_use_data.loc[cond, 'agegr10 recode'] = '65 years and over'

sub_time_use_data.loc[sub_time_use_data['agegr10'] > 7, 'agegr10 recode'] = None

Code cell 2 written by a Data Scientist#

Run the code cell below, but do not modify the code.

# This code was written by a Data Scientist
#
# CAUTION: Don't modify this code cell
#
column_names = {
    'agegr10 recode' : 'Age Group', 
    'readdur': 'Reading (min)',
    'tvdur': 'Watching (min)'
}

sub_time_use_rename = sub_time_use_data.copy()

sub_time_use_rename.rename(columns = column_names, inplace = True)
sub_time_use_rename = sub_time_use_rename[['Age Group', 'Reading (min)', 'Watching (min)']]

Question 4f#

In the Markdown cell below, briefly explain what value of the inplace parameter was used in the rename method in Code cell 2 written by a Data Scientist, and why it was used. Hint: You may find this documentation helpful. (2 marks)

[Take me back to Questions section.]

Write your answer to 4f here.

Question 5#

The two duration columns in sub_time_use_rename currently list their times in minutes.

Add two new columns to sub_time_use_rename called 'Reading (hours)' and 'Watching (hours)', which contain their respective durations converted into hours (by dividing by the number of minutes by 60). (2 marks, 1 mark each)

[Take me back to Questions section.]

# Write your answer in this cell





# Check your work: you should see your two new columns added to sub_time_use_rename
sub_time_use_rename.head()

Question 6#

Use sub_time_use_rename to:

  • Create a boolean Series for younger adults that is True if a respondent is aged 55-64 and False otherwise. Name this series younger_adults. (1 mark)

  • Create a boolean Series for older adults that is True if a respondent is 65 years and older, and False otherwise. Name this series older_adults. (1 mark)

[Take me back to Questions section.]

# Write your answer here



# Check your work
print(younger_adults.value_counts())
print(older_adults.value_counts())

Question 7#

Question 7a#

Use the DataFrame method describe to summarize the following distributions:

  • 'Reading (hours)' for the 55-64 age group in sub_time_use_rename. Name the summary of the distribution younger_adults_reading_summary. (1 mark)

  • 'Watching (hours)' for the 55-64 age group in sub_time_use_rename. Name the summary of the distribution younger_adults_watching_summary. (1 mark)

  • 'Reading (hours)' for the 65 years and older age group in sub_time_use_rename. Name the summary of the distribution older_adults_reading_summary. (1 mark)

  • 'Watching (hours)' for the 65 years and older age group in sub_time_use_rename. Name the summary of the distribution older_adults_watching_summary. (1 mark)

[Take me back to Questions section.]

# Write your answer here
# Check your work (55-64 age group)
print(younger_adults_reading_summary)
print(younger_adults_watching_summary)
# Check your work (65+ age group)
print(older_adults_reading_summary)
print(older_adults_watching_summary)

Question 7b#

Below sub_time_use_prop = sub_time_use_rename.copy() creates a copy of sub_time_use_rename named sub_time_use_prop.

Add a new column named Reading proportion to sub_time_use_prop which contains the following proportion:

\[\frac{\text{Time spent reading in hours}}{\text{Time spent reading in hours} + \text{Time spent watching in hours}}\]

(1 mark)

[Take me back to Questions section.]

sub_time_use_prop = sub_time_use_rename.copy()
# Write your answer here

# Check your work
sub_time_use_prop.head()

Question 7c#

In this step you will create a boxplot to compare the distributions of proportion of time spent reading out of time spent reading or watching between ages 55-64 and 65 years and older age group.

Use the DataFrame method called boxplot to create the following boxplot from sub_time_use_prop:

  • A boxplot of the 'Reading proportion' for the two age groups 15-54 and 55 years and older. Store the result in a variable called boxplot_reading. (1 mark)

Do not use the optional input arguments such as figsize to avoid potential errors in MarkUs.

[Take me back to Questions section.]

# Write your answer in this cell


# Check your work
boxplot_reading;

Question 8#

Answer the following questions:

Question 8a#

Compare the distributions of time spent reading between 55-64 and 65+ age groups. Which group spends more time reading on average? What do the numerical descriptions of the distributions suggest? Briefly explain your reasoning. (3 marks)

[Take me back to Questions section.]

Write your answer to 8a here

Question 8b#

Compare the distributions of time spent watching between 55-64 and 65+ age groups. Which group spends more time watching? What do the numerical descriptions of the distributions suggest? Briefly explain your reasoning. (3 marks)

[Take me back to Questions section.]

Write your answer to 8b here

Question 8c#

Compare the distributions of the proportion of time spent reading out of time spent reading or watching between 55-64 and 65+ age groups. What does the boxplot suggest? Briefly explain your reasoning. (3 marks)

[Take me back to Questions section.]

Write your answer to 8c here

Question 8d#

What limitations might there be in using a box plot to investigate the relationship between the proportion of time spent reading and age group? What information is difficult to interpret Hint: Compare the list of numerical descriptions using .describe() and the informtion you can clearly retreive from the boxplot. (3 marks)

[Take me back to Questions section.]

Write your answer to 8d here