GGR274 Midterm Test#
Read this section before starting the test
Test Instructions#
The test is divided into eight questions. Complete all questions below.
The answers to the questions will be submitted on MarkUs using a similar workflow to the lab and homework assignments, except that MarkUs will not give you feedback on passing or failing the autotests.
We have provided some code to help you check your programming work in each cell.
Answers where you are asked to write Python code will be autograded, and written answers will be graded manually by the teaching team.
Marking Rubric#
Section |
0 |
1 |
2 |
3 |
---|---|---|---|---|
python computation steps |
auto test fails |
auto test passes |
NA |
NA |
Describe what you did to the data (for each part) |
No answer |
A partial description is given that explains what the python code did to the data |
A full description that uses data science terminology is given that explains what the python code did to the data and why this step is important |
NA |
Conclusion (for each part) |
No answer |
The question is answered but no explanation is given |
The question is answered but the explanation is not supported or weakly supported by the data |
The question is answered and the explanation is supported by the data |
Aids Allowed and Academic Integrity#
You are allowed to use any materials from the course or any other written sources (e.g., books, websites).
You are not allowed to directly receive or give help during the test period. In other words, all work must be your own, and you must not discuss or post any information about this test with anyone during the test period.
As a student, you alone are responsible for ensuring the integrity of your work and for understanding what constitutes an academic offense.
Time Allowed#
The test will be available at 12:00 PM on Sunday, February 18, and must be submitted on MarkUs under midterm-makeup section (see Submission Instructions) by 2:05 PM on Sunday, February 18.
Late tests will receive a grade of zero unless you have an approved accommodation from your instructor.
How do I ask a question during the test?#
The teaching team will be available during the test on a zoom class link in case you have any questions during the midterm.
The class discussion forum on Piazza will be disabled during the test period.
Submission Instructions#
You will submit your work on MarkUs. To submit your work:
Download this file (
Midterm.ipynb
) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)Submit this file to MarkUs under the midterm assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.
Questions#
Below is an outline of the questions that you will answer in this midterm test.
For your convenience you can keep track of which questions you have answered by ticking off the boxes by adding an x
in between the square brackets [ ]
.
- [ ]
- [x]
The above Markdown will render like this
[ ]
[x]
Check marks are not required for full marks. This is to help you stay organized during the test, but please feel free to ignore this if you wish.
List of Questions in this Test#
Below are hyperlinks to individual test questions. A hyperlink back to this section is provided at the end of most questions.
Question 1a (1 mark)
Question 1b (1 mark)
Question 2 (4 marks)
Question 3a (1 mark)
Question 3b (1 mark)
Question 3c (1 mark)
Question 4a (1 mark)
Question 4b (1 mark)
Question 4c (2 marks)
Question 4d (2 marks)
Question 4e (2 marks)
Question 4f (2 marks)
Question 5 (2 marks)
Question 6 (2 marks)
Question 7a (4 marks)
Question 7b (1 mark)
Question 7c (1 mark)
Question 8a (3 marks)
Question 8b (3 marks)
Question 8c (3 marks)
Question 8d (3 marks)
# Compute total marks in this test and print out a nice message.
num_question_parts = 21
total_test_marks = 41
time = 120
print(f"""
There are {num_question_parts} question parts.\n
The total number of marks in this test is {total_test_marks}. \n
The total amount of time to complete this test is {time} minutes. \n
\U0001F447The test questions start below.\U0001F447
""")
Question 1#
Canada has a current population of 39,401,996, which we have stored below in the variable canada_population
.
Of these, the total number of immigrants is 8,361,505, which we have stored below in the variable num_immigrants
.
[Take me back to Questions section.]
# These variables are given to you. Do not change them!
canada_population = "39,401,996"
num_immigrants = 8361505
Question 1a#
Use the Python functions int
and str.replace
to convert canada_population
to an integer.
Step 1: Use str.replace(",", "")
to remove the ,
from the string '39,401,996'
. Store the result in canada_population_no_comma
. (1 mark)
Step 2: Use int
to convert canada_population_no_comma
to an integer. Store the result in canada_population_int
. (1 mark)
[Take me back to Questions section.]
# Write your code here
Question 1b#
Use the variables in Question 1a to calculate the percentage of Canada’s population who are immigrants, as a float
between 0.00% and 100.00%, rounded to 2 decimal places.
Store the result in a variable called percent_immigrants
. (1 mark)
[Take me back to Questions section.]
# Write your code here
# This code is provided for you to check your work
print(f"Percentage of immigrants: {percent_immigrants}%")
Question 2#
We have given you a long string in the code cell below. You have the following tasks:
First, split the string into sentences using the
split()
method with a correct argument. Store the result in a variable calledmy_sentences
. (0.5 mark)Then, create an empty list called
my_lengths
(0.5 mark).Using a
for
loop, iterate overmy_sentences
. Split each sentence into words inside the loop (use thesplit()
method again). You can store the resulting list corresponding to the current sentence incurrent_words
variable inside the loop. (2 marks).Calculate the length of each sentence in words, and append the length to
my_lengths
(1 mark).
You can print out your two variables my_sentences
and my_lengths
to check your work for this question.
Remember you can also print out variables from inside the loop if you want to check intermediary steps.
[Take me back to Questions section.]
# This variable is given to you: do not change it!
mindfulness_statement = "Breathe deeply. Trust your preparation. Remember: data doesn't judge"
# Write your code here
# This code is provided to you to check your work
print(my_sentences)
print(my_lengths)
Question 3#
In the questions below you will use data from the Statistics Canada Time Use dataset to explore how time spent reading vs. watching TV/videos differ between age groups.
The code blocks and questions below will guide you through this analysis.
[Take me back to Questions section.]
Question 3a#
We will use your student number as data for this midterm. Complete the assignment statement below by typing your student number as an int
. In other words assign your student number as an integer to the variable student_number
(1 mark)
[Take me back to Questions section.]
# Replace the ... with your student number
student_number = ...
# This checks that you correctly typed in your student_number as an int.
# Make sure there's no error when you run this cell!
assert type(student_number) == int
Question 3b#
Read the CSV file gss_tu2016_main_file.csv
into a DataFrame
and store it in a variable named time_use_all_data
. (1 mark)
[Take me back to Questions section.]
import pandas as pd
# Write your answer in this cell
# This code is provided to help you check your work
time_use_all_data.head()
Question 3c#
Run the code cell below. This code uses pandas.DataFrame.sample
to take a random sample of 75% of the rows from time_use_all_data
, and names the resulting DataFrame
time_use_data
. (1 mark)
[Take me back to Questions section.]
#
# CAUTION: Don't modify this code cell
#
seed = student_number % 1000
time_use_data = time_use_all_data.sample(frac = 0.75, random_state = seed)
# The following checks that your DataFrame has the correct shape.
# You will automatically get the 1 mark if the first message ('Your time_use_data ... move on!') is printed.
if time_use_data.shape == (13042, 350):
print('Your time_use_data variable has the correct shape, and you are ready to move on!')
else:
print('Please check that you have entered your student number correctly and have run the previous cells.')
Question 4#
Question 4a#
Create a list of these column names from time_use_data
and store this list in a variable named important_columns
:
'agegr10'
: Age group of respondent (groups of 10)'readdur'
: Duratin - Reading - Online or paper version'tvdur'
: Duration - Watching television or videos
Important: Make sure that the column names in important_columns
are in the order presented above (i.e., the first element of the list should be 'agegr10'
).
(1 mark)
[Take me back to Questions section.]
# Write your answer in this cell
# Check your work: important_columns should be a list of 3 strings
print(f"important_columns: {important_columns}")
Question 4b#
Use important_columns
to create a new DataFrame
from time_use_data
that only contains the three columns listed in Question 4a and store this new DataFrame
in a variable named sub_time_use_data
. (1 mark)
[Take me back to Questions section.]
# Write your answer in this cell
# This code is provided to help you check your work
sub_time_use_data.head()
Question 4c#
Briefly describe in plain language what you did to the data when you completed Question 4a and Question 4b. Put your answer in the Markdown cell below. (2 marks)
[Take me back to Questions section.]
Write your answer to 4c here
Definition of Data Values#
agegr10 : Age group of respondent (groups of 10)
VALUE LABEL
1 15 to 24 years
2 25 to 34 years
3 35 to 44 years
4 45 to 54 years
5 55 to 64 years
6 65 to 74 years
7 75 years and over
96 Valid skip
97 Don't know
98 Refusal
99 Not stated
Data type: numeric
Missing-data codes: 96-99
readdur : Duration - Reading - Online or paper version
VALUE LABEL
0 No time spent doing this activity
9996 Valid skip
9997 Don't know
9998 Refusal
9999 Not stated
Data type: numeric
Missing-data codes: 9996-9999
tvdur : Duration - Watching television or videos
VALUE LABEL
0 No time spent doing this activity
9996 Valid skip
9997 Don't know
9998 Refusal
9999 Not stated
Data type: numeric
Missing-data codes: 9996-9999
Question 4d#
Using Code cell 1 written by a Data Scientist below, answer the following question:
Which column in sub_time_use_data
is transformed in that code cell? Briefly explain how you know. (2 marks)
[Take me back to Questions section.]
Write your answer to 4d here
Question 4e#
In the Code cell 1 written by a Data Scientist below, briefly explain why
cond = (sub_time_use_data['agegr10'] == 6) | (sub_time_use_data['agegr10'] == 7)
sub_time_use_data.loc[cond, 'agegr10 recode'] = '65 years and over'
required 2 logical operators as opposed to 1 used for the other age group created. (2 marks)
[Take me back to Questions section.]
Write your answer to 4e here
Code cell 1 written by a Data Scientist#
Run the code cell below, but do not modify the code.
# This code was written by a Data Scientist
#
# CAUTION: Don't modify this code cell
#
sub_time_use_data = sub_time_use_data.copy()
sub_time_use_data.loc[sub_time_use_data['agegr10'] == 5, 'agegr10 recode'] = '55-64'
cond = (sub_time_use_data['agegr10'] == 6) | (sub_time_use_data['agegr10'] == 7)
sub_time_use_data.loc[cond, 'agegr10 recode'] = '65 years and over'
sub_time_use_data.loc[sub_time_use_data['agegr10'] > 7, 'agegr10 recode'] = None
Code cell 2 written by a Data Scientist#
Run the code cell below, but do not modify the code.
# This code was written by a Data Scientist
#
# CAUTION: Don't modify this code cell
#
column_names = {
'agegr10 recode' : 'Age Group',
'readdur': 'Reading (min)',
'tvdur': 'Watching (min)'
}
sub_time_use_rename = sub_time_use_data.copy()
sub_time_use_rename.rename(columns = column_names, inplace = True)
sub_time_use_rename = sub_time_use_rename[['Age Group', 'Reading (min)', 'Watching (min)']]
Question 4f#
In the Markdown cell below, briefly explain what value of the inplace
parameter was used in the rename
method in Code cell 2 written by a Data Scientist, and why it was used. Hint: You may find this documentation helpful. (2 marks)
[Take me back to Questions section.]
Write your answer to 4f here.
Question 5#
The two duration columns in sub_time_use_rename
currently list their times in minutes.
Add two new columns to sub_time_use_rename
called 'Reading (hours)'
and 'Watching (hours)'
, which contain their respective durations converted into hours (by dividing by the number of minutes by 60). (2 marks, 1 mark each)
[Take me back to Questions section.]
# Write your answer in this cell
# Check your work: you should see your two new columns added to sub_time_use_rename
sub_time_use_rename.head()
Question 6#
Use sub_time_use_rename
to:
Create a boolean
Series
for younger adults that isTrue
if a respondent is aged 55-64 andFalse
otherwise. Name this seriesyounger_adults
. (1 mark)Create a boolean
Series
for older adults that isTrue
if a respondent is 65 years and older, andFalse
otherwise. Name this seriesolder_adults
. (1 mark)
[Take me back to Questions section.]
# Write your answer here
# Check your work
print(younger_adults.value_counts())
print(older_adults.value_counts())
Question 7#
Question 7a#
Use the DataFrame
method describe
to summarize the following distributions:
'Reading (hours)'
for the 55-64 age group insub_time_use_rename
. Name the summary of the distributionyounger_adults_reading_summary
. (1 mark)'Watching (hours)'
for the 55-64 age group insub_time_use_rename
. Name the summary of the distributionyounger_adults_watching_summary
. (1 mark)'Reading (hours)'
for the 65 years and older age group insub_time_use_rename
. Name the summary of the distributionolder_adults_reading_summary
. (1 mark)'Watching (hours)'
for the 65 years and older age group insub_time_use_rename
. Name the summary of the distributionolder_adults_watching_summary
. (1 mark)
[Take me back to Questions section.]
# Write your answer here
# Check your work (55-64 age group)
print(younger_adults_reading_summary)
print(younger_adults_watching_summary)
# Check your work (65+ age group)
print(older_adults_reading_summary)
print(older_adults_watching_summary)
Question 7b#
Below sub_time_use_prop = sub_time_use_rename.copy()
creates a copy of sub_time_use_rename
named sub_time_use_prop
.
Add a new column named Reading proportion
to sub_time_use_prop
which contains the following proportion:
(1 mark)
[Take me back to Questions section.]
sub_time_use_prop = sub_time_use_rename.copy()
# Write your answer here
# Check your work
sub_time_use_prop.head()
Question 7c#
In this step you will create a boxplot to compare the distributions of proportion of time spent reading out of time spent reading or watching between ages 55-64 and 65 years and older age group.
Use the DataFrame
method called boxplot
to create the following boxplot from sub_time_use_prop
:
A boxplot of the
'Reading proportion'
for the two age groups 15-54 and 55 years and older. Store the result in a variable calledboxplot_reading
. (1 mark)
Do not use the optional input arguments such as figsize
to avoid potential errors in MarkUs.
[Take me back to Questions section.]
# Write your answer in this cell
# Check your work
boxplot_reading;
Question 8#
Answer the following questions:
Question 8a#
Compare the distributions of time spent reading between 55-64 and 65+ age groups. Which group spends more time reading on average? What do the numerical descriptions of the distributions suggest? Briefly explain your reasoning. (3 marks)
[Take me back to Questions section.]
Write your answer to 8a here
Question 8b#
Compare the distributions of time spent watching between 55-64 and 65+ age groups. Which group spends more time watching? What do the numerical descriptions of the distributions suggest? Briefly explain your reasoning. (3 marks)
[Take me back to Questions section.]
Write your answer to 8b here
Question 8c#
Compare the distributions of the proportion of time spent reading out of time spent reading or watching between 55-64 and 65+ age groups. What does the boxplot suggest? Briefly explain your reasoning. (3 marks)
[Take me back to Questions section.]
Write your answer to 8c here
Question 8d#
What limitations might there be in using a box plot to investigate the relationship between the proportion of time spent reading and age group? What information is difficult to interpret Hint: Compare the list of numerical descriptions using .describe()
and the informtion you can clearly retreive from the boxplot. (3 marks)
[Take me back to Questions section.]
Write your answer to 8d here