{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Homework 5\n",
"\n",
"## Logistics\n",
"\n",
"**Due date**: The homework is due 23:59 on Monday, February 12.\n",
"\n",
"You will submit your work on [MarkUs](https://markus-ds.teach.cs.toronto.edu).\n",
"To submit your work:\n",
"\n",
"1. Download this file (`Homework_5.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../../../guides/jupyterhub_guide.ipynb) for detailed instructions.)\n",
"2. Submit this file to MarkUs under the **hw5** assignment. (See [our MarkUs Guide](../../../guides/markus_guide.ipynb) for detailed instructions.)\n",
"All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.\n",
"\n",
"## Introduction\n",
"\n",
"In this homework we explore: \n",
"- row, column selection\n",
"- create new columns\n",
"- grouping\n",
"- summary statistics\n",
"- visualizing distributions\n",
"\n",
"**Question:** Explore sleeping, exercising, and socializing among Canadians."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 1\n",
"\n",
"a) Use the `pandas` method `read_csv` to read the file `gss_tu2016_main_file.csv` into a DataFrame. Store this `DataFrame` in a variable called `time_use_df`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"time_use_df = pd.read_csv('gss_tu2016_main_file.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"b) Create a subset of `time_use_df` with only the following columns: `dur41`, `dur47`, `sleepdur`, `agegr10`, `prv`. To do this follow these steps:\n",
"\n",
"- Create a list called `analysis_columns` with the column names.\n",
"- Use `analysis_columns` to select these columns from `time_use_df` and store this `DataFrame` in a variable called `time_use_subset_df`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
dur41
\n",
"
dur47
\n",
"
sleepdur
\n",
"
agegr10
\n",
"
prv
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
510
\n",
"
5
\n",
"
46
\n",
"
\n",
"
\n",
"
1
\n",
"
90
\n",
"
0
\n",
"
420
\n",
"
5
\n",
"
59
\n",
"
\n",
"
\n",
"
2
\n",
"
0
\n",
"
0
\n",
"
570
\n",
"
4
\n",
"
47
\n",
"
\n",
"
\n",
"
3
\n",
"
395
\n",
"
60
\n",
"
510
\n",
"
6
\n",
"
35
\n",
"
\n",
"
\n",
"
4
\n",
"
0
\n",
"
0
\n",
"
525
\n",
"
2
\n",
"
35
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
17385
\n",
"
0
\n",
"
0
\n",
"
560
\n",
"
1
\n",
"
24
\n",
"
\n",
"
\n",
"
17386
\n",
"
0
\n",
"
0
\n",
"
600
\n",
"
5
\n",
"
24
\n",
"
\n",
"
\n",
"
17387
\n",
"
125
\n",
"
77
\n",
"
510
\n",
"
7
\n",
"
24
\n",
"
\n",
"
\n",
"
17388
\n",
"
0
\n",
"
0
\n",
"
785
\n",
"
6
\n",
"
24
\n",
"
\n",
"
\n",
"
17389
\n",
"
15
\n",
"
0
\n",
"
450
\n",
"
5
\n",
"
35
\n",
"
\n",
" \n",
"
\n",
"
17390 rows × 5 columns
\n",
"
"
],
"text/plain": [
" dur41 dur47 sleepdur agegr10 prv\n",
"0 0 0 510 5 46\n",
"1 90 0 420 5 59\n",
"2 0 0 570 4 47\n",
"3 395 60 510 6 35\n",
"4 0 0 525 2 35\n",
"... ... ... ... ... ...\n",
"17385 0 0 560 1 24\n",
"17386 0 0 600 5 24\n",
"17387 125 77 510 7 24\n",
"17388 0 0 785 6 24\n",
"17389 15 0 450 5 35\n",
"\n",
"[17390 rows x 5 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"analysis_columns = ['dur41', 'dur47', 'sleepdur', 'agegr10', 'prv']\n",
"\n",
"time_use_subset_df = time_use_df[analysis_columns]\n",
"\n",
"time_use_subset_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"c) In the next steps you will rename the columns of `time_use_subset_df` according to the following table:\n",
"\n",
"Old name | New name\n",
"---------|------------\n",
"`dur41` |`Socializing time`\n",
"`dur47` |`Exercising time`\n",
"`sleepdur`| `Sleep time`\n",
"`agegr10`|`Age group`\n",
"`prv` |`Province` \n",
"\n",
"Step 1: Create a dictionary called `new_col_names` with each *Old name* as a key and each *New name* as the corresponding value."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dur41': 'Socializing time',\n",
" 'dur47': 'Exercising time',\n",
" 'sleepdur': 'Sleep time',\n",
" 'agegr10': 'Age group',\n",
" 'prv': 'Province'}"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_col_names = {\n",
" 'dur41': 'Socializing time', \n",
" 'dur47': 'Exercising time', \n",
" 'sleepdur': 'Sleep time', \n",
" 'agegr10': 'Age group',\n",
" 'prv': 'Province'\n",
"}\n",
"\n",
"new_col_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 2: Use `new_col_names` to rename the columns of `time_use_subset_df` and store the DataFrame with renamed columns in a variable called `time_use_subset_renamed_df`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Socializing time
\n",
"
Exercising time
\n",
"
Sleep time
\n",
"
Age group
\n",
"
Province
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
510
\n",
"
5
\n",
"
46
\n",
"
\n",
"
\n",
"
1
\n",
"
90
\n",
"
0
\n",
"
420
\n",
"
5
\n",
"
59
\n",
"
\n",
"
\n",
"
2
\n",
"
0
\n",
"
0
\n",
"
570
\n",
"
4
\n",
"
47
\n",
"
\n",
"
\n",
"
3
\n",
"
395
\n",
"
60
\n",
"
510
\n",
"
6
\n",
"
35
\n",
"
\n",
"
\n",
"
4
\n",
"
0
\n",
"
0
\n",
"
525
\n",
"
2
\n",
"
35
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
17385
\n",
"
0
\n",
"
0
\n",
"
560
\n",
"
1
\n",
"
24
\n",
"
\n",
"
\n",
"
17386
\n",
"
0
\n",
"
0
\n",
"
600
\n",
"
5
\n",
"
24
\n",
"
\n",
"
\n",
"
17387
\n",
"
125
\n",
"
77
\n",
"
510
\n",
"
7
\n",
"
24
\n",
"
\n",
"
\n",
"
17388
\n",
"
0
\n",
"
0
\n",
"
785
\n",
"
6
\n",
"
24
\n",
"
\n",
"
\n",
"
17389
\n",
"
15
\n",
"
0
\n",
"
450
\n",
"
5
\n",
"
35
\n",
"
\n",
" \n",
"
\n",
"
17390 rows × 5 columns
\n",
"
"
],
"text/plain": [
" Socializing time Exercising time Sleep time Age group Province\n",
"0 0 0 510 5 46\n",
"1 90 0 420 5 59\n",
"2 0 0 570 4 47\n",
"3 395 60 510 6 35\n",
"4 0 0 525 2 35\n",
"... ... ... ... ... ...\n",
"17385 0 0 560 1 24\n",
"17386 0 0 600 5 24\n",
"17387 125 77 510 7 24\n",
"17388 0 0 785 6 24\n",
"17389 15 0 450 5 35\n",
"\n",
"[17390 rows x 5 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_use_subset_renamed_df = time_use_subset_df.rename(columns=new_col_names)\n",
"\n",
"time_use_subset_renamed_df\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 2\n",
"\n",
"Create columns in `time_use_subset_renamed_df` that converts time use from minutes to hours. Since 60 minutes is equal to 1 hour we can divide the time use columns by 60 to compute the time in hours.\n",
"\n",
"To do this create new columns in `time_use_subset_renamed_df` called \n",
"\n",
" + `Socializing time (hour)`, \n",
" + `Exercising time (hour)`, and \n",
" + `Sleep time (hour)` \n",
" \n",
"these columns are (respectively) `Socializing time`, `Exercising time`, and `Sleep time` in hours."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Socializing time
\n",
"
Exercising time
\n",
"
Sleep time
\n",
"
Age group
\n",
"
Province
\n",
"
Socializing time (hour)
\n",
"
Exercising time (hour)
\n",
"
Sleep time (hour)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
510
\n",
"
5
\n",
"
46
\n",
"
0.000000
\n",
"
0.000000
\n",
"
8.500000
\n",
"
\n",
"
\n",
"
1
\n",
"
90
\n",
"
0
\n",
"
420
\n",
"
5
\n",
"
59
\n",
"
1.500000
\n",
"
0.000000
\n",
"
7.000000
\n",
"
\n",
"
\n",
"
2
\n",
"
0
\n",
"
0
\n",
"
570
\n",
"
4
\n",
"
47
\n",
"
0.000000
\n",
"
0.000000
\n",
"
9.500000
\n",
"
\n",
"
\n",
"
3
\n",
"
395
\n",
"
60
\n",
"
510
\n",
"
6
\n",
"
35
\n",
"
6.583333
\n",
"
1.000000
\n",
"
8.500000
\n",
"
\n",
"
\n",
"
4
\n",
"
0
\n",
"
0
\n",
"
525
\n",
"
2
\n",
"
35
\n",
"
0.000000
\n",
"
0.000000
\n",
"
8.750000
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
17385
\n",
"
0
\n",
"
0
\n",
"
560
\n",
"
1
\n",
"
24
\n",
"
0.000000
\n",
"
0.000000
\n",
"
9.333333
\n",
"
\n",
"
\n",
"
17386
\n",
"
0
\n",
"
0
\n",
"
600
\n",
"
5
\n",
"
24
\n",
"
0.000000
\n",
"
0.000000
\n",
"
10.000000
\n",
"
\n",
"
\n",
"
17387
\n",
"
125
\n",
"
77
\n",
"
510
\n",
"
7
\n",
"
24
\n",
"
2.083333
\n",
"
1.283333
\n",
"
8.500000
\n",
"
\n",
"
\n",
"
17388
\n",
"
0
\n",
"
0
\n",
"
785
\n",
"
6
\n",
"
24
\n",
"
0.000000
\n",
"
0.000000
\n",
"
13.083333
\n",
"
\n",
"
\n",
"
17389
\n",
"
15
\n",
"
0
\n",
"
450
\n",
"
5
\n",
"
35
\n",
"
0.250000
\n",
"
0.000000
\n",
"
7.500000
\n",
"
\n",
" \n",
"
\n",
"
17390 rows × 8 columns
\n",
"
"
],
"text/plain": [
" Socializing time Exercising time Sleep time Age group Province \\\n",
"0 0 0 510 5 46 \n",
"1 90 0 420 5 59 \n",
"2 0 0 570 4 47 \n",
"3 395 60 510 6 35 \n",
"4 0 0 525 2 35 \n",
"... ... ... ... ... ... \n",
"17385 0 0 560 1 24 \n",
"17386 0 0 600 5 24 \n",
"17387 125 77 510 7 24 \n",
"17388 0 0 785 6 24 \n",
"17389 15 0 450 5 35 \n",
"\n",
" Socializing time (hour) Exercising time (hour) Sleep time (hour) \n",
"0 0.000000 0.000000 8.500000 \n",
"1 1.500000 0.000000 7.000000 \n",
"2 0.000000 0.000000 9.500000 \n",
"3 6.583333 1.000000 8.500000 \n",
"4 0.000000 0.000000 8.750000 \n",
"... ... ... ... \n",
"17385 0.000000 0.000000 9.333333 \n",
"17386 0.000000 0.000000 10.000000 \n",
"17387 2.083333 1.283333 8.500000 \n",
"17388 0.000000 0.000000 13.083333 \n",
"17389 0.250000 0.000000 7.500000 \n",
"\n",
"[17390 rows x 8 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_use_subset_renamed_df['Socializing time (hour)'] = time_use_subset_renamed_df['Socializing time'] / 60\n",
"\n",
"time_use_subset_renamed_df['Exercising time (hour)'] = time_use_subset_renamed_df['Exercising time'] / 60\n",
"\n",
"time_use_subset_renamed_df['Sleep time (hour)'] = time_use_subset_renamed_df['Sleep time'] / 60\n",
"\n",
"time_use_subset_renamed_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 3\n",
"\n",
"Some respondents in the time use survey spent no time exercising, socializing, and sleeping. In this section we will create a `DataFrame` that only has respondents who spent time sleeping, exercising, and socializing. In other words respondents that spent no time on these activities will be excluded.\n",
"\n",
"a) Create a boolean `Series` called `well_balanced` that is `True` if time spent exercising **and** time spent sleeping **and** time spent socializing are all greater than 0, and `False` otherwise."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"well_balanced = (\n",
" (time_use_subset_renamed_df['Sleep time (hour)'] > 0) & \n",
" (time_use_subset_renamed_df['Exercising time (hour)'] > 0) & \n",
" (time_use_subset_renamed_df['Socializing time (hour)'] > 0)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"b) Use `well_balanced` to filter (i.e. select) the rows of `time_use_subset_renamed_df` where respondents had non-zero times of sleeping, exercising, and socializing. Store this filtered DataFrame in `well_balanced_df`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Socializing time
\n",
"
Exercising time
\n",
"
Sleep time
\n",
"
Age group
\n",
"
Province
\n",
"
Socializing time (hour)
\n",
"
Exercising time (hour)
\n",
"
Sleep time (hour)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
3
\n",
"
395
\n",
"
60
\n",
"
510
\n",
"
6
\n",
"
35
\n",
"
6.583333
\n",
"
1.000000
\n",
"
8.500000
\n",
"
\n",
"
\n",
"
7
\n",
"
180
\n",
"
60
\n",
"
440
\n",
"
5
\n",
"
59
\n",
"
3.000000
\n",
"
1.000000
\n",
"
7.333333
\n",
"
\n",
"
\n",
"
23
\n",
"
80
\n",
"
230
\n",
"
330
\n",
"
6
\n",
"
46
\n",
"
1.333333
\n",
"
3.833333
\n",
"
5.500000
\n",
"
\n",
"
\n",
"
48
\n",
"
455
\n",
"
15
\n",
"
270
\n",
"
6
\n",
"
35
\n",
"
7.583333
\n",
"
0.250000
\n",
"
4.500000
\n",
"
\n",
"
\n",
"
54
\n",
"
130
\n",
"
185
\n",
"
670
\n",
"
1
\n",
"
12
\n",
"
2.166667
\n",
"
3.083333
\n",
"
11.166667
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
17325
\n",
"
25
\n",
"
15
\n",
"
640
\n",
"
6
\n",
"
47
\n",
"
0.416667
\n",
"
0.250000
\n",
"
10.666667
\n",
"
\n",
"
\n",
"
17336
\n",
"
105
\n",
"
100
\n",
"
525
\n",
"
6
\n",
"
59
\n",
"
1.750000
\n",
"
1.666667
\n",
"
8.750000
\n",
"
\n",
"
\n",
"
17351
\n",
"
40
\n",
"
90
\n",
"
540
\n",
"
5
\n",
"
46
\n",
"
0.666667
\n",
"
1.500000
\n",
"
9.000000
\n",
"
\n",
"
\n",
"
17366
\n",
"
120
\n",
"
90
\n",
"
490
\n",
"
6
\n",
"
59
\n",
"
2.000000
\n",
"
1.500000
\n",
"
8.166667
\n",
"
\n",
"
\n",
"
17387
\n",
"
125
\n",
"
77
\n",
"
510
\n",
"
7
\n",
"
24
\n",
"
2.083333
\n",
"
1.283333
\n",
"
8.500000
\n",
"
\n",
" \n",
"
\n",
"
741 rows × 8 columns
\n",
"
"
],
"text/plain": [
" Socializing time Exercising time Sleep time Age group Province \\\n",
"3 395 60 510 6 35 \n",
"7 180 60 440 5 59 \n",
"23 80 230 330 6 46 \n",
"48 455 15 270 6 35 \n",
"54 130 185 670 1 12 \n",
"... ... ... ... ... ... \n",
"17325 25 15 640 6 47 \n",
"17336 105 100 525 6 59 \n",
"17351 40 90 540 5 46 \n",
"17366 120 90 490 6 59 \n",
"17387 125 77 510 7 24 \n",
"\n",
" Socializing time (hour) Exercising time (hour) Sleep time (hour) \n",
"3 6.583333 1.000000 8.500000 \n",
"7 3.000000 1.000000 7.333333 \n",
"23 1.333333 3.833333 5.500000 \n",
"48 7.583333 0.250000 4.500000 \n",
"54 2.166667 3.083333 11.166667 \n",
"... ... ... ... \n",
"17325 0.416667 0.250000 10.666667 \n",
"17336 1.750000 1.666667 8.750000 \n",
"17351 0.666667 1.500000 9.000000 \n",
"17366 2.000000 1.500000 8.166667 \n",
"17387 2.083333 1.283333 8.500000 \n",
"\n",
"[741 rows x 8 columns]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"well_balanced_df = time_use_subset_renamed_df[well_balanced]\n",
"\n",
"well_balanced_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"c) The number of rows in a `pandas` `DataFrame` can be computed by `len()`. For example, `len(well_balanced_df)` is the number of rows in `well_balanced_df`. Compute the number of respondents who were *removed* from `time_use_subset_renamed_df` when it was filtered using `well_balanced` and store this number in a variable called `diff`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16649"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diff = len(time_use_subset_renamed_df) - len(well_balanced_df)\n",
"\n",
"diff"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"d) Use `diff` to compute the percentage of respondents removed from `time_use_subset_renamed_df`. Round the percentage to two decimal places, and store the result value in a variable called `pct_lost`."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"95.74"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pct_lost = round(diff / len(time_use_subset_renamed_df) * 100, 2)\n",
"\n",
"pct_lost"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 4\n",
"\n",
"In this section you will explore the distributions of time spent socializing, exercising, and sleeping by age group and province.\n",
"\n",
"a) Compute the mean hours spent sleeping, socializing, and exercising by age group using `.groupby` on `well_balanced_df`. Store this DataFrame in a variable called `group_means`.\n",
"\n",
"b) Create a new column in `group_means` called `Total time (hour)` that is the sum of the time (in hours) spent sleeping, exercising, and socializing.\n",
"\n",
"c) Create a new index for `group_means` using the labels of Age group found in the code book (`gss_tu2016_codebook.txt`) and store the values in a list called `index_new`. \n",
"\n",
"d) Change the index of `group_means` to correspond to `index_new`. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Socializing time (hour)
\n",
"
Exercising time (hour)
\n",
"
Sleep time (hour)
\n",
"
Total time (hour)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
15-24
\n",
"
2.707031
\n",
"
1.372917
\n",
"
8.918750
\n",
"
12.998698
\n",
"
\n",
"
\n",
"
25-34
\n",
"
2.127004
\n",
"
1.321730
\n",
"
8.060338
\n",
"
11.509072
\n",
"
\n",
"
\n",
"
35-44
\n",
"
1.802305
\n",
"
1.254433
\n",
"
8.079787
\n",
"
11.136525
\n",
"
\n",
"
\n",
"
45-54
\n",
"
1.754045
\n",
"
1.121359
\n",
"
8.249191
\n",
"
11.124595
\n",
"
\n",
"
\n",
"
55-64
\n",
"
2.236025
\n",
"
1.150311
\n",
"
8.266046
\n",
"
11.652381
\n",
"
\n",
"
\n",
"
65-74
\n",
"
2.073184
\n",
"
1.221688
\n",
"
8.342949
\n",
"
11.637821
\n",
"
\n",
"
\n",
"
75+
\n",
"
2.068452
\n",
"
1.016270
\n",
"
8.640873
\n",
"
11.725595
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Socializing time (hour) Exercising time (hour) Sleep time (hour) \\\n",
"15-24 2.707031 1.372917 8.918750 \n",
"25-34 2.127004 1.321730 8.060338 \n",
"35-44 1.802305 1.254433 8.079787 \n",
"45-54 1.754045 1.121359 8.249191 \n",
"55-64 2.236025 1.150311 8.266046 \n",
"65-74 2.073184 1.221688 8.342949 \n",
"75+ 2.068452 1.016270 8.640873 \n",
"\n",
" Total time (hour) \n",
"15-24 12.998698 \n",
"25-34 11.509072 \n",
"35-44 11.136525 \n",
"45-54 11.124595 \n",
"55-64 11.652381 \n",
"65-74 11.637821 \n",
"75+ 11.725595 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"group_means = well_balanced_df.groupby('Age group')[[\n",
" 'Socializing time (hour)', \n",
" 'Exercising time (hour)',\n",
" 'Sleep time (hour)'\n",
"]].mean()\n",
"\n",
"group_means['Total time (hour)'] = group_means.sum(axis=1)\n",
"\n",
"index_new = [\n",
" '15-24',\n",
" '25-34',\n",
" '35-44',\n",
" '45-54',\n",
" '55-64',\n",
" '65-74',\n",
" '75+'\n",
"]\n",
"\n",
"group_means.index = index_new\n",
"\n",
"\n",
"# Display group_means to check that the index has been updated.\n",
"# On the left-hand side you should see the Age group labels, from \"15-24\" to \"75+\".\n",
"group_means"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"e) Sort `group_means` in descending order of `Total time (hour)`. Store this sorted `DataFrame` in a variable called `group_means_sorted`\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Socializing time (hour)
\n",
"
Exercising time (hour)
\n",
"
Sleep time (hour)
\n",
"
Total time (hour)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
15-24
\n",
"
2.707031
\n",
"
1.372917
\n",
"
8.918750
\n",
"
12.998698
\n",
"
\n",
"
\n",
"
75+
\n",
"
2.068452
\n",
"
1.016270
\n",
"
8.640873
\n",
"
11.725595
\n",
"
\n",
"
\n",
"
55-64
\n",
"
2.236025
\n",
"
1.150311
\n",
"
8.266046
\n",
"
11.652381
\n",
"
\n",
"
\n",
"
65-74
\n",
"
2.073184
\n",
"
1.221688
\n",
"
8.342949
\n",
"
11.637821
\n",
"
\n",
"
\n",
"
25-34
\n",
"
2.127004
\n",
"
1.321730
\n",
"
8.060338
\n",
"
11.509072
\n",
"
\n",
"
\n",
"
35-44
\n",
"
1.802305
\n",
"
1.254433
\n",
"
8.079787
\n",
"
11.136525
\n",
"
\n",
"
\n",
"
45-54
\n",
"
1.754045
\n",
"
1.121359
\n",
"
8.249191
\n",
"
11.124595
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Socializing time (hour) Exercising time (hour) Sleep time (hour) \\\n",
"15-24 2.707031 1.372917 8.918750 \n",
"75+ 2.068452 1.016270 8.640873 \n",
"55-64 2.236025 1.150311 8.266046 \n",
"65-74 2.073184 1.221688 8.342949 \n",
"25-34 2.127004 1.321730 8.060338 \n",
"35-44 1.802305 1.254433 8.079787 \n",
"45-54 1.754045 1.121359 8.249191 \n",
"\n",
" Total time (hour) \n",
"15-24 12.998698 \n",
"75+ 11.725595 \n",
"55-64 11.652381 \n",
"65-74 11.637821 \n",
"25-34 11.509072 \n",
"35-44 11.136525 \n",
"45-54 11.124595 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"group_means_sorted = group_means.sort_values(by='Total time (hour)', ascending=False)\n",
"\n",
"group_means_sorted\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"f) Use `well_balanced_df` to create three side-by-side boxplots using `layout = (1, 3)` and `figsize = (20, 10)` of time spent (in hours) socializing, exercising, and sleeping for each age group. Store these boxplots in a variable called `time_boxplots`."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"time_boxplots = well_balanced_df.boxplot(\n",
" column = ['Socializing time (hour)', 'Exercising time (hour)','Sleep time (hour)'], \n",
" by = 'Age group', \n",
" figsize = (20, 10),\n",
" layout = (1, 3)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 5\n",
"\n",
"The tick marks on the horizontal axes of `time_boxplots` are not informative unless the viewer knows which age group each value represents. Fix labels of boxplot by recoding `Age group` using the labels in the code book (see `gss_tu2016_codebook.txt`).\n",
"\n",
"a) First, create a copy of `well_balanced_df` (using the `DataFrame` `.copy()` method), and store it in a variable called `well_balanced_age_label_df`. For that new `DataFrame`, recode `Age group` by adding a column called `Age group label` with the Age group labels found in the code book."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Socializing time
\n",
"
Exercising time
\n",
"
Sleep time
\n",
"
Age group
\n",
"
Province
\n",
"
Socializing time (hour)
\n",
"
Exercising time (hour)
\n",
"
Sleep time (hour)
\n",
"
Age group label
\n",
"
\n",
" \n",
" \n",
"
\n",
"
3
\n",
"
395
\n",
"
60
\n",
"
510
\n",
"
6
\n",
"
35
\n",
"
6.583333
\n",
"
1.000000
\n",
"
8.500000
\n",
"
65-74
\n",
"
\n",
"
\n",
"
7
\n",
"
180
\n",
"
60
\n",
"
440
\n",
"
5
\n",
"
59
\n",
"
3.000000
\n",
"
1.000000
\n",
"
7.333333
\n",
"
55-64
\n",
"
\n",
"
\n",
"
23
\n",
"
80
\n",
"
230
\n",
"
330
\n",
"
6
\n",
"
46
\n",
"
1.333333
\n",
"
3.833333
\n",
"
5.500000
\n",
"
65-74
\n",
"
\n",
"
\n",
"
48
\n",
"
455
\n",
"
15
\n",
"
270
\n",
"
6
\n",
"
35
\n",
"
7.583333
\n",
"
0.250000
\n",
"
4.500000
\n",
"
65-74
\n",
"
\n",
"
\n",
"
54
\n",
"
130
\n",
"
185
\n",
"
670
\n",
"
1
\n",
"
12
\n",
"
2.166667
\n",
"
3.083333
\n",
"
11.166667
\n",
"
15-24
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
17325
\n",
"
25
\n",
"
15
\n",
"
640
\n",
"
6
\n",
"
47
\n",
"
0.416667
\n",
"
0.250000
\n",
"
10.666667
\n",
"
65-74
\n",
"
\n",
"
\n",
"
17336
\n",
"
105
\n",
"
100
\n",
"
525
\n",
"
6
\n",
"
59
\n",
"
1.750000
\n",
"
1.666667
\n",
"
8.750000
\n",
"
65-74
\n",
"
\n",
"
\n",
"
17351
\n",
"
40
\n",
"
90
\n",
"
540
\n",
"
5
\n",
"
46
\n",
"
0.666667
\n",
"
1.500000
\n",
"
9.000000
\n",
"
55-64
\n",
"
\n",
"
\n",
"
17366
\n",
"
120
\n",
"
90
\n",
"
490
\n",
"
6
\n",
"
59
\n",
"
2.000000
\n",
"
1.500000
\n",
"
8.166667
\n",
"
65-74
\n",
"
\n",
"
\n",
"
17387
\n",
"
125
\n",
"
77
\n",
"
510
\n",
"
7
\n",
"
24
\n",
"
2.083333
\n",
"
1.283333
\n",
"
8.500000
\n",
"
75+
\n",
"
\n",
" \n",
"
\n",
"
741 rows × 9 columns
\n",
"
"
],
"text/plain": [
" Socializing time Exercising time Sleep time Age group Province \\\n",
"3 395 60 510 6 35 \n",
"7 180 60 440 5 59 \n",
"23 80 230 330 6 46 \n",
"48 455 15 270 6 35 \n",
"54 130 185 670 1 12 \n",
"... ... ... ... ... ... \n",
"17325 25 15 640 6 47 \n",
"17336 105 100 525 6 59 \n",
"17351 40 90 540 5 46 \n",
"17366 120 90 490 6 59 \n",
"17387 125 77 510 7 24 \n",
"\n",
" Socializing time (hour) Exercising time (hour) Sleep time (hour) \\\n",
"3 6.583333 1.000000 8.500000 \n",
"7 3.000000 1.000000 7.333333 \n",
"23 1.333333 3.833333 5.500000 \n",
"48 7.583333 0.250000 4.500000 \n",
"54 2.166667 3.083333 11.166667 \n",
"... ... ... ... \n",
"17325 0.416667 0.250000 10.666667 \n",
"17336 1.750000 1.666667 8.750000 \n",
"17351 0.666667 1.500000 9.000000 \n",
"17366 2.000000 1.500000 8.166667 \n",
"17387 2.083333 1.283333 8.500000 \n",
"\n",
" Age group label \n",
"3 65-74 \n",
"7 55-64 \n",
"23 65-74 \n",
"48 65-74 \n",
"54 15-24 \n",
"... ... \n",
"17325 65-74 \n",
"17336 65-74 \n",
"17351 55-64 \n",
"17366 65-74 \n",
"17387 75+ \n",
"\n",
"[741 rows x 9 columns]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"well_balanced_age_label_df = well_balanced_df.copy()\n",
"\n",
"well_balanced_age_label_df.loc[well_balanced_age_label_df['Age group'] == 1, 'Age group label'] = '15-24'\n",
"\n",
"well_balanced_age_label_df.loc[well_balanced_age_label_df['Age group'] == 2, 'Age group label'] = '25-34'\n",
"\n",
"well_balanced_age_label_df.loc[well_balanced_age_label_df['Age group'] == 3, 'Age group label'] = '35-44'\n",
"\n",
"well_balanced_age_label_df.loc[well_balanced_age_label_df['Age group'] == 4, 'Age group label'] = '45-54'\n",
"\n",
"well_balanced_age_label_df.loc[well_balanced_age_label_df['Age group'] == 5, 'Age group label'] = '55-64'\n",
"\n",
"well_balanced_age_label_df.loc[well_balanced_age_label_df['Age group'] == 6, 'Age group label'] = '65-74'\n",
"\n",
"well_balanced_age_label_df.loc[well_balanced_age_label_df['Age group'] == 7, 'Age group label'] = '75+'\n",
"\n",
"well_balanced_age_label_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"b) Create the same boxplots as in Task 4 f) using `layout = (1, 3)` and `figsize = (20, 10)`, but use `Age group label` to create the boxplot, so that the ticks on horizontal axes of the boxplot are informative. Store this boxplot in a variable called `time_boxplots_age_label`."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"time_boxplots_age_label = well_balanced_age_label_df.boxplot(\n",
" column = ['Socializing time (hour)', 'Exercising time (hour)', 'Sleep time (hour)'], \n",
" by = 'Age group label', \n",
" figsize = (20, 10),\n",
" layout = (1, 3)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 6 (Written Discussion)\n",
"\n",
"a) Which age group spends the most time sleeping, exercising, and socializing? Does your ranking change if you use mean or median as a summary measure of time? Briefly explain why or why not your ranking changes, and which ranking is the best representation.\n",
"\n",
"b) Which age group shows the most variability in time spent socializing? Provide a brief explanation of why this group shows the most variability.\n",
"\n",
"c) State one limitations of basing this data analysis on only respondents that spent more than zero time sleeping, exercising, and socializing. Briefly explain why it's a limitation to your findings in Tasks 4 and 5."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> *Sample solutions*\n",
">\n",
"> a) \n",
"> - Yes, the rankings change for the total if the mean or median is used.\n",
"> - If the median is used then the rankings are: 15-24, 25-34, 75+ ,...\n",
"> - If the mean is used then the rankings are: 15-25, 75+, 55-64, ...\n",
"> - There are outliers in sleep and socializing that we can see on the boxplots that are pushing the mean higher, but the median is not influenced by these observations.\n",
"> - The median would be a more suitable choice since it's not influenced by outliers.\n",
">\n",
"> b) The length of the boxplot is longest for 15-24 age group.\n",
">\n",
"> c) 96% of the data is excluded so results might be different if these observations are included."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"vscode": {
"interpreter": {
"hash": "8b8edaa195e148f815789564e9a10f57d8b792ac9e1a5daafce5fbae42bebd0e"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}