Lecture 4: Introduction to pandas and Data Wrangling#

  • Introduction to working with (tabular) data using pandas

    • Import a csv file into a pandas DataFrame

  • Selecting rows of a DataFrame

  • Selecting columns of a DataFrame

  • Computing summary statistics on a DataFrame

pandas#

  • pandas is a python package that makes it easy to work with tabular data.

  • What is tabular data?

Data Example 1#

Is this data tabular?

Data Example 2#

Is this data tabular?

Text is taken from the Stanford Encyclopedia of Philosophy.

Data Example 3#

Is this data tabular?

Raw Data#

The data from Example 3 is from

https://open.toronto.ca/dataset/police-race-and-identity-based-data-collection-arrests-strip-searches/

Is this raw data?

What is this data’s provenance?

Comma Separated Value (csv) Files#

  • A csv file is a text file that uses a comma (this is an example of a delimiter) to separate values.

id, person, department
0, MM, STA
1, IM, DCS 
2, CL, GGR
  • csv files often have a “.csv” extension as part of the file name. For example, GGR274faculty.csv.

  • Spreadsheet programs such as Excel are often used to create, read, and store csv files. But, any program (e.g., Jupyter notebooks) that can manage text files can do the same.

pandas#

  • pandas is a Python package and is the “fundamental high-level building block for doing practical, real-world data analysis in Python” (see pandas Getting started).

  • We will study and use the primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional).

Create a pandas Series#

Import pandas#

First import the pandas package.

import pandas

This will allow us to access functions and methods in the pandas package in our Jupyter notebook. But, we can use the Python keyword as to abbreviate pandas.

import pandas as pd

Create a pandas Series from a List#

people = ['MM', 'IM', 'CL']

faculty_series = pd.Series(people)

faculty_series
dept = ['STA', 'DCS', 'GGR']

department_series = pd.Series(dept)

department_series

Creating a Boolean Series based on a Condition#

Create a Series where the element is True if department_series is equal to STA and False otherwise.

department_series = pd.Series(dept)

department_series
department_series == 'STA'

Create a Series where the element is True if department_series is equal to GGR and False otherwise.

print(department_series)

Create a Series where the element is True if department_series is equal to STA OR GGR and False otherwise.

print(department_series)

(department_series == 'STA') | (department_series == 'GGR')

Create a Series where the element is True if department_series is equal to STA AND GGR and False otherwise.

print(department_series)

(department_series == 'STA') & (department_series == 'GGR')

Create a Series where the element is
True if department_series equal to STA is NOT EQUAL to department_series equal to GGR,
and False otherwise (this is tricky 😨)

print(department_series)

# TRY!

Boolean logic with pandas Series#

When comparing Boolean Series in pandas we use different logical operators

Series1 = pd.Series([True, False, True])

Series2 = pd.Series([False, False, True])

Operation

Description

Result of operation in a list

Series1 & Series2

Series1 and Series 2

[False, False, True]

Series1 | Series2

Series1 or Series 2

[True, False, True]

Series1 != Series2

Series1 not equal to Series 2

[True, False, False]

Create a pandas DataFrame using a dictionary#

  • A dictionary store data in key-value pairs.

  • A popular way to create a dictionary is to use curly braces {} and colons : to separate keys and values (key:values)

candy_dict = {'candy': ['fuzzy peach', 'gummy bear', 'chocolate']}

  • the key of candy_dict is candy

  • the values of candy are: 'fuzzy peach', 'gummy bear', 'chocolate'

candy_dict = {'candy': ['fuzzy peach', 'gummy bear', 'chocolate']}
type(candy_dict)

We can create a dict of GGR274 course faculty.

data = {'academic department' : ['STA', 'DCS', 'GGR'], 
        'faculty': ['MM', 'IM', 'CL'],
        'favourite candy': ['nurungji', 'gummy bear', 'chocolate'],
        'name length': [len('Michael Moon'), len('Ilya Musabirov'), len('Chunjiang Li')]}

data

Let’s store data in a pandas DataFrame.

pd.DataFrame(data)

Now, let’s store the pandas DataFrame above in a variable called GGR274fac_df.

GGR274fac_df = pd.DataFrame(data)

GGR274fac_df

Select rows of a DataFrame using a list of True & False values (a.ka. Boolean values)#

Let’s remove the second row.

print(GGR274fac_df)

GGR274fac_df[[True, False, True]]
  • What happened?

  • How can I remove the first row?

GGR274fac_df[[True, False, False]]

from IPython.display import IFrame

url = "https://forms.office.com/Pages/ResponsePage.aspx?id=JsKqeAMvTUuQN7RtVsVSEC-tsJgdMo9KmhgqULaAHJxUMDVXRUM1VEI2TlVRS0IwVTNaSUIwS0VMQi4u&embed=true"

IFrame(url, width = 640, height = 480)

Select columns of a DataFrame using a list of Column Names#

  • The column names in the DataFrame GGR274fac_df can be obtained using list().

  • There are other ways to get the column names, but we will focus on this for now.

list(GGR274fac_df)
  • To select the column favourite candy we can add it in quotation marks inside the square brackets [] at the end of the DataFrame name.

  • For example: \(\underbrace{\texttt{GGR274fac_df}}_\text{DataFrame Name}\underbrace{\texttt{['favourite candy']}}_\text{Column Name}\)

GGR274fac_df['favourite candy']
my_list_of_column_names = ['favourite candy', 'name length']

GGR274fac_df[my_list_of_column_names]
GGR274fac_df[['favourite candy']]
GGR274fac_df_column_names = list(GGR274fac_df)

print(f"The list of column names is: {GGR274fac_df_column_names}")
GGR274fac_df_column_names[0]
GGR274fac_df[GGR274fac_df_column_names[0]]

Which column number will be selected using:

  • GGR274fac_df[GGR274fac_df_column_names[0]]

  • GGR274fac_df[GGR274fac_df_column_names[3]]

Select rows of a DataFrame#

  • Rows can be selected from a DataFrame using a list of Boolean values.

  • \(\underbrace{\texttt{GGR274fac_df}}_\text{DataFrame Name}\underbrace{\texttt{[[True, False, True]]}}_\text{List of Boolean values}\)

    • selects the first and third rows of the DataFrame since the first and third values are True. The second row is not selected since the second element of the list is False.

my_important_condition = [True, False, True]

my_important_condition
GGR274fac_cond = GGR274fac_df[my_important_condition] # select rows

my_important_columns = ['academic department', 'faculty']

GGR274fac_cond[my_important_columns] # select columns

Select rows and columns of a DataFrame#

Combine these two lines of code:

GGR274fac_cond = GGR274fac_df[my_important_condition] # select rows

GGR274fac_cond[mycolumns] # select columns

to select rows and columns.

GGR274fac_df

Exercise#

Create a pandas DataFrame with three columns:

  1. Your first name and two people sitting close to you – your (new??) friends.

  2. The distance from home to the U of T St. George campus for you and your two (new?) friends.

  3. The month and day of you and your two (new?) friends’ birthday.

# create your DataFrame here.

Create a pandas DataFrame from a csv file#

  • Data will usually be stored in a file such as a csv.

  • It’s very convenient to “read” the file into a pandas DataFrame, since pandas has many methods that can manipulate tabular data. Otherwise, we could use base python to do these manipulations.

Time Use dataset#

Dataset: Statistics Canada General Social Survey’s (GSS) Time Use (TU) Survey

  • Tracks how people spend their time

  • two parts:

    • an episode file where each row describes an event for one person; each person has one row per event (like “make coffee”) that they experienced.

    • a main file that includes meta-information about the individuals and also includes summary information from the episode file; there is 1 row of information per person.

  • we will stick to the ‘main’ file for now

The data are stored in file gss_tu2016_main_file.csv. The name uses abbreviations:

  • GSS: general social survey

  • tu2016: time use from the year 2016

  • csv: comma-separated values

Import Time Use Survey Data using pandas#

import pandas as pd

timeuse_filename = 'gss_tu2016_main_file.csv'

time_use_data = pd.read_csv(timeuse_filename)

time_use_data.head()

Question#

Do urban residents with more children report feeling more rushed than those with fewer children?

Let’s narrow the question even further …

Do urban residents with one or more children report feeling more rushed every day versus never feeling rushed than those with no children?

Among urban respondents that feel rushed daily or never feel rushed our data analysis will consist of computing the two proportions:

\[\text{Proportion}_\text{at least one kid} = \frac{\text{Number of respondents that feel rushed daily}}{\text{Number of respondents with at least one kid that never feel rushed or feel rushed daily}}\]
\[\text{Proportion}_\text{no kids} = \frac{\text{Number of respondents that feel rushed daily}}{\text{Number of respondents with no kids that never feel rushed or feel rushed daily}}\]

Wrangle the Time Use Survey Data#

Create a new DataFrame with only the relevant columns needed for the data analysis. In other words, create a subset of time_use_data.

Selecting columns from a DataFrame#

What columns are relevant?

To create an easy to use data set we will only keep the following columns:

  • CASEID: participant ID

  • luc_rst: large urban centre vs rural and small towns

  • chh0014c: number of kids 14 or under

  • gtu_110: feeling rushed

important_columns = ["CASEID","luc_rst","chh0014c","gtu_110"] 

subset_time_use_data = time_use_data[important_columns]
subset_time_use_data.head()

Rename columns#

Use the rename function to rename columns.

columnnames = {'CASEID': 'Participant ID',
               'luc_rst': 'Urban/Rural',
               'chh0014c': 'Kids under 14',
               'gtu_110': 'Feeling Rushed'}

subset_time_use_data_colnames = subset_time_use_data.rename(columns=columnnames)

list(subset_time_use_data_colnames)

From the code book

 luc_rst            Population centre indicator

           VALUE  LABEL
               1  Larger urban population centres (CMA/CA)
               2  Rural areas and small population centres (non CMA/CA)
               3  Prince Edward Island
               6  Valid skip
               7  Don't know
               8  Refusal
               9  Not stated

           Data type: numeric
           Missing-data codes: 6-9
           Record/column: 1/59

 chh0014c           Child(ren) in household - 0 to 14 years

           VALUE  LABEL
               0  None
               1  One
               2  Two
               3  Three or more
               6  Valid skip
               7  Don't know
               8  Refusal
               9  Not stated

           Data type: numeric
           Missing-data codes: 6-9
           Record/column: 1/40


      gtu_110            General time use - Feel rushed

           VALUE  LABEL
               1  Every day
               2  A few times a week
               3  About once a week
               4  About once a month
               5  Less than once a month
               6  Never
              96  Valid skip
              97  Don't know
              98  Refusal
              99  Not stated

           Data type: numeric
           Missing-data codes: 96-99
           Record/columns: 1/60-61

Select respondents that live in urban areas#

Select the Urban/Rural column.

 luc_rst            Population centre indicator

           VALUE  LABEL
               1  Larger urban population centres (CMA/CA)
               2  Rural areas and small population centres (non CMA/CA)
               3  Prince Edward Island
               6  Valid skip
               7  Don't know
               8  Refusal
               9  Not stated

           Data type: numeric
           Missing-data codes: 6-9
           Record/column: 1/59

urbanrural_col = subset_time_use_data_colnames["Urban/Rural"]
  • Create a Boolean variable that is True if respondent lives in urban area and False otherwise.

urban # crate the Boolean variable

Look at the head (first 5 rows) of the Series using .head()

urban.head()

We can combine

urbanrural_col = subset_time_use_data_colnames["Urban/Rural"]

urban = urbanrural_col == 1

into one line of code

urban = (subset_time_use_data_colnames["Urban/Rural"] == 1)
urban.sum() # add up the total number of `True` values

Select all the participants (rows) who live in urban areas in subset_time_use_data_colnames using the Boolean series urban.

subset_time_use_data_colnames[urban].head()
subset_time_use_data_colnames.shape
  • There are 17 390 rows and 4 columns in subset_time_use_data_colnames.

  • DataFrame.shape returns a tuple (a Python data type that stores multiple values).

  • Items in a tuple can be accessed in a similar way to a list.

Define a DataFrame with only urban respondents and relevant columns#

urban_df = subset_time_use_data_colnames[urban]

urban_df.head()

Examine the distributions of each column using value_counts()#

Distribution of Kids under 14.

urban_df['Kids under 14'].value_counts()
 chh0014c           Child(ren) in household - 0 to 14 years

           VALUE  LABEL
               0  None
               1  One
               2  Two
               3  Three or more
               6  Valid skip
               7  Don't know
               8  Refusal
               9  Not stated

           Data type: numeric
           Missing-data codes: 6-9
           Record/column: 1/40

Distribution of Feeling Rushed.

urban_df['Feeling Rushed'].value_counts()

      gtu_110            General time use - Feel rushed

           VALUE  LABEL
               1  Every day
               2  A few times a week
               3  About once a week
               4  About once a month
               5  Less than once a month
               6  Never
              96  Valid skip
              97  Don't know
              98  Refusal
              99  Not stated

           Data type: numeric
           Missing-data codes: 96-99
           Record/columns: 1/60-61

Distribution of Urban/Rural.

urban_df['Urban/Rural'].value_counts()

Let’s compute

\[\text{Proportion}_\text{at least one kid} = \frac{\text{Number of respondents that feel rushed daily}}{\text{Number of respondents with at least one kid that never feel rushed or feel rushed daily}}\]
\[\text{Proportion}_\text{no kids} = \frac{\text{Number of respondents that feel rushed daily}}{\text{Number of respondents with no kids that never feel rushed or feel rushed daily}}\]

to answer

Do urban residents with one or more children report feeling more rushed every day versus never feeling rushed than those with no children?

Let’s start with the denominator of \(\text{Proportion}_\text{at least one kid}\):

\({\text{Number of respondents with at least one kid that never feel rushed or feel rushed daily}}\)

Define a Boolean variable kids_norush that is:

  • True if:

    • a respondent has at least one kid (urban_df['Kids under 14'] >= 1) and

    • never feels rushed (urban_df['Feeling Rushed'] == 6)

  • False otherwise.

kids_norush = (urban_df['Kids under 14'] >= 1) & (urban_df['Feeling Rushed'] == 6)

Define a Boolean variable kids_rush that is:

  • True if:

    • a respondent has at least one kid (urban_df['Kids under 14'] >= 1) and

    • never feels rushed (urban_df['Feeling Rushed'] == 1)

  • False otherwise.

kids_rush = ((urban_df['Kids under 14'] >= 1) & (urban_df['Feeling Rushed'] == 1))

Compute the total number respondents with urban_df[Kids under 14] >= 1 that feel rushed daily or never feel rushed.

Total_kids_norush = kids_norush.sum() 

Total_kids_rush = kids_rush.sum()

Total_kids_norush + Total_kids_rush

Now compute the proportion that feel rushed among those with at least one kid.

prop_kids = Total_kids_rush / (Total_kids_norush + Total_kids_rush)

prop_kids

Let’s do the same for respondents with no kids

nokids_norush = ((urban_df['Kids under 14'] == 0) & (urban_df['Feeling Rushed'] == 6))

nokids_rush = ((urban_df['Kids under 14'] == 0) & (urban_df['Feeling Rushed'] == 1))

Total_nokids_norush = nokids_norush.sum() 

Total_nokids_rush = nokids_rush.sum()

prop_nokids = Total_nokids_rush/(Total_nokids_norush + Total_nokids_rush)

prop_nokids

Let’s multiply by 100 and round to decimal places to express as a percentage and print out an informative description of the statistic.

percent_kids = round(prop_kids * 100, 2)

percent_nokids = round(prop_nokids * 100, 2)

print(percent_kids)

print(percent_nokids)

Let’s add a more detailed description.

nokidstext = 'The percentage of respondents with kids that feel rushed is'

print(f"{nokidstext} {percent_kids}%.") # print interpretation of percent with kids

print('\n') # add a blank line

kidstext = 'The percentage of respondents with no kids that feel rushed is'

print(f"{kidstext} {percent_nokids}%.") # print interpretation of percent with no kids

print('\n')  # add a blank line

difftext = 'more respondents with kids feel rushed compared to respondents without kids.'

print(f"{round(percent_kids - percent_nokids, 2)}% {difftext}")
# feel free to further experiment with the DataFrame