Class 10

Contents

Class 10#

Mar. 19, 2024#

By the end of this class, you’ll be able to:

  • Fit linear regression;

  • Know what is spatial data;

  • Learn a new package geopandas;

  • Do some basic spatial analysis (we’ll learn more in next 2 weeks!)

Review of linear regression#

What is the effect of recent immigration on asthma rates in Toronto neighbourhoods?#

Data#

Sociodemographic data#

  • Read sociodemographic data from the excel file 1_socdem_neighb_2006-2.xls with sheet name socdem_2006 into pandas using read_excel.

import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns
from statsmodels.formula.api import ols
fname = '1_socdem_neighb_2006-2.xls'
sname = 'socdem_2006'
socdem_neighb = pd.read_excel(fname, sheet_name = sname, header = 10) # the table starts in row #11
socdem_neighb.head()
Neighbourhood id Neighbourhood Name Total Population % 19 and under % 65 and over % Living alone % Aged 65+ living alone % Lone parent families § % 1 year mobility in population Median household income before-tax $ ... % Less than high school education ** % With a University degree ** % No knowledge of English/French % Recent immigrants-within 5 years % Recent immigrants-within 10 years % Immigrants % Visible minority Top 3 Home Languages for non-English speaking households, #1 Top 3 Home Languages for non-English speaking households, #2 Top 3 Home Languages for non-English speaking households, #3
0 1 West Humber-Clairville 32252 26.8 11.3 3.7 11.7 26.3 11.6 63413 ... 18.4 22.9 5.4 12.2 22.1 60.0 77.9 Panjabi (Punjabi) Gujarati Spanish
1 2 Mount Olive-Silverstone-Jamestown 32127 32.2 8.1 4.1 15.4 31.5 15.7 48146 ... 20.1 23.6 7.6 22.2 35.1 67.1 85.3 Panjabi (Punjabi) Gujarati Arabic
2 3 Thistletown-Beaumond Heights 9928 25.3 16.7 5.4 17.5 30.1 10.9 55032 ... 16.7 20.7 5.0 9.7 19.3 55.9 53.8 Panjabi (Punjabi) Italian Spanish
3 4 Rexdale-Kipling 10725 23.6 18.3 10.2 28.8 33.3 13.2 52428 ... 15.3 18.8 3.0 8.3 14.0 47.2 41.8 Spanish Panjabi (Punjabi) Croatian
4 5 Elms-Old Rexdale 9437 29.9 10.5 5.8 13.0 39.1 9.7 53779 ... 15.2 18.8 3.6 9.8 18.2 53.8 60.5 Spanish Italian Urdu

5 rows × 28 columns

Which column should we select to measure immigration? There are several choices.

socdem_neighb.columns
Index(['Neighbourhood id', 'Neighbourhood Name', 'Total Population',
       '% 19 and under', '% 65 and over', '% Living alone',
       '% Aged 65+ living alone', '% Lone parent families §',
       '% 1 year mobility in population',
       'Median household income before-tax $',
       'Median household income after-tax $ ‡',
       '% Families-Incidence of low income before-tax',
       '% Families-Incidence of low income after-tax ‡',
       '% Individuals-Incidence of low income before-tax',
       '% Individuals-Incidence of low income after-tax ‡',
       '% Rented Dwellings', '% Unemployment rate *',
       '% Not in labour force *', '% Less than high school education **',
       '% With a University degree **', '% No knowledge of English/French',
       '% Recent immigrants-within 5 years',
       '% Recent immigrants-within 10 years', '% Immigrants',
       '% Visible minority',
       'Top 3 Home Languages for non-English speaking households, #1',
       'Top 3 Home Languages for non-English speaking households, #2',
       'Top 3 Home Languages for non-English speaking households, #3'],
      dtype='object')
  • Let’s use '% Recent immigrants-within 5 years' to represent recent immigrants in a neighborhood.

  • Later on we will want a few more sociodemographic columns, so let’s create a new DataFrame and rename the columns.

cols = ['Neighbourhood id', 'Neighbourhood Name', 
        'Median household income after-tax $ ‡', 
        '% Rented Dwellings', '% Unemployment rate *', 
        '% Recent immigrants-within 5 years', '% Visible minority']

socdem_neighb = socdem_neighb[cols]

colnames = {cols[0] : 'Neighbid',
            cols[1] : 'name',
            cols[2] : 'median_income',
            cols[3] : 'rented_dwell',
            cols[4] : 'unemployment',
            cols[5] : 'immigration5',
            cols[6] : 'vismin'}

socdem_neighb = socdem_neighb.copy()

socdem_neighb.rename(columns = colnames, inplace = True)
socdem_neighb.head()
Neighbid name median_income rented_dwell unemployment immigration5 vismin
0 1 West Humber-Clairville 56220 27.4 7.0 12.2 77.9
1 2 Mount Olive-Silverstone-Jamestown 43975 52.0 9.9 22.2 85.3
2 3 Thistletown-Beaumond Heights 49800 34.4 6.3 9.7 53.8
3 4 Rexdale-Kipling 46033 45.1 7.7 8.3 41.8
4 5 Elms-Old Rexdale 48055 41.2 7.8 9.8 60.5

Asthma data#

  • Read data from the excel file 1_ahd_neighb_db_ast_hbp_mhv_copd_2007.xls with sheet name 1_ahd_neighb_asthma_2007 into pandas using read_excel.

fname = '1_ahd_neighb_db_ast_hbp_mhv_copd_2007.xls'
sname = '1_ahd_neighb_asthma_2007'

asthma_neighb = pd.read_excel(fname, sheet_name = sname, header = 11) #the table starts with row #12
asthma_neighb.head()
Unnamed: 0 Unnamed: 1 Demographics ÂȘ % With asthma LL (95% CI) UL (95% CI) Demographics ÂȘ.1 % With asthma.1 LL (95% CI) .1 UL (95% CI) .1 ... Demographics ÂȘ.7 % With asthma.7 LL (95% CI) .7 UL (95% CI) .7 Demographics ÂȘ.8 % With asthma.8 LL (95% CI) .8 UL (95% CI) .8 Rate Ratio**.2 H/L/NS.2
0 1 West Humber-Clairville 11977 9.7 9.2 10.2 11770 10.6 10.0 11.2 ... 4435 12.3 11.3 13.3 8921 10.9 10.3 11.6 1.07 H
1 2 Mount Olive-Silverstone-Jamestown 11157 7.9 7.4 8.4 11462 10.2 9.6 10.8 ... 3819 13.5 12.5 14.6 7726 11.2 10.5 11.9 1.10 H
2 3 Thistletown-Beaumond Heights 3307 9.4 8.4 10.4 3322 10.5 9.5 11.6 ... 1307 12.3 10.5 14.1 2587 10.8 9.6 12.0 1.06 NS
3 4 Rexdale-Kipling 3317 9.7 8.7 10.7 3438 13.0 11.9 14.2 ... 1468 13.2 11.5 14.9 2897 10.4 9.3 11.5 1.02 NS
4 5 Elms-Old Rexdale 3209 10.2 9.1 11.2 3379 13.0 11.8 14.1 ... 1314 15.1 13.1 17.0 2610 11.8 10.6 13.0 1.16 H

5 rows × 44 columns

  • The asthma rates are broken down by sex and age.

  • Which rates should we choose?

  • For this example, let’s select asthma rates for both sexes age 20 - 64.

asthma_neighb.columns
Index(['Unnamed: 0', 'Unnamed: 1', 'Demographics ÂȘ', '% With asthma',
       'LL (95% CI) ', 'UL (95% CI) ', 'Demographics ÂȘ.1', '% With asthma.1',
       'LL (95% CI) .1', 'UL (95% CI) .1', 'Demographics ÂȘ.2',
       '% With asthma.2', 'LL (95% CI) .2', 'UL (95% CI) .2', 'Rate Ratio**',
       'H/L/NS', 'Demographics ÂȘ.3', '% With asthma.3', 'LL (95% CI) .3',
       'UL (95% CI) .3', 'Demographics ÂȘ.4', '% With asthma.4',
       'LL (95% CI) .4', 'UL (95% CI) .4', 'Demographics ÂȘ.5',
       '% With asthma.5', 'LL (95% CI) .5', 'UL (95% CI) .5', 'Rate Ratio**.1',
       'H/L/NS.1', 'Demographics ÂȘ.6', '% With asthma.6', 'LL (95% CI) .6',
       'UL (95% CI) .6', 'Demographics ÂȘ.7', '% With asthma.7',
       'LL (95% CI) .7', 'UL (95% CI) .7', 'Demographics ÂȘ.8',
       '% With asthma.8', 'LL (95% CI) .8', 'UL (95% CI) .8', 'Rate Ratio**.2',
       'H/L/NS.2'],
      dtype='object')
# Select the neghborhood_id, neighborhood_name, and both sexes with asthma

important_cols = asthma_neighb.columns[[0, 1, 11]]

colnames = {important_cols[0]: 'Neighbid',
           important_cols[1] : 'name',
           important_cols[2] : 'asthma_pct'}

asthma_rates = asthma_neighb.copy()

asthma_rates = asthma_rates[important_cols]

asthma_rates.rename(columns = colnames, inplace=True)

asthma_rates.head()
Neighbid name asthma_pct
0 1 West Humber-Clairville 10.2
1 2 Mount Olive-Silverstone-Jamestown 9.1
2 3 Thistletown-Beaumond Heights 10.0
3 4 Rexdale-Kipling 11.4
4 5 Elms-Old Rexdale 11.6

Merge Sociodemographic and asthma data#

  • To examine the relationship between asthma and immigration we will need to merge socdem_neighb and asthma_rates.

  • The DataFrames can be merged on neighbourhood id.

# Two datasets are merged based on neighborhood id

asthma_socdem = asthma_rates.merge(socdem_neighb, left_on='Neighbid', right_on='Neighbid')

asthma_socdem.head()
Neighbid name_x asthma_pct name_y median_income rented_dwell unemployment immigration5 vismin
0 1 West Humber-Clairville 10.2 West Humber-Clairville 56220 27.4 7.0 12.2 77.9
1 2 Mount Olive-Silverstone-Jamestown 9.1 Mount Olive-Silverstone-Jamestown 43975 52.0 9.9 22.2 85.3
2 3 Thistletown-Beaumond Heights 10.0 Thistletown-Beaumond Heights 49800 34.4 6.3 9.7 53.8
3 4 Rexdale-Kipling 11.4 Rexdale-Kipling 46033 45.1 7.7 8.3 41.8
4 5 Elms-Old Rexdale 11.6 Elms-Old Rexdale 48055 41.2 7.8 9.8 60.5

A regression model of asthma on immigration#

Plot the data#

  • What are the dependent and independent variables?

  • Does the relationship look linear?

sns.regplot(y = 'asthma_pct', x = 'immigration5', data = asthma_socdem, ci = None);
../../../_images/d731387bdc7010332bbb35039f73b1b1096929ecda865ff3d490ffd89ff56692.png

Describe the relationship between asthma and immigration.

Fit the regression model#

reg_asthmaimm = ols('asthma_pct ~ immigration5', data = asthma_socdem) # setup the model

reg_asthmaimm_fit = reg_asthmaimm.fit() # estimate/fit the model 

Statistical summary of the regression model#

reg_asthmaimm_summ = reg_asthmaimm_fit.summary()

reg_asthmaimm_summ.tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 11.8266 0.218 54.328 0.000 11.396 12.257
immigration5 -0.1597 0.019 -8.475 0.000 -0.197 -0.122

The regression model is the equation:#

\[\texttt{asthma_pct} = 11.8266 - 0.1597 \times \texttt{immigration5}\]
  • The slope indicates that for a 1% increase in immigration in the past 5 years the percentage of asthma cases in a neighbourhood decrease by 0.1597%.

  • The y-intercept indicates that when a neighbourhood has 0% immigration the past 5 years then asthma rates are 11%.

  • Are there any neighbourhoods with 0% immigration in the past 5 years?

asthma_socdem['immigration5'].describe()
count    140.000000
mean       9.864286
std        6.037045
min        1.200000
25%        4.975000
50%        8.650000
75%       13.100000
max       32.800000
Name: immigration5, dtype: float64
  • The y-intercept in regression models often doesn’t have a sensible interpretation, but is often mathematically important for regression to work well.

Accuracy of the regression model#

reg_asthmaimm_fit.rsquared
0.34233239704200125
plt.scatter(x = reg_asthmaimm_fit.fittedvalues , y = reg_asthmaimm_fit.resid)
plt.axhline(y = 0)
plt.xlabel('fitted values')
plt.ylabel('residuals')
Text(0, 0.5, 'residuals')
../../../_images/bd0b4e15217751ef5c3338559baea3ef15936b868b1119ac48b541b4f4cf079d.png
  • The R-squared statistics indicates a relatively good explanation of the model (quite larger than 0)

  • The residual plot indicates some important variables are missing, as we can see larger abolute values as fitted values increase (not a random distributed pattern)

Summary#

Therefore, we can conclude that the immigration rate in recent 5 years in a neighborhood is negatively associated with the prevalence of asthma in the neighborhood.

Why not causal relationship?

What is the implication?

Training and testing a model#

  • Ideally we could assess the accuracy of this linear regression model on a new set of data (i.e., immigration rates in Toronto neighbourhoods in future years.)

  • But, we don’t have this data.

  • One trick that data scientists use to fit the model on part of the data (training data) and use the remaining part (testing data) to test the accuracy.

  • If we only have one data set then the training data can be obtained by randomly selecting rows from the data to fit the regression model, then the remaining rows can be used to test the accuracy of the model.

Splitting a pandas DataFrame by index#

data = {'a' : [1, 2, 3, 4, 5]}

df = pd.DataFrame(data)

df.index
RangeIndex(start=0, stop=5, step=1)

A Boolean array that is True if the index is 0 or 1 and False otherwise.

df.index.isin([0,1])
array([ True,  True, False, False, False])
df[df.index.isin([0,1])]
a
0 1
1 2

To create a Boolean series that is False if the index is 0 or 1 and True otherwise we can negate df.index.isin([0,1]) this Boolean series using the ~ operator ~df.index.isin([0,1])

~df.index.isin([0,1])
array([False, False,  True,  True,  True])
df[~df.index.isin([0,1])]
a
2 3
3 4
4 5

Creating training and test data sets from a single dataset#

Step 1:#

  • split the data into a training set with 75% of the rows.

  • use the remaining 25% of the data for testing.

np.random.seed(11) # for reproducibility

# randomly select 75% of neighbourhoods
reg_df_train = asthma_socdem.sample(frac = 0.75, replace = False) 

# get index of training data
train_index = reg_df_train.index
  • Exclude indicies from reg_df_train using pandas isin function.

# exclude rows in training to define test data
reg_df_test = asthma_socdem[~asthma_socdem.index.isin(train_index)]

print(asthma_socdem.shape)

print(reg_df_train.shape)

print(reg_df_test.shape)
(140, 9)
(105, 9)
(35, 9)

Step 2: Fit the regression model on the training data#

reg_train = ols('asthma_pct ~ immigration5', data = reg_df_train) # setup the model

reg_train_fit = reg_train.fit() # estimate/fit the model 

Step 3: Compute fitted values using training data#

# use the model fit on the training data to predict asthma rates
# from the test set. 

predvalues_train = reg_train_fit.fittedvalues

# Another way to compute
# predvalues_train = reg_train_fit.predict(reg_df_train['immigration5'])

Step 4: Evaluate accuracy using root mean-squared error on training data#

  • Another measure of accuracy of regression models.

  • Compares observed values of the dependent variable with the predicted values.

rmse.jpg

It can be computed using the rmse function from statsmodels.

# Compute root mean-squared error for the training data.
from statsmodels.tools.eval_measures import rmse

rmse(predvalues_train, reg_df_train['asthma_pct'])
1.3987103239995489
  • The observed asthma rates deviate from the predicted asthma rates in each neighbourhood by 1.4% in the training set.

  • Is this an acceptable prediction error?

  • Let’s examine the accuracy of the linear regression model on the test set

Step 5: Evaluate the accuracy of regression model on the test data using root mean-squared error#

  • Compute predictions using test data

predvalues_test = reg_train_fit.predict(reg_df_test['immigration5'])
  • Now compute the root mean-squared error for the test data using the model fit on the training data.

rmse(predvalues_test, reg_df_test['asthma_pct'])
1.1096219805062997

Conclusion of data analysis using linear regression model#

  • The regression model appears to provide an accurate fit the data, although both the scatter plot and plot of residuals indicate that there is a lot of variation in asthma rates in neighbourhoods when the immigrations rates are small (between 10%-20%).

  • As the percentage of immigrants to a neighbourhood within the last 5 years increases, the asthma rates will decreases.

  • Regression analyses on aggregated data such as neighbourhoods can be misleading when compared to non-aggregated data.

Are there other sociodemographic factors in our data that might effect asthma rates?#

  • Let’s create a matrix scatter plot using pairplot in the seaborn library.

  • pairplot will visualize paired relationship between variables in a dataset.

asthma_socdem.columns
Index(['Neighbid', 'name_x', 'asthma_pct', 'name_y', 'median_income',
       'rented_dwell', 'unemployment', 'immigration5', 'vismin'],
      dtype='object')
# reduce the column for better presentation
to_show = ['asthma_pct','median_income','rented_dwell','immigration5','vismin']
sns.pairplot(asthma_socdem[to_show]);
../../../_images/99e5dc5bd7173ef809331fbc109a49c718ab8d873b6dfbdbe360ecc6c442d4df.png
  • Let’s consider median income

  • Fit the regression model on all the data and produce a statistical summary.

reg_asthmaimmunem = ols('asthma_pct ~ immigration5 + median_income', data = asthma_socdem) # setup the model

reg_asthmaimmunem_fit = reg_asthmaimmunem.fit() # estimate/fit the model 

reg_asthmaimmunem_summ = reg_asthmaimmunem_fit.summary()

reg_asthmaimmunem_summ.tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 14.3041 0.497 28.758 0.000 13.321 15.288
immigration5 -0.2054 0.019 -10.750 0.000 -0.243 -0.168
median_income -4.048e-05 7.45e-06 -5.431 0.000 -5.52e-05 -2.57e-05
  • Notice that the slope estimate (coefficient) for median_income is 0, but so is the p-value and range of plausible values.

  • That seems odd. What’s happening?

  • We can see from the scatter plot that immigration5 and median_income have a linear relationship.

  • When two variables have a linear relationship multiple linear regression models produce strange results.

  • It’s important that independent variables in a multiple regression model are unrelated.

sns.regplot(y = 'immigration5', x = 'median_income', data = asthma_socdem, ci = None);
../../../_images/5b8cf5c3f8b2828fec4d943bb3b98bf84d0e3e78aa89320afaf60c4bbbb58de2.png

A quick review of linear regression#

  • Step 1: prepare data (Which dataset I want to import? Which variables I want to analyze? How I can ‘wrangle’ the data, i.e., produce new variables and merge them into one dataframe for further analysis?)

  • Step 2: Explore pair relationship (Using sns.pairplot to explore potential relationships. Is there a linear relationship?)

  • Step 3: Fit model

  • Step 4: Accuracy (Which indicators I want to use? e.g., r-squared, residual plot, RMSE)

  • Step 5: Display the full model and explain (What does the model mean?)

Introduction to spatial data#

Why Spatial Data Matters#

What are spatial data?#

Spatial data are data that have some geographic attribute(s). So you can think of it like each row having a pointer to where the row goes on a map.

Why do we want spatial data?#

By locating things on a map, we can look for how certain variables present vary across space. We can look for things like clustering of high values or cold spots (aka hot spots/cold spots) or dispersal. Patterns across space can reveal interesting drivers of social phenomena.

toronto income

Types of Spatial Data#

raster data & vector data#

vector and raster

Raster data#

For example, satellite and aerial images

landsat data

Vector data#

For example, points, lines, and polygons

points lines polys

How are these data stored in a computer?#

There are many ways!

Generally raster data are stored as a matrix#

matrix raster

Vector data is stored in many different ways
#

Two common ones are geojson and shapefiles.

What are these files made of?

Geojson is pretty straightforward:

geojson example

Shapefiles are more complicated and involve multiple distinct files that contain information about spatial locations, coordinate systems and projections, attributes, etc. We’ll focus on geojson for this class.

Introducing GeoPandas#

How do we work with spatial data in python?#

We’re going to use something called ‘geopandas’

https://geopandas.org/en/stable/

There are a few other libraries that we need, like mapclassify. There are a lot of different tools out there, but we’ll stick to these two (and the xls reader) for today

# # Let's install and then load libraries
# import sys
# !{sys.executable} -m pip install geopandas
# !{sys.executable} -m pip install mapclassify
# !{sys.executable} -m pip install xlrd
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import mapclassify
import xlrd

Now we want to import a spatial data file.#

We’ll be working with two spatial data files today: Toronto_Neighbourhoods.geojson and toronto_rails.geojson

We’ll also use the health data we were playing with last week: 1_ahd_neighb_db_ast_hbp_mhv_copd_2007.xls

To load the spatial data we need to create something known as a geodataframe:#

nbrhd = gpd.GeoDataFrame.from_file("Toronto_Neighbourhoods.geojson") ## Like pd.read_xxx(file_name)
nbrhd.head()
AREA_ATTR_ID AREA_DESC AREA_ID AREA_LONG_CODE AREA_NAME AREA_SHORT_CODE CLASSIFICATION CLASSIFICATION_CODE LATITUDE LONGITUDE OBJECTID PARENT_AREA_ID Shape__Area Shape__Length X Y _id geometry
0 26005521 Casa Loma (96) 2480141 096 Casa Loma (96) 096 None None None None 17545105 None 3.678385e+06 8214.176485 None None 1 POLYGON ((-79.41469 43.67391, -79.41485 43.674...
1 26005520 Annex (95) 2480140 095 Annex (95) 095 None None None None 17545121 None 5.337192e+06 10513.883143 None None 2 POLYGON ((-79.39414 43.66872, -79.39588 43.668...
2 26005519 Caledonia-Fairbank (109) 2480139 109 Caledonia-Fairbank (109) 109 None None None None 17545137 None 2.955857e+06 6849.911724 None None 3 POLYGON ((-79.46021 43.68156, -79.46044 43.681...
3 26005444 Woodbine Corridor (64) 2480064 064 Woodbine Corridor (64) 064 None None None None 17545153 None 3.052518e+06 7512.966773 None None 4 POLYGON ((-79.31485 43.66674, -79.31660 43.666...
4 26005443 Lawrence Park South (103) 2480063 103 Lawrence Park South (103) 103 None None None None 17545169 None 6.211341e+06 13530.370002 None None 5 POLYGON ((-79.41096 43.70408, -79.41165 43.703...
# Let's look at the geometry column - this is the data that 
# describes the polygon representing the first neighbourhood

print(nbrhd.loc[0,'geometry'])
POLYGON ((-79.41469317817781 43.6739104164259, -79.41484930122832 43.6743388247927, -79.4155279126094 43.67606998537741, -79.4157867581137 43.6767302521386, -79.4164385645546 43.6783329020511, -79.4165782543862 43.6787785512344, -79.4167583944682 43.6792484522149, -79.41698203537291 43.6798247440047, -79.4170902471251 43.6801047004014, -79.4174797570886 43.681051830379, -79.4176666295158 43.6815699096212, -79.4176976020857 43.68163169892501, -79.4178116560388 43.6819252777403, -79.4181213711598 43.6827105222857, -79.4182734938897 43.6830961662335, -79.418301440907 43.683167503583, -79.4184313411692 43.6834966841996, -79.4184870757371 43.6836294842548, -79.4185410795905 43.6837582054538, -79.4184092922561 43.683786393574, -79.4182955283344 43.6834937157887, -79.4154265333151 43.684065660547, -79.4135200432832 43.6844456915062, -79.4134087861596 43.6844695957643, -79.4134686959176 43.6846245536214, -79.4124643275333 43.6848288476745, -79.4118960009694 43.6849418545125, -79.4126020656339 43.6866857692443, -79.4128218223812 43.687229039671, -79.4122731860895 43.687340632143, -79.4123095851789 43.6874306640936, -79.4124449310775 43.6877690817767, -79.4125749764611 43.6881026705283, -79.4126022442033 43.68820308289101, -79.4126262355444 43.6883388768546, -79.412631283351 43.6884334123732, -79.4126432068819 43.6885168725437, -79.4126690850027 43.6886690142942, -79.4121798023448 43.6887725684515, -79.4121317450359 43.688782747996, -79.4094911620232 43.6893416576956, -79.4095159141932 43.6894373160946, -79.4079543147094 43.6897847976179, -79.4064026019847 43.6900806794424, -79.4048987011668 43.6903958607535, -79.4030176369546 43.6907821281771, -79.4030622386805 43.6907397518989, -79.403084589916 43.6906872662864, -79.4030847025454 43.6906327275232, -79.40233853963132 43.6887211977866, -79.4014954899744 43.6866406966408, -79.401126389152 43.6857160700657, -79.400841057476 43.6850272500519, -79.4005303647662 43.6843098010236, -79.4002092949982 43.6835267044631, -79.4001688947727 43.6834159800984, -79.3998035302044 43.6825434875815, -79.3992154226597 43.6811276314469, -79.3990654446524 43.6807665464922, -79.3989394746847 43.6804604265612, -79.3987331620365 43.679962228305, -79.3985629657772 43.6795542728627, -79.3983515160598 43.6790391648102, -79.3981409020124 43.6785307360415, -79.3980121612171 43.6782038734666, -79.3979440512351 43.6780262977103, -79.3979313672526 43.6779949618024, -79.397997906573 43.6779648847871, -79.39810766305662 43.6779154844303, -79.3982178162304 43.677866084320804, -79.3983283660932 43.677816684457504, -79.3984396598895 43.6777672941492, -79.3985509707706 43.677722377402, -79.39866305797811 43.6776729695768, -79.3987751498381 43.6776280623042, -79.39888799806091 43.677583155590305, -79.3990012429834 43.6775382491152, -79.399114879728 43.6774888512157, -79.3992025285702 43.6774574239244, -79.3992901799736 43.6774169952501, -79.3993786050701 43.6773810678483, -79.3994674269127 43.677345149728104, -79.3995566132194 43.6773137235166, -79.3996462037999 43.6772777969312, -79.3997357818565 43.6772418792657, -79.3998261336784 43.6772104448643, -79.3999168946437 43.6771790287522, -79.4000076183382 43.6771475945326, -79.4000991356532 43.6771161699357, -79.4001906603179 43.6770802536138, -79.4002825618979 43.6770488292052, -79.4003752372778 43.6770218970549, -79.40046791256 43.6769949738305, -79.4005609796537 43.6769635502124, -79.4006540144064 43.6769366361508, -79.40074786265342 43.6769052040578, -79.4008416959652 43.6768827821947, -79.4009359160751 43.6768513592708, -79.4010305181557 43.6768289379187, -79.4011394914127 43.6767930269395, -79.40124844475712 43.6767616164984, -79.4013577948506 43.6767302062957, -79.4014675168886 43.6766987963088, -79.401577635645 43.6766674045606, -79.4016877469976 43.6766404773541, -79.4017982426678 43.6766135773755, -79.401909142447 43.6765821679758, -79.4020200223858 43.6765552591109, -79.4021313064015 43.6765238498247, -79.4022429725496 43.6765014420761, -79.4023546335109 43.6764745335542, -79.4024666788314 43.6764476252511, -79.4025787240519 43.6764207168379, -79.4026915556349 43.6763983006426, -79.40280397272 43.67637139232491, -79.402883149386 43.6763541860873, -79.4029167917251 43.6763489758964, -79.4030299826882 43.6763265686732, -79.4031435776805 43.6762996520191, -79.4049670168388 43.6759208047062, -79.4073338149418 43.675429007407, -79.407847991862 43.67532218132661, -79.4113608949251 43.6745959152122, -79.4121860970333 43.6744315128604, -79.4137161812489 43.6741115018232, -79.4146748682077 43.6739143093347, -79.41469317817781 43.6739104164259))
# Otherwise we can continue to treat the variable like a dataframe ... 
# it's what we have been doing all along!

nbrhd_names = nbrhd["AREA_NAME"]
type(nbrhd_names)
pandas.core.series.Series

Now if we want to make a map, what do we do?#

It’s easy! You can “plot” a geodataframe and geopandas will make it a map!

nbrhd.plot() 
<AxesSubplot:>
../../../_images/27e712e60dc91b07a0b35350413459ec3351f1bd6903e40a2a2c0b5eb62327e4.png

That’s nice
#

But we want to be a little more sophisticated with our maps.

basic_gif

The next cell goes through a better way to create a map “figure”.

Let’s first create Figure and Axes objects.

fig and ax

First, let’s use plt.subplots to create a figure variable “fig” and an axes variables called “ax”.

The first line of code sets the stage - it says how many images - by rows/columns, and we can set other figure attributes in here, like figure size using figsize.

To plot the map, we need to call the plot command .plot like before, but this time we want to say which ‘cell’ or ax cell it will go in, i.e., ax=axes. We also can use the column parameter to shade each row (that represents a polygon) by an attribute value.

fig, axes = plt.subplots(1, 1, figsize = (12,12))

print(type(fig)) #Figure object is like the sheet we are drawing on
print(type(axes))  #AxesSubplots are like cells within the sheet

nbrhd.plot(column = "AREA_NAME", ax = axes);
<class 'matplotlib.figure.Figure'>
<class 'matplotlib.axes._subplots.AxesSubplot'>
../../../_images/1ef46f2889d9c4626293574800eddb25e22fb7ccb194f6d053137ce64cb49661.png

What if we want to do 2 maps in one figure?

fig, axes = plt.subplots(1, 2)

#notice there are just 2 things in the axes object
print(axes.shape)

# pick the first cell in the figure
ax = axes[0]
nbrhd.plot(column = "AREA_NAME",
           ax=ax)

# pick the second cell in the figure
ax = axes[1]
nbrhd.plot(ax=ax)

#resize
current_fig = plt.gcf() #get current figure aka gcf
current_fig.set_size_inches(15,10) #set figure width
current_fig.savefig('torontomaps.png',dpi=100) #export maps as an image file!
(2,)
../../../_images/9262229187a3325950a280e13573f69baf327062d460403bc3e853453f6df21b.png

A note about axes:

  • if you only has one ‘axes’ (i.e., plt.subplots(1,1)) then you can just say ax = axes

  • if you have a 1 row and multiple columns or multiple rows and 1 column, then you would say ax = axes[n] where n is the axes cell you’d like to put the figure in

  • if you have a matrix of cells of axes (i.e., plt.subplots(2,2)) then you need to say ax = axes[n][m] where n is the row number and m is the column number of the cell

Let’s try with a column that contains a number
 what happens there?#

let’s also add a new parameter “cmap” = color map and a legend.

This link will provide more info on other things we can do with plotting geodataframes: https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.plot.html

fig, axes = plt.subplots(1,1)

# We want to plot the area of each neighborhood
nbrhd.plot(column = "Shape__Area", ax = axes, 
           cmap = "spring", legend = True)

current_fig = plt.gcf()
current_fig.set_size_inches(10, 10)
../../../_images/149ff182d47c37e9eb7da49c1a020fa4ae588ce77bc106e2a822282b6111a780.png

let’s try again with neighbourhoods:#

#another way to resize
fig, axes = plt.subplots(1,1, figsize = (15,15))


nbrhd.plot(column = "AREA_NAME", ax = axes,
           cmap = "Pastel1", legend = True)
<AxesSubplot:>
../../../_images/9323f71fa819ff3c5dffc0d0c6052f92ac02e0341b053e72fabd1ef49f21cf9c.png


 These are silly maps. Who cares how big the neighbourhoods are or who can deal with a weird pastel map with a legend that is way too big


We want to know what is happening inside of them!#

How do we bring two pieces of info together? How do we bring data from a table into a geodataframe?

We can use joins! Remember those?

How do we bring data from a table into a geodataframe?#

We can use merge, just like what we did before!

Let’s join the asthma rates into the neighborhood and map them out.

# simplify the geodataframe ready

important_spat_cols = nbrhd.columns[[4, 5, 17]]
colnames_spat = {important_spat_cols[0]: 'name',
           important_spat_cols[1] : 'nbrhd_spat_id',
           important_spat_cols[2] : 'geometry'}

nbrhd_simple = nbrhd.copy()

nbrhd_simple = nbrhd_simple[important_spat_cols]

nbrhd_simple.rename(columns = colnames_spat, inplace=True)

# Notice that the "nbrhd_spat_id" is a string and not a number... that may be an issue
print("The data type of nbrhd_spat_id is: ", type(nbrhd_simple["nbrhd_spat_id"][0]))
nbrhd_simple.head()
The data type of nbrhd_spat_id is:  <class 'str'>
name nbrhd_spat_id geometry
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674...
1 Annex (95) 095 POLYGON ((-79.39414 43.66872, -79.39588 43.668...
2 Caledonia-Fairbank (109) 109 POLYGON ((-79.46021 43.68156, -79.46044 43.681...
3 Woodbine Corridor (64) 064 POLYGON ((-79.31485 43.66674, -79.31660 43.666...
4 Lawrence Park South (103) 103 POLYGON ((-79.41096 43.70408, -79.41165 43.703...
#get the asthma data

fname = '1_ahd_neighb_db_ast_hbp_mhv_copd_2007.xls' #file name
sname = '1_ahd_neighb_asthma_2007' #sheet name in excel file

#store excel sheet with asthma data in a dataframe variable
asthma_neighb = pd.read_excel(fname, sheet_name = sname, header = 11)
asthma_neighb.head()
Unnamed: 0 Unnamed: 1 Demographics ÂȘ % With asthma LL (95% CI) UL (95% CI) Demographics ÂȘ.1 % With asthma.1 LL (95% CI) .1 UL (95% CI) .1 ... Demographics ÂȘ.7 % With asthma.7 LL (95% CI) .7 UL (95% CI) .7 Demographics ÂȘ.8 % With asthma.8 LL (95% CI) .8 UL (95% CI) .8 Rate Ratio**.2 H/L/NS.2
0 1 West Humber-Clairville 11977 9.7 9.2 10.2 11770 10.6 10.0 11.2 ... 4435 12.3 11.3 13.3 8921 10.9 10.3 11.6 1.07 H
1 2 Mount Olive-Silverstone-Jamestown 11157 7.9 7.4 8.4 11462 10.2 9.6 10.8 ... 3819 13.5 12.5 14.6 7726 11.2 10.5 11.9 1.10 H
2 3 Thistletown-Beaumond Heights 3307 9.4 8.4 10.4 3322 10.5 9.5 11.6 ... 1307 12.3 10.5 14.1 2587 10.8 9.6 12.0 1.06 NS
3 4 Rexdale-Kipling 3317 9.7 8.7 10.7 3438 13.0 11.9 14.2 ... 1468 13.2 11.5 14.9 2897 10.4 9.3 11.5 1.02 NS
4 5 Elms-Old Rexdale 3209 10.2 9.1 11.2 3379 13.0 11.8 14.1 ... 1314 15.1 13.1 17.0 2610 11.8 10.6 13.0 1.16 H

5 rows × 44 columns

#let's simplify this by only pulling a few relevant columns
important_cols = asthma_neighb.columns[[0, 1, 11]]
colnames = {important_cols[0]: 'Neighbid',
           important_cols[1] : 'name',
           important_cols[2] : 'asthma_pct'}

asthma_rates = asthma_neighb.copy()

asthma_rates = asthma_rates[important_cols]

asthma_rates.rename(columns = colnames, inplace=True)

print("The data type of Neighbid is: ", type(asthma_rates["Neighbid"][0]))
asthma_rates.head()
The data type of Neighbid is:  <class 'numpy.int64'>
Neighbid name asthma_pct
0 1 West Humber-Clairville 10.2
1 2 Mount Olive-Silverstone-Jamestown 9.1
2 3 Thistletown-Beaumond Heights 10.0
3 4 Rexdale-Kipling 11.4
4 5 Elms-Old Rexdale 11.6

Note that the nbrhd_spat_id in Neighborhood GeoDataFrame is string#

Let’s convert it into a number.

# Create a new variable in the nbrhd_simple geodataframe and store the new number
# version of the neighbourhood id here
nbrhd_simple["Neighbid"] = nbrhd_simple["nbrhd_spat_id"].astype(int)

# we can sort by the id - and check to see if they line up with the asthma data 
nbrhd_simple.sort_values('Neighbid').head()
name nbrhd_spat_id geometry Neighbid
30 West Humber-Clairville (1) 001 POLYGON ((-79.55236 43.70947, -79.55238 43.709... 1
40 Mount Olive-Silverstone-Jamestown (2) 002 POLYGON ((-79.60338 43.75786, -79.60205 43.758... 2
124 Thistletown-Beaumond Heights (3) 003 POLYGON ((-79.57751 43.73384, -79.57806 43.734... 3
122 Rexdale-Kipling (4) 004 POLYGON ((-79.55512 43.71510, -79.55504 43.714... 4
22 Elms-Old Rexdale (5) 005 POLYGON ((-79.55512 43.71510, -79.55569 43.716... 5

Now let’s do the join!#

nbrhd_simple = nbrhd_simple.merge(asthma_rates, on="Neighbid")
nbrhd_simple.head()
name_x nbrhd_spat_id geometry Neighbid name_y asthma_pct
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96 Casa Loma 10.1
1 Annex (95) 095 POLYGON ((-79.39414 43.66872, -79.39588 43.668... 95 Annex 9.5
2 Caledonia-Fairbank (109) 109 POLYGON ((-79.46021 43.68156, -79.46044 43.681... 109 Caledonia-Fairbank 11.3
3 Woodbine Corridor (64) 064 POLYGON ((-79.31485 43.66674, -79.31660 43.666... 64 Woodbine Corridor 13.0
4 Lawrence Park South (103) 103 POLYGON ((-79.41096 43.70408, -79.41165 43.703... 103 Lawrence Park South 10.7

Now that we have merged our spatial data with our attribute data
#

we can plot the values using the geodataframe maps plotting capability#

#generate our figure/axes and set fig size
fig, axes = plt.subplots(1, 1, figsize = (12,12))

#let's map asthma percent rates across toronto!
nbrhd_simple.plot(column = "asthma_pct",
                  legend = True, ax = axes, 
                  cmap = "YlOrRd")
<AxesSubplot:>
../../../_images/a16e2c45b81dbaaa498798125905025b7d69d93e48b47c01e511706610205e3c.png

Hmm - Looks like there are some spatial patterns!#

We can also look at non-spatial figures to explore our data. Doing this helps us understand the ‘shape’ of our data in different ways.

Let’s create a histogram of the same data in the following cell:

# Just like what we do in pandas
nbrhd_simple['asthma_pct'].hist(bins=8)

plt.xlabel('Asthma Percent')
plt.ylabel('Number of neighbourhoods')
plt.title('Distribution of percent of residents with asthma in Toronto, ON')
plt.show()
../../../_images/97b669d417da2443b603d0d0ac6d01c7b17feaf191306d779e148f0c2eab12e8.png

Interesting!#

This looks like a “normal” distribution:

  • Many neighbourhoods with values in the middle range, but some at the extremes (both low and high).

But when we look at the map, we notice that neighbourhoods with high values tend to be near other neighbourhoods with high values, and low value neighbourhoods near others with low values.

Maybe there is some spatial mechanism causing this!

To dig into this more, let’s classify the data using quartiles
#

To do this we add a scheme parameter to the plot function and use quantiles. the k value tells the plot function how many quantiles you want. We want four quantiles (0-25%, 25%-50%, 50%-75%, 75%-100%).

More information about plot function

We do this for a few reasons:

  • (1) It makes it easier to see groups with similar values that are close to eachother and

  • (2) it makes the legend easier to interpret.

Note that we also have added an edgecolor parameter which tells the map what color to outline neighbourhoods with.

fig, axes = plt.subplots(1,1, figsize =  (10,10))
nbrhd_simple.plot(column='asthma_pct', scheme='quantiles', 
                  k=4, cmap='YlOrRd', edgecolor='black', 
                  ax = axes, legend=True)
<AxesSubplot:>
../../../_images/bea1a976986bf78fb632a6327b4bddcdccbef9fb158bb6102ccacbedafb36804.png

This map looks a lot better, I think! But the legend is overlapping the map. We should move it.

The legend options are kind of complicated, but you can find more about stuff here: https://matplotlib.org/3.5.1/api/_as_gen/matplotlib.pyplot.legend.html

I’m going to show you a few things you can do, like move the legend, adjust font size, and add a title:

fig, axes = plt.subplots(1,1, figsize =  (12,12))

nbrhd_simple.plot(column='asthma_pct', scheme='quantiles', 
                  k=4, cmap='Greens', edgecolor='grey', 
                  ax = axes, legend=True, 
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic',
                               'title_fontsize': 23,'fontsize': 15}) #'loc'=4 means lower right
<AxesSubplot:>
../../../_images/58575b3e7cf372c98f39147ef51da65e3a6327803b7a2783ad099709e758ad33.png

Can we show descriptive stats charts next to a map?

Yes!

#First create a figure sheet with 2 axes cells (and resize the figure)
fig, axes = plt.subplots(1,2, figsize = (15,5))

#let's put the map in the first axes cell:
nbrhd_simple.plot(column='asthma_pct', scheme='quantiles', 
                  k=4, cmap='Greens', edgecolor='grey', linewidth = 0.2, #notice I added linewidth to narrow the borders
                  ax = axes[0], legend=True, 
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 10,'fontsize': 9})

#we can put the histogram we made earlier in the second axes cell:
nbrhd_simple['asthma_pct'].hist(bins=8, ax = axes[1])
plt.xlabel('Asthma Percent')
plt.ylabel('Number of neighbourhoods')
plt.title('Distribution of percent of residents with asthma in Toronto, ON')
Text(0.5, 1.0, 'Distribution of percent of residents with asthma in Toronto, ON')
../../../_images/91594a7a8775a9b150f3190b5c4a4c393736900d7e38dc16263e6b10624a59ad.png

First steps into “spatial analysis”
#

OK great! That chart is interesting and gets us thinking. But one of the powers of spatial analysis is relating environments to spatial phenomena.

Coordinate systems and projections#

When working with spatial data, you may need to think about coordinate systems and geographic projections.

Because we are taking data that exist on a sphere and laying them out flat on a plane, all spatial data is projected.

projections

If you have two layers of spatial data it is always good to check that they are in the same projection, or else you will get errors or bad results.

If the files are not in the same projection, it’s pretty simple to convert one of the files to the same projection as the other file:

# double check coordinate systems - we always want to work in the same projection

print("The original projection of railways is: ", railways.crs)
print("The original projection of nbrhd_simple is: ", nbrhd_simple.crs)

# reproject railways so it's the same as our neighbourhood layer
railways = railways.to_crs(nbrhd_simple.crs) #we can just borrow the "crs" (coordinate reference system) from the other spatial file
print("The NEW projection of railways is: ", railways.crs)
print("And nbrhd_simple is still: ", nbrhd_simple.crs)
railways.plot()
The original projection of railways is:  EPSG:4269
The original projection of nbrhd_simple is:  EPSG:4326
The NEW projection of railways is:  EPSG:4326
And nbrhd_simple is still:  EPSG:4326
<AxesSubplot:>
../../../_images/97eff3470ed1e39b220f931434c350daff57c8163f8e0a8cfd447921a8bbffc2.png
nbrhd_simple.head()
name_x nbrhd_spat_id geometry Neighbid name_y asthma_pct
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96 Casa Loma 10.1
1 Annex (95) 095 POLYGON ((-79.39414 43.66872, -79.39588 43.668... 95 Annex 9.5
2 Caledonia-Fairbank (109) 109 POLYGON ((-79.46021 43.68156, -79.46044 43.681... 109 Caledonia-Fairbank 11.3
3 Woodbine Corridor (64) 064 POLYGON ((-79.31485 43.66674, -79.31660 43.666... 64 Woodbine Corridor 13.0
4 Lawrence Park South (103) 103 POLYGON ((-79.41096 43.70408, -79.41165 43.703... 103 Lawrence Park South 10.7

Now that our two geodataframe spatial layers are in the same projection, we can plot them together#

We want to create 1 map and tell geopandas to put the two spatial layers in the same axes cell:

fig, axes = plt.subplots(1,1, figsize = (15,10))



nbrhd_simple.plot(column='asthma_pct', scheme='quantiles', 
                  k=4, cmap='Greens', edgecolor='grey',
                  linewidth = .2, ax = axes, legend=True, 
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 20,'fontsize': 15})

railways.plot(edgecolor="red", linewidth = .5, ax = axes) #notice i'm putting this in the same axes cell
<AxesSubplot:>
../../../_images/91e69d335768617107d0e125a543c8ae1cedf256f29a3361698ac240a00e2ed3.png
railways.columns
Index(['OBJECTID', 'OGF_ID', 'TRACKNID', 'TRACKSEGID', 'TRACKNAME',
       'TRACKCLASS', 'REGULATOR', 'TRANSPTYPE', 'USETYPE', 'GAUGE',
       'NUMTRACKS', 'ELECTRIC', 'STATUS', 'SPEEDFREIT', 'SPEEDPASSE',
       'UNITOFSPEE', 'SOURCEID', 'OPERATOENA', 'OPERATOFNA', 'OPERATORMK',
       'OPSUBSTART', 'OPSUBEND', 'OPSUBUDIS', 'OWNERENA', 'OWNERFNA',
       'TRKUSR1ENA', 'TRKUSR1FNA', 'TRKUSR1RMK', 'TRKUSR2ENA', 'TRKUSR2FNA',
       'TRKUSR2RMK', 'TRKUSR3ENA', 'TRKUSR3FNA', 'TRKUSR3RMK', 'TRKUSR4ENA',
       'TRKUSR4FNA', 'TRKUSR4RMK', 'SUBDI1NID', 'SUBDI1NAME', 'SUBD1START',
       'SUBD1END', 'SUB1UNITDI', 'SUBDI2NID', 'SUBDI2NAME', 'SUBD2START',
       'SUBD2END', 'SUB2UNITDI', 'ADMINAREA', 'SPECVERS', 'SECURCLASS',
       'GEOCREDATE', 'GEOREVDATE', 'GEOACQTECH', 'GEOACCURA', 'GEOPROVIDE',
       'ATTCREDATE', 'ATTREVDATE', 'ATTACQTECH', 'ATTPROVIDE', 'EFFECTIVE_',
       'SHAPE_Leng', 'SHAPE_1_Le', 'geometry'],
      dtype='object')

Hmm - could there be something there?#

Maybe!

But what are the things that might be happening here?

Let’s do a “spatial join”#

What is a spatial join?

alt text

Basically, we are relating information from two tables that share a spatial relationship.

So information from layer a’s table will be joined with information from layer b’s table where the two intersect/touch/are within/etc.

We will use “intersect” here, but you can refer to the geopandas documentation to learn more:

https://geopandas.org/en/stable/gallery/spatial_joins.html

join_left_df = nbrhd_simple.sjoin(railways, how="left", predicate = "intersects")
join_left_df
name_x nbrhd_spat_id geometry Neighbid name_y asthma_pct index_right OBJECTID OGF_ID TRACKNID ... GEOACQTECH GEOACCURA GEOPROVIDE ATTCREDATE ATTREVDATE ATTACQTECH ATTPROVIDE EFFECTIVE_ SHAPE_Leng SHAPE_1_Le
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96 Casa Loma 10.1 1280.0 1720.0 10131.0 2579cdd5ccaf447e93bec017362a451a ... Vector Data 10.0 Provincial/Territorial 201306 None Vector Data Federal 2013-07-30 0.0 0.026163
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96 Casa Loma 10.1 3222.0 8267.0 16018.0 ced49be659214db99e606ea1f3153f03 ... Vector Data 10.0 Provincial/Territorial 201306 None Vector Data Federal 2013-07-30 0.0 0.043467
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96 Casa Loma 10.1 888.0 9370.0 17250.0 05f860e758f746be971872822cb137c8 ... Vector Data 10.0 Provincial/Territorial 201306 None Vector Data Federal 2013-07-30 0.0 0.060344
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96 Casa Loma 10.1 1844.0 10459.0 3086.0 1fd8c25a828f46829c173713f49c63d3 ... Vector Data 10.0 Provincial/Territorial 201306 None Vector Data Federal 2013-07-30 0.0 0.008783
1 Annex (95) 095 POLYGON ((-79.39414 43.66872, -79.39588 43.668... 95 Annex 9.5 1280.0 1720.0 10131.0 2579cdd5ccaf447e93bec017362a451a ... Vector Data 10.0 Provincial/Territorial 201306 None Vector Data Federal 2013-07-30 0.0 0.026163
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
137 Humbermede (22) 022 POLYGON ((-79.52628 43.73640, -79.52649 43.736... 22 Humbermede 10.8 1239.0 9798.0 2436.0 816b3f97bace44149b88174dc78a7e55 ... Orthoimage 10.0 Federal 201306 None Vector Data Federal 2013-07-30 0.0 0.002236
137 Humbermede (22) 022 POLYGON ((-79.52628 43.73640, -79.52649 43.736... 22 Humbermede 10.8 2305.0 15763.0 8280.0 9da306e9415d4e7fa5b006a39dc550c1 ... Orthoimage 10.0 Federal 201306 None Vector Data Federal 2013-07-30 0.0 0.002528
137 Humbermede (22) 022 POLYGON ((-79.52628 43.73640, -79.52649 43.736... 22 Humbermede 10.8 717.0 5591.0 13590.0 632d434dd2b24ef5aaf8544d87f467f5 ... Orthoimage 10.0 Federal 201306 None Vector Data Federal 2013-07-30 0.0 0.012516
138 Willowdale West (37) 037 POLYGON ((-79.44043 43.76340, -79.44052 43.763... 37 Willowdale West 8.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
139 Willowdale East (51) 051 POLYGON ((-79.41225 43.76669, -79.41232 43.766... 51 Willowdale East 6.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2528 rows × 69 columns

join_left_df.loc[join_left_df['TRACKNID'].isnull()]
name_x nbrhd_spat_id geometry Neighbid name_y asthma_pct index_right OBJECTID OGF_ID TRACKNID ... GEOACQTECH GEOACCURA GEOPROVIDE ATTCREDATE ATTREVDATE ATTACQTECH ATTPROVIDE EFFECTIVE_ SHAPE_Leng SHAPE_1_Le
4 Lawrence Park South (103) 103 POLYGON ((-79.41096 43.70408, -79.41165 43.703... 103 Lawrence Park South 10.7 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 Henry Farm (53) 053 POLYGON ((-79.35966 43.76649, -79.35966 43.766... 53 Henry Farm 6.7 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 Pleasant View (46) 046 POLYGON ((-79.34346 43.79517, -79.34359 43.795... 46 Pleasant View 7.7 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 Steeles (116) 116 POLYGON ((-79.34132 43.81565, -79.33988 43.816... 116 Steeles 6.5 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 Brookhaven-Amesbury (30) 030 POLYGON ((-79.50296 43.69574, -79.50415 43.696... 30 Brookhaven-Amesbury 12.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 Yonge-Eglinton (100) 100 POLYGON ((-79.41096 43.70408, -79.40962 43.704... 100 Yonge-Eglinton 10.8 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 Englemount-Lawrence (32) 032 POLYGON ((-79.43854 43.70586, -79.43855 43.705... 32 Englemount-Lawrence 10.1 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 Bridle Path-Sunnybrook-York Mills (41) 041 POLYGON ((-79.39008 43.72768, -79.39016 43.728... 41 Bridle Path-Sunnybrook-York Mills 9.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 Danforth East York (59) 059 POLYGON ((-79.34627 43.68272, -79.34645 43.683... 59 Danforth Village-East York 12.5 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 Elms-Old Rexdale (5) 005 POLYGON ((-79.55512 43.71510, -79.55569 43.716... 5 Elms-Old Rexdale 11.6 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 North St.James Town (74) 074 POLYGON ((-79.38057 43.67161, -79.37947 43.671... 74 North St. James Town 8.9 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 Highland Creek (134) 134 POLYGON ((-79.17527 43.78021, -79.17535 43.780... 134 Highland Creek 10.8 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 Don Valley Village (47) 047 POLYGON ((-79.36414 43.77164, -79.36412 43.771... 47 Don Valley Village 7.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 Mount Pleasant West (104) 104 POLYGON ((-79.39611 43.69566, -79.39620 43.696... 104 Mount Pleasant West 8.6 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
29 Westminster-Branson (35) 035 POLYGON ((-79.44043 43.76340, -79.44004 43.761... 35 Westminster-Branson 7.3 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41 Bay Street Corridor (76) 076 POLYGON ((-79.38752 43.65067, -79.38663 43.650... 76 Bay Street Corridor 7.3 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
42 Bathurst Manor (34) 034 POLYGON ((-79.43998 43.76156, -79.43967 43.761... 34 Bathurst Manor 9.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
44 L'Amoreaux (117) 117 POLYGON ((-79.28857 43.79607, -79.28921 43.795... 117 L'Amoreaux 8.1 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
45 Trinity-Bellwoods (81) 081 POLYGON ((-79.42241 43.64349, -79.42255 43.643... 81 Trinity-Bellwoods 9.5 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
47 Glenfield-Jane Heights (25) 025 POLYGON ((-79.52628 43.73640, -79.52721 43.740... 25 Glenfield-Jane Heights 11.7 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
49 Eringate-Centennial-West Deane (11) 011 POLYGON ((-79.56730 43.65425, -79.56755 43.654... 11 Eringate-Centennial-West Deane 11.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
54 Taylor-Massey (61) 061 POLYGON ((-79.28776 43.68978, -79.29269 43.688... 61 Crescent Town 9.9 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
55 Humewood-Cedarvale (106) 106 POLYGON ((-79.41849 43.68363, -79.41913 43.683... 106 Humewood-Cedarvale 10.8 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60 Rustic (28) 028 POLYGON ((-79.50384 43.70288, -79.50425 43.704... 28 Rustic 11.5 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62 Forest Hill North (102) 102 POLYGON ((-79.42556 43.70099, -79.42669 43.700... 102 Forest Hill North 8.9 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
63 Hillcrest Village (48) 048 POLYGON ((-79.34346 43.79517, -79.34351 43.795... 48 Hillcrest Village 6.9 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
64 Humber Heights-Westmount (8) 008 POLYGON ((-79.51329 43.69338, -79.51330 43.692... 8 Humber Heights-Westmount 9.8 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65 Danforth (66) 066 POLYGON ((-79.32364 43.68337, -79.32477 43.683... 66 Danforth Village-Toronto 11.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
66 Newtonbrook West (36) 036 POLYGON ((-79.44576 43.79241, -79.44335 43.792... 36 Newtonbrook West 7.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
67 Black Creek (24) 024 POLYGON ((-79.53488 43.77269, -79.53441 43.772... 24 Black Creek 11.5 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
68 University (79) 079 POLYGON ((-79.40772 43.65648, -79.40847 43.658... 79 University 8.1 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
70 Kensington-Chinatown (78) 078 POLYGON ((-79.40401 43.64719, -79.40419 43.647... 78 Kensington-Chinatown 8.6 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
71 Old East York (58) 058 POLYGON ((-79.35056 43.69941, -79.35041 43.699... 58 Old East York 13.7 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
74 Palmerston-Little Italy (80) 080 POLYGON ((-79.42640 43.65360, -79.42744 43.656... 80 Palmerston-Little Italy 10.4 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
78 Bedford Park-Nortown (39) 039 POLYGON ((-79.40493 43.73601, -79.40503 43.735... 39 Bedford Park-Nortown 10.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
83 Willowridge-Martingrove-Richview (7) 007 POLYGON ((-79.52661 43.68233, -79.52705 43.682... 7 Willowridge-Martingrove-Richview 9.9 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
84 Woodbine-Lumsden (60) 060 POLYGON ((-79.31743 43.69644, -79.31775 43.697... 60 Woodbine-Lumsden 13.1 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91 Morningside (135) 135 POLYGON ((-79.22552 43.78925, -79.22549 43.789... 135 Morningside 10.9 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
93 Lawrence Park North (105) 105 POLYGON ((-79.39008 43.72768, -79.39199 43.727... 105 Lawrence Park North 11.5 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
94 Clanton Park (33) 033 POLYGON ((-79.43371 43.73615, -79.43626 43.734... 33 Clanton Park 9.8 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
96 Markland Wood (12) 012 POLYGON ((-79.55959 43.63624, -79.55957 43.636... 12 Markland Wood 10.6 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
97 Maple Leaf (29) 029 POLYGON ((-79.49249 43.72010, -79.49185 43.720... 29 Maple Leaf 10.2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
101 Etobicoke West Mall (13) 013 POLYGON ((-79.55959 43.63624, -79.55980 43.636... 13 Etobicoke West Mall 9.8 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
107 Church-Yonge Corridor (75) 075 POLYGON ((-79.37672 43.66242, -79.37658 43.662... 75 Church-Yonge Corridor 9.8 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
112 Newtonbrook East (50) 050 POLYGON ((-79.42008 43.79800, -79.41973 43.798... 50 Newtonbrook East 6.6 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
119 Oakwood Village (107) 107 POLYGON ((-79.45028 43.69569, -79.44934 43.695... 107 Oakwood-Vaughan 10.7 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
121 Mount Pleasant East (99) 099 POLYGON ((-79.37749 43.71309, -79.37739 43.712... 99 Mount Pleasant East 10.5 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
122 Rexdale-Kipling (4) 004 POLYGON ((-79.55512 43.71510, -79.55504 43.714... 4 Rexdale-Kipling 11.4 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
124 Thistletown-Beaumond Heights (3) 003 POLYGON ((-79.57751 43.73384, -79.57806 43.734... 3 Thistletown-Beaumond Heights 10.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
125 The Beaches (63) 063 POLYGON ((-79.31485 43.66674, -79.31356 43.667... 63 The Beaches 11.4 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
128 Forest Hill South (101) 101 POLYGON ((-79.42556 43.70099, -79.42314 43.701... 101 Forest Hill South 10.3 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
129 Princess-Rosethorn (10) 010 POLYGON ((-79.52679 43.65093, -79.52688 43.650... 10 Princess-Rosethorn 10.4 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
130 Lansing-Westgate (38) 038 POLYGON ((-79.43998 43.76156, -79.44004 43.761... 38 Lansing-Westgate 9.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
138 Willowdale West (37) 037 POLYGON ((-79.44043 43.76340, -79.44052 43.763... 37 Willowdale West 8.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
139 Willowdale East (51) 051 POLYGON ((-79.41225 43.76669, -79.41232 43.766... 51 Willowdale East 6.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

55 rows × 69 columns

Woah!#

Now we have 2528 rows
 that’s way more rows than neighbourhoods. What happened?

Spatial join creates a new row for every instance of ‘intersection’. So if railway 1 and railway 2 both go through neighbourhood a, then there will be two rows with neighbourhood a with one row joined with railway a data and the other joined with railway b data.

Also, notice that neighbourhoods that didn’t have any railway go through them were appended with NaNs. This is saying there is no data about railways for these neighbourhoods because there were none there.

Because we just want to know what neighbourhoods have ANY railway go through them, we can take the joined results and remove duplicate rows of each neighbourhood.

no_duplicates_gdf = join_left_df.copy()
no_duplicates_gdf = no_duplicates_gdf.drop_duplicates('Neighbid')
no_duplicates_gdf.shape
(140, 69)

Let’s check this new, no duplicate geodataframe to see if looks right.#

We can plot the map using “index_right” which is the index from the railway layer. The values are meaningless, but it shows us where there ARE values and where there are not.

no_duplicates_gdf.plot(column = "index_right")
<AxesSubplot:>
../../../_images/10e4f37de901fafae8b17623e7f05ef893257f60feabc12955fbb90f2208fc28.png

This looks good, based on visual inspection#

But we need a column that tells us there IS or there IS NOT a railway in each neighbourhood. We can go back to some of the skills we learned earlier in the course to do this.

We will first create two series of TRUE and FALSE values to show where a value exists in the “index_right” column.

The series rail_present will be true if there is a value in the column - implying a railway goes through the neighbourhood, and false otherwise.

The series rail_not_present will be true if there is a NaN value in the row where the railway info is, and false otherwise.

rail_present = no_duplicates_gdf["index_right"] >= 0
rail_present.value_counts()
True     85
False    55
Name: index_right, dtype: int64
rail_not_present = no_duplicates_gdf["index_right"].isnull()
rail_not_present.value_counts()
False    85
True     55
Name: index_right, dtype: int64

Let’s clean things up a little and create the final geodataframe we want to work with#

We can clean things up and then create a map and a descriptive statistics plot to show our findings.

#create list of columns relevant to our analysis
columns_to_keep = ["name_x","Neighbid","name_y","asthma_pct","index_right","geometry"]

#create our final analysis geodataframe and create a column 'rail_present' that is true 
#if railways are in the neighbourhood and false if not:
final_toronto_analysis = no_duplicates_gdf[columns_to_keep].copy()
final_toronto_analysis.loc[rail_present, 'rail_present'] = True
final_toronto_analysis.loc[rail_not_present, 'rail_present'] = False
final_toronto_analysis.head()
name_x Neighbid name_y asthma_pct index_right geometry rail_present
0 Casa Loma (96) 96 Casa Loma 10.1 1280.0 POLYGON ((-79.41469 43.67391, -79.41485 43.674... True
1 Annex (95) 95 Annex 9.5 1280.0 POLYGON ((-79.39414 43.66872, -79.39588 43.668... True
2 Caledonia-Fairbank (109) 109 Caledonia-Fairbank 11.3 1305.0 POLYGON ((-79.46021 43.68156, -79.46044 43.681... True
3 Woodbine Corridor (64) 64 Woodbine Corridor 13.0 1940.0 POLYGON ((-79.31485 43.66674, -79.31660 43.666... True
4 Lawrence Park South (103) 103 Lawrence Park South 10.7 NaN POLYGON ((-79.41096 43.70408, -79.41165 43.703... False
fig, axes = plt.subplots(1,2, figsize = (15,7))

final_toronto_analysis.boxplot(column="asthma_pct", 
                               by = "rail_present", 
                               ax = axes[0])

final_toronto_analysis.plot(column='asthma_pct', scheme='quantiles', 
                  k=2, cmap='Greens', edgecolor='grey', linewidth = 0.2, 
                  ax = axes[1], legend=True, 
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 14,'fontsize': 11})
railways.plot(ax=axes[1], edgecolor='red',linewidth=0.5)

axes[0].set_title("")
axes[0].set_ylabel("Asthma Percent", fontsize = 15)
axes[0].set_xlabel("Rail Present?", fontsize = 15)

fig.suptitle("Percent of Population with Asthma for Neighbourhoods with Railways", 
             fontsize = 20, fontweight = "bold")
Text(0.5, 0.98, 'Percent of Population with Asthma for Neighbourhoods with Railways')
../../../_images/4d1f6b9c74aab35b663ed677730e50412106a61a7b0d8667587e199be51fe4c0.png

It seems that the presence of railway in neighborhood is related to higher asthma rate!

But is it a significant relationship?

Remember the comparison between two groups before!

Let’s do a hypothesis testing using shuffling method.

  • Null hypothesis: There is no differences of asthma rates between neighborhoods with railways and without railways.

  • Alternative hypothesis: Neighborhoods with railways have higher asthma rates than the neighborhoods without railways.

Let’s do a exercise.

# Simulation
simulated_diffs = []

for _ in range(5000):
    simulated = final_toronto_analysis['rail_present'].sample(frac=1, replace = False).reset_index(drop = True)
    mean_railway = final_toronto_analysis.loc[simulated == True, 'asthma_pct'].mean()
    mean_norailway = final_toronto_analysis.loc[simulated == False, 'asthma_pct'].mean()
    sim_diff = mean_railway - mean_norailway
    simulated_diffs.append(sim_diff)

# Calculate observed difference
obs_mean_railway = final_toronto_analysis.loc[final_toronto_analysis['rail_present'] == True, 'asthma_pct'].mean()
obs_mean_norailway = final_toronto_analysis.loc[final_toronto_analysis['rail_present'] == False, 'asthma_pct'].mean()

obs_diff = obs_mean_railway - obs_mean_norailway

# Calculate p-value
simulated_diffs_df = pd.DataFrame({'sim_diffs' : simulated_diffs})

numgreater = (simulated_diffs_df['sim_diffs'] >= obs_diff).sum()
print('The number of simulated differences greater than the observed difference is:', numgreater)

numsmaller = (simulated_diffs_df['sim_diffs'] < -1 * obs_diff).sum()
print('The number of simulated differences smaller than the observed difference is:', numsmaller)

pvalue = (numgreater + numsmaller)/ 5000

print('The p-value is:', pvalue)
The number of simulated differences greater than the observed difference is: 11
The number of simulated differences smaller than the observed difference is: 13
The p-value is: 0.0048

Great! We can say the asthma rates in neighborhoods with railways are significantly higher than the rates in neighborhoods without railways.

But is this a causal relationship?