Class 11

Contents

Class 11#

Mar. 26, 2024#

By the end of this class, you’ll be able to:

  • Load in spatial data

  • Join geodataframes with attribute tables

  • Plot maps in multiple ways

  • Get some knowledge about clustering (will talk about it next week!)

# ## Let's install and then load libraries
# %pip install geopandas
# %pip install mapclassify
# %pip install contextily
# %pip install libpysal
# %pip install esda
# %pip install splot
# %pip install xlrd
# %pip install numba==0.53
# Let's start with importing the packages

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import mapclassify
import esda
import splot
import libpysal as lps
import contextily as cx
import xlrd

A review of spatial data and mapping#

Let’s bring in the same files we worked with last week#

We’ll be working with the Toronto Neighbourhoods spatial data and the health data today:

  • Toronto_Neighbourhoods.geojson

  • 1_ahd_neighb_db_ast_hbp_mhv_copd_2007.xls

1. Using GeoPandas to load in spatial data#

nbrhd = gpd.GeoDataFrame.from_file("Toronto_Neighbourhoods.geojson")

The content of the spatial data

nbrhd.head(2)
AREA_ATTR_ID AREA_DESC AREA_ID AREA_LONG_CODE AREA_NAME AREA_SHORT_CODE CLASSIFICATION CLASSIFICATION_CODE LATITUDE LONGITUDE OBJECTID PARENT_AREA_ID Shape__Area Shape__Length X Y _id geometry
0 26005521 Casa Loma (96) 2480141 096 Casa Loma (96) 096 None None None None 17545105 None 3.678385e+06 8214.176485 None None 1 POLYGON ((-79.41469 43.67391, -79.41485 43.674...
1 26005520 Annex (95) 2480140 095 Annex (95) 095 None None None None 17545121 None 5.337192e+06 10513.883143 None None 2 POLYGON ((-79.39414 43.66872, -79.39588 43.668...

What is geometry?

first_neighbor = nbrhd.loc[0,'geometry']
print(first_neighbor)
POLYGON ((-79.41469317817781 43.6739104164259, -79.41484930122832 43.6743388247927, -79.4155279126094 43.67606998537741, -79.4157867581137 43.6767302521386, -79.4164385645546 43.6783329020511, -79.4165782543862 43.6787785512344, -79.4167583944682 43.6792484522149, -79.41698203537291 43.6798247440047, -79.4170902471251 43.6801047004014, -79.4174797570886 43.681051830379, -79.4176666295158 43.6815699096212, -79.4176976020857 43.68163169892501, -79.4178116560388 43.6819252777403, -79.4181213711598 43.6827105222857, -79.4182734938897 43.6830961662335, -79.418301440907 43.683167503583, -79.4184313411692 43.6834966841996, -79.4184870757371 43.6836294842548, -79.4185410795905 43.6837582054538, -79.4184092922561 43.683786393574, -79.4182955283344 43.6834937157887, -79.4154265333151 43.684065660547, -79.4135200432832 43.6844456915062, -79.4134087861596 43.6844695957643, -79.4134686959176 43.6846245536214, -79.4124643275333 43.6848288476745, -79.4118960009694 43.6849418545125, -79.4126020656339 43.6866857692443, -79.4128218223812 43.687229039671, -79.4122731860895 43.687340632143, -79.4123095851789 43.6874306640936, -79.4124449310775 43.6877690817767, -79.4125749764611 43.6881026705283, -79.4126022442033 43.68820308289101, -79.4126262355444 43.6883388768546, -79.412631283351 43.6884334123732, -79.4126432068819 43.6885168725437, -79.4126690850027 43.6886690142942, -79.4121798023448 43.6887725684515, -79.4121317450359 43.688782747996, -79.4094911620232 43.6893416576956, -79.4095159141932 43.6894373160946, -79.4079543147094 43.6897847976179, -79.4064026019847 43.6900806794424, -79.4048987011668 43.6903958607535, -79.4030176369546 43.6907821281771, -79.4030622386805 43.6907397518989, -79.403084589916 43.6906872662864, -79.4030847025454 43.6906327275232, -79.40233853963132 43.6887211977866, -79.4014954899744 43.6866406966408, -79.401126389152 43.6857160700657, -79.400841057476 43.6850272500519, -79.4005303647662 43.6843098010236, -79.4002092949982 43.6835267044631, -79.4001688947727 43.6834159800984, -79.3998035302044 43.6825434875815, -79.3992154226597 43.6811276314469, -79.3990654446524 43.6807665464922, -79.3989394746847 43.6804604265612, -79.3987331620365 43.679962228305, -79.3985629657772 43.6795542728627, -79.3983515160598 43.6790391648102, -79.3981409020124 43.6785307360415, -79.3980121612171 43.6782038734666, -79.3979440512351 43.6780262977103, -79.3979313672526 43.6779949618024, -79.397997906573 43.6779648847871, -79.39810766305662 43.6779154844303, -79.3982178162304 43.677866084320804, -79.3983283660932 43.677816684457504, -79.3984396598895 43.6777672941492, -79.3985509707706 43.677722377402, -79.39866305797811 43.6776729695768, -79.3987751498381 43.6776280623042, -79.39888799806091 43.677583155590305, -79.3990012429834 43.6775382491152, -79.399114879728 43.6774888512157, -79.3992025285702 43.6774574239244, -79.3992901799736 43.6774169952501, -79.3993786050701 43.6773810678483, -79.3994674269127 43.677345149728104, -79.3995566132194 43.6773137235166, -79.3996462037999 43.6772777969312, -79.3997357818565 43.6772418792657, -79.3998261336784 43.6772104448643, -79.3999168946437 43.6771790287522, -79.4000076183382 43.6771475945326, -79.4000991356532 43.6771161699357, -79.4001906603179 43.6770802536138, -79.4002825618979 43.6770488292052, -79.4003752372778 43.6770218970549, -79.40046791256 43.6769949738305, -79.4005609796537 43.6769635502124, -79.4006540144064 43.6769366361508, -79.40074786265342 43.6769052040578, -79.4008416959652 43.6768827821947, -79.4009359160751 43.6768513592708, -79.4010305181557 43.6768289379187, -79.4011394914127 43.6767930269395, -79.40124844475712 43.6767616164984, -79.4013577948506 43.6767302062957, -79.4014675168886 43.6766987963088, -79.401577635645 43.6766674045606, -79.4016877469976 43.6766404773541, -79.4017982426678 43.6766135773755, -79.401909142447 43.6765821679758, -79.4020200223858 43.6765552591109, -79.4021313064015 43.6765238498247, -79.4022429725496 43.6765014420761, -79.4023546335109 43.6764745335542, -79.4024666788314 43.6764476252511, -79.4025787240519 43.6764207168379, -79.4026915556349 43.6763983006426, -79.40280397272 43.67637139232491, -79.402883149386 43.6763541860873, -79.4029167917251 43.6763489758964, -79.4030299826882 43.6763265686732, -79.4031435776805 43.6762996520191, -79.4049670168388 43.6759208047062, -79.4073338149418 43.675429007407, -79.407847991862 43.67532218132661, -79.4113608949251 43.6745959152122, -79.4121860970333 43.6744315128604, -79.4137161812489 43.6741115018232, -79.4146748682077 43.6739143093347, -79.41469317817781 43.6739104164259))
first_neighbor
../../../_images/4b938be3a15d781260e818a13824198846d9cccee135ad6a4ec0c600793328ee.svg

Another example of geometry

rails =  gpd.GeoDataFrame.from_file("toronto_rails.geojson")
print(rails.loc[0,'geometry'])
MULTILINESTRING ((-79.5125670999999 43.814361500000075, -79.51253669999994 43.81447250000008, -79.512524 43.81452850000005, -79.51251939999992 43.814607, -79.51251739999992 43.814716900000064, -79.51252729999993 43.814817, -79.51254659999995 43.81492840000004, -79.51261019999993 43.815184800000054, -79.51270719999991 43.815575100000046, -79.51291559999993 43.816454400000055, -79.51339439999992 43.81834120000008, -79.51371579999994 43.819633, -79.513803 43.820002, -79.51385459999995 43.82019790000004, -79.51389059999991 43.820318, -79.51392659999993 43.82045130000006))
first_rail_points = [(-79.5125670999999,43.814361500000075),
                    (-79.51253669999994, 43.81447250000008),
                    (-79.512524, 43.81452850000005),
                    (-79.51251939999992,43.814607),
                    (-79.51251739999992,43.814716900000064),
                    (-79.51252729999993,43.814817),
                    (-79.51254659999995,43.81492840000004),
                    (-79.51261019999993,43.815184800000054),
                    (-79.51270719999991,43.815575100000046), 
                    (-79.51291559999993,43.816454400000055), 
                    (-79.51339439999992,43.81834120000008), 
                    (-79.51371579999994,43.819633), 
                    (-79.513803,43.820002), 
                    (-79.51385459999995, 43.82019790000004), 
                    (-79.51389059999991, 43.820318), 
                    (-79.51392659999993, 43.82045130000006),]
x, y = zip(* first_rail_points)
x
(-79.5125670999999,
 -79.51253669999994,
 -79.512524,
 -79.51251939999992,
 -79.51251739999992,
 -79.51252729999993,
 -79.51254659999995,
 -79.51261019999993,
 -79.51270719999991,
 -79.51291559999993,
 -79.51339439999992,
 -79.51371579999994,
 -79.513803,
 -79.51385459999995,
 -79.51389059999991,
 -79.51392659999993)
fig, axes = plt.subplots(1, 1, figsize = (6,6))

plt.plot(x,y)
plt.scatter(x, y, c='red');
../../../_images/a0d1dc4e28590f6c008d52c65326eabef4198d19561ca58b6a08ab40a9443b7c.png
rails.loc[0,'geometry']
../../../_images/a9c90d2fe5cb7ef219b94d466110604094048a0fabd9c8c04bdfdaae1a96ebfc.svg

Note: the difference between plotting on Cartesian coordinate system and plotting as a geo-object comes from the different coordination system used by the GeoDataFrame.

rails.crs
<Geographic 2D CRS: EPSG:4269>
Name: NAD83
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands.
- bounds: (167.65, 14.92, -40.73, 86.45)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

Therefore, a GeoDataframe can be understood as a GeoSeries with several DataFrame Series.

figure

2. Simplify the geodataframe#

nbrhd.columns
Index(['AREA_ATTR_ID', 'AREA_DESC', 'AREA_ID', 'AREA_LONG_CODE', 'AREA_NAME',
       'AREA_SHORT_CODE', 'CLASSIFICATION', 'CLASSIFICATION_CODE', 'LATITUDE',
       'LONGITUDE', 'OBJECTID', 'PARENT_AREA_ID', 'Shape__Area',
       'Shape__Length', 'X', 'Y', '_id', 'geometry'],
      dtype='object')
# important columns
important_spat_cols = nbrhd.columns[[4, 5, 17]]
important_spat_cols
Index(['AREA_NAME', 'AREA_SHORT_CODE', 'geometry'], dtype='object')
# rename dictionary
colnames_spat = {important_spat_cols[0]: 'name',
           important_spat_cols[1] : 'nbrhd_spat_id',
           important_spat_cols[2] : 'geometry'}
colnames_spat
{'AREA_NAME': 'name',
 'AREA_SHORT_CODE': 'nbrhd_spat_id',
 'geometry': 'geometry'}
# Selec the important columns
nbrhd_simple = nbrhd.copy()
nbrhd_simple = nbrhd_simple[important_spat_cols]
nbrhd_simple.rename(columns = colnames_spat, inplace=True)
nbrhd_simple.head()
name nbrhd_spat_id geometry
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674...
1 Annex (95) 095 POLYGON ((-79.39414 43.66872, -79.39588 43.668...
2 Caledonia-Fairbank (109) 109 POLYGON ((-79.46021 43.68156, -79.46044 43.681...
3 Woodbine Corridor (64) 064 POLYGON ((-79.31485 43.66674, -79.31660 43.666...
4 Lawrence Park South (103) 103 POLYGON ((-79.41096 43.70408, -79.41165 43.703...

Convert our spatial ID from a string to a number:

  • rembmer the type of spatial ID in the health dataset is int but here in spatial dataset it is string.

#create a new variable in the nbrhd_simple geodataframe and store the new number
#version of the neighbourhood id here
nbrhd_simple["Neighbid"] = nbrhd_simple["nbrhd_spat_id"].astype(int)

nbrhd_simple.head()
name nbrhd_spat_id geometry Neighbid
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96
1 Annex (95) 095 POLYGON ((-79.39414 43.66872, -79.39588 43.668... 95
2 Caledonia-Fairbank (109) 109 POLYGON ((-79.46021 43.68156, -79.46044 43.681... 109
3 Woodbine Corridor (64) 064 POLYGON ((-79.31485 43.66674, -79.31660 43.666... 64
4 Lawrence Park South (103) 103 POLYGON ((-79.41096 43.70408, -79.41165 43.703... 103
print(type(nbrhd_simple.loc[0,'nbrhd_spat_id']))
print(type(nbrhd_simple.loc[0,'Neighbid']))
<class 'str'>
<class 'numpy.int64'>

3. Import health information and simplify it#

fname = '1_ahd_neighb_db_ast_hbp_mhv_copd_2007.xls' #file name
sname = '1_ahd_neighb_asthma_2007' #sheet name in excel file

#store excel sheet with asthma data in a dataframe variable
asthma_neighb = pd.read_excel(fname, sheet_name = sname, header = 11)
asthma_neighb.head(2)
Unnamed: 0 Unnamed: 1 Demographics ÂȘ % With asthma LL (95% CI) UL (95% CI) Demographics ÂȘ.1 % With asthma.1 LL (95% CI) .1 UL (95% CI) .1 ... Demographics ÂȘ.7 % With asthma.7 LL (95% CI) .7 UL (95% CI) .7 Demographics ÂȘ.8 % With asthma.8 LL (95% CI) .8 UL (95% CI) .8 Rate Ratio**.2 H/L/NS.2
0 1 West Humber-Clairville 11977 9.7 9.2 10.2 11770 10.6 10.0 11.2 ... 4435 12.3 11.3 13.3 8921 10.9 10.3 11.6 1.07 H
1 2 Mount Olive-Silverstone-Jamestown 11157 7.9 7.4 8.4 11462 10.2 9.6 10.8 ... 3819 13.5 12.5 14.6 7726 11.2 10.5 11.9 1.10 H

2 rows × 44 columns

asthma_neighb.columns
Index(['Unnamed: 0', 'Unnamed: 1', 'Demographics ÂȘ', '% With asthma',
       'LL (95% CI) ', 'UL (95% CI) ', 'Demographics ÂȘ.1', '% With asthma.1',
       'LL (95% CI) .1', 'UL (95% CI) .1', 'Demographics ÂȘ.2',
       '% With asthma.2', 'LL (95% CI) .2', 'UL (95% CI) .2', 'Rate Ratio**',
       'H/L/NS', 'Demographics ÂȘ.3', '% With asthma.3', 'LL (95% CI) .3',
       'UL (95% CI) .3', 'Demographics ÂȘ.4', '% With asthma.4',
       'LL (95% CI) .4', 'UL (95% CI) .4', 'Demographics ÂȘ.5',
       '% With asthma.5', 'LL (95% CI) .5', 'UL (95% CI) .5', 'Rate Ratio**.1',
       'H/L/NS.1', 'Demographics ÂȘ.6', '% With asthma.6', 'LL (95% CI) .6',
       'UL (95% CI) .6', 'Demographics ÂȘ.7', '% With asthma.7',
       'LL (95% CI) .7', 'UL (95% CI) .7', 'Demographics ÂȘ.8',
       '% With asthma.8', 'LL (95% CI) .8', 'UL (95% CI) .8', 'Rate Ratio**.2',
       'H/L/NS.2'],
      dtype='object')
# important columns
## Note: Instead of copying the column names, we can use the number to avoid any miscopies.
important_cols = asthma_neighb.columns[[0, 1, 10, 11]] 

# rename dictionary
colnames = {important_cols[0] : 'Neighbid', 
            important_cols[1] : 'name', 
            important_cols[2] : 'adult_pop',
            important_cols[3] : 'asthma_pct'}
# Select important columns
asthma_rates = asthma_neighb.copy()

asthma_rates = asthma_rates[important_cols]

asthma_rates.rename(columns = colnames, inplace=True)

asthma_rates.head()
Neighbid name adult_pop asthma_pct
0 1 West Humber-Clairville 23747 10.2
1 2 Mount Olive-Silverstone-Jamestown 22619 9.1
2 3 Thistletown-Beaumond Heights 6629 10.0
3 4 Rexdale-Kipling 6755 11.4
4 5 Elms-Old Rexdale 6588 11.6

4. Join our neighbourhood gdf with our asthma rate df:#

Take a look at both data

asthma_rates.head()
Neighbid name adult_pop asthma_pct
0 1 West Humber-Clairville 23747 10.2
1 2 Mount Olive-Silverstone-Jamestown 22619 9.1
2 3 Thistletown-Beaumond Heights 6629 10.0
3 4 Rexdale-Kipling 6755 11.4
4 5 Elms-Old Rexdale 6588 11.6

For the spatial data, we sort it by neighborhood id to check the consistency

nbrhd_simple.sort_values('Neighbid').head()
name nbrhd_spat_id geometry Neighbid
30 West Humber-Clairville (1) 001 POLYGON ((-79.55236 43.70947, -79.55238 43.709... 1
40 Mount Olive-Silverstone-Jamestown (2) 002 POLYGON ((-79.60338 43.75786, -79.60205 43.758... 2
124 Thistletown-Beaumond Heights (3) 003 POLYGON ((-79.57751 43.73384, -79.57806 43.734... 3
122 Rexdale-Kipling (4) 004 POLYGON ((-79.55512 43.71510, -79.55504 43.714... 4
22 Elms-Old Rexdale (5) 005 POLYGON ((-79.55512 43.71510, -79.55569 43.716... 5
nbrhd_simple = nbrhd_simple.merge(asthma_rates, on="Neighbid")
nbrhd_simple.head()
name_x nbrhd_spat_id geometry Neighbid name_y adult_pop asthma_pct
0 Casa Loma (96) 096 POLYGON ((-79.41469 43.67391, -79.41485 43.674... 96 Casa Loma 6640 10.1
1 Annex (95) 095 POLYGON ((-79.39414 43.66872, -79.39588 43.668... 95 Annex 20869 9.5
2 Caledonia-Fairbank (109) 109 POLYGON ((-79.46021 43.68156, -79.46044 43.681... 109 Caledonia-Fairbank 6615 11.3
3 Woodbine Corridor (64) 064 POLYGON ((-79.31485 43.66674, -79.31660 43.666... 64 Woodbine Corridor 8346 13.0
4 Lawrence Park South (103) 103 POLYGON ((-79.41096 43.70408, -79.41165 43.703... 103 Lawrence Park South 9709 10.7

5. Map the attribute data (asthma)#

We need to first set the “stage” of the figure.

Then we plot:

  • the asthma rate (column = “asthma_pct”),

  • in 5 quantiles (scheme = “quantiles”, k=5),

  • in the stage (ax = axes),

  • with legend (legend = True).

# First, let's create a figure variable "fig" and an axes variables called "ax".
# This code sets the stage - it says how many images - by rows/columns
# We can also set other figure attributes in here, like figure size using figsize

#import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 1, figsize = (12,12))

# Then plot the data with certain arguements:
# These two codes need to write in the same code, i.e., run simultaneously
nbrhd_simple.plot(column='asthma_pct', scheme='quantiles', 
                  k=5, cmap='YlGn', edgecolor='grey', 
                  ax = axes, legend=True, 
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 16,'fontsize': 12})
<AxesSubplot:>
../../../_images/937fcc55d551b8999d7b4dfcb267aa26eca65a608967b1defc82d9da2c481682.png

Make a figure with two maps side by side


fig, axes = plt.subplots(1, 2, figsize = (12,12))

# pick the first cell in the figure
nbrhd_simple.plot(column = "name_x",
           ax=axes[0], cmap = "tab20")

# pick the second cell in the figure
nbrhd_simple.plot(column = "asthma_pct",
           ax=axes[1], cmap = "YlGn")
<AxesSubplot:>
../../../_images/bc204ce5d8c37e9351668aaee0c6e8beece6f12642d8c491011042d73ad51e56.png

POP QUIZ!!#

  1. How would we make a figure with 3 maps in 1 row?

  2. How would we make a figure with 3 maps in 1 column?

  3. How would we make a figure with 6 maps in 3 rows and 2 columns?

# 3 maps in 1 row

fig, axes = plt.subplots(1, 3, figsize = (12,12))

nbrhd_simple.plot(ax=axes[0])
nbrhd_simple.plot(ax=axes[1])
nbrhd_simple.plot(ax=axes[2])
<AxesSubplot:>
../../../_images/abda9ae5877dfbedbb6182ae8bcb9f45e046cf544b83179d107acd0558594fac.png
# 3 maps in 1 column

fig, axes = plt.subplots(3, 1, figsize = (12,12))

nbrhd_simple.plot(ax=axes[0])
nbrhd_simple.plot(ax=axes[1])
nbrhd_simple.plot(ax=axes[2])
<AxesSubplot:>
../../../_images/3f1d230d7c3cbc38756647d1a9a5a674f56d21b2f9ca4133b7dc4ed0d8e07939.png
# 6 maps in 3 rows and 2 columns

fig, axes = plt.subplots(3, 2, figsize = (12,12))

nbrhd_simple.plot(ax=axes[0][0])
nbrhd_simple.plot(ax=axes[0][1])
nbrhd_simple.plot(ax=axes[1][0])
nbrhd_simple.plot(ax=axes[1][1])
nbrhd_simple.plot(ax=axes[2][0])
nbrhd_simple.plot(ax=axes[2][1])
<AxesSubplot:>
../../../_images/004e70f6cbf4364638ab56e3ac0acc090e17ff2038d9ab84592a9a24f455420e.png

More on Cartography#

To this point we have used quantiles to classify continous values in maps.#

What other common options do we have? See the document of GeoDataFrame.plot and mapclassify#

  1. Equal Intervals

  • This classification method “divides the data into equal size classes (e.g., 0-10, 10-20, 20-30, etc.) and works best on data that is generally spread across the entire range. CAUTION: Avoid equal interval if your data are skewed to one end or if you have one or two really large outlier values.”

  1. Natural Breaks

  • This classification method “is a kind of “optimal” classification scheme that finds class breaks that will minimize within-class variance and maximize between-class differences.” However, these breaks will be unique to each dataset and makes it difficult to compare across different maps.

Before plot the variable, we usually want to see its distribution to see if certain classification can fit.

nbrhd_simple.asthma_pct.hist(bins=8)
<AxesSubplot:>
../../../_images/abf76421b548f121ac6f5f6d7c6dc8105a3820fd3e5799fc04f4846162151b93.png

Close to normal distribution, which means that the equal intervals should be fine.

Let’s use the scheme parameter to the plot our asthma percentages using quartiles, equal intervals, and natural breaks.

fig, axes = plt.subplots(1,3, figsize =  (20,10))

nbrhd_simple.plot(column='asthma_pct', scheme='Quantiles', 
                  k=4, cmap='YlOrRd', edgecolor='black', 
                  ax = axes[0], legend=True,                  
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 8,'fontsize': 8})

nbrhd_simple.plot(column='asthma_pct', scheme='EqualInterval', 
                  k=4, cmap='YlOrRd', edgecolor='black', 
                  ax = axes[1], legend=True,                  
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 8,'fontsize': 8})

nbrhd_simple.plot(column='asthma_pct', scheme='NaturalBreaks', 
                  k=4, cmap='YlOrRd', edgecolor='black', 
                  ax = axes[2], legend=True,                  
                  legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 8,'fontsize': 8})

axes[0].set_title("Quartiles", fontsize = 20)
axes[1].set_title("Equal Intervals", fontsize = 20)
axes[2].set_title("Natural Breaks", fontsize = 20)

current_fig = plt.gcf() #get current figure aka gcf
current_fig.savefig('asthma_three_class.png',dpi=100) #export maps as an image file!
../../../_images/f576c9d6158b9f7b22d0967363206b1a67d5c265a3eee08b7eaf2b5f1f9b2781.png

Another way to show the figure: save it and import it to show in the presentation! figure

Adding a background#

What if we want to add a background to our maps? It’s pretty easy!

We will use the contextily library to do this.

But first we need to double check that our coordinate reference system is properly set. To layer our maps on top of a web background map, we need to make sure the projection is set to something known as “Web Mercator”: https://epsg.io/3857

That is, make our spatial data have the same coordinate reference system as the web map.

#check current crs: WGS 84 (EPSG:4326)
nbrhd_simple.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Let’s reproject the map into Web Mercator and compare to the original projection:#

nbrhd_webmap = nbrhd_simple.to_crs("EPSG:3857")

#let's throw in a polar projection just for fun! 
## https://www.esri.com/arcgis-blog/products/imagery/imagery/two-views-from-the-top-of-the-world/
nbrhd_polar = nbrhd_simple.to_crs("EPSG:3995")

fig, axes = plt.subplots(1,3, figsize =  (20,10))

#original
nbrhd_simple.plot(ax = axes[0])
#web mercator
nbrhd_webmap.plot(ax = axes[1])
#polar projection
nbrhd_polar.plot(ax = axes[2])

axes[0].set_title("WGS 84", fontsize = 20)
axes[1].set_title("Web Mercator", fontsize = 20)
axes[2].set_title("Polar", fontsize = 20)
Text(0.5, 1.0, 'Polar')
../../../_images/60f3fd6f89fe4afaeaeaa0aa7dfa40f34904e4ac33b7c44fb382c6bcbc42c49a.png

Note: The coordinate reference system will change how the geometry is plotted in a 2D map.

To add the web basemap, we can do the following#

Remember we have import the contextily package as cx

fig, axes = plt.subplots(1,1, figsize = (10,10))

#alpha parameter makes the map transparent
nbrhd_webmap.plot(column = "asthma_pct", scheme = "quantiles",
                  k = 4, cmap = "Reds", ax = axes,
                  alpha=0.5, edgecolor='k')

cx.add_basemap(axes)
../../../_images/e20ddc6cf540f9a55e06cb6b0f0e17374ba7d163614612722edfab586f653980.png

The contextily pacakage also provides different map source.

For example,

Under the provider of CartoDB:
- Positron
- PositronNoLabels 
- PositronOnlyLabels 
- DarkMatter 
- DarkMatterNoLabels 
- DarkMatterOnlyLabels 
- Voyager 
- VoyagerNoLabels
- VoyagerOnlyLabels
- VoyagerLabelsUnder
fig, axes = plt.subplots(1,1, figsize = (10,10))

#alpha parameter makes the map transparent
nbrhd_webmap.plot(column = "asthma_pct", scheme = "quantiles",
                  k = 4, cmap = "Reds", ax = axes,
                  alpha=0.5, edgecolor='k')

cx.add_basemap(axes,
              source=cx.providers.CartoDB.DarkMatter)
../../../_images/a26622ca3ea058267d28cb509500b51b04f51e69e9499545cf8a253c3edb12e9.png

What else? We can visualize the data analysis results together!#

Recall we’ve conducted a linear regression on the correlation between asthma and immigration.

We can plot the data description, spatial distribution, and linear regression togehter.

1. load in the immigration data#

fname = '1_socdem_neighb_2006-2.xls'
sname = 'socdem_2006'
socdem_neighb = pd.read_excel(fname, sheet_name = sname, header = 10)
socdem_neighb.head()
Neighbourhood id Neighbourhood Name Total Population % 19 and under % 65 and over % Living alone % Aged 65+ living alone % Lone parent families § % 1 year mobility in population Median household income before-tax $ ... % Less than high school education ** % With a University degree ** % No knowledge of English/French % Recent immigrants-within 5 years % Recent immigrants-within 10 years % Immigrants % Visible minority Top 3 Home Languages for non-English speaking households, #1 Top 3 Home Languages for non-English speaking households, #2 Top 3 Home Languages for non-English speaking households, #3
0 1 West Humber-Clairville 32252 26.8 11.3 3.7 11.7 26.3 11.6 63413 ... 18.4 22.9 5.4 12.2 22.1 60.0 77.9 Panjabi (Punjabi) Gujarati Spanish
1 2 Mount Olive-Silverstone-Jamestown 32127 32.2 8.1 4.1 15.4 31.5 15.7 48146 ... 20.1 23.6 7.6 22.2 35.1 67.1 85.3 Panjabi (Punjabi) Gujarati Arabic
2 3 Thistletown-Beaumond Heights 9928 25.3 16.7 5.4 17.5 30.1 10.9 55032 ... 16.7 20.7 5.0 9.7 19.3 55.9 53.8 Panjabi (Punjabi) Italian Spanish
3 4 Rexdale-Kipling 10725 23.6 18.3 10.2 28.8 33.3 13.2 52428 ... 15.3 18.8 3.0 8.3 14.0 47.2 41.8 Spanish Panjabi (Punjabi) Croatian
4 5 Elms-Old Rexdale 9437 29.9 10.5 5.8 13.0 39.1 9.7 53779 ... 15.2 18.8 3.6 9.8 18.2 53.8 60.5 Spanish Italian Urdu

5 rows × 28 columns

socdem_neighb.columns
Index(['Neighbourhood id', 'Neighbourhood Name', 'Total Population',
       '% 19 and under', '% 65 and over', '% Living alone',
       '% Aged 65+ living alone', '% Lone parent families §',
       '% 1 year mobility in population',
       'Median household income before-tax $',
       'Median household income after-tax $ ‡',
       '% Families-Incidence of low income before-tax',
       '% Families-Incidence of low income after-tax ‡',
       '% Individuals-Incidence of low income before-tax',
       '% Individuals-Incidence of low income after-tax ‡',
       '% Rented Dwellings', '% Unemployment rate *',
       '% Not in labour force *', '% Less than high school education **',
       '% With a University degree **', '% No knowledge of English/French',
       '% Recent immigrants-within 5 years',
       '% Recent immigrants-within 10 years', '% Immigrants',
       '% Visible minority',
       'Top 3 Home Languages for non-English speaking households, #1',
       'Top 3 Home Languages for non-English speaking households, #2',
       'Top 3 Home Languages for non-English speaking households, #3'],
      dtype='object')
# Important columns
important_cols = socdem_neighb.columns[[0, 1, 10, 21]]

rename_dic = {
    important_cols[0]: 'Neighbid',
    important_cols[1]: 'name',
    important_cols[2]: 'median_income',
    important_cols[3]: 'immigration5'
}
rename_dic
{'Neighbourhood id': 'Neighbid',
 'Neighbourhood Name': 'name',
 'Median household income after-tax $ ‡': 'median_income',
 '% Recent immigrants-within 5 years': 'immigration5'}
# Select important columns
socdem_neighb = socdem_neighb.copy()
socdem_neighb = socdem_neighb[important_cols]
socdem_neighb.rename(columns = rename_dic, inplace = True)
socdem_neighb.head()
Neighbid name median_income immigration5
0 1 West Humber-Clairville 56220 12.2
1 2 Mount Olive-Silverstone-Jamestown 43975 22.2
2 3 Thistletown-Beaumond Heights 49800 9.7
3 4 Rexdale-Kipling 46033 8.3
4 5 Elms-Old Rexdale 48055 9.8

2. Merge dataset#

Merge socdem_neighb to nbrhd_webmap, because we want to map it on a basemap.

nbrhd_webmap = nbrhd_webmap.merge(socdem_neighb, on='Neighbid')

nbrhd_webmap.head()
name_x nbrhd_spat_id geometry Neighbid name_y adult_pop asthma_pct name median_income immigration5
0 Casa Loma (96) 096 POLYGON ((-8840403.206 5415116.967, -8840420.5... 96 Casa Loma 6640 10.1 Casa Loma 59151 4.9
1 Annex (95) 095 POLYGON ((-8838115.394 5414318.191, -8838308.4... 95 Annex 20869 9.5 Annex 46012 4.7
2 Caledonia-Fairbank (109) 109 POLYGON ((-8845470.121 5416294.143, -8845495.7... 109 Caledonia-Fairbank 6615 11.3 Caledonia-Fairbank 46712 7.6
3 Woodbine Corridor (64) 064 POLYGON ((-8829288.812 5414013.407, -8829483.7... 64 Woodbine Corridor 8346 13.0 Woodbine Corridor 49287 4.5
4 Lawrence Park South (103) 103 POLYGON ((-8839987.390 5419761.937, -8840063.9... 103 Lawrence Park South 9709 10.7 Lawrence Park South 99440 2.5

3. conduct linear regression#

from statsmodels.formula.api import ols
reg_asthmaimm = ols('asthma_pct ~ immigration5', data = nbrhd_webmap) # setup the model
reg_asthmaimm_fit = reg_asthmaimm.fit()
reg_asthmaimm_sum = reg_asthmaimm_fit.summary()
reg_asthmaimm_sum.tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 11.8266 0.218 54.328 0.000 11.396 12.257
immigration5 -0.1597 0.019 -8.475 0.000 -0.197 -0.122

4. Visualize the results#

The regression analysis shows a significant coefficient of immigration percent, indicating a significant linear relationship.

Therefore, we can visualize these two variables as a scatter plot and regressed line.

What else do we want?

  • data description of the dependent variable

  • map both variables

Then we can plot the four figures in a (2,2) manner.

fig, axes = plt.subplots(2,2, figsize = (15,15))

# Description of the dependent variable
nbrhd_simple.asthma_pct.hist(bins=8, ax=axes[0][0])

# Regression
import seaborn as sns
sns.regplot(y = 'asthma_pct', x = 'immigration5', data = nbrhd_webmap, ci = None, ax=axes[0][1])


# Map of the dependent variable
nbrhd_webmap.plot(column = "asthma_pct", scheme = "quantiles",
                  k = 4, cmap = "Reds", ax = axes[1][0],
                  alpha=0.5, edgecolor='k',legend = True,
                 legend_kwds={'loc': 4, 'title': 'Percent Asthmatic', 
                               'title_fontsize': 12,'fontsize': 12})
cx.add_basemap(axes[1][0])

# Map of the independent variable
nbrhd_webmap.plot(column = "immigration5", scheme = "quantiles",
                  k = 4, cmap = "Blues", ax = axes[1][1],
                  alpha=0.5, edgecolor='k',legend = True,
                 legend_kwds={'loc': 4, 'title': 'Immigration rate', 
                               'title_fontsize': 12,'fontsize': 12})
cx.add_basemap(axes[1][1])

# Make some adjustment
axes[0][0].set_title("Historgram of asthma percentage", fontsize = 20)
axes[0][0].xaxis.set_label_text('Asthma percentage %')
axes[0][0].xaxis.get_label().set_fontsize(12)
axes[0][0].yaxis.set_label_text('Proportion %')
axes[0][0].yaxis.get_label().set_fontsize(12)

axes[0][1].set_title("Regression of asthma percentage on immigration rate", fontsize = 20)
axes[0][1].xaxis.set_label_text('Immigration rate recent 5 years %')
axes[0][1].xaxis.get_label().set_fontsize(12)
axes[0][1].yaxis.set_label_text('Asthma percentage %')
axes[0][1].yaxis.get_label().set_fontsize(12)

axes[1][0].set_title("Asthma percentage in Toronto", fontsize = 20)
axes[1][1].set_title("Immigration rate recent 5 years in Toronto", fontsize = 20)


current_fig = plt.gcf() 
current_fig.savefig('results.png',dpi=100) #export maps as an image file!
../../../_images/3f46c5a3a446c5afa08cc749a0eeabb17ea38bba4497e588a334ad1a9a1ef4e1.png

results

Introduction to clustering#

What is clustering?#

Clustering, generally, is the act of grouping similar things with eachother.

We can do this with non-spatial data as well as spatial data.

One common clustering method is k-means, which we won’t cover in this class, but if you’re interested, there’s a nice tutorial on the method here:

https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-f2ad05ed5203

Why do we want to identify clusters?#

Because clusters identify groups of data with similar attribute values, we can use it to begin to understand what drives the grouping.

For example, we may have data on individuals’ income outcomes and education levels (e.g., total number of years in schools, colleges, and universities).

If we perform a cluster analysis, we can identify discrete groups which may be easier to work with when we want to do further analysis.

These groups may be something like:

  • cluster 1: high income, more education

  • cluster 2: high income, less education

  • cluster 3: low income, less education

  • cluster 4: low income, more education

From there, we can look at other attributes of those clusters.

This approach is commonly used in demographic analysis where many attributes are considered to create discrete clusters.

A good example of this used in practice is Environics Analytics PRIZM Segmentation:

https://environicsanalytics.com/en-ca/data/segmentation

example of different segments

You can look up what ‘demographic segment’ or cluster your postal code belongs to here!

Spatial clustering, at a high level, is doing the same thing but one attribute to consider is geography.#

Spatial clustering involves looking at:

  • how some attribute is expressed across space;

  • whether similar attribute values are near each other.

If we want to do ‘spatial clustering’ of a variable what two pieces of information do we need? (don’t over think!!!)#

Autocorrelation#

Autocorrelation literally means ‘self correlations’.

So instead of looking at how two attributes relate to each other (e.g., regression), we want to explore one a single attribute relates to its neighbours.

Typically, we’re talking about temporal autocorrelation or spatial autocorrelation.

Spatial Autocorrelation#

How does a variable value in a specific location correlate with the variable values in its neighbours?

  • regress a value on the values of its neighbours!

So again, information do we need to do answer this question?#

1) For each observation (in this case, neighborhood) we need to know their neighbours!#

To do this we can create something known as a ‘weights matrix’.

A weights matrix is a matrix that describes whether or not any one observation is ‘neighbours’ with another.

There are many ways we can do this, but let’s focus on two common ones: queens contiguity matrix and the rook contiguity matrix:

weights matrices

2) Next we need to describe the value of our attribute of interest in neighbours#

To do this we create a variable known as ‘spatial lag’.

Spatial lag for neighbourhood i is often just the average of some attribute value across all of neighbourhood i’s neighbours (as determined by our weights matrix!).

3) Finally, now we can see how a neighbourhood’s attribute compares to its neighbours.#

This set up allows us to see if neighbourhood’s with high attribute values are next to other neighbourhoods with high values.

We can then use a number of statistical tests to determine if there are ‘significant’ clusters in our dataset.

We will talk about the code to show how this works next week.#