{
"cells": [
{
"cell_type": "markdown",
"id": "41ce46ff",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"# Welcome to GGR274\n",
"\n",
"- What is Data Science?\n",
"- What is the role of Statistics in Data Science?"
]
},
{
"cell_type": "markdown",
"id": "320af18d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Where is Data Science Used in Society?\n",
"\n",
"### Questions\n",
"\n",
"- Is there a Bike Share bike available?\n",
"\n",
"- Will I experience another TTC delay today?\n",
"\n",
"- Will this person commit a crime in the future?\n",
"\n",
"- Does COVID-19 affect males more than females?\n",
"\n",
"- What movies might this person enjoy?\n",
"\n",
"- How many people attended Indian Day Schools in Canada?"
]
},
{
"cell_type": "markdown",
"id": "c5abf094",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Bike Share Toronto\n",
"\n",
"**Is there a bike available at St. George St. /Hoskin Ave.?**\n",
"\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "19a6a68c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Visualization using maps \n",
"\n",
" **Don't worry about understanding the code today - it's complicated, but we will be learning as the course goes on** "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e77fafc5",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"outputs": [],
"source": [
"import folium\n",
"\n",
"m = folium.Map(location=[43.664288100559816, -79.39800603825044], zoom_start =18)\n",
"m"
]
},
{
"cell_type": "markdown",
"id": "403f6401",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Bike Share data is ugly\n",
"\n",
"But, it can be made beautiful with a little bit of programming ..."
]
},
{
"cell_type": "markdown",
"id": "9578440c",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"## Step 1\n",
"\n",
"- Read the data from into Python."
]
},
{
"cell_type": "markdown",
"id": "dab7bb35",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```\n",
"{\"last_updated\":1637611674,\"ttl\":30,\"data\":{\"stations\":[{\"station_id\":\"7000\",\"name\":\"Fort York Blvd / Capreol Ct\",\"physical_configuration\":\"REGULAR\",\"lat\":43.639832,\"lon\":-79.395954,\"altitude\":0.0,\"address\":\"Fort York Blvd / Capreol Ct\",\"capacity\":35,\"rental_methods\":[\"KEY\",\"TRANSITCARD\",\"CREDITCARD\",\"PHONE\"],\"groups\":[],\"obcn\":\"647-643-9607\",\"nearby_distance\":500.0},{\"station_id\":\"7001\",\"name\":\"Wellesley Station Green P\",\"physical_configuration\":\"REGULAR\",\"lat\":43.66496415990742,\"lon\":-79.38355031526893,\"altitude\":0.0,\"address\":\"Yonge / Wellesley\",\"post_code\":\"M4Y 1G7\",\"capacity\":17,\"rental_methods\":[\"KEY\",\"TRANSITCARD\",\"CREDITCARD\",\"PHONE\"],\"groups\":[],\"obcn\":\"416-617-9576\",\"nearby_distance\":500.0},\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "2a2bb792",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"🙀 Don't worry about understanding all the details for now ... "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f861c4d0",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"station_url = 'https://tor.publicbikesystem.net/ube/gbfs/v1/en/station_information'\n",
"stationinfo = pd.read_json(station_url)\n",
"\n",
"stationlist = stationinfo['data'].iloc[0]"
]
},
{
"cell_type": "markdown",
"id": "451b4b3a",
"metadata": {},
"source": [
"The first station data looks a bit better:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6698ad6b",
"metadata": {},
"outputs": [],
"source": [
"stationlist[0]"
]
},
{
"cell_type": "markdown",
"id": "bd4c91c5",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Step 2\n",
"\n",
"- A bit of programming is needed to **transform the data** into a format that can be displayed on the map above."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d885beb3",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"# convert JSON format to GeoJSON format \n",
"def featuredict(lon,lat, id1, name, capacity):\n",
" dict = {\n",
" \"type\": \"Feature\",\n",
" \"geometry\": {\n",
" \"type\": \"Point\",\n",
" \"coordinates\": [lon,lat]\n",
" },\n",
" \"properties\": {\n",
" \"station_id\": id1,\n",
" \"name\": name,\n",
" \"capacity\": capacity\n",
" }\n",
" }\n",
" return(dict)\n",
"\n",
"\n",
"# featuredict1 = create_station_list\n",
"def create_station_list(l):\n",
" m = []\n",
" for x in range(0,len(l)):\n",
" m.append(\n",
" featuredict(\n",
" l[x]['lon'],\n",
" l[x]['lat'],\n",
" l[x]['station_id'],\n",
" l[x]['name'],\n",
" l[x]['capacity']\n",
" )\n",
" )\n",
" x += 1\n",
" return(m)\n",
"\n",
"stations = create_station_list(stationlist)\n",
"stations[0:2]"
]
},
{
"cell_type": "markdown",
"id": "05cc9d7e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Step 2 (Cont'd)\n",
"\n",
"- More data transformation ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebc8212c",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"# convert stations list to GeoJSON format\n",
"\n",
"def add_GeoJSON_formatting(features):\n",
" dict = {\n",
" \"type\": \"FeatureCollection\",\n",
" \"features\": features\n",
" }\n",
" return(dict)\n",
"\n",
"\n",
"stations_geoj = add_GeoJSON_formatting(stations)\n",
"stations_geoj['features'][0]"
]
},
{
"cell_type": "markdown",
"id": "0b0899ce",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## A Data Science Application\n",
"\n",
"**Question:** How many bikes are available\n",
"\n",
"**Data:** TO Bike Share data\n",
"\n",
"**Communicate the results:** Visualize the data on a map where the bikes are located\n",
"\n",
"**Next Steps:** Predict how many bikes are available at hourly intervals"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f943b82",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"folium.GeoJson(\n",
" stations_geoj, \n",
" name = \"Bikes\", \n",
" tooltip = folium.features.GeoJsonTooltip(fields=['station_id','name','capacity'], localize=True)\n",
").add_to(m)\n",
"m"
]
},
{
"cell_type": "markdown",
"id": "71c5eb78",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## TTC Subway Delays"
]
},
{
"cell_type": "markdown",
"id": "cd30d28b-2f7f-4321-99d3-f25f5fdec99d",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "5c1219f2-51b7-4938-a193-530371d7ecf5",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Steps to Read the Public Data\n",
"\n",
"1. Extract data file locations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13b38d3b-b4cf-4ea6-ab65-f0a1f0ae3c41",
"metadata": {
"editable": true,
"scrolled": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import requests\n",
"# Toronto Open Data is stored in a CKAN instance. It's APIs are documented here:\n",
"# https://docs.ckan.org/en/latest/api/\n",
"\n",
"# To hit our API, you'll be making requests to:\n",
"base_url = \"https://ckan0.cf.opendata.inter.prod-toronto.ca\"\n",
"\n",
"# Datasets are called \"packages\". Each package can contain many \"resources\"\n",
"# To retrieve the metadata for this package and its resources, use the package name in this page's URL:\n",
"url = base_url + \"/api/3/action/package_show\"\n",
"params = { \"id\": \"ttc-subway-delay-data\"}\n",
"package = requests.get(url, params = params).json()\n",
"xlsx_list = []\n",
"\n",
"# To get resource data:\n",
"for idx, resource in enumerate(package[\"result\"][\"resources\"]):\n",
"\n",
" # To get metadata for non datastore_active resources:\n",
" if not resource[\"datastore_active\"]:\n",
" url = base_url + \"/api/3/action/resource_show?id=\" + resource[\"id\"]\n",
" resource_metadata = requests.get(url).json()['result']\n",
" # From here, you can use the \"url\" attribute to download this file\n",
" print(resource_metadata['url'])\n",
" xlsx_list.append(resource_metadata['url'])\n"
]
},
{
"cell_type": "markdown",
"id": "e580eb9c-c90d-47f0-9282-fcf3907d8fe3",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"2. Read the files of interest, say files for 2022 and 2023."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1696ce1d-5074-45a7-9e32-1a484493fb1e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import re\n",
"import pandas as pd\n",
"p = re.compile(\".+(2022|2023)\")\n",
"df_list = []\n",
"\n",
"for x in xlsx_list:\n",
" if p.match(x):\n",
" print(x)\n",
" df_list.append(pd.read_excel(x))\n",
"\n",
"ttc_delays = pd.concat(df_list)\n",
"ttc_delays"
]
},
{
"cell_type": "markdown",
"id": "f6b2fddc",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## What can we learn from the data?\n",
"\n",
"Plotting (with a bit of manipulation) usually helps us see patterns that we might not see otherwise."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa2b0fc2-cf86-4341-80cb-86518c27d235",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import seaborn as sns\n",
"ttc_delays_stn = ttc_delays.groupby(\n",
" \"Station\" # station names need cleaning...\n",
").agg(\n",
" {\"Min Delay\": \"sum\"}\n",
").reset_index().sort_values(\n",
" \"Min Delay\", ascending = False\n",
").head(15)\n",
"# print(ttc_delays_stn)\n",
"g = sns.barplot(\n",
" data = ttc_delays_stn,\n",
" x = \"Min Delay\",\n",
" y = \"Station\"\n",
")\n",
"g.set(xlabel = \"Total Length of Delays (min)\", ylabel = \"Top 15 Stations\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59140615",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [],
"source": [
"ttc_delays[\"Year\"] = ttc_delays[\"Date\"].apply(lambda x: x.year)\n",
"ttc_delays[\"Month\"] = ttc_delays[\"Date\"].apply(lambda x: x.month)\n",
"ttc_delays_ymd = ttc_delays.groupby(\n",
" [\"Year\", \"Month\"]\n",
").size().reset_index(name = \"Number of Delays\")\n",
"\n",
"g = sns.lineplot(\n",
" data = ttc_delays_ymd,\n",
" x = \"Month\",\n",
" y = \"Number of Delays\",\n",
" hue = \"Year\"\n",
")\n",
"g.set(xticks = [1, 12], xticklabels = [\"Jan\", \"Dec\"], xlabel = \"\")"
]
},
{
"cell_type": "markdown",
"id": "eeeb4198",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## How many people attended Federal Day Schools?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "598e6d95",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"outputs": [],
"source": [
"from IPython.display import IFrame\n",
"IFrame(\"https://indiandayschools.org/\", 1200,1000)"
]
},
{
"cell_type": "markdown",
"id": "3e199e87",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "6cbd4c44",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "f97f06da",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "622ee42b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "f7d8f745",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "748cbf17",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "ad472b24",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "44f749e3",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "35c37979",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Questions \n",
"\n",
"- What is the provenance of the data used in the expert report?"
]
},
{
"cell_type": "markdown",
"id": "448aebb6",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- How does this data relate to the original documents?"
]
},
{
"cell_type": "markdown",
"id": "f92f28dd",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- What were record retention policies of Day Schools?"
]
},
{
"cell_type": "markdown",
"id": "8c45482b",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- If we don't know the data's provenance then it's not possible to assess the reliability of the total number of people that attended Day School from this data."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}