{ "cells": [ { "cell_type": "markdown", "id": "41ce46ff", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Welcome to GGR274\n", "\n", "- What is Data Science?\n", "- What is the role of Statistics in Data Science?" ] }, { "cell_type": "markdown", "id": "320af18d", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Where is Data Science Used in Society?\n", "\n", "### Questions\n", "\n", "- Is there a Bike Share bike available?\n", "\n", "- Will I experience another TTC delay today?\n", "\n", "- Will this person commit a crime in the future?\n", "\n", "- Does COVID-19 affect males more than females?\n", "\n", "- What movies might this person enjoy?\n", "\n", "- How many people attended Indian Day Schools in Canada?" ] }, { "cell_type": "markdown", "id": "c5abf094", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Bike Share Toronto\n", "\n", "**Is there a bike available at St. George St. /Hoskin Ave.?**\n", "\n", " \n", "\n" ] }, { "cell_type": "markdown", "id": "19a6a68c", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Visualization using maps \n", "\n", " **Don't worry about understanding the code today - it's complicated, but we will be learning as the course goes on** " ] }, { "cell_type": "code", "execution_count": null, "id": "e77fafc5", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "import folium\n", "\n", "m = folium.Map(location=[43.664288100559816, -79.39800603825044], zoom_start =18)\n", "m" ] }, { "cell_type": "markdown", "id": "403f6401", "metadata": { "jp-MarkdownHeadingCollapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "## Bike Share data is ugly\n", "\n", "But, it can be made beautiful with a little bit of programming ..." ] }, { "cell_type": "markdown", "id": "9578440c", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "## Step 1\n", "\n", "- Read the data from into Python." ] }, { "cell_type": "markdown", "id": "dab7bb35", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```\n", "{\"last_updated\":1637611674,\"ttl\":30,\"data\":{\"stations\":[{\"station_id\":\"7000\",\"name\":\"Fort York Blvd / Capreol Ct\",\"physical_configuration\":\"REGULAR\",\"lat\":43.639832,\"lon\":-79.395954,\"altitude\":0.0,\"address\":\"Fort York Blvd / Capreol Ct\",\"capacity\":35,\"rental_methods\":[\"KEY\",\"TRANSITCARD\",\"CREDITCARD\",\"PHONE\"],\"groups\":[],\"obcn\":\"647-643-9607\",\"nearby_distance\":500.0},{\"station_id\":\"7001\",\"name\":\"Wellesley Station Green P\",\"physical_configuration\":\"REGULAR\",\"lat\":43.66496415990742,\"lon\":-79.38355031526893,\"altitude\":0.0,\"address\":\"Yonge / Wellesley\",\"post_code\":\"M4Y 1G7\",\"capacity\":17,\"rental_methods\":[\"KEY\",\"TRANSITCARD\",\"CREDITCARD\",\"PHONE\"],\"groups\":[],\"obcn\":\"416-617-9576\",\"nearby_distance\":500.0},\n", "\n", "```" ] }, { "cell_type": "markdown", "id": "2a2bb792", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "🙀 Don't worry about understanding all the details for now ... " ] }, { "cell_type": "code", "execution_count": null, "id": "f861c4d0", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "station_url = 'https://tor.publicbikesystem.net/ube/gbfs/v1/en/station_information'\n", "stationinfo = pd.read_json(station_url)\n", "\n", "stationlist = stationinfo['data'].iloc[0]" ] }, { "cell_type": "markdown", "id": "451b4b3a", "metadata": {}, "source": [ "The first station data looks a bit better:" ] }, { "cell_type": "code", "execution_count": null, "id": "6698ad6b", "metadata": {}, "outputs": [], "source": [ "stationlist[0]" ] }, { "cell_type": "markdown", "id": "bd4c91c5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 2\n", "\n", "- A bit of programming is needed to **transform the data** into a format that can be displayed on the map above." ] }, { "cell_type": "code", "execution_count": null, "id": "d885beb3", "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "# convert JSON format to GeoJSON format \n", "def featuredict(lon,lat, id1, name, capacity):\n", " dict = {\n", " \"type\": \"Feature\",\n", " \"geometry\": {\n", " \"type\": \"Point\",\n", " \"coordinates\": [lon,lat]\n", " },\n", " \"properties\": {\n", " \"station_id\": id1,\n", " \"name\": name,\n", " \"capacity\": capacity\n", " }\n", " }\n", " return(dict)\n", "\n", "\n", "# featuredict1 = create_station_list\n", "def create_station_list(l):\n", " m = []\n", " for x in range(0,len(l)):\n", " m.append(\n", " featuredict(\n", " l[x]['lon'],\n", " l[x]['lat'],\n", " l[x]['station_id'],\n", " l[x]['name'],\n", " l[x]['capacity']\n", " )\n", " )\n", " x += 1\n", " return(m)\n", "\n", "stations = create_station_list(stationlist)\n", "stations[0:2]" ] }, { "cell_type": "markdown", "id": "05cc9d7e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 2 (Cont'd)\n", "\n", "- More data transformation ..." ] }, { "cell_type": "code", "execution_count": null, "id": "ebc8212c", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "# convert stations list to GeoJSON format\n", "\n", "def add_GeoJSON_formatting(features):\n", " dict = {\n", " \"type\": \"FeatureCollection\",\n", " \"features\": features\n", " }\n", " return(dict)\n", "\n", "\n", "stations_geoj = add_GeoJSON_formatting(stations)\n", "stations_geoj['features'][0]" ] }, { "cell_type": "markdown", "id": "0b0899ce", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A Data Science Application\n", "\n", "**Question:** How many bikes are available\n", "\n", "**Data:** TO Bike Share data\n", "\n", "**Communicate the results:** Visualize the data on a map where the bikes are located\n", "\n", "**Next Steps:** Predict how many bikes are available at hourly intervals" ] }, { "cell_type": "code", "execution_count": null, "id": "3f943b82", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "folium.GeoJson(\n", " stations_geoj, \n", " name = \"Bikes\", \n", " tooltip = folium.features.GeoJsonTooltip(fields=['station_id','name','capacity'], localize=True)\n", ").add_to(m)\n", "m" ] }, { "cell_type": "markdown", "id": "71c5eb78", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## TTC Subway Delays" ] }, { "cell_type": "markdown", "id": "cd30d28b-2f7f-4321-99d3-f25f5fdec99d", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "5c1219f2-51b7-4938-a193-530371d7ecf5", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Steps to Read the Public Data\n", "\n", "1. Extract data file locations." ] }, { "cell_type": "code", "execution_count": null, "id": "13b38d3b-b4cf-4ea6-ab65-f0a1f0ae3c41", "metadata": { "editable": true, "scrolled": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import requests\n", "# Toronto Open Data is stored in a CKAN instance. It's APIs are documented here:\n", "# https://docs.ckan.org/en/latest/api/\n", "\n", "# To hit our API, you'll be making requests to:\n", "base_url = \"https://ckan0.cf.opendata.inter.prod-toronto.ca\"\n", "\n", "# Datasets are called \"packages\". Each package can contain many \"resources\"\n", "# To retrieve the metadata for this package and its resources, use the package name in this page's URL:\n", "url = base_url + \"/api/3/action/package_show\"\n", "params = { \"id\": \"ttc-subway-delay-data\"}\n", "package = requests.get(url, params = params).json()\n", "xlsx_list = []\n", "\n", "# To get resource data:\n", "for idx, resource in enumerate(package[\"result\"][\"resources\"]):\n", "\n", " # To get metadata for non datastore_active resources:\n", " if not resource[\"datastore_active\"]:\n", " url = base_url + \"/api/3/action/resource_show?id=\" + resource[\"id\"]\n", " resource_metadata = requests.get(url).json()['result']\n", " # From here, you can use the \"url\" attribute to download this file\n", " print(resource_metadata['url'])\n", " xlsx_list.append(resource_metadata['url'])\n" ] }, { "cell_type": "markdown", "id": "e580eb9c-c90d-47f0-9282-fcf3907d8fe3", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "2. Read the files of interest, say files for 2022 and 2023." ] }, { "cell_type": "code", "execution_count": null, "id": "1696ce1d-5074-45a7-9e32-1a484493fb1e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import re\n", "import pandas as pd\n", "p = re.compile(\".+(2022|2023)\")\n", "df_list = []\n", "\n", "for x in xlsx_list:\n", " if p.match(x):\n", " print(x)\n", " df_list.append(pd.read_excel(x))\n", "\n", "ttc_delays = pd.concat(df_list)\n", "ttc_delays" ] }, { "cell_type": "markdown", "id": "f6b2fddc", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## What can we learn from the data?\n", "\n", "Plotting (with a bit of manipulation) usually helps us see patterns that we might not see otherwise." ] }, { "cell_type": "code", "execution_count": null, "id": "fa2b0fc2-cf86-4341-80cb-86518c27d235", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import seaborn as sns\n", "ttc_delays_stn = ttc_delays.groupby(\n", " \"Station\" # station names need cleaning...\n", ").agg(\n", " {\"Min Delay\": \"sum\"}\n", ").reset_index().sort_values(\n", " \"Min Delay\", ascending = False\n", ").head(15)\n", "# print(ttc_delays_stn)\n", "g = sns.barplot(\n", " data = ttc_delays_stn,\n", " x = \"Min Delay\",\n", " y = \"Station\"\n", ")\n", "g.set(xlabel = \"Total Length of Delays (min)\", ylabel = \"Top 15 Stations\")" ] }, { "cell_type": "code", "execution_count": null, "id": "59140615", "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [], "source": [ "ttc_delays[\"Year\"] = ttc_delays[\"Date\"].apply(lambda x: x.year)\n", "ttc_delays[\"Month\"] = ttc_delays[\"Date\"].apply(lambda x: x.month)\n", "ttc_delays_ymd = ttc_delays.groupby(\n", " [\"Year\", \"Month\"]\n", ").size().reset_index(name = \"Number of Delays\")\n", "\n", "g = sns.lineplot(\n", " data = ttc_delays_ymd,\n", " x = \"Month\",\n", " y = \"Number of Delays\",\n", " hue = \"Year\"\n", ")\n", "g.set(xticks = [1, 12], xticklabels = [\"Jan\", \"Dec\"], xlabel = \"\")" ] }, { "cell_type": "markdown", "id": "eeeb4198", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How many people attended Federal Day Schools?\n" ] }, { "cell_type": "code", "execution_count": null, "id": "598e6d95", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "from IPython.display import IFrame\n", "IFrame(\"https://indiandayschools.org/\", 1200,1000)" ] }, { "cell_type": "markdown", "id": "3e199e87", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](FederalDaySchools/cover_expertreport.png)" ] }, { "cell_type": "markdown", "id": "6cbd4c44", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](FederalDaySchools/title_expertreport.png)" ] }, { "cell_type": "markdown", "id": "f97f06da", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](FederalDaySchools/estimates_expertreport.png)" ] }, { "cell_type": "markdown", "id": "622ee42b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](FederalDaySchools/datapg1_expertreport.png)" ] }, { "cell_type": "markdown", "id": "f7d8f745", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](FederalDaySchools/datapg2_expertreport.png)" ] }, { "cell_type": "markdown", "id": "748cbf17", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](FederalDaySchools/datapg3_expertreport.png)" ] }, { "cell_type": "markdown", "id": "ad472b24", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "![](FederalDaySchools/c-8149-00418.png)" ] }, { "cell_type": "markdown", "id": "44f749e3", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](FederalDaySchools/c-8171-00008.png)" ] }, { "cell_type": "markdown", "id": "35c37979", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Questions \n", "\n", "- What is the provenance of the data used in the expert report?" ] }, { "cell_type": "markdown", "id": "448aebb6", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- How does this data relate to the original documents?" ] }, { "cell_type": "markdown", "id": "f92f28dd", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- What were record retention policies of Day Schools?" ] }, { "cell_type": "markdown", "id": "8c45482b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If we don't know the data's provenance then it's not possible to assess the reliability of the total number of people that attended Day School from this data." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.1" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 5 }