GGR274H1 Final Project Instructions#

The final project for this course will use data science methods including programming and statistical analysis of data from Ontario Community Health Profiles on one of the topics below. You will present your findings via a five minute pre-recorded video, where you will present your work in the style of an oral presentation at an academic Data Science Conference (DSC).

Deliverables#

  1. The Jupyter notebook (.ipynb) and related data files (.xlsx) in a archive (.zip) that produced the slides for the pre-recorded video presentation, to be submitted in MarkUs before by April 8, 23:59 (11:59 PM) (extended from April 5).

Here are the instructions for submitting to MarkUs:

  • Download your Jupyter Notebook file from JupyterHub and rename it as Final_Project.ipynb.

  • Download the data files you used from JupyterHub.

  • Create a new folder and put the Notebook file (.ipynb) and the data files (.xlsx) in the same folder.

  • In this case, you can directly import the data files in Notebook using the code like what we usually do in homework and class, (e.g.,pd.read_excel(<filename>, <sheetname>, header=), because they are in the same folder.

  • Compress the folder into a archive (.zip) and rename it as (final_project.zip)

  • Submit this archive to MarkUs under the Project assignment.

  1. A 5 minute pre-recorded video presentation summarizing the work that you will give at the Data Science Conference (DSC).

  • If the Jupyter notebook and other files are submitted after the deadline you will lose 10% of your overall final project mark as long as they are submitted at most 24 hours after the deadline. The project files will not be accepted more than 24 hours after the deadline.

Conference Slides#

  • You should produce your conference slides using the Jupyter notebook Slideshow Extension RISE which is available on the UofT Jupyterhub. The template of a Jupyter notebook document to produce the slides for your presentation is here.

  • You will be allowed to present five content slides. The five slides do not include title page and do not include slides to break up sections described below.

  • Your presentation slides should include reproducible Python output (i.e., graphs, tables should be produced by Python code but not not hard-coded or inserted as an image), but not Python code unless it’s directly related to one of the sections below. See the template and tips for an example of how to write code chunks to do this. Simply speaking, you may want to use skip to skip python code cell and fragement to show multiple cells in one slide.

  • Your title slide must include your name, tutorial section (e.g., TUTXXXX), and the link to your presentation video.

  • Your slides should include the following sections:

    • Introduction

    • Methods

    • Results

    • Conclusion

A few guidelines for an effective video presentation:

  • Your slides should be clear, concise, and easy to read quickly.

  • Do not use small fonts.

  • Figures often display information more efficiently than text.

  • Numbered or bulleted lists convey points in slides more effectively than blocks of text.

Oral Presentation#

You will be asked to present your slides in a 5 minute pre-recorded video presentation that summarizes your work. This time limit is firm and we will not view any content past 5 minutes.

  • You may not work with others in the class, as this is an individual project. You may discuss the project with professors and the course TAs using office hours.

Video Recording and Submission Instructions#

  1. First, record your video presentation.

    • You can refer to this demo video regarding how to make this recording.

  2. Upload your video recording to the University of Toronto MyMedia platform, https://mymedia.library.utoronto.ca.

    • Martin has made another demo video for this step.

  3. After uploading the video, obtain the “Permanent Link” to the video but clicking on the “Share” button. This link should look something like

    https://play.library.utoronto.ca/watch/1234567abcdefg...
    
  4. Add this URL to the top cell of your Jupyter notebook file, and clearly label it as the link to your presentation recording.

  5. Then, submit your Jupyter notebook file and all data used in your Jupyter notebook file to MarkUs under project-submission: Project Final Submission.

    Your TA will need to be able to see your full notebook file, run the code, and access the video URL, so please double-check notebook, data, and video before making your final submission!

Evaluation#

Grade component

Value

Content of slides

50%

Reproducibility of slide content

10%

Video Presentation

40%

Content of slides#

  • The rubric for the content of the conference slides evaluation is below.

import pandas as pd
slidesrubric = pd.read_csv('confrenceslidesrubric.csv', keep_default_na=False)
slidesrubric.style.hide(axis='index')
Criteria Category Excellent Good Adequate Poor
Content Reasonable scope The scope of the analysis is clear and questions can be fully addressed using the available data. The scope of the analysis is clear and questions can be reasonably addressed using the available data. The scope of the analysis is less clear, the questions can somewhat be addressed using the available data with slight modifications. The questions are beyond the scope, cannot be reasonably addressed with the available data; need to resort to additional data or complete modification.
Data wrangling Creative use of data wrangling to produce informative variables. Appropriate use of data wrangling to create sensible variables. Some use of data wrangling to create new variables. No evidence of data wrangling to create any variables.
Graphical display Choice of graphs are appropriate and creative; graphs reveal useful information and tell a story. Meaningful captions and titles. Choice of graphs are appropriate; graphs reveal useful information, but are not self-sufficient. Might require some explaining. Choice of graphs are appropriate; graphs reveal some useful information. Might require some explaining and minor changes to titles/axes/labels, etc. A lack of visual aid; graphs are inappropriate, reveal no information.
Statistical methods The choice of method is appropriate; analyses are complete; diverse and creative use of more than one approach. The choice of method is appropriate; some non-essential analyses are missing The choice of method is somewhat appropriate; some analyses are missing. The choice of method is inappropriate; essential analyses are missing.
Appropriate conclusion Results are clearly and completely summarized. Appropriate limitations and concerned clearly stated. Results are completely summarized. Some limitations and concerned are stated. Some results are summarized. The conclusion is not appropriate and no mentioning of any limitations. Results are not summarized and conclusion is missing.
Writing Organization Contents are very well organized under the appropriate section and subsection headings. Contents are organized under the appropriate section and subsection headings. Contents are somewhat organized under section and subsection headings. Contents are poorly organized under section and subsection headings.
Overall Writing Very polished and well written. Few errors in spelling, punctuation, and/or grammar. Mostly clear and understandable. Partly unclear, but mostly understandable. Several errors in spelling, punctuation, and/or grammar. Too many errors in spelling, punctuation, and/or grammar, which make it unclear and difficult to follow.

Reproducibility of Conference Slides#

  • You will submit the Jupyter notebook that was used to produce your conference slides.

  • Your TA will attempt to reproduce your conference slides using the Jupyter notebook (.ipynb) and data files you submit.

  • In order to reproduce notebook, you may want to save the notebook and data files in the same folder as shown in “Deliverables” section (in the beginning).

  • If your TA cannot run the .ipynb file you submit to reproduce your conference slide content then you will receive 0; if the TA has to make minor changes to get it to run then you will receive 1; and if it runs with no changes then you will receive 2.

Oral Video Presentation#

  • You will give a 5 minute presentation about your work.

  • In the beginning of the video, you must clearly present your student ID along with the camera on your face. The grading TA must be able to identify you and your student ID number. Failure to present your student ID will result in a 0 grade for the video presentation. We recommend that you update your Quercus profile with a picture where your face is clearly identifable.

  • The presentation should show you speaking on one part of the screen and your slides on another part of the screen.

  • The presentation should not exceed 5 minutes. Any video beyond 5 minutes will not be viewed by the grading TA, and will not be considered when marking.

  • In the video you should describe your project’s:

    • Introduction

    • Data

    • Methods

    • Results

    • Conclusion

  • The rubric for the oral video presentation is below.

oralrubric = pd.read_csv('Oralpresrubric.csv', keep_default_na=False)
oralrubric.style.hide(axis='index')
Criteria Excellent (Rare) Good (Common) Adequate (Common) Poor (Very rare)
Speech clarity Words were articulated clearly and distinctly, and very easy to understand. Words were articulated clearly and distinctly most of the time, easy to understand. Clear attempts to enunciate, with some occasional mumbling, but still understandable. A lot of word slurring or mumbling, barely understandable.
Content Clarity Just the right amount of explanation and details were given, the presentation effectively achieved its points. Sufficient explanation and details were given, the presentation achieved most points. Some explanation, too little or too much details were given, the presentation achieved some points. No explanation, insufficient or too many unnecessary details, the presentation was confusing and had no clear objectives.
Transitional Phrases Effective use of words and phrases to enhance the flow and signal transitions. Good use of words and phrases to control the flow and signal transitions. Some use of transitional words and phrases to signal transitions. Lack of transitions and a poor progression of flow.
Vocabulary Accurate use of statistical terms and phrases, and the presentation was professional and polished. Good use of statistical terms and phrases whenever necessary. Demonstrated efforts to incorporate statistical terms and phrases, but some were used inaccurately. Completely inaccurate and wrong use of statistical terms and phrases and signals a lack of understanding.
Delivery Well-paced, good volume, and the presenter was confident. Good pace, and the presenter seemed confident. Pacing could be improved, volume was not consistent. Poor pacing, barely audible.
The wow factor Overall an excellent and impressive presentation.

Data Analysis Expectations#

You will carry out a data analysis on data from Ontario Community Health Profiles using Python to address the topic below.

We expect that your analysis will require data wrangling, exploratory data analysis (plots and summary statistics), statistical tests and modeling. Your project does not need to include all of these statistical methods nor does it need to include all of the variables in the data set. You might also choose not to include all observations, or to make new variables from the data that may be more suitable for answering your questions of interest.

The goal is not to carry out an exhaustive analysis, nor to apply everything you have learned in the course. The goal is to demonstrate that you have learned how to use Python, that you can appropriately apply the methods we have covered in class to address a question, and that you can effectively interpret and present the results.

The Data#

Information about data from the About Page from the Ontario Community Health Profiles website:

“Our overall goal is to support action to reduce health inequities in Ontario.

How we achieve our goal: Through our open-access, freely accessible website, we provide free health and socio-demographic data and maps for everyone to use, download, customize and share. Interactive functionality allows users the ability to view multiple variables on one map. A user guide provides guidance on navigation while methods documents fully describe data sources and analyses.

Our strategy:

  • Provide health profiles of Ontario communities for Ontario communities with relevant and timely information in user-friendly format;

  • Provide technical support and mechanisms for communities to access data;

  • Conduct a series of workshops to foster access to and use of health data for decision-making, advocacy and policy, and to stimulate collaboration;

  • Respond to the needs of users and adapt to new ways of providing data.”

There is a wealth of information about health outcomes and socioeconomic characteristics, all summarized at the neighborhood level. These data come from a number of sources - e.g., socioeconomic data come from the Census, hospital admission data come from the Canadian Institute for Health Information, and chronic disease data come from the Institute for Clinical Evaluative Sciences.

What we give you#

You are provided with a number of datasets with health and socioeconomic data at the Toronto Neighbourhood level.

The health data include a dataset on adult health outcome (AHD_2018-19_RPDB2019_neighb_Toronto.xlsx), a dataset on preventive health care measures (PR_2018-20_RPDB2019_neighb_TOR.xlsx), and a dataset on emergency department visits (EDC_2018-20_RPDB2019_neighb_All_HU_LU_TOR.xlsx).

The socioeconomic data include a dataset on income (income_Toronto_2021_7.xlsx), population characterstics(population_characteristics_Toronto_2021_7.xlsx), and a dataset on the households and dwellings characterstics (housing_dwellings_Toronto_2021_7.xlsx).

To complete your project, you only need to “wrangle” and analyze two of the above mentioned datasets, but you are also welcome to incorporate more than two datasets and to explore spatial patterns using the tools we learned in the last few weeks of class. A spatial file of Toronto Neighbourhoods that can be joined to the previously mentioned data is available as Neighbourhoods.geojson.

All data files are under data folder or avaiable here (link only available when you open this notebook in JupyterHub).

You are welcome to think about the data we have presented you and come up with your own question to answer, but a few questions are also provided below to help get you started. There are many ways you can address these questions in the data. You will need to focus on your unique research question and make choices about what variables are important to your question. You do not need to consider every variable in the data set.

Project Questions#

Below are a few example questions that can inspire your project topics. These general questions will require you to decide which data to use to answer these questions.

  1. What are the relationships between sociodemographic and income variables and health outcomes? Using the dataset on emergency department visits, explore the following:

    • Classify neighbourhoods based on their income quintiles (5 groups, 20% each), and explore the demographics (e.g., living alone, language) within each group.

    • Repeat the above by classifying neighbourhoods as quintiles by a sociodemographic variable of your choice, and summarize income.

    • Analzye how high urgency and low urgency emergency department visits are associated with income and sociodemographic variables. Is there an association between income and high (or low) urgency ED visits?

    • If they exist, discuss why these relationships do or do not make sense; and if they don’t exist what might be missing? You may need to do a little research to figure this part out. Cite your sources.

  2. Do certain adult health conditions (like asthma) have similar spatial patterns? (NB: some of the methods mentioned in this question will be covered later in the course)

    • Explore the geography of the adult asthma prevalence by mapping them using the libpysal and geopandas libraries.

    • Use a measure of global spatial autocorrelation (for example Moran’s I) to determine if the outcomes are actually spatially clustered.

    • Use a spatial clustering tool (e.g., the Local G or LISA statistics we covered in class) to map where hot spots and cold spots of the diseases are.

    • Compare the patterns and state whether you think there is significant overlap. If you can do this using code, all the better :).

    • To go one step further, use the sociodemographic data to explore whether the disease maps correspond to patterns in sociodemographic variables (of your choosing).

  3. Are differences in the number of preventive procedures (e.g., mammograms, pap smears, etc.) by age group (as classified in the preventive health care measures dataset) related to different sociodemographic variables?

    • The various preventive procedures in the dataset are broken down by various age groups (which vary some depending on the preventive procedures). Select the two procedures outlined in the dataset, and explore if older people participate in more preventive procedures than younger people.

    • For the two procedures, calculate the differences in percent (e.g., % who had a mammogram) by age group for each neighbourhood. How different are these percentages, on average, for the procedures you picked?

    • For each procedure, identify the neighbourhoods with the biggest absolute differences (use the top 10th percentile). How do sociodemographic variables in these neighbourhoods differ from neighbourhoods with smaller differences (aka the bottom 90th percentile)?

    • Given what you’ve learned, discuss why you think these differences might exist. Feel free to create a map to help justify your explanation.