GGR274 Lecture 3: Introduction to Programming, Part 2#
Recap#
Last lecture, we learned about:
Five basic Python data types:
Data type
Description
Example
int
Integers
10
,0
,-3
float
Non-integer numbers
10.3
,-3.5
bool
True/False
True
,False
str
Text
"January is cold"
list
A sequence of data
["January", "is", "cold"]
Last lecture, we learned about:
Operations we can perform on different data types
Arithmetic:
+
,-
,*
,/
Comparison:
==
,!=
,<
,<=
,>
,>=
Substring search:
in
Boolean (logic):
and
,or
,not
Indexing:
my_string[i]
,my_list[i]
Note: indexing starts at 0, not 1!
Last lecture, we learned about:
Functions
len
type
abs
sum
String methods
str.upper()
str.lower()
str.replace(old, new)
str.split()
Learning Objectives#
In this lecture, you will learn to:
“Reset” Python in your notebook
Avoid a common pitfall involving changing variables
Use if statements and for loops to control how Python code gets executed
Read in data from files
Navigate through and find your course files on JupyterHub
Jupyter Notebook Tips#
If Python seems to be behaving strangely or you’ve run a bunch of cells and lost track of what you’re doing, don’t panic!
Reset Python your notebook by doing the following:
Go to Kernel -> Restart & Clear Output.
Python will restart, and all cell outputs will be removed, leaving only your code.
Then, select the cell you’re currently working on, and go to Cell -> Run All Above.
Or, if you want to re-run all cells, select Cell -> Run All instead.
Common pitfall: changing variables and running cells out of order#
Demo!
my_number = 10
my_number
my_number = 20
my_number = my_number + 100
my_number
Avoiding this pitfall:
Prefer making cells “self-contained”: define and use variables in the same cell.
If you’re using a variable, make sure it’s been defined in an earlier cell.
After defining a variable, avoid changing the variable’s value in a different cell. Treat the variable as “read-only” after the cell it’s been defined in.
If statements#
So far, all of our code has been executed one line at a time, in top-down order:
name = "Karen"
age = 18
print(f"Hello {name}! How old are you?")
print(f"Wow {name}, you're only {age} years old? Why do they let you teach?")
print(f"Goodbye {name}, nice to meet you. :)")
Sometimes, we want to run different code depending on the values of our variables.
An if statement is a kind of Python code that lets us only execute some code if a specific condition is met.
name = "Karen"
age = 18
print(f"Hello {name}! How old are you?")
# NEW
if age < 20:
print(f"Wow {name}, you're only {age} years old? Why do they let you teach?")
print(f"Goodbye {name}, nice to meet you. :)")
If statement terminology#
if <condition>:
<statement1>
<statement2>
...
We call the expression between the if
and the colon the if condition.
We call the statements indented on the line(s) after the colon the if branch.
Warning: indentation matters! Python uses indentation to determine what code is part of an if statement vs. what’s after the if statement.
if 3 > 5:
print("Line 1")
print("Line 2")
vs.
if 3 > 5:
print("Line 1")
print("Line 2")
if
, elif
, else
#
We can use an optional else block after the if block to execute code only when the if condition is False.
name = "Karen"
age = 18
print(f"Hello {name}! How old are you?")
if age < 20:
print(f"Wow {name}, you're only {age} years old? Why do they let you teach?")
else:
print(f"Ah yes {name}, {age} is indeed a reasonable age to be teaching.")
print(f"Goodbye {name}, nice to meet you. :)")
Finally, we can include zero or more elif blocks between the if and else to express more than two cases.
name = "Karen"
age = 18
print(f"Hello {name}! How old are you?")
if age < 20:
print(f"Wow {name}, you're only {age} years old? Why do they let you teach?")
elif age > 100:
print(f"Wow {name}, you're already {age} years old? Why do they let you teach?")
elif age == 57:
print(f"Ah {name}, you are the perfect age and they should let you do whatever you want. 😎")
else:
print(f"Ah yes {name}, {age} is indeed a reasonable age to be teaching.")
print(f"Goodbye {name}, nice to meet you. :)")
Exercise Break#
When is a year a leap year? A leap year is a year that has 366 days by adding February 29 to the calendar. Leap years occur on years that are evenly divisible by 4, except years that are evenly divisible by 100 but not by 400.
This question uses the modulus operator, %
, which finds the integer remainder of the division between two integers. For example, x % 4 == 0
if x
is evenly divisible by 4. In other words x % 4 == 0
is true if x / 4
has no remainder.
Write condtional statements below so that the statement f"{year} is a leap year"
will be printed if year
is a leap year and f"{year} is not a leap year"
will be printed if year
is not a leap year.
Test your statements on the following years:
2025 is not a leap year
2024 is a leap year
2000 is a leap year
2100 is not a leap year
year = 2024
if year % 4 == 0:
if year % 100 == 0:
if year % 400 == 0:
print(f"{year} is a leap year")
else:
print(f"{year} is not a leap year")
else:
print(f"{year} is a leap year")
else:
print(f"{year} is not a leap year")
# or more compactly
if year % 4 == 0 and year % 100 != 0 or year % 400 == 0:
print(f"{year} is a leap year")
else:
print(f"{year} is not a leap year")
For loops#
Recall that we can represent and operate on a collection of data using the list
data type:
numbers = [10, 20, 30, 40, 50]
sum(numbers)
Sometimes we want to execute a piece of code once per element in a collection.
We can do this using a for loop.
numbers = [10, 20, 30, 40, 50]
for number in numbers:
print(number)
For loop terminology#
for <variable> in <collection>:
<statement1>
<statement2>
...
We call
<variable>
the (for) loop variable. It refers to each element of the<collection>
, one at a time.We say that the for loop iterates over the
<collection>
.
We call the statements indented after the colon the (for) loop body.
We call each repetition of the loop body an iteration of the loop.
Using loops: compute and print#
Problem: Given a list of names, compute the length of each name and display each result.
names = ["Karen Reid", "Chunjiang Li", "Michael Moon"]
for name in names:
name_length = len(name)
print(f"The length of '{name}' is {name_length}.")
Using loops: compute and save#
Problem: Given a list of names, compute the length of each name and store each result (so that we can use these values later).
names = ["Karen Reid", "Chunjiang Li", "Michael Moon"]
for name in names:
name_length = len(name)
# How do we "save" the length of the name?
To accomplish this, we will:
Create a new list variable to store the results.
In the loop body, use the (new!)
list.append
method to add each length to the new list.
names = ["Karen Reid", "Chunjiang Li", "Michael Moon"]
lengths = []
for name in names:
name_length = len(name)
lengths.append(name_length)
lengths
Using loops: filtering#
We can combine a for loop and an if statement to iterate over a collection, decide what to do based on whether the current value satisifies a condition.
Problem: Given a list of names, compute the length of each name that contains an "e"
, and store each result.
names = ["Karen Reid", "Chunjiang Li", "Michael Moon"]
lengths = []
for name in names:
if "e" in name:
name_length = len(name)
lengths.append(name_length)
lengths
Exercise Break#
Given a list of names, compute the length of each name that contains an
"a"
and a"c"
, and store each result.Given a list of names, compute the length of each name and store each result that is \(\geq 11\).
names = ["Karen Reid", "Chunjiang Li", "Michael Moon"]
# Write your code for Question 1 here.
# Write your code for Question 2 here.
len(names[0])
# Solutions (only look at this after attempting the problems yourself!)
names = ["Karen Reid", "Chunjiang Li", "Michael Moon"]
# Question 1
lengths1 = []
for name in names:
if ("a" in name) and ("b" in name):
name_length = len(name)
lengths1.append(name_length)
print(f"lengths1 is {lengths1}")
# Question 2
lengths2 = []
for name in names:
name_length = len(name)
if name_length >= 11:
lengths2.append(name_length)
print(f"lengths2 is {lengths2}")
Reading data from files#
So far, we’ve been working entirely with data we defined inside our notebook.
But in practice, we’ll be reading in large quantities of data from files. Let’s learn how to do this in JupyterHub and Python!
Running example: Canada’s Electoral Districts#
Today we’ll work with a small dataset of Canada’s 338 electoral districts, sourced from Elections Canada.
We’ve downloaded this data in a file called ED-Canada_2016.csv
and added it to course website in the same folder as this notebook.
Let’s take a look now. The easiest way to get to the file is to go to File -> Open… in the menu. Demo time!
CSV files#
ED-Canada_2016.csv
is a type of file called a comma-separated values (CSV) file, which is a common way of storing tabular data.
In a CSV file:
Each line of the file represents one row
Within one line, each column entry is separated by a comma
Reading file data: open
and readlines
#
Now that we’ve seen the file, let’s learn how to read the file’s contents into Python.
Formally, we do this in two steps:
Open the file.
Read the file data into Python, line by line.
# Step 1
district_file = open("ED-Canada_2016.csv", encoding="utf-8")
district_file
district_data = district_file.readlines()
district_data
Data processing#
Let’s look at just the first line from the file:
district_data[0]
There’s two annoying parts about this line:
It’s a single string, but really stores two pieces of data.
There’s a strange
\n
at the end of the string, representing a line break.
Goal: take the list district_data
and extract just the population counts, converting to int
. We’ll develop this one together!
populations = []
for line in district_data:
entries = line.split(",")
population_entry = entries[1].strip()
population_int = int(population_entry)
populations.append(population_int)
populations
Now we can compute!#
num_populations = len(populations)
total_population = sum(populations)
max_population = max(populations)
min_population = min(populations)
avg_population = total_population / num_populations
print(f"Number of population entries: {num_populations}.")
print(f"Sum of populations: {total_population}.")
print(f"Maximum district population: {max_population}.")
print(f"Minimum district population: {min_population}.")
print(f"Average district population: {avg_population}.")
Dictionaries#
It’s kind of worrying to have just a list of numbers without any label for what the numbers mean. There is a Python data structure that is more useful for associating labels with values called a dictionary. As you will see it is similar to a list, but instead of a sequential index to access values, we think of a dictionary as containing key value pairs.
Using the district_data
lines that we read from ED-Canada_2016.csv
, let’s see how we can build a dictionary with this data instead of a list of populations:
# create an empty dictionary
district_populations_dict = {}
for line in district_data:
entries = line.split(",")
district_name = entries[0].strip()
population_entry = entries[1].strip()
population_int = int(population_entry)
district_populations_dict[district_name] = population_int
print(district_populations_dict)
Now let’s use the dictionary to compute the same values as above#
# These are not great variable names, but I wante to create new names rather
# than repeat the ones from above to avoid confusion.
num_districts = len(district_populations_dict)
total_population_dict = sum(district_populations_dict.values())
max_population_dict = max(district_populations_dict.values())
min_population_dict = min(district_populations_dict.values())
avg_population_dict = total_population_dict / num_districts
print(f"Number of population entries: {num_districts}.")
print(f"Sum of populations: {total_population_dict}.")
print(f"Maximum district population: {max_population_dict}.")
print(f"Minimum district population: {min_population_dict}.")
print(f"Average district population: {avg_population_dict}.")
Dictionary Keys#
We can also loop over dictionary elements. In the example below, the variable district will get the value of another key in the dictionary each iteration of the loop. We could also have used district_populations_dict.keys()
instead of district_populations_dict
to get the list of keys.
for district in district_populations_dict:
print(district)
Exercise Break#
Let’s put some of this together. Print the following message for each district that has Toronto
in its name. Use district_populations_dict
that was defined above.
f"{name} has population {district_populations[name]}
In English what we want to do is check each dictionary key to see if the string Toronto
is in the key. If it is, then we will print the message.
for district in district_populations_dict:
if "Toronto" in district:
population = district_populations_dict[district]
print(f"{district} has population {population}")
Bridge to pandas
#
From lists to data frames#
Python lists are very good at storing one-dimensional data (one “column” of data).
But in practice, we don’t just want to store or compute on one column—we want to compute on an entire table, containing multiple columns!
While it is possible to do this in Python lists, it’s more cumbersome to do so. It’s also possible to do this with Python dictionaries, but we’d have to do all the work.
So in practice, we use a new data type called DataFrame
, which is used to store (two-dimensional) tabular data.
Importing pandas
#
The DataFrame
type doesn’t come with base Python, but instead as part of the pandas
library for doing data science in Python.
To use a library in Python, we need to import it into our code.
You are going to start to see a lot of conventions. For example, you will almost always see pandas imported as below. There is nothing magical about pd
, but when we all use it as a short-hand for pandas
it is easy to remember what it means.
import pandas as pd
This line of code loads the pandas
library, calling it pd
.
Now we can use the pd.read_csv
function to read in this data!
district_df = pd.read_csv("ED-Canada_2016.csv", header=None)
district_df
Preview: Creating your own data frame#
Next lecture, you’ll work with data frames in earnest. Some of the examples you’ll look at will involve creating small data frames “by hand”, since they’re easier to conceptualize than large data frames.
Creating a data frame manually consists of two steps.
Step 1: Create a dictionary of your data#
A dictionary is another type of Python collection that lets you created associated pairs of data. For us, a dictionary can map column names (strings) to column values (lists) in a table.
For example:
my_data = {
"Name": ["Karen Reid", "Chunjiang Li", "Michael Moon"],
"Age": [101, 18, 19]
}
my_data
Step 2: Turn the dictionary into a data frame#
Now, we can turn that dictionary into a pandas DataFrame
:
my_data_frame = pd.DataFrame(my_data)
my_data_frame
Of course, if we wanted to we could combine the two steps into one larger statement. That’s more convenient, but a bit less explicit about what’s going on. Make sure you understand the previous two steps separately, and then study how they’re combined below!
my_data_frame2 = pd.DataFrame({
"Name": ["Karen Reid", "Chunjiang Li", "Michael Moon"],
"Age": [101, 18, 19]
})
my_data_frame2