Week 2: Reading a file and handling basic data types#
EEB125 | W2025#
Tomo Parins-Fukuchi#
Learning Objectives#
By the end of this lecture, you will be able to
work with basic python data types to answer data science questions
open and read a data file using Python and Jupyter Notebook
specifically, a dataset capturing occurrences of fossils across Canada
parse through the data to extract useful information
Why Python?#
Python is a nice, general purpose programming language
It has excellent support for data science tools
It is easy to learn and sets you up for success with other tools
Techniques and concepts#
Ints vs floats
Operators
Manipulating strings
Variables
Lists
Looping
Tabular data
The most fossiliferous(?) Canadian province
Data types#
Programming languages specify different types of data
E.g., numbers and letters/words are often represented as different types
In Python, some basic types are:
integers (‘ints’): whole numbers (0,1,2)
floats: decimal numbers (e.g., 1.347)
strings: letters and words (e.g., ‘horse’, ‘cat’)
Integers#
Round numbers (can be positive or negative)
type(23)
Floats#
Decimal values
type(23.54321)
Operators#
Python has built-in operators for performing operations on ints and floats
In addition to
+, there are others:-,*,**,/,%
Let’s explore what they do:
2 / 3
Strings#
Words, letters, etc
type("michael jordan")
Variables#
Programming languages store data using variables
A programming variable represents a piece of data + Can generally be any type
Variables reference value(s) that we assign to them
We can then manipulate those variables to perform tasks
Example#
Create the variable ‘x’ and assign an integer value to it
a_variable = 1 + 1
Example#
x is now stored in memory. We will make another and call it ‘y’
x = 1
y = 1
Example#
We can now use both of these variables to do some arithmetic:
x = 1
y = 1
x + y
Example#
We can assign pretty much any data type to a variable:
a = "spongebob "
b = "squarepants"
print(a)
print(b)
Example#
We also can use some of the same operators on other data types:
a = "spongebob "
b = "squarepants"
a+b
Reassigning to variables#
We can also change the value assigned to a variable:
b = "loserpants"
print(a+b)
# can even reassign a variable to itself
b = b
print(a+b)
a = a + b
print(a)
‘Print’ statement#
Often, we may want to see the results of some operation in our notebook or computer terminal
We can use the “print” function in python to do this
print("whoever thought that i would be the greatest growing up?")
Example#
Adding two strings basically slams both together. Can we add a string and a number?
#a+x
Nope!#
Interpreting errors is one of the most important parts of programming
This is telling us that we cannot mix these data types when adding
Other errors#
We have to be very careful with what we tell Python. It is very particular.
E.g.:
real = 33
# print(rea)
Other rules for variables#
variable names can contain only letters, numbers, underscores
cannot start with a number
no spaces
don’t use python keywords (e.g.,
print)
What else can we do with strings?#
Python has many built-in tools for manipulating strings. Let’s explore some:
# make uppercase
test = " what's the deal "
#" what's the deal ".upper()
print(test.upper())
# remove whitespace
test = " what's the deal "
print(test.strip())
# do both
test = " what's the deal "
print(test.upper().strip())
# do both and reassign to original variable
test = " what's the deal "
print(test)
test = test.upper().strip()
print(test)
# replace part of a string
print(test)
test = test.replace("WHAT'S","THIS IS")
print(test)
Converting between types#
In some cases, Python will also allow us to convert between data types
# we can convert a float to a string:
a = 3.21
a = str(a)
type(a)
# now, if we try to treat it like it is a float, we will get into trouble:
#print(a+3.3)
# so we can also convert it back:
a = float(a)
print(a)
print(a+3.3)
Be careful with types#
This flexibility allows us to be very sloppy with types
Not always obvious what type a variable will be
Be careful
Containers#
Besides the three we have examined, Python has many other data types
Some of these we can refer to as ‘containers’
These allow us to store many values within a single variable
Lists#
Lists are a very common python container
We can specify a list using square brackets
my_list = []
Generates an empty list. We can create a list with things in it:
emcees = ["cole","rocky","21","drake"]
print(emcees)
Lists#
We can also add items to an existing list:
emcees.append("kendrick")
print(emcees)
Indexing lists#
We can select individual items from a list by ‘indexing’ it
# the first item in python of anything is always accessed as the zeroth item
#print(emcees)
print(emcees[1])
print(emcees[1])
# can also index from the other end
print(emcees[-2])
Slicing lists#
We can select ranges of items from a list by ‘slicing’ it
print(emcees)
print(emcees[1:3])
print(emcees)
print(emcees[1:])
print(emcees)
print(emcees[:3])
# lists also have tools associated with them
# how many times does "rocky " appear in the list emcees?
print(emcees.count("rocky"))
Creating lists from strings#
Python also has built-in tools for creating a list of smaller strings from a longer string
emcees_str = "cole, rocky, 21, drake, kendrick"
#print(emcees_str)
emcees_ls = emcees_str.split(",")
print(emcees_ls)
Creating strings from lists#
We can also join the elements of a list composed of all strings into a larger string
print('string')
emcees_str2 = ",".join(emcees_ls)
print(emcees_str2)
print(type(emcees_str2))
Other tools#
Python has many built-in tools for dealing with a variety of data types
# how long is a list?
#print(len(emcees_ls))
# how long is a string?
print(len(emcees_str))
Reviewing#
Python represents data using different types
ints, floats, strings
We can assign data to variables
This allows us to commit it to memory and perform operations later
We can store data in ‘containers’
We can access data stored in containers by indexing slicing
Containers can also be assigned to variables
Looping#
Much of computing fundamentally involves performing an operation on one piece of data at a time
Given a collecton of data, we can examine one item at a time by constructing a loop
For loops#
One way of looping over data in Python is by using a ‘for loop’
print(emcees_ls)
for z in emcees_ls:
print(z)
print(z)
print(emcees_ls)
print(emcees_ls)
for mc in emcees_ls:
print(len(mc))
print(mc.strip().upper())
Tabular data#
Much of the data that we will work with is in ‘tabular’ format
Data contained within rows and columns
Sort of like an Excel spreadsheet
Tabular data#
We are often used to seeing data in table form:
Team |
# of Cups |
Last cup |
Country |
|---|---|---|---|
Canadiens |
23 |
1993 |
Canada |
Maple Leafs |
13 |
1967 |
Canada |
Red Wings |
11 |
2008 |
US |
Bruins |
6 |
2011 |
US |
Blackhawks |
6 |
2015 |
US |
Oilers |
5 |
1990 |
Canada |
Penguins |
5 |
2017 |
US |
Tabular data#
A common way of storing such data for programming is by separating each cell by a pre-specified character. Commas are common:
team,nCups,lastCup,country
Canadiens,23,1993,Canada
MapleLeafs,13,1967,Canada
RedWings,11,2008,US
Bruins,6,2011,US
Blackhawks,6,2015,US
Oilers,5,1990,Canada
Penguins,5,2017,US
We often refer to this as a “comma-separated values” (csv) file
Columns are separated by columns
The first line usually contains a guide to the data (the ‘header’)
Reading and writing data#
We often want to work with data stored in external files
This requires us to read files into our active memory so we can perform analyses
We may also want to write new files to save modified data or analyses
We can do this using the
openfunction in Python
print(open)
‘Calling’ (using) a function#
To actually use a function, we add parentheses after the function name:
open()
#open()
Calling open()#
Functions require ‘arguments’– information we provide that is necessary to perform the function
We are being told that our call needs to specify a
'file'We need to tell Python where to find the file we want to open
Today’s Data#
We will be using data from the Paleobiology Database (https://paleobiodb.org/#/)
Public database of fossil occurrences all over the world
What species did a fossil come from?
Where did it occur?
When did it occur
Reading a file#
We can use open to read a file that is stored in the directory and save it to a variable:
file = open("TESTFILE.csv")
file = open("pbdb_data.csv","r")
print(file)
Reading a file#
The information from the file is now assigned to the variable file
How can we make this information human-readable?
lines = file.readlines()
Will read all of the lines of text contained within the file and assign them to the variable lines
lines = file.readlines()
Reading a file#
We now have our information in a way we can interpret
The lines of the file are stored as strings contained within a list
The first line is a “header”– it describes the data
How can we examine this header?
header = lines[0]
print(header)
# assign the rest of the data to a variable called data
data = lines[1:]
Which Canadian province is the most fossil-rich?#
We have the lines of our data stored in a list
That means we can loop over the lines and extract data one line at a time
# create an empty list that we will populate with data
province_recs = []
for line in lines[1:]:
line_dat = line.strip().split(",")
province = line_dat[-1].strip()
province_recs.append(province)
Quiz#
What type is
province_recs?
What information does it contain from our data file?
print(province_recs)
One at a time#
We could look up the number of fossils one province/territory at a time
Count the number of times that province/territory occurs in the list using
.count()
province = "Ontario"
province_recs.count(province)
Brainstorm#
How else might we do this? Is there a more efficient way to get results for every province?
province_ls = ["Ontario","British Columbia","Alberta","Saskatchewan","Manitoba","Newfoundland and Labrador","Northwest Territories","Yukon","Prince Edward Island","Nunavut","Nova Scotia","Quebec","New Brunswick"]
print(province_ls)
for i in province_ls:
print(i,province_recs.count(i))