Week 2: Reading a file and handling basic data types#

EEB125 | W2025#

Tomo Parins-Fukuchi#

Learning Objectives#

By the end of this lecture, you will be able to

  • work with basic python data types to answer data science questions

  • open and read a data file using Python and Jupyter Notebook

    • specifically, a dataset capturing occurrences of fossils across Canada

  • parse through the data to extract useful information

Why Python?#

  • Python is a nice, general purpose programming language

  • It has excellent support for data science tools

  • It is easy to learn and sets you up for success with other tools

Techniques and concepts#

  • Ints vs floats

  • Operators

  • Manipulating strings

  • Variables

  • Lists

  • Looping

  • Tabular data

  • The most fossiliferous(?) Canadian province

Data types#

Programming languages specify different types of data

  • E.g., numbers and letters/words are often represented as different types

  • In Python, some basic types are:

    • integers (‘ints’): whole numbers (0,1,2)

    • floats: decimal numbers (e.g., 1.347)

    • strings: letters and words (e.g., ‘horse’, ‘cat’)

Integers#

  • Round numbers (can be positive or negative)

type(23)

Floats#

  • Decimal values

type(23.54321)

Operators#

  • Python has built-in operators for performing operations on ints and floats

  • In addition to +, there are others:

    • -,*,**,/,%

  • Let’s explore what they do:

2 / 3

Strings#

  • Words, letters, etc

type("michael jordan")

Variables#

Programming languages store data using variables

  • A programming variable represents a piece of data + Can generally be any type

  • Variables reference value(s) that we assign to them

  • We can then manipulate those variables to perform tasks

Example#

Create the variable ‘x’ and assign an integer value to it

a_variable = 1 + 1

Example#

x is now stored in memory. We will make another and call it ‘y’

x = 1
y = 1

Example#

We can now use both of these variables to do some arithmetic:

x = 1
y = 1
x + y

Example#

We can assign pretty much any data type to a variable:

a = "spongebob "
b = "squarepants"
print(a)
print(b)

Example#

We also can use some of the same operators on other data types:

a = "spongebob "
b = "squarepants"
a+b

Reassigning to variables#

We can also change the value assigned to a variable:

b = "loserpants"
print(a+b)
# can even reassign a variable to itself

b = b
print(a+b)
a = a + b
print(a)

Example#

Adding two strings basically slams both together. Can we add a string and a number?

#a+x

Nope!#

Interpreting errors is one of the most important parts of programming

  • This is telling us that we cannot mix these data types when adding

Other errors#

  • We have to be very careful with what we tell Python. It is very particular.

  • E.g.:

real = 33
# print(rea)

Other rules for variables#

  • variable names can contain only letters, numbers, underscores

  • cannot start with a number

  • no spaces

  • don’t use python keywords (e.g., print)

What else can we do with strings?#

  • Python has many built-in tools for manipulating strings. Let’s explore some:

# make uppercase 

test = "    what's the deal    "

#"    what's the deal    ".upper()

print(test.upper())
# remove whitespace
test = "    what's the deal    "
print(test.strip())
# do both
test = "    what's the deal    "
print(test.upper().strip())
# do both and reassign to original variable
test = "    what's the deal    "
print(test)
test = test.upper().strip()
print(test)
# replace part of a string

print(test)
test = test.replace("WHAT'S","THIS IS")
print(test)

Converting between types#

  • In some cases, Python will also allow us to convert between data types

# we can convert a float to a string:

a = 3.21
a = str(a)
type(a)
# now, if we try to treat it like it is a float, we will get into trouble:

#print(a+3.3)
# so we can also convert it back:

a = float(a)
print(a)
print(a+3.3)

Be careful with types#

  • This flexibility allows us to be very sloppy with types

  • Not always obvious what type a variable will be

  • Be careful

Containers#

  • Besides the three we have examined, Python has many other data types

  • Some of these we can refer to as ‘containers’

    • These allow us to store many values within a single variable

Lists#

  • Lists are a very common python container

  • We can specify a list using square brackets

my_list = []

Generates an empty list. We can create a list with things in it:

emcees = ["cole","rocky","21","drake"]
print(emcees)

Lists#

We can also add items to an existing list:

emcees.append("kendrick")
print(emcees)

Indexing lists#

  • We can select individual items from a list by ‘indexing’ it

# the first item in python of anything is always accessed as the zeroth item
#print(emcees)
print(emcees[1])
print(emcees[1])
# can also index from the other end
print(emcees[-2])

Slicing lists#

  • We can select ranges of items from a list by ‘slicing’ it

print(emcees)
print(emcees[1:3])
print(emcees)
print(emcees[1:])
print(emcees)
print(emcees[:3])
# lists also have tools associated with them
# how many times does "rocky " appear in the list emcees?
print(emcees.count("rocky"))

Creating lists from strings#

  • Python also has built-in tools for creating a list of smaller strings from a longer string

emcees_str = "cole, rocky, 21, drake, kendrick"
#print(emcees_str)
emcees_ls = emcees_str.split(",")
print(emcees_ls)

Creating strings from lists#

  • We can also join the elements of a list composed of all strings into a larger string

print('string')
emcees_str2 = ",".join(emcees_ls)
print(emcees_str2)
print(type(emcees_str2))

Other tools#

  • Python has many built-in tools for dealing with a variety of data types

# how long is a list?
#print(len(emcees_ls))
# how long is a string?
print(len(emcees_str))

Reviewing#

  • Python represents data using different types

    • ints, floats, strings

  • We can assign data to variables

    • This allows us to commit it to memory and perform operations later

  • We can store data in ‘containers’

    • We can access data stored in containers by indexing slicing

    • Containers can also be assigned to variables

Looping#

  • Much of computing fundamentally involves performing an operation on one piece of data at a time

  • Given a collecton of data, we can examine one item at a time by constructing a loop

For loops#

  • One way of looping over data in Python is by using a ‘for loop’

print(emcees_ls)

for z in emcees_ls:
    print(z)  
    print(z)

print(emcees_ls)
print(emcees_ls)

for mc in emcees_ls:
    print(len(mc))
    print(mc.strip().upper())

Tabular data#

Much of the data that we will work with is in ‘tabular’ format

  • Data contained within rows and columns

  • Sort of like an Excel spreadsheet

Tabular data#

We are often used to seeing data in table form:

Team

# of Cups

Last cup

Country

Canadiens

23

1993

Canada

Maple Leafs

13

1967

Canada

Red Wings

11

2008

US

Bruins

6

2011

US

Blackhawks

6

2015

US

Oilers

5

1990

Canada

Penguins

5

2017

US

Tabular data#

  • A common way of storing such data for programming is by separating each cell by a pre-specified character. Commas are common:

team,nCups,lastCup,country
Canadiens,23,1993,Canada
MapleLeafs,13,1967,Canada
RedWings,11,2008,US
Bruins,6,2011,US
Blackhawks,6,2015,US
Oilers,5,1990,Canada
Penguins,5,2017,US
  • We often refer to this as a “comma-separated values” (csv) file

  • Columns are separated by columns

  • The first line usually contains a guide to the data (the ‘header’)

Reading and writing data#

We often want to work with data stored in external files

  • This requires us to read files into our active memory so we can perform analyses

  • We may also want to write new files to save modified data or analyses

  • We can do this using the open function in Python

print(open)

‘Calling’ (using) a function#

To actually use a function, we add parentheses after the function name:

open()

#open()

Calling open()#

Functions require ‘arguments’– information we provide that is necessary to perform the function

  • We are being told that our call needs to specify a 'file'

  • We need to tell Python where to find the file we want to open

Today’s Data#

  • We will be using data from the Paleobiology Database (https://paleobiodb.org/#/)

  • Public database of fossil occurrences all over the world

    • What species did a fossil come from?

    • Where did it occur?

    • When did it occur

Reading a file#

We can use open to read a file that is stored in the directory and save it to a variable:

file = open("TESTFILE.csv")

file = open("pbdb_data.csv","r")
print(file)

Reading a file#

The information from the file is now assigned to the variable file

  • How can we make this information human-readable?

lines = file.readlines()

Will read all of the lines of text contained within the file and assign them to the variable lines

lines = file.readlines()

Reading a file#

We now have our information in a way we can interpret

  • The lines of the file are stored as strings contained within a list

  • The first line is a “header”– it describes the data

  • How can we examine this header?

header = lines[0]
print(header)
# assign the rest of the data to a variable called data
data = lines[1:]

Which Canadian province is the most fossil-rich?#

  • We have the lines of our data stored in a list

  • That means we can loop over the lines and extract data one line at a time

# create an empty list that we will populate with data
province_recs = []

for line in lines[1:]:
    line_dat = line.strip().split(",")
    province = line_dat[-1].strip()
    province_recs.append(province)

Quiz#

  • What type is province_recs?

  • What information does it contain from our data file?

print(province_recs)

One at a time#

  • We could look up the number of fossils one province/territory at a time

  • Count the number of times that province/territory occurs in the list using .count()

province = "Ontario"
province_recs.count(province)

Brainstorm#

  • How else might we do this? Is there a more efficient way to get results for every province?

province_ls = ["Ontario","British Columbia","Alberta","Saskatchewan","Manitoba","Newfoundland and Labrador","Northwest Territories","Yukon","Prince Edward Island","Nunavut","Nova Scotia","Quebec","New Brunswick"]
print(province_ls)
for i in province_ls:
    print(i,province_recs.count(i))

END#