EEB125 Lecture 7: Introduction to Pandas#
Feb 26, 2025#
Karen Reid#
Introducing pandas#
So far this semester, you’ve worked in base Python, using only types of data, functions, and methods that are built into Python.
For the next few weeks, we’ll learn how to use one of the most common libraries for doing data science in Python: pandas.
What is pandas?#
pandas “is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.”
Today, we’ll learn how to use pandas to:
Read in a dataset from a CSV file
Identify, use, and differentiate two new Pandas data types,
DataFrameandSeriesDescribe the properties of a dataset representation in Pandas
Inspect parts of a large dataset
Perform simple data cleaning and data transformation operations on a dataset
Compute some summary statistics on a dataset
Importing pandas#
Because pandas doesn’t come built-in with Python, we need to import it to be able to use it in our code.
This is done with a Python statement called an import statement.
import pandas
Common alternate: import with a renaming (“nickname”):
import pandas as pd
Reading in data from a CSV file#
Using pandas, we can read in data from a csv file using the read_csv function.
species_data = pd.read_csv('PanTHERIA_WR05_Aug2008.csv')
Let’s explore: what is species_data?
species_data
type(species_data)
Formally, species_data is a DataFrame, which is a custom data type defined by pandas to represent tabular data.
Exploring DataFrames#
We can use the DataFrame.head() method to quickly see the first few rows.
species_data.head()
We use the .shape attribute to obtain the number of rows and columns of a DataFrame.
An attribute is like a method, but it just stores a piece of data, and is not a function.
You do not write parentheses after an attribute name.
species_data.shape
We can access just the number of rows or columns by using indexing on .shape (with square brackets), just like with lists.
num_rows = species_data.shape[0]
num_cols = species_data.shape[1]
print(f"There are {num_rows} rows and {num_cols} columns in the dataset.")
DataFrame columns properties#
One of the most important properties of a DataFrame are its columns.
Each column has two important pieces of “metadata”:
the column’s name
the column’s type (i.e., the type of data stored in that column)
We can see the column names by accessing the .columns attribute of a DataFrame.
species_data.columns
The .columns attribute has a special type called Index, which is like a list.
You don’t need to worry about what Index is exactly, but if you want you can convert it into a list:
list(species_data.columns)
We can access column types by using the .dtypes attribute:
species_data.dtypes
Pandas uses its own custom data types to represent large datasets efficiently. They typically correspond to Python’s built-in data types.
For example:
float64corresponds to a Pythonfloatobjectis a specialdtypethat means “any value”
Note: by default, Pandas reads in text column data as object, not string.
We’ll see how to improve this later this lecture.
Finally, we can use the DataFrame.info() method to display all of the previous information and more:
species_data.info()
Summary#
Given a DataFrame, we can access the following attributes/methods to obtain information about it.
Attribute/Method |
Description |
|---|---|
|
(number of rows, number of columns) |
|
column names |
|
column names and types |
|
all of the above, and more (e.g. non-null counts) |
|
display the first few rows of the |
Data Wrangling: Columns#
In data science, data wrangling is the process of turning raw data into a format more suitable for subsequent computation, analysis, and visualization. This might be more properly called Data Cleaning.
There are many different types of data wrangling, but for now we’ll look at three techniques centred on columns:
renaming columns
converting column types
identifying and replacing “invalid” values
extracting a subset of columns to work with
Renaming columns#
We rename columns by using the DataFrame.rename(columns=...) method, where we pass in a dictionary mapping “original column name” to “new column name”.
old_to_new = {
'MSW05_Genus': 'Genus',
'MSW05_Species': 'Species',
'1-1_ActivityCycle': 'Activity Cycle',
'5-1_AdultBodyMass_g': 'Adult Body Mass (g)',
'2-1_AgeatEyeOpening_d': 'Age at Eye Opening (days)',
'17-1_MaxLongevity_m': 'Max Longevity (months)'
}
species_data_renamed = species_data.rename(columns=old_to_new)
species_data_renamed.head()
Converting column types#
We can also ask Pandas to automatically choose the best column types for an existing DataFrame.
This is done with the DataFrame.convert_dtypes() method.
species_data_converted = species_data_renamed.convert_dtypes()
species_data_converted.dtypes
Identifying and replacing “missing” values#
The PanTHERIA dataset uses a special value, -999, to represent missing or unknown data.
Instead of leaving these values in our DataFrame, we’ll replace them with a special pandas value called NA.
species_data_with_na = species_data_converted.replace(-999, pd.NA)
species_data_with_na.head()
Extracting a subset of columns#
Sometimes our full dataset contains too much information, and we only care about a subset of the data.
One common occurrence is when we only want a subset of the columns in a dataset.
For example, suppose we only care about the genus, species, body mass, and longevity of each species in our dataset.
Extracting a subset of columns#
We select a subset of columns in two steps:
Define a list containing the column names that we want to select.
Use square bracket “lookup” syntax on a
DataFrame, with the list inside the square brackets.
columns_to_keep = [
'Genus',
'Species',
'Adult Body Mass (g)',
'Max Longevity (months)'
]
species_data_final = species_data_with_na[columns_to_keep]
species_data_final.head()
Data Transformation: computing on columns#
A typical step in analysis of a dataset is to perform computations on invididual columns, or operations that combine columns in some way.
For example:
Add 1 to each value in a column
Multiply the values in two columns together
“Find and Replace” values in a column
Retrieving a column by name#
We can extract a single column from a DataFrame using square brackets with a single string instead of a list of strings.
masses = species_data_final['Adult Body Mass (g)']
masses
But what exactly is totals?
type(masses)
masses is a Series, which is a pandas data type that represents a single column of data.
A Series is similar to a DataFrame, but it can only hold one “series” of data, rather than storing a whole table.
But most of the descriptive attributes/methods we learned for DataFrames can be applied to Series as well:
masses.shape
masses.dtypes
masses.info()
# We can even obtain the original column name from the Series
masses.name
But if Series are a simplified version of DataFrames, why bother with them?
Because we can perform computations on Series “one element at a time”, without needing to use for loops!
Example: transform a single Series#
Goal: Given the species masses, convert to kg by dividing each one by 1000 and rounding to one decimal place.
Example: for a single mass like 492714.47, we’d compute
round(492714.47 / 1000, 1) # 492.7
But we want to do this for every mass!
masses_kg = masses / 1000
masses_kg_rounded = masses_kg.round(1)
masses_kg_rounded
Example: combine two Series#
Now let’s consider another problem: we’ll calculate the ratio between the longevity and mass of each species.
Example: for Camelus dromedarius, we’ll compute
480.0 / 492714.47
But again, we want to do ths for each species!
masses = species_data_final["Adult Body Mass (g)"]
longevities = species_data_final["Max Longevity (months)"]
longevities / masses
Adding a new column to a DataFrame#
In addition to creating new variables to store computed Series, it is common to modify existing DataFrame by adding a computed Series as a new column.
We can do this using square bracket notation again, this time on the left-hand side of an assignment statement.
# You don't need to worry about the following line.
# It just hides a warning message that's beyond the scope of this course.
pd.set_option('mode.chained_assignment', None)
species_data_final["Longevity-to-Mass Ratio"] = longevities / masses
species_data_final
WARNING!#
Warning: the previous code cell changes the existing data frame species_data_final, rather than creating a new DataFrame.
Boolean Series and filtering rows#
Another common type of data transformation is to filter for specific rows in a dataset based on one or more conditions.
Goal: filter the rows of the dataset to keep the species with a mass greater than or equal to 100 kg.
As a first step, we create a boolean Series that stores True for the rows we want to keep, and False for the other rows.
is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_large
Then, we use this Series to index species_data_final by using square bracket notation.
species_data_final[is_large]
Note: lots of square brackets!#
One of the tricky things about DataFrames is that there are different ways of obtaining subsets of the dataset that all have very similar code syntax:
species_data_final[...]
The key principle is that the type of the value inside the square brackets determines what kind of “subsetting” operation is being performed.
Type inside |
Example |
Return type |
Which columns? |
Which rows? |
|---|---|---|---|---|
|
|
|
The one specified |
All rows |
|
|
|
The ones specified |
All rows |
|
|
|
All columns |
The ones specified |
Logical operators: & and |#
Sometimes we want to filter on two conditions.
To start, suppose we have these two boolean Series:
is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_long_lived = species_data_final["Max Longevity (months)"] >= 240
There are two common ways to filter based on a combination of these two conditions.
Filter 1: find rows where the species is large and is long-lived.
To do this, we use the & operator to combine the two Series.
filter1 = is_large & is_long_lived
species_data_final[filter1]
Filter 2: find rows where the species is large or is long-lived.
To do this, we use the | operator to combine the two Series.
filter2 = is_large | is_long_lived
species_data_final[filter2]
Exploratory analysis: sorting and basic descriptive statistics#
Sorting#
Suppose we want to take our DataFrame and sort it by the "Adult Body Mass (g)" column to see which species have the largest mass.
We do this by using the DataFrame.sort_values(by=...) method, where we pass in a str that names the column to sort by.
species_data_final.sort_values(by="Adult Body Mass (g)")
By default, the column values are sorted in ascending (low-to-high) order.
If we want to sort in descending (high-to-low) order, we can pass in an optional argument ascending=False to DataFrame.sort_values:
species_data_final.sort_values(by="Adult Body Mass (g)", ascending=False)
Descriptive statistics#
Here are five simple descriptive statistics that we can use to describe a collection of numbers:
sum
count (i.e., size; number of elements)
mean (average)
min
max
Unsurprisingly, we can compute all of these on any Pandas Series containing numeric data by calling a corresponding Series method.
Statistic |
|
|---|---|
sum |
|
count |
|
mean |
|
min |
|
max |
|
Note: all five of these methods ignore NA values.
Let’s start by extracting the body mass column (again).
totals = species_data_final["Adult Body Mass (g)"]
totals.head()
totals.sum()
totals.count()
totals.mean()
totals.min()
totals.max()
Further reading#
pandas is the most complex part of Python we’ve studied so far in this course, and so we expect you’ll need to review and practice more as we dive deeper into this library.
The official Pandas website has some great introductory materials, including: