Class 5: Statistical variables, distributions, life history and conservation

Class 5: Statistical variables, distributions, life history and conservation#

EEB 125#

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Read in maturation data#

How long does each species usually take to grow to maturity?
Measured in days

file = open("maturity.csv")
lines = file.readlines()
header = lines[0]
data = lines[1:]
#print(header)
#print(data[1:4])

Read in IUCN data#

Extinction risk across mammalian speciecs

# get our data read in and prepped

iucn=open("iucn_status.csv")
iucn_lines = iucn.readlines()
iucn_header = iucn_lines[0]
iucn_data = iucn_lines[1:]

IUCN Red List#

To assess extinction risk, we will use IUCN status
IUCN is a conservation organization that manages information on threats to wild animals
https://www.iucnredlist.org/

IUCN Red List#

We will need to combine information from both datasets to ask our question

print(iucn_header)
print(header)

Our approach:#

We will calculate the mean maturation time for all of the mammal species within a given risk category
Need to merge the two datasets by linking information for all the species shared between datasets

Setup both datasets#

Map the IUCN threat level to each species using a dictionary

sp_iucn = {}
for line in iucn_data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    iucn_risk = line_dat[2]
    sp_iucn[species] = iucn_risk

Setup both datasets#

Map maturation time to each species using a dictionary

sp_mat = {}
for line in data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    mat_time = line_dat[2]
    if mat_time != "NA":
        sp_mat[species] = float(mat_time) / 365 # convert to years

Linking things up#

We will want to calculate the mean maturation time for the species within each risk category
First, how can we find what unique risk categories exist in our dataset?

risk_cat = sp_iucn.values() 

unique_risk_cat = set(risk_cat)
print(unique_risk_cat)

Getting setup#

create our container that links maturation times with IUCN risk level
we will want to store the maturation times associated with each level in a list

iucn_mat = {}
for cat in unique_risk_cat:
    iucn_mat[cat] = []

Link our two dictionaries to a third#

Both of our dictionaries, sp_mat and sp_iucn have species names as the keys
We can use the keys of one to look up the values from the other
We then need to add values to our third dictionary, iucn_mat

Dictionary overload#

sp_mat: keys = species name, values = maturation time
sp_iucn: keys = species name, values = iucn risk level
iucn_mat: keys = iucn risk level, values = empty list (to be populated with maturation times)

The approach (in English)#

iterate through sp_mat
- keys are species, values are maturation time
look up the IUCN risk level stored in sp_iucn using the keys we are iterating over
populate the lists associated with each key in iucn_mat

for sp in sp_mat:
    mat = sp_mat[sp]
    try:
        iucn_cat = sp_iucn[sp]           ## 
        iucn_mat[iucn_cat].append(mat)
    except:
        continue

Calculate means#

loop through iucn_mat and calculate the mean for each risk level

# let's calculate a function that does this
def mean(pop):
    tot = 0
    for i in pop:
        tot += i
    mean = tot / len(pop)
    return mean

iucn_means = {}

for cat in iucn_mat:
    mat_times = iucn_mat[cat]
    if len(mat_times)>0:
        cat_mean = mean(mat_times)
        iucn_means[cat] = cat_mean

for cat in iucn_means:
    print(cat, iucn_means[cat])

What categories should we consider ‘at risk’?#

We will say anything above level 2 is “at risk”, while anything below is not.

iucn_map={'LC':1,'NT':2,'VU':3,'EN':4,'CR':5,'EW':6,'EX':7,'DD':0}

Mark species at risk, or not at risk#

We will say anything above level 2 is “at risk”, while anything below is not.

sp_threat = {}
for line in iucn_data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    iucn_risk = line_dat[2]
    risk_numeric = iucn_map[iucn_risk]
    threat = False
    if risk_numeric > 2:
        threat = True
    elif risk_numeric == 0:
        continue
    sp_threat[species] = threat

Our approach:#

We will calculate the mean maturation time for all at risk, vs not at risk mammals

Setup both datasets#

Map maturation time to each species using a dictionary

sp_mat = {}
for line in data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    mat_time = line_dat[2]
    if mat_time != "NA":
        sp_mat[species] = float(mat_time) / 365 # convert to years

Link our two dictionaries to a third#

Both of our dictionaries, sp_mat and sp_iucn have species names as the keys
We can use the keys of one to look up the values from the other
We then need to add values to a third dictionary, threat_mat

threat_mat={True:[],False:[]}

Dictionary overload#

sp_mat: keys = species name, values = maturation time
sp_iucn: keys = species name, values = iucn risk level
threat_mat: keys = threat, values = empty list (to be populated with maturation times)

The approach (in English)#

iterate through sp_mat
- keys are species, values are maturation time
look up the IUCN risk level stored in sp_iucn using the keys we are iterating over
populate the lists associated with each key in threat_mat

for sp in sp_mat:
    mat = sp_mat[sp]
    try:
        threat = sp_threat[sp]
        threat_mat[threat].append(mat)
    except:
        continue

print(threat_mat)

Calculate means#

loop through iucn_mat and calculate the mean for threatened vs non-threatened species

threat_means = {}

for threat in threat_mat:
    mat_times = threat_mat[threat]    
    threat_mean = mean(mat_times)
    threat_means[threat] = threat_mean

print(threat_means)

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Possibly? We will learn more sophisticated statistical techniques later to examine questions like this more closely

Statistical Distributions#

What is a statistical distribution?
How can a distribution be summarized?
What questions can we answer using a distribution?

What is the distribution of conservation risk across mammals?#

How many species belong to each category?

# create two lists, one that will be the keys and another that will be values
keys = list(iucn_map.keys())
vals = [0 for i in range(len(keys))]   #  we will tally species counts for each iucn level later

# wrap the two lists together into a single dictionary 
iucn_counts = dict(zip(keys,vals))  
print(iucn_counts)

What is the distribution of conservation risk across mammals?#

for line in iucn_data:
    line_dat = line.strip().split(",")
    iucn_risk = line_dat[2]
    iucn_counts[iucn_risk]+=1

What is the distribution of conservation risk across mammals?#

print(iucn_counts)

Importing modules#

There are often times where something we want to do is so common that someone has already written code that does it
These are packeged in the form of python ‘modules’
We need to import these modules to use this code
We will use one for plotting data called matplotlib
- (you will not need to do this yourself yet– just watch for now)

import matplotlib.pyplot as plt

What is the distribution of conservation risk across mammals?#

The bars represent the frequency of observations and the labels on the horizontal axis represent the number of species at a conservation risk level.
This is called the frequency distribution of conservation risk.

What is the distribution of conservation risk across mammals?#

rel_counts = [i/sum(iucn_counts.values()) for i in iucn_counts.values()]
plt.bar(iucn_counts.keys(),rel_counts)
plt.show()

If we want to plot proportions instead of counts then we can transform activity_dist by dividing by the total number of observations.
This is called the relative frequency distribution of activity.
Q: About what proportion of mammals is at risk?

rel_counts = [i/sum(iucn_counts.values()) for i in iucn_counts.values()]
plt.bar(iucn_counts.keys(),rel_counts)
plt.show()

Summarizing the distribution of a continuous variable#

What is the distribution of the time it takes to grow up across mammals?#

Variation#

One of the most important concepts in statistics and biology
Standard deviation is average deviation from the mean
- Large values mean lots of variation and small values mean less variation.
Other measures of variation also exist (e.g., the range– max - min)

Variance#

How far from the mean are the data, on average?
Calculate the difference between each data point and the mean
Calculate the mean of these differences

def variance(data,mean_val):
    diffs=[]
    for i in data:
        diff = i - mean_val
        sq_diff = diff ** 2
        diffs.append(sq_diff)
    var = mean(diffs)
    return var

Standard Deviation#

Square root of the variance
Descibes the variation in values, expressed in the same units as the data

import math

def st_dev(data,mean_val):
    var = variance(data,mean_val)
    sd = math.sqrt(var)
    return sd

Calculate means#

loop through iucn_mat and calculate the mean for each risk level

threat_sds = {}

for threat in threat_mat:
    mat_times = threat_mat[threat]  
    threat_mean = threat_means[threat]
    threat_sd = st_dev(mat_times,threat_mean)
    threat_sds[threat] = threat_sd
threat_sds

for threat in threat_means:
    threat_mean = threat_means[threat]
    threat_sd = threat_sds[threat]
    print(threat,threat_mean,threat_sd)

Histograms#

We can also visualize central tendency and variation using a type of plot called a histogram

plt.hist(sp_mat.values())
plt.show()

Histograms#

We can also visualize the relative frequency, or proportion of individuals with each maturation time

plt.hist(sp_mat.values(),density=True)
plt.show()

Histograms#

Histograms can be useful to visualize differences in how data are distributed

plt.hist(threat_mat[False],density=True,alpha=0.5,label="not at risk")
plt.hist(threat_mat[True],density=True,alpha=0.5,label="at risk")
plt.legend(loc='upper right')
plt.xlabel("maturation time (years)")
plt.show()

Midterm#

Computer-based
Will take place in this classroom at our normal lecture time
- Feb. 12, 1pm-3pm

Format#

Mix of:
- (simple) programming exercises (i.e., produce your own code)
- code reading/interpretation (i.e., explain some pre-written code)
- data interpretation (i.e., look at some data summaries and interpret them)
Will be largely similar to the structure of a homework assignment

Class 5: Statistical variables, distributions, life history and conservation

Contents

Class 5: Statistical variables, distributions, life history and conservation#

EEB 125#

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Read in maturation data#

Read in IUCN data#

IUCN Red List#

IUCN Red List#

IUCN Red List#

Our approach:#

Setup both datasets#

Setup both datasets#

Linking things up#

Getting setup#

Link our two dictionaries to a third#

Dictionary overload#

The approach (in English)#

Calculate means#

What categories should we consider ‘at risk’?#

What categories should we consider ‘at risk’?#

Mark species at risk, or not at risk#

Our approach:#

Setup both datasets#

Link our two dictionaries to a third#

Dictionary overload#

The approach (in English)#

Calculate means#

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Statistical Distributions#

What is the distribution of conservation risk across mammals?#

What is the distribution of conservation risk across mammals?#

What is the distribution of conservation risk across mammals?#

Importing modules#

What is the distribution of conservation risk across mammals?#

What is the distribution of conservation risk across mammals?#

Summarizing the distribution of a continuous variable#

What is the distribution of the time it takes to grow up across mammals?#

Variation#

Variance#

Standard Deviation#

Calculate means#

Histograms#

Histograms#

Histograms#

Midterm#

Format#