Class 5: Statistical variables, distributions, life history and conservation#

EEB 125#

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Read in maturation data#

  • How long does each species usually take to grow to maturity?

  • Measured in days

file = open("maturity.csv")
lines = file.readlines()
header = lines[0]
data = lines[1:]
#print(header)
#print(data[1:4])

Read in IUCN data#

  • Extinction risk across mammalian speciecs

# get our data read in and prepped

iucn=open("iucn_status.csv")
iucn_lines = iucn.readlines()
iucn_header = iucn_lines[0]
iucn_data = iucn_lines[1:]

IUCN Red List#

  • To assess extinction risk, we will use IUCN status

  • IUCN is a conservation organization that manages information on threats to wild animals

  • https://www.iucnredlist.org/

IUCN Red List#

IUCN Red List#

  • We will need to combine information from both datasets to ask our question

print(iucn_header)
print(header)

Our approach:#

  • We will calculate the mean maturation time for all of the mammal species within a given risk category

  • Need to merge the two datasets by linking information for all the species shared between datasets

Setup both datasets#

  • Map the IUCN threat level to each species using a dictionary

sp_iucn = {}
for line in iucn_data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    iucn_risk = line_dat[2]
    sp_iucn[species] = iucn_risk

Setup both datasets#

  • Map maturation time to each species using a dictionary

sp_mat = {}
for line in data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    mat_time = line_dat[2]
    if mat_time != "NA":
        sp_mat[species] = float(mat_time) / 365 # convert to years

Linking things up#

  • We will want to calculate the mean maturation time for the species within each risk category

  • First, how can we find what unique risk categories exist in our dataset?

risk_cat = sp_iucn.values() 
unique_risk_cat = set(risk_cat)
print(unique_risk_cat)

Getting setup#

  • create our container that links maturation times with IUCN risk level

  • we will want to store the maturation times associated with each level in a list

iucn_mat = {}
for cat in unique_risk_cat:
    iucn_mat[cat] = []

Dictionary overload#

  • sp_mat: keys = species name, values = maturation time

  • sp_iucn: keys = species name, values = iucn risk level

  • iucn_mat: keys = iucn risk level, values = empty list (to be populated with maturation times)

The approach (in English)#

  • iterate through sp_mat

    • keys are species, values are maturation time

  • look up the IUCN risk level stored in sp_iucn using the keys we are iterating over

  • populate the lists associated with each key in iucn_mat

for sp in sp_mat:
    mat = sp_mat[sp]
    try:
        iucn_cat = sp_iucn[sp]           ## 
        iucn_mat[iucn_cat].append(mat)
    except:
        continue

Calculate means#

  • loop through iucn_mat and calculate the mean for each risk level

# let's calculate a function that does this
def mean(pop):
    tot = 0
    for i in pop:
        tot += i
    mean = tot / len(pop)
    return mean
iucn_means = {}

for cat in iucn_mat:
    mat_times = iucn_mat[cat]
    if len(mat_times)>0:
        cat_mean = mean(mat_times)
        iucn_means[cat] = cat_mean
for cat in iucn_means:
    print(cat, iucn_means[cat])

What categories should we consider ‘at risk’?#

What categories should we consider ‘at risk’?#

  • We will say anything above level 2 is “at risk”, while anything below is not.

iucn_map={'LC':1,'NT':2,'VU':3,'EN':4,'CR':5,'EW':6,'EX':7,'DD':0}

Mark species at risk, or not at risk#

  • We will say anything above level 2 is “at risk”, while anything below is not.

sp_threat = {}
for line in iucn_data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    iucn_risk = line_dat[2]
    risk_numeric = iucn_map[iucn_risk]
    threat = False
    if risk_numeric > 2:
        threat = True
    elif risk_numeric == 0:
        continue
    sp_threat[species] = threat

Our approach:#

  • We will calculate the mean maturation time for all at risk, vs not at risk mammals

Setup both datasets#

  • Map maturation time to each species using a dictionary

sp_mat = {}
for line in data:
    line_dat = line.strip().split(",")
    species = line_dat[1]
    mat_time = line_dat[2]
    if mat_time != "NA":
        sp_mat[species] = float(mat_time) / 365 # convert to years

Link our two dictionaries to a third#

  • Both of our dictionaries, sp_mat and sp_iucn have species names as the keys

  • We can use the keys of one to look up the values from the other

  • We then need to add values to a third dictionary, threat_mat

threat_mat={True:[],False:[]}

Dictionary overload#

  • sp_mat: keys = species name, values = maturation time

  • sp_iucn: keys = species name, values = iucn risk level

  • threat_mat: keys = threat, values = empty list (to be populated with maturation times)

The approach (in English)#

  • iterate through sp_mat

    • keys are species, values are maturation time

  • look up the IUCN risk level stored in sp_iucn using the keys we are iterating over

  • populate the lists associated with each key in threat_mat

for sp in sp_mat:
    mat = sp_mat[sp]
    try:
        threat = sp_threat[sp]
        threat_mat[threat].append(mat)
    except:
        continue
print(threat_mat)

Calculate means#

  • loop through iucn_mat and calculate the mean for threatened vs non-threatened species

threat_means = {}

for threat in threat_mat:
    mat_times = threat_mat[threat]    
    threat_mean = mean(mat_times)
    threat_means[threat] = threat_mean

print(threat_means)

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

Today’s data story:#

Are mammals that take longer to grow up at greater risk of extinction?#

  • Possibly? We will learn more sophisticated statistical techniques later to examine questions like this more closely

Statistical Distributions#

  • What is a statistical distribution?

  • How can a distribution be summarized?

  • What questions can we answer using a distribution?

What is the distribution of conservation risk across mammals?#

How many species belong to each category?

# create two lists, one that will be the keys and another that will be values
keys = list(iucn_map.keys())
vals = [0 for i in range(len(keys))]   #  we will tally species counts for each iucn level later

# wrap the two lists together into a single dictionary 
iucn_counts = dict(zip(keys,vals))  
print(iucn_counts)

What is the distribution of conservation risk across mammals?#

for line in iucn_data:
    line_dat = line.strip().split(",")
    iucn_risk = line_dat[2]
    iucn_counts[iucn_risk]+=1

What is the distribution of conservation risk across mammals?#

print(iucn_counts)

Importing modules#

  • There are often times where something we want to do is so common that someone has already written code that does it

  • These are packeged in the form of python ‘modules’

  • We need to import these modules to use this code

  • We will use one for plotting data called matplotlib

    • (you will not need to do this yourself yet– just watch for now)

import matplotlib.pyplot as plt

What is the distribution of conservation risk across mammals?#

  • The bars represent the frequency of observations and the labels on the horizontal axis represent the number of species at a conservation risk level.

  • This is called the frequency distribution of conservation risk.

What is the distribution of conservation risk across mammals?#

rel_counts = [i/sum(iucn_counts.values()) for i in iucn_counts.values()]
plt.bar(iucn_counts.keys(),rel_counts)
plt.show()
  • If we want to plot proportions instead of counts then we can transform activity_dist by dividing by the total number of observations.

  • This is called the relative frequency distribution of activity.

  • Q: About what proportion of mammals is at risk?

rel_counts = [i/sum(iucn_counts.values()) for i in iucn_counts.values()]
plt.bar(iucn_counts.keys(),rel_counts)
plt.show()

Summarizing the distribution of a continuous variable#

What is the distribution of the time it takes to grow up across mammals?#

Variation#

  • One of the most important concepts in statistics and biology

  • Standard deviation is average deviation from the mean

    • Large values mean lots of variation and small values mean less variation.

  • Other measures of variation also exist (e.g., the range– max - min)

Variance#

  • How far from the mean are the data, on average?

  • Calculate the difference between each data point and the mean

  • Calculate the mean of these differences

def variance(data,mean_val):
    diffs=[]
    for i in data:
        diff = i - mean_val
        sq_diff = diff ** 2
        diffs.append(sq_diff)
    var = mean(diffs)
    return var

Standard Deviation#

  • Square root of the variance

  • Descibes the variation in values, expressed in the same units as the data

import math

def st_dev(data,mean_val):
    var = variance(data,mean_val)
    sd = math.sqrt(var)
    return sd

Calculate means#

  • loop through iucn_mat and calculate the mean for each risk level

threat_sds = {}

for threat in threat_mat:
    mat_times = threat_mat[threat]  
    threat_mean = threat_means[threat]
    threat_sd = st_dev(mat_times,threat_mean)
    threat_sds[threat] = threat_sd
threat_sds
for threat in threat_means:
    threat_mean = threat_means[threat]
    threat_sd = threat_sds[threat]
    print(threat,threat_mean,threat_sd)

Histograms#

  • We can also visualize central tendency and variation using a type of plot called a histogram

plt.hist(sp_mat.values())
plt.show()

Histograms#

  • We can also visualize the relative frequency, or proportion of individuals with each maturation time

plt.hist(sp_mat.values(),density=True)
plt.show()

Histograms#

  • Histograms can be useful to visualize differences in how data are distributed

plt.hist(threat_mat[False],density=True,alpha=0.5,label="not at risk")
plt.hist(threat_mat[True],density=True,alpha=0.5,label="at risk")
plt.legend(loc='upper right')
plt.xlabel("maturation time (years)")
plt.show()

Midterm#

  • Computer-based

  • Will take place in this classroom at our normal lecture time

    • Feb. 12, 1pm-3pm

Format#

  • Mix of:

    • (simple) programming exercises (i.e., produce your own code)

    • code reading/interpretation (i.e., explain some pre-written code)

    • data interpretation (i.e., look at some data summaries and interpret them)

  • Will be largely similar to the structure of a homework assignment