Class 7

Contents

Class 7#

Mar. 5, 2024#

Where we left off last week

  • understand the role of empirical distributions (sample and statistic) and simulation

New topics for today

  • Models

  • Comparing two samples

  • Causality

Sampling from a Population#

The law of averages also holds when the random sample is drawn from individuals in a large population.

 persdur            Duration - Personal activities

           VALUE  LABEL
               0  No time spent doing this activity
            9996  Valid skip
            9997  Don't know
            9998  Refusal
            9999  Not stated


 luc_rst            Population centre indicator

           VALUE  LABEL
               1  Larger urban population centres (CMA/CA)
               2  Rural areas and small population centres (non CMA/CA)
               3  Prince Edward Island
               6  Valid skip
               7  Don't know
               8  Refusal
               9  Not stated
import pandas as pd
timeuse = pd.read_csv('gss_tu2016_main_file.csv')

importantcols = ['CASEID', 'persdur', 'luc_rst']

timeuse_subset = timeuse[importantcols]
print(timeuse_subset.shape)
timeuse_subset.head()

Transform time spent on personal activities from minutes to hours, and add this new column called 'persdur_hour' to timeuse_subset.

timeuse_subset = timeuse_subset.copy()
timeuse_subset['persdur_hour'] = (timeuse_subset['persdur']/60)
timeuse_subset.head()

Let’s zoom in on respondents that had a personal activities time between 0.25 and 3 hours.

zoom = (timeuse_subset['persdur_hour'] >= 0.25) & (timeuse_subset['persdur_hour'] <= 3)

timeuse_subset.loc[zoom].describe()

Empirical Distribution of the Sample#

  • Let’s think of the 13,004 respondent times (hours) as a population, and draw random samples from it with replacement.

  • Below is a histogram of the distribution.

persdur_zoom = timeuse_subset.loc[zoom, 'persdur_hour']

persdur_zoom.plot.hist(bins = 10, color = 'grey', edgecolor = 'black')

A random sample of 100 …

persdur_zoom.sample(100, replace = True).plot.hist(bins = 10, color = 'grey', edgecolor = 'black')
  • The values that occur with the least frequency are less likely to occur in small random samples.

  • As the size of the random sample increases the sample will resemble the population, with high probability.

Interactive exploration of empirical distribution of a sample#

from ipywidgets import interact
import ipywidgets as widgets

def emp_hist_plot(n):
    persdur_zoom.sample(n, replace = True).plot.hist(bins = 10, color = 'grey', edgecolor = 'black')

interact(emp_hist_plot, n = widgets.IntSlider(min = 10, max= 500, step=50, value=10))

The histogram of the “population”.

persdur_zoom.plot.hist(bins = 10, color = 'grey', edgecolor = 'black');

Questions#

  • Describe how the empirical histogram changes as the the value n gets larger.

Empirical Distribution of a Statistic#

The Law of Averages implies that with high probability, the empirical distribution of a large random sample will resemble the distribution of the population from which the sample was drawn.

The resemblance is visible in two histograms: the empirical histogram of a large random sample is likely to resemble the histogram of the population.

Statistical Parameter#

Frequently, we are interested in numerical quantities associated with a population.

  • In a population of voters, what percent will vote for Candidate A?

  • In a population of Facebook users, what is the largest number of Facebook friends that the users have?

  • In a population of Air Canada flights, what is the median departure delay?

Numerical quantities associated with a (statistical) population are called statistical parameters or parameters. For the population of respondents in persdur_zoom, we know the value of the parameter “median time (hours) spent on personal activities”:

persdur_zoom.median()

Statistic#

In many situations, we will be interested in figuring out the value of an unknown parameter. For this, we will rely on data from a large random sample drawn from the population.

A statistic (note the singular!) is any number computed using the data in a sample. The sample median, therefore, is a statistic.

Remember that persdur_zoom.sample(100) contains a random sample of 100 respondents from persdur_zoom. The observed value of the sample median is:

persdur_zoom.sample(100).median()

Our sample – one set of 100 people – gave us one observed value of the statistic. This raises an important problem of inference:

The statistic could have been different. A fundamental consideration in using any statistic based on a random sample is that the sample could have come out differently, and therefore the statistic could have come out differently too.

Run the cell above a few times to see how the answer varies. Often it is equal to 0.75, the same value as the population parameter. But sometimes it is different.

Just how different could the statistic have been? One way to answer this is to simulate the statistic many times and note the values. A histogram of those values will tell us about the distribution of the statistic.

Simulating a Statistic#

We will simulate the sample median using the steps below. You can replace the sample size of 1000 by any other sample size, and the sample median by any other statistic.

Step 1: Decide which statistic to simulate. We have already decided that: we are going to simulate the median of a random sample of size 1000 drawn from the population of time use survey respondents that had a median time spent on personal activities between 0.25 and 3.0 hours.

Step 2: Define a function that returns one simulated value of the statistic. Draw a random sample of size 1000 and compute the median of the sample. We did this in the code cell above. Here it is again, encapsulated in a function.

def random_sample_median():
    return persdur_zoom.sample(100).median()
random_sample_median()

Step 3: Decide how many simulated values to generate. Let’s do 5,000 repetitions.

Step 4: Use a for loop to generate a list of simulated values. Start by creating an empty list in which to collect our results. We will then set up a for loop for generating all the simulated values. The body of the loop will consist of generating one simulated value of the sample median, and appending it to our collection list.

The simulation takes a noticeable amount of time to run. That is because it is performing 5000 repetitions of the process of drawing a sample of size 1000 and computing its median. That’s a lot of sampling and repeating!

Let’s break down this step a bit further:

  • set up an empty list called sim_medians.

sim_medians = [] # empty list
  • Use the append function to append values to sim_medians.

sim_medians.append(random_sample_median())
sim_medians
  • Run the cell above several times and you will see that values keep getting appended to sim_medians.

  • Each time the cell is run a random sample of 100 is drawn, the median is calculated then appended to the list.

sim_medians = []

num_sims = 5000 

for _ in range(num_sims):
    sim_medians.append(random_sample_median())

Empirical Distribution of a Statistic#

We can now examine the empirical frequency distribution of the median statistic.

Create a pandas series using the list sim_medians then use the describe function to describe the distribution.

pd.Series(sim_medians).describe()
  • The distribution of the median can be visualized using a histogram.

  • The histogram can tell us how frequent certain values of the median occur in random samples.

import numpy as np
import matplotlib.pyplot as plt
# create bins of length 0.1 
# starting at 0.5 and ending at 1.1

median_bins = np.arange(0.5, 1.1, 0.1) 

# use the bins to plot empirical distribution of medians

plt.hist(sim_medians, edgecolor = 'black', color = 'grey', bins = median_bins);

The exact counts of medians in each interval can be examined using pd.cut and value_counts.

# create a pandas series so we can use cut function
sim_means_series = pd.Series(sim_medians)

# frequency of values in each bin
print(pd.cut(sim_means_series, median_bins).value_counts())

# relative frequency (proportion) of values in each bin
pd.cut(sim_means_series, median_bins).value_counts()/num_sims
  • The histogram shows that the median values between 0.70 and 0.80 have the highest probability of occurring (viz. 0.5314).

  • This means that a random sample would rarely yield a median value in the range of 0.5 - 0.6 or 1.0 - 1.1.

What can we learn from simulation?#

  • If we could generate all possible random samples of size 1000, we would know all possible values of the statistic (the sample median), as well as the probabilities of all those values. We could visualize all the values and probabilities in the probability histogram of the statistic.

  • But in many situations including this one, the number of all possible samples is large enough to exceed the capacity of the computer, and purely mathematical calculations of the probabilities can be intractably difficult.

  • This is where empirical histograms come in.

  • We know that by the Law of Averages, the empirical histogram of the statistic is likely to resemble the probability histogram of the statistic, if the sample size is large and if you repeat the random sampling process numerous times.

  • This means that simulating random processes repeatedly is a way of approximating probability distributions without figuring out the probabilities mathematically or generating all possible random samples.

  • Thus computer simulations become a powerful tool in data science. They can help data scientists understand the properties of random quantities that would be complicated to analyze in other ways.

Models#

  • A model is a set of assumptions about the data. In many cases models include assumptions about random (stochastic) processes used to generate the data.

  • Data scientists are often in a position of formulating and assessing models.

Goals of Data Science#

  • Deeper understanding of the world.

  • Make the world a better place to live.

  • For example, help expose injustice.

  • The skills you are learning help empower you to do this.

Jury Selection#

  • U.S. Constitution grants equal protection under the law

  • All defendants have the right to due process

  • Robert Swain, a Black man, was convicted in Talladega County, AL

  • He appealed to the U.S. Supreme Court

  • Main reason: Unfair jury selection in the County’s trials

  • At the time of the trial, only men aged 21 or more were eligible to serve on juries in Talladega County. In that population, 26% of the men were Black.

  • But only eight men among the panel of 100 men (that is, 8%) were Black.

  • The U.S. Supreme Court reviewed the appeal and concluded, “the overall percentage disparity has been small.” But was this assertion reasonable?

  • If jury panelists were selected at random from the county’s eligible population, there would be some chance variation. We wouldn’t get exactly 26 Black panelists on every 100-person panel. But would we expect as few as eight?

A model of random selection#

  • A model of the data is that the panel was selected at random and ended up with a small number of Black panelists just due to chance.

  • Since the panel was supposed to resemble the population of all eligible jurors, the model of random selection is important to assess. Let’s see if it stands up to scrutiny.

  • The numpy.random function multinomial(n, pvals, size) can be used to simulate sample proportions or counts with two or more categories.

Example 1: rolling a six-sided die 20 times#

import numpy as np

# number of times die is rolled
sample_size = 20

# number of experiments
num_simulations = 1

# probability of each side
true_probabilities = [1 / 6] * 6

# number of times each side appears
counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

counts
proportions = counts / sample_size

print('Sample counts: \n', counts) 
print('Sample proportions: \n', proportions)

Example 2: rolling a loaded six-side 100 times more likely to land on 6 - repeated 3 times#

sample_size = 100

num_simulations = 3

true_probabilities = [1 / 7] * 5 + [2 / 7]

counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

proportions = counts / sample_size

print('Sample counts:\n', counts) 
print('Sample proportions: \n', proportions)
  • Let’s use this to simulate the jury selection process.

  • The size of the jury panel is 100, so sample_size is 100.

  • The distribution from which we will draw the sample is the distribution in the population of eligible jurors: 26% of them were Black, so 100% - 26% = 74% are white (very simplistic assumption, but let’s go with it for now).

  • This means true_pobabilities is [0.26, 0.74].

  • One simulation is below.

import numpy as np
import pandas as pd

sample_size = 100

true_probabilities = [0.26, 0.74]

num_simulations = 1

counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

counts
proportions = counts / sample_size

sim_counts = pd.DataFrame(proportions, columns = ['Black', 'White'])

print(sim_counts)

sim_counts.iloc[0,0]

Quick side note on Indexing a pandas DataFrame#

  • iloc and loc are functions that can access data in a DataFrame.

  • loc uses labels to access rows and columns.

  • iloc uses integers to access rows and columns.

print(sim_counts.shape)
print(sim_counts.index)
print(sim_counts.columns)
# row 0 column 1
print(sim_counts.iloc[0, 1])

# access row 0 column 1 using label names
print(sim_counts.loc[0, 'White'])
print(sim_counts.iloc[0, 0])

print(sim_counts.loc[0,'Black'])

Back to simulation … Simulate one value#

  • Let’s write a function to simulate one value.

def simulate_one_count():
    sample_size = 100 # jury size

    true_probabilities = [0.26, 0.74] #true prob of race
    
    num_simulations = 1 # number of simulations
    # get the random counts
    counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations) 
    
    # store in data frame
    sim_counts = pd.DataFrame(counts, columns = ['Black', 'White'])
    
    return sim_counts.iloc[0,0]
simulate_one_count()

Simulate multiple values#

  • Our analysis is focused on the variability in the counts.

  • Let’s generate 10,000 simulated values of the count and see how they vary.

  • We will do this by using a for loop and collecting all the simulated counts in a list called sim_counts

sim_counts = []

number_sims = 10000

for _ in np.arange(number_sims):
    sim_counts.append(simulate_one_count())
import matplotlib.pyplot as plt

plt.hist(sim_counts, bins = np.arange(5.5, 50, 1), edgecolor = 'black', 
         color = 'grey', density = True);
plt.xlabel('Count in random sample')
plt.ylabel('Frequency')
plt.scatter(8, 0, color = 'orange', s = 150);

Conclusion of the data analysis#

  • The histogram shows that if we select a panel of size 100 at random from the eligible population, we are very unlikely to get counts of Black panelists that are as low as the eight that were observed on the panel in the trial.

  • This is evidence that the model of random selection of the jurors in the panel is not consistent with the data from the panel. While it is possible that the panel could have been generated by chance, our simulation demonstrates that it is hugely unlikely.

  • Therefore the most reasonable conclusion is that the assumption of random selection is unjustified for this jury panel.

  • The simulation also could have been done using np.random.multinomial.

  • This is an example of a ‘vectorized’ computation, and are usually faster than non-vectorized computations.

sample_size = 100

true_probabilities = [0.26, 0.74]

num_simulations = 10000

counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

print(len(counts))

counts

Comparing two samples#

Are mental health visit rates higher in Toronto Neighbourhoods with higher material deprivation?#

Material deprivation is closely connected to poverty and it refers to inability for individuals and communities to access and attain basic material needs. The indicators included in this dimension measure income, quality of housing, educational attainment, and family structure characteristics.

See 2011 Ontario Marginalization Index Documentation

Data wrangling#

  • The next few slides involve reading the data into pandas and getting it ready for analysis.

  • This won’t be covered in detail in class, but we have already covered this process in previous classes.

  • The data is stored in 1_marg_neighb_toronto_2006_OnMarg.xls - a Microsoft Excel file format with file extension .xls.

  • Use pandas function read_excel with sheet_name parameter.

Neighbourhood deprivation scores#

# this python library is not pre-installed
# and is required to use `pandas` `read_excel`
%pip install xlrd
marg_neighb = pd.read_excel('1_marg_neighb_toronto_2006_OnMarg.xls', 
                            sheet_name='Neighbourhood_Toronto_OnMarg')
marg_neighb.head()

If we specify header parameter then it will include column names.

marg_neighb = pd.read_excel('1_marg_neighb_toronto_2006_OnMarg.xls', 
                            sheet_name='Neighbourhood_Toronto_OnMarg', 
                            header=1)
marg_neighb.head()

Get the column names of marg_neighb.

marg_neighb.columns
# the elements can be accessed
# similar to a list

marg_neighb.columns[0:2]

Select the columns corresponding to Neighb id (column 0), Neighbourhood name (column 1), and DEPRIVATION (deprivation score - column 5).

marg_neighb[marg_neighb.columns[[0, 1, 5]]].head()

Create a new DataFrame called depriv with only those columns.

depriv = marg_neighb[marg_neighb.columns[[0, 1, 5]]]

Rename the columns of depriv

colnames = {'Neighb id ': 'Neighbid',
           'DEPRIVATION' : 'DEPRIVATION',
           'Neighbourhood name ': 'name'}

depriv = depriv.copy()

depriv.rename(columns = colnames, inplace= True)
depriv.head()

Mental health visit rates#

  • Read in data on rates of mental health visits stored in 2_ahd_neighb_db_ast_hbp_mhv_copd_2012.xls.

  • Use read_excel with sheet_name parameter

mentalhealth_neighb = pd.read_excel('2_ahd_neighb_db_ast_hbp_mhv_copd_2012.xls', 
                                    sheet_name = '2_MentalHealthV_2012', 
                                    header = 11)
mentalhealth_neighb.head()
mentalhealth_neighb.columns
mhvisitrates = mentalhealth_neighb[mentalhealth_neighb.columns[[0, 1, 10]]]
mhvisitrates.head()
  • Unamed: 0 corresponds to Neighbid in depriv

  • Unamed: 0 corresponds to name in depriv

  • Both sexes.2 (column 10) corresponds to Age-Standardized rate of Mental Health Visits (2012), All Ages 20+ for both sexes

  • rename the columns of mhvisitrates so that identical columns in depriv have the same name.

colnames = {'Unnamed: 0': 'Neighbid',
           'Both sexes.2' : 'mhvisitrates_mf',
           'Unnamed: 1' : 'name'}

mhvisitrates = mhvisitrates.copy()

mhvisitrates.rename(columns = colnames, inplace=True)

mhvisitrates.columns

Merging mental health visits and deprivation score#

merge mhvisitrates and depriv

mhvisitdepriv = mhvisitrates.merge(depriv, on = ['Neighbid', 'name'])
mhvisitdepriv.head()
  • We will create a variable that categorizes neighbourhoods above and below the median deprivation score.

median_depriv = mhvisitdepriv['DEPRIVATION'].median()
median_depriv
mhvisitdepriv = mhvisitdepriv.copy()

# create a new column in mhvisitdepric called depriv_HL
# classify neighbourhoods at or above the median as High

mhvisitdepriv.loc[mhvisitdepriv['DEPRIVATION'] >= median_depriv, 'depriv_HL'] = 'High'

# classify neighbourhoods below the median as Low
mhvisitdepriv.loc[mhvisitdepriv['DEPRIVATION'] < median_depriv, 'depriv_HL'] = 'Low'

mhvisitdepriv.tail(n=3) # print last 3 rows of mhvisitdepriv

Do Neighbourhoods with high deprivation have more mental health visits compared to Neighbourhoods with low deprivation?#

  • The mean rate of mental health visits in the high poverty neighbourhoods is 8.80 and the mean rate in the low poverty neighbourhoods is 7.48.

mean_table = mhvisitdepriv.groupby('depriv_HL')['mhvisitrates_mf'].mean()
mean_table
observed_mean_difference = mean_table.iloc[0] - mean_table.iloc[1]
observed_mean_difference
  • Let’s compare the distribution of mental health visit rates by high/low deprivation.

The Logic of Hypothesis Testing#

1. Hypotheses#

Two claims:

  1. There is no difference in the mean mental health visit rates between high and low deprivation neighbourhoods. This is called the null hypothesis.

  2. There is a difference in the mean mental health visit rates between high and low deprivation neighbourhoods. This is called the alternative hypothesis.

2. Test statistic#

The test statistic is a number, calculated from the data, that captures what we’re interested in.

What would be a useful test statistic for this study?

3. Simulate what the null hypothesis predicts will happen#

  • If the null hypothesis is true then the mean values of high and low deprivation neighbourhoods will be the same regardless of how they are named or labelled.

  • That means we can randomly assign (or shuffle) the neigbourhood names to high and low deprivation and the mean difference should be close to 0.

  • Imagine we have 68 playing cards labelled High and 68 cards labelled Low.

  • Shuffle the cards …

  • Assign the cards to the 136 neighbourhoods then calculate the mean difference between high and low. This is one simulated value of the test statistic.

  • Shuffle the cards again …

  • Assign the cards to the 136 neighbourhoods then calculate the mean difference between high and low. This is another simulated value of the test statistic.

  • Continue shuffling, assigning to neigbourhoods, and computing the mean difference.

Shuffling#

  • The observed difference in mean rate of mental health visits between high and low deprivation is 0.75.

  • Could this difference be due to chance?

  • Let’s repeat this study assuming that the difference is due to chance.

  • Suppose that the (true) mean mental health visit rates in high deprivation neigbourhoods is equal to the (true) mean mental health visit rates in low deprivation neigbourhoods. Then the labels of 'depriv_HL' High and Low are …

Quiz#

If the (true) mean mental health visit rates in high deprivation neigbourhoods is equal to the (true) mean mental health visit rates in low deprivation neigbourhoods. Then the labels of 'depriv_HL' High and Low on neighbourhoods are …

A. interchangable (High can be changed for Low without effecting the mean mental health visit rates)

B. not interchangable

C. random

D. the same, but different on average.

E. different, but the same on average.

So, if

(1) (The true) mean mental health visit rates in high deprivation neigbourhoods are equal to the (true) mean mental health visit rates in low deprivation neigbourhoods.

(2) the labels High and Low are randomly shuffled.

(3) the mean mental health rates for High and Low deprivation are re-calculated using the shuffled labels.

mhvisitdepriv.iloc[0:10,[2,4]]

Random shuffling#

  • We can randlomly shuffle using the sample function in pandas.

  • The parameter frac in the pandas sample function refers to the fraction of rows to return. frac = 1 means all the rows are returned.

import numpy as np

np.random.seed(7) # for reproducability

mhvisitdepriv.iloc[0:10, 4].sample(frac = 1, replace = False)
  • reset_index resets the index.

  • the parameter drop = True in reset_index indicates not to save the index after running sample as a column.

np.random.seed(7)

mhvisitdepriv.iloc[0:10,4].sample(frac = 1, replace = False).reset_index(drop = True)

We can investigate what happens if the labels for High and Low are randomly shuffled for the first 10 rows.

# mental health visit rates for first 10 rows

visits = mhvisitdepriv.loc[0:10, 'mhvisitrates_mf']

# randomly shuffle depriv_HL 
# previous index is dropped because we don't need it

shuffled_depriv = mhvisitdepriv.loc[0:10, 'depriv_HL'].sample(frac = 1, replace = False).reset_index(drop = True)

# put two dataframes in a list as input to
# pd.concat

L = [visits, shuffled_depriv]

# combine two columns
pd.concat(L, axis = 1)

Set up the simulation in python#

Step 1: Shuffle the column 'depriv_HL'

  • To do this we will use pandas sample with the following parameters for the sample function:

    • frac = 1 (sample 100% of values)

    • replace = True (we want the same number of High Low as original sample)

  • reset_index(drop = True) (use the dafault index - row labels in pandas so that we can assign shuffled labels to )

depriv_HL_shuffle = mhvisitdepriv['depriv_HL'].sample(frac = 1, replace=False).reset_index(drop = True)

Step 2: Assign the shuffled labels to the neighbourhoods and compute the mean rate of mental health visits

visitrate_low_shuffle = mhvisitdepriv.loc[depriv_HL_shuffle == 'High', 'mhvisitrates_mf'].mean()

visitrate_high_shuffle  = mhvisitdepriv.loc[depriv_HL_shuffle == 'Low', 'mhvisitrates_mf'].mean()

Step 3: Compute the mean difference between the groups

visitrate_high_shuffle - visitrate_low_shuffle 
  • Steps 1 - 3 is an algorithm for computing one simulated value of the mean difference when the null hypothesis is true.

  • Let’s create a function to do Steps 1 - 3 that returns a simulated value of the mean difference (a simulated value of the test statistic).

def random_shuffle_mean():
    # step 1
    depriv_HL_shuffle = mhvisitdepriv['depriv_HL'].sample(frac = 1, replace=False).reset_index(drop = True)
    
    #step 2
    visitrate_low_shuffle = mhvisitdepriv.loc[depriv_HL_shuffle == 'High', 'mhvisitrates_mf'].mean()
    visitrate_high_shuffle  = mhvisitdepriv.loc[depriv_HL_shuffle == 'Low', 'mhvisitrates_mf'].mean()
    
    #step 3
    shuffled_diff = visitrate_high_shuffle - visitrate_low_shuffle 
    
    return shuffled_diff    
random_shuffle_mean()

Distribution of simulated values of the mean difference assuming the null hypothesis is true#

  • Statistical tests based on shuffles or permutations of the data are called permutation tests.

shuffled_diffs = []

for _ in range(5000):
    shuffled_diffs.append(random_shuffle_mean())
import matplotlib.pyplot as plt

plt.hist(shuffled_diffs, color = 'grey', edgecolor = 'black')

plt.vlines(x = observed_mean_difference, ymin = 0, ymax = 1600, color = 'red')

plt.vlines(x = -1*observed_mean_difference, ymin = 0, ymax = 1600, color = 'red')

p-value#

  • Assuming that the null hypothesis is true, the p-value gives a measure of the probability of getting data that are at least as unusual or extreme as the sample data.

  • What does “at least as unusual” mean?

  • An unusual value would be a mean difference (High minus Low) greater than what we observed or a mean difference (Low minus High) less than what we observed.

  • Values that are larger than the observed value mean difference (the test statistic) 0.75 (High minus Low) or smaller than - 0.75 (Low minus High).

The number of simulated mean differences that are greater than the observed mean difference can be computed using:

rightextreme = shuffled_diffs >= observed_mean_difference

rightextreme.sum()

The number of simulated mean differences that are less than the -1 times observed mean difference can be computed using:

leftextreme = shuffled_diffs <= -1*observed_mean_difference

leftextreme.sum()

The proportion of simulations that are unusual or extreme are:

(rightextreme.sum() + leftextreme.sum()) / 5000

We can also write a function to carry out this computation.

def pvalue2(shuff_diffs, obs_diffs):
    
    rightextreme_count = (shuff_diffs >= obs_diffs).sum()
    
    leftextreme_count = (shuff_diffs < -1*obs_diffs).sum()
    
    allextreme = rightextreme_count + leftextreme_count
    
    pval = allextreme / len(shuff_diffs)
    
    return pval  
pvalue2(shuff_diffs = shuffled_diffs, obs_diffs = observed_mean_difference)
  • This is called the P-value.

  • The entire procedure is often referred to as significance testing.

  • None of the permuted 5000 samples resulted in a mean difference more extreme than what we observed.

Conclusions#

So, there are two possibilities:

  1. The null hypothesis of no difference is true, but we have observed a very rare value of the mean difference in our study.

  2. The null hypothesis is false, and there is a difference between high and low deprivation.

It’s scientific convention to assume that a small p-value is evidence in favour of 2 (i.e., the null hypothesis is false, and there is evidence of a difference in the means).

History of Significance Testing#

R.A. Fisher made many scientific contributions including developing significance testing.

He was from an early age a supporter of certain eugenic ideas, and it is for this reason that he has been accused of being a racist and an advocate of forced sterilisation (Evans 2020). His promotion of eugenics has recently caused various organisations to remove his name from awards and dedications of buildings.” (Bodmer et al, 2021)

Causality#

Imagine the following hypothetical scenario:

  • you have a headache

  • you take an Aspirin at 11:00 to relieve your pain. your pain goes away after 30 minutes.

Now, you go back in time to 11:00 …

  • you don’t take an Aspirin at 11:00 to relieve your pain. your pain goes away after 48 minutes.

The causal effect of taking an Aspirin is 18 minutes (48 minutes - 30 minutes).

Potential outcomes and randomized control trials#

  • Establishing causality involves comparing these potential outcomes.

  • The problem is that we can never observe both taking an Aspirin and not taking as Aspirin (in the same person at the same time under the same conditions).

  • A close approximation to comparing potential outcomes is to compare two groups of people that are similar on average (age, sex, income, etc.) except one group is allowed to take Aspirin after a headache and the other group takes a fake Aspirin (sugar pill/placebo) after a headache. This is an example of a randomized control trial.

  • Then the mean difference between time to pain relief should be due to Aspirin and not other factors related to why people may or may not take an Aspirin.

Does Material Deprivation Cause increases in mental health visits?#

  • Neighbourhoods can’t be randomly assigned to high or low deprivationdepravity, like people could be randomized to take a real or fake Aspirin.

  • There could be many other factors related to why a neighbourhood has a low or high deprivation score.

  • This mean that when we are comparing neighbourhoods with low deprivation to high deprivation the differences in a certain outcome such as mental health visit rates could be due to factors other than deprivation such as genetic or environmental factors that are associated with mental illness, but happen to be more or less prevalent in one of the deprivation groups.

Just in case you have seen a two-sample t-test …#

This is similar to a two-sample t-test (we won’t cover in this course except here) except not as flexible.

from scipy import stats

mhvisits_low = mhvisitdepriv.loc[mhvisitdepriv['depriv_HL'] == 'Low', 'mhvisitrates_mf']

mhvisits_high = mhvisitdepriv.loc[mhvisitdepriv['depriv_HL'] == 'High', 'mhvisitrates_mf']

statistic, pvalue = stats.ttest_ind(mhvisits_high, mhvisits_low)

print('The p-value from the two-sample t-test is: ', round(pvalue, 3))

The two-sample t-test involves:

  1. Computing \(t_{\text observed} = \frac{\bar{x_{\text Low}} - \bar{x_{\text High}}}{SD}\), where \(SD\) is an estimate of the standard deviation of the mean difference.

  2. Computing the p-value using \(t_{\text observed}\) and the appropriate \(t\)-distribution.

  3. Assessing several statistical assumptions about the data to ensure the accuracy of the p-value.

Review of this class#

  • Comparing two samples by simulating the distribution of a test statistic (e.g., the difference in two means) assuming the null hypothesis is true (i.e., there is no difference in the means).

  • Simulated the values of the test statistic when the null hypothesis is true we:

    • Stated the null and alternative hypotheses.

    • Randomly shuffled the group labels (e.g., high/low deprivation)

    • Calculated the test statistic in the shuffled data set

    • Repeat the previous three steps a large number of times (e.g., 5,000)

  • Compute the p-value as a measure of how consistent the data are with the null hypothesis.

  • The p-value is computed by summing the number of simulated values that are more extreme in the positive or negative direction.

  • If the p-value is small then the data are inconsistent with the null hypothesis and there is evidence in support of the alternative hypothesis.

  • A small p-value may not be causal evidence that for the alternative hypothesis unless random assignment was used part of the study design.