from itertools import product
from pathlib import Path

import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.weightstats import ttest_ind

base_path = Path().cwd().parent / "data/stats"

One- and two-sample statistics¶

One- and two-sample statistics are statistics where we compare one group to some predifined mean or distribution or compare whether two gropus are different. There are parametric and non-parametric tests.

The t-test¶

The t-test is perhaps one of the simplest statistical tests you can run. Usually in electrophysiology we are comparing two different (independent) groups, comparing two different timepoints or drug treatment within a group (paired, not independent) or comparing whether a single group is different from a predefined mean (one-sample t-test) which can be used for a paired t-test. You can run a t-test using Scipy or Statsmodels in Python. I always recommend getting the confidence intervals of the difference between the two groups. Usually when the confidence intervals of the difference do not include 0 a statistically significant effect. The Further from 0 that your interval is the larger the effect size. Since the p-value is dependent on sample size this helps interpret the p-value. If you are running a t-test between two different groups there are several things to consider.

Are your data normally distributed? Yes, go to 2.
Are your two groups independent? Yes, use z-test, t-test or Welch’s t-test. No, use paired t-test.
Is the variance of your two groups the same? Yes, use standard t-test. No, use Welch’s t-test.
Are you sample sizes unequal? Use Welch’s t-test
How large are your sample sizes? Larger than ~30 use the z-test

The output¶

You will get the test statistic, a p-value and the degrees of freedom. You can also get confidence intervals. For simple tests like t-tests I prefer to get bootstrapped confidence intervals.

Scipy vs Statmodels¶

I prefer Scipy for the t-tests but, I will include the code for Statsmodels as well.

Two-sample t-test with equal variances¶

Scipy¶

data = pd.read_csv(base_path / "ttest.csv", header=None)
one = data.loc[data[1] == "one", 0]
two = data.loc[data[1] == "two", 0]
output = stats.ttest_ind(one, two)
print(output)
print(output.confidence_interval())

TtestResult(statistic=np.float64(2.808477410819165), pvalue=np.float64(0.01023650912422613), df=np.float64(22.0))
ConfidenceInterval(low=np.float64(0.6512716748203615), high=np.float64(4.328502280179638))

Statsmodels¶

output = ttest_ind(one, two, alternative="two-sided", usevar="pooled")
print(output)

(np.float64(2.808477410819165), np.float64(0.01023650912422613), np.float64(22.0))

Comparing two-sample t-test with unqual variances with and without Welch’s correction¶

You will see that without Welch’s correction that p-value and the degrees of freedom (df) is inflated. One thing to note is that Welch’s correct will converge towards a standard t-test if the variances and sample sizes are equal.

Scipy¶

data = pd.read_csv(base_path / "ttest_welch.csv", header=None)
one = data.loc[data[1] == "one", 0]
two = data.loc[data[1] == "two", 0]
output = stats.ttest_ind(one, two, equal_var=True)
output_w = stats.ttest_ind(one, two, equal_var=False)
print(output, output_w, sep="\n")

TtestResult(statistic=np.float64(3.940909741688904), pvalue=np.float64(0.0006965299598973857), df=np.float64(22.0))
TtestResult(statistic=np.float64(3.940909741688904), pvalue=np.float64(0.0017280885405381943), df=np.float64(12.84406845836461))

Statsmodels¶

output = ttest_ind(one, two, alternative="two-sided", usevar="unequal")
output_w = ttest_ind(one, two, alternative="two-sided", usevar="pooled")
print(output, output_w, sep="\n")

(np.float64(3.940909741688904), np.float64(0.0017280885405381926), np.float64(12.844068458364609))
(np.float64(3.940909741688904), np.float64(0.0006965299598973857), np.float64(22.0))

One tailed t-test¶

One tailed t-tests are used when you have a specific hypothesis about the direction of your effect. This needs to be choosen before you collect your data. An example of when this would be used is if you have a treatment that “needs” to be more effective than a previous treatment or if you are replicating data and aleady have a hypothesis about the direction of your effect. The direction of you effect is determine by why group you have as one and two. If you expect one to be larger than two (positive differece between the groups) you want larger/greater. If you expect one to be smaller than two (negative difference between the groups) the you want less/smaller.

Paired t-test¶

The paired t-test is used when you have a subject that has two measures and you want to compare if the two measures are different. Paired data (aka within-subject data) violates the independence assumption of a traditional t-test. Also paired t-test will increase your statistical power by reducing between subject variance since each subject essentially acts as its own control.`

Scipy¶

data = pd.read_csv(base_path / "ttest.csv", header=None)
one = data.loc[data[1] == "one", 0]
two = data.loc[data[1] == "two", 0]
output = stats.ttest_rel(one, two)
print(output)

TtestResult(statistic=np.float64(2.6762147294376857), pvalue=np.float64(0.02155343272358373), df=np.int64(11))

Statsmodels¶

Statsmodels does not have a one sample t-test. Since the one-sample t-test is pretty simple to run we will manually compute it here. If you want to determine whether you difference is different from some other value than zero then you can replace 0 with that value.

diffs = one.values - two.values
m_diff = diffs.mean() - 0
se = diffs.std(ddof=1) / np.sqrt(diffs.size)
t_value = m_diff / se
p = stats.t.sf(np.abs(t_value), df=diffs.size - 1) * 2  # two-sided test
print(f"statistic: {t_value}, pvalue: {p}, df: {diffs.size - 1}")

statistic: 2.6762147294376857, pvalue: 0.02155343272358373, df: 11

Rank-based two-sample statistics¶

There are a variety of non-parametric two-sample rank-based statistics such as Mann-Whitney and Wilcoxon signed-rank test. Rank based tests can be used with continuous or ordinal data. Some things you should be aware of before using rank-based tests. If your data is continuous, try transforming your data before running a rank-based test. I often see people fall back to Mann-Whitney for non-normal data. However, in most cases a simple log-transform would have been enough to make the data fit the assumptions of the t-test and its variants. Another is that for rank-based tests in biological sciences we generally want to test if the location or scale is different but not both because that makes interpreting our data hard. To do so we are most interested in the median. If your two samples each have a distribution (same gamma) with different shapes AND dispersion you may not have a difference in medians (due to location or scale) which is what we really want to test with rank-based tests. The test is still valid you just cannot say where the difference in your samples are. If you do not know the terms location, shape, scale or dispersion check out the Distributions chapter. Non-parametric rank-based statistics really have almost as many assumptions as parametric-based statistics. Rank-based statistics can also be sensitive the number of ties or the number of same values between your two group. Rank-based statistics are also not a substitute for parametric tests when your sample size is small (n=3 for each group). The last case where rank-based tests may be useful is when you have outliers

Mann-Whitney U¶

Mann-Whitney is a rank-based statistical test for two indepedent samples. Rather than compare the mean difference, Mann-Whitney essentially scores how many values are larger than in the other group and creates a U score for each group which is used to generate a p-value. One thing to note that the U value that Scipy outputs changes depending on what group is one and what group is two but, the p value will be the same since the p value is calculated from the smaller U value. The second U value can be calculated $U2 = n1*n2-U1$ . Another thing to note is that you have to use a different method to calculate the p value if your sample size is small. The Mann-Whitney tests has a table of precomputed critical values for smaller sample sizes. For the Mann-Whitney test to be valid your data must come from the same distribution type otherwise interpreting your what difference your significant p-value is showing is next to impossible. When reporting the Mann-Whitney p-values you should also report the difference between the medians (or the median of each group), the Hodges-Lehmann estimator which is like an effects size measures for the Mann-Whitney test and the probability of superiority which gives you the probability that values in one group will be large than another when randomly choosing two points.

Some other parameters in the Scipy version are alternative. This is if you are expecting group one to be more or less than group two. The method parameter method is used to define whether to use the exact, asymptotic or permutation p value calculation. “Auto” is probably the best for most things but Scipy recommends passing a Permutation instance if there are less than 10 samples in one of the groups and there are ties.

def hodges_lehmann(group1, group2):
    """Median of all pairwise differences"""
    pairwise_diffs = [y - x for x, y in product(group1, group2)]
    return np.median(pairwise_diffs)

def probability_of_superiority(group1, group2):
    """P(Y > X) - common language effect size"""
    pairs = [(x, y) for x, y in product(group1, group2)]
    return np.mean([y > x for x, y in pairs])

def rank_biserial(U1, n1, n2):
    """Rank-biserial correlation for a Mann-Whitney U test.

    Args:
        U1 (float): Mann-Whitney U value for group one/n1
        n1 (int): Number of values in group one
        n2 (int): Number of values in group two

    Returns:
        float: Rank biserial correlation value
    """
    return (2*U1/(n1*n2))-1

data = pd.read_csv(base_path / "ttest.csv", header=None)
one = data.loc[data[1] == "one", 0]
two = data.loc[data[1] == "two", 0]
output = stats.mannwhitneyu(one, two,)

mannu = stats.mannwhitneyu(one, two, method="auto")
print(mannu, f"U2: {len(one)*len(two)-mannu[0]}")
print(f"Rank-biserial correlation: {rank_biserial(mannu[0], len(one), len(two))}")
print(f"Hodges_lehmann: {hodges_lehmann(one.values, two.values)}")
print(f"Probability of superiority: {probability_of_superiority(one.values, two.values)}")

MannwhitneyuResult(statistic=np.float64(116.0), pvalue=np.float64(0.012022825407617439)) U2: 28.0
Rank-biserial correlation: 0.6111111111111112
Hodges_lehmann: -2.500757395
Probability of superiority: 0.19444444444444445

Brunner-Munzel¶

Brunner-Munzel is a rank-based tests with very few assumptions that tests whether the probability of getting large values in both groups is the same. It is robust to outliers, it can be used when the shape or dispersion of a distributions are different or if the distributions are different. However, there are some assumptions which means it will not be useful in most cases. One is that Scipy recommends that you have a sample size greater than 10 for both groups so most biological experiments will not hit this threshold. The other is that responses are ordinal or can be assumed to be ordinal (i.e. number of votes)

stats.brunnermunzel(one, two)

BrunnerMunzelResult(statistic=np.float64(-3.3166247903554), pvalue=np.float64(0.0036251885202779398))

Bootstrap and permutation tests¶

Bootstrap and permutation tests are resampling tests. They are really more a method than an actual test since you can apply them to parametric statistical tests. Some people insist that these tests are better for small sample sizes however, the core problem remains that the small sample size is too small.

Permutation test¶

The permutation test tests the null hypothesis that all samples come from the same distribution. The permutation test takes the follow steps:

Get the difference between your two groups
Pool your groups
Randomly assign values to each group. The number of assigned values should be the same size as the original groups. Ideally you want to choose all unique permutations without repeats. However, sometimes that is not possible with large sample sizes so you get a subset.
Find the difference between the groups.
Repeat steps 3 and 4 n times.
Find whether your difference is significant. The permutation test can also be run on the other tests like the t-test and Mann-Whitney U test.

def statistic(x, y, axis):
    return np.mean(y, axis=axis)-np.mean(x, axis=axis)
res = stats.permutation_test([one, two], statistic, permutation_type="independent", alternative="two-sided")
print(res.pvalue)

0.007

Bootstrap test¶

The bootstrap test takes a different approach by repeatedly sampling with replacement from each group and comparing the observed difference to the bootstrap difference. Scipy’s bootstrap takes some extra work to get a p-value. We need to get the percentile rank of the observed difference.

def statistic(x, y, axis):
    return np.mean(y, axis=axis)-np.mean(x, axis=axis)
res = stats.bootstrap([one, two], statistic)
observed_mean = np.mean(one)-np.mean(two) 
bs_replicates_shifted = res.bootstrap_distribution- res.bootstrap_distribution.mean()
emp_diff_pctile_rnk = stats.percentileofscore(bs_replicates_shifted, observed_mean)
auc_right = emp_diff_pctile_rnk / 100
auc_left = 1 - emp_diff_pctile_rnk / 100
auc_tail = auc_left if auc_left < auc_right else auc_right
p_value = auc_tail * 2
print(f"p-value: {p_value}", sep="\n")

p-value: 0.002200220022002153