Statistics
Statistics is one of those topics where I feel like the classes need to be updated for a modern world. Most of the statistics you will do will be run on a computer, yet in the stats class I took we never touched a computer. It is important to know some of the math underneath the hood of the statistical models but, I feel that we have limited time and cannot become an expert in everything. There are very smart people who have created statistical tools we should use. Programming does allow you to see more of the inner workings of statistical tests. This tutorial will show you best practices on how to use Python (this chapter) and R (the next chapter) to run statistics. The R chapter will mostly be programming examples since the Python chapter will cover more whys. R will cover more mixed model statistics since the ecosystem for that is way better. Lastly, we will cover only frequentist methods and not bayesian.
Statistics as models of data¶
It is important to think of statistics as modeling our data. Models are often simplified versions of what we see in real life. Generally, we are trying to find the simplest model that can most fully recapitulate our actual data. Since we will never be able to fully model our data there is error between what our model predicts vs what our data actually is. We also need to be keenly aware of the assumptions that statistic models make. Failure of data to meet the assumptions of a statistic model can result in false positives. More complex models can mitigate some of these assumptions but can also penalize the power of our model. When we get to the regression models there are a variety of ways that we can compare different regression models within regression class (OLS, GLM, mixed-model, etc.).
Data¶
Data that we collect are often called realizations that are generated by a random function (i.e. a distribution). Data can come in several types and the type determines the test we can use.
Continuous¶
This is the most common data type in biology and is the most common depedent variable type that we collect. Continuous means we could collect any value, real or imaginary, with infinite precision however, in reality, we typically get real values. Examples are rheobase, frequency, membrane resistance, etc. Continuous data can come from a variety of continuous distributions which we cover in the distributions chapter.
Categorical¶
Categorical data is very common in biology and is most commonly an indepedent variable. Categorical data has a finite number of values that are unordered and usually are the result of some grouping like genotype or sex. Categorical data can be depedent a variable as well but that is much less common.
Ordinal¶
Ordinal data is probably less common in biology. It is data with a finite number of values (discrete) that have some intrisic order like more votes is better than less or date-time. Something that is inherently ordered but not continuous.
Binary¶
This is 0 or 1, success or failure. Binary data is not that common in biology but can be very useful. Binary data can be generated by using a single cutoff on continuous data. This is useful if you have data that is heavily skewed, kurtotic or bimodal where you can split values into two groups and test that instead. Binary data fall under the binomial distribution.
Basic terminology¶
Before we start covering stats we will cover some basic terminology. Part of understanding a topic is sometimes just learning the language a topic uses.
Random function/variable and realizations¶
A random function/variable is a mathematical function that generates realizations. Sometimes and maybe often times we do not know what the actual random function/variable is that generates our data and this introduces error. Realizations are the data we work with and is the what comes out of a random function/variable. Realizations can be a single number or a set of numbers. The random component is essentially the probability of getting a specific relization. You do not choose the realization but is randomly drawn from a random function like a probability distribution.
Central tendency¶
Central tendency is the typical value for a dataset or in particular a probability distribution. We typically use the mean as a central tendency however there are several other measures used such as the median and mode.
Mean: The weighted sum of all the values in a dataset. The basic mean just assigns a weight of
1/nto each value wherenis the number of samples in your dataset. Many statistical tests that use the normal distribution under the hood are comparing the standard mean. There are other versions of the mean such as geometric and power mean but they are not used in statistical tests we will cover.Median: Percentile based measure where the median is the 50% percentile of your data. It is typically used with the interquartile range (IQR) or other percentiles such as the 25-75% percentile. If you have an odd number of values it is middle value is you sort the values otherwise it is the average of the two middle values.
Mode: The most common value in a set of values. It is typically on defined for categorical, ordinal or other set of values where there may be repeating values.
Spread or dispersion of data¶
There are several ways to measure the spread/dispersion of your data. The most commonly used in science is the SEM however, it is probably the least useful for understanding your data.
Variance (Var): The spread of your data around the mean. Is is just standard deviation squared and is not in the units your data is but in .The variance is very important in mathetic theory of generalized linear models.
Standard deviation (STD or STDEV): The spread of your data around the mean. The standard deviation is very important for statistics because most parametric statistical models assume the standard deviation of both groups is the same.
Standard error of the mean (SEM): The precision of your estimated mean. SEM decreases as n increases but not linearly. Many people plot because it looks nice. Since SEM is indirectly related to statistical tests in that p-values are correlated with sample size.
Homogeneity of variances (homoscedastic): When two or more groups have the same variance. Homogeneity of variances is one the assumptions of many parametric statistics.
Heterogeneity of variances (heteroscedastic): When two or more groups have the different variances. Hetetoscedasticity can lead to incorrect results of hypothesis tests. Sometimes heteroscedasticity can be correct by transforming your data other times you will need specialized statistical tests.
Model¶
A model is essentially anything you can put your data into and get a out something that helps you describe your data. All statistical tests model your data. Models make assumptions about your data thus you need to make sure your data fits those assumptions. Models often a simplified version of what happens in real life thus they have error. We generally want to minimize the error of our statistical tests. Below are some terms you will see when describing a statistical model.
Between-subject: Between subjects are things like genotype, sex, age (if you have restricted age groups like adult, child, infant). Between subjects tests are the most common in basic neuroscience and biology.
Within-subject (repeated measures): Some sort of measure that is collected from a subject two or more times. This can be cells within mouse, mice within litter, students within a school within a distric within a state. Usually repeated measures can be things like baseline, drug, drug+drug in the same cell or timepoints in the same mouse. Mixed model and repeated measures regression/ANOVA are two ways to control for within-subjects however, these models have different assumptions. Within-subject effects are often ignored in neuroscience and biology. Ignoring within-subjects can inflate p-values since within-subjects tests can trade off within and between mice differences. Within-subject tests often require larger sample sizes for better estimates.
Test statistic: This is a value calculated from the test that can be used for hypothesis testing. You will see , , and test statistics. The test statistics also have an associated degrees of freedom that is used to look up the resulting p-value. and values are in some case interchangable such as for simple regression or categorical predictors and t-test.
Predictor or indepedent variable: These are the “x” values in your data. Many tests are essentially solving for coefficients to show how much these predictor variables effect the outcome or dependent variable.
Outcome or depedent variable: These are the “y” values in your data. These depend on your “x” values or are the outcome of your “x” values.
Parametric: We covered this in the distributions chapter but parametric just means you have distributions that have parameters to describe them such as mean, STD, scale, shape, etc. Parametric statistics are most of what we will cover.
Non-parametric: These statistical tests are often rank-based tests. They essentially rank two groups and compare how many values in one group are larger than those in the other group. Non-parametric statistics do not assume your data follows a known distribution but some do assume that each group follows the same distribution.
Multiple comparisions¶
This when you compare one group with many other groups and usually get a p value for each comparison. Multiple comparision p value’s need to be corrected for the many comparisions. Multiple comparisions are usually run as posthoc tests for ANOVA or on RNA expression data. Multiple comparisions are not a substitute for ANOVA which I have seen used too many times.
Bootstraping¶
Resampling from a set of data over and over again. You can bootstrap with replacement where you can draw a value multiple times, or without replacement where you cannot draw a value multiple times. Bootstrapping can be used to construct non-parametric statistics. Bootstrapping assumes your data is independent and identically distributed.
Independent and identically distributed¶
Independent means that each data point you have was not influenced or related to any other data points that you have in your dataset. Knowledge of one value does not give you information about another value. The indepedendence assumption is violated when you use within-subject tests. Identically distributed means that all values in your datasets are drawn from the same probability distribution. In a way, many statistical tests with categorical independent variables (predictors) are testing whether identically distributed is false; the means are different with the assumption that the variance is the same. Other tests can test whether the variance is different such as Levene’s test.
Null hypothesis¶
Baseline assumption that there is no difference in a parameter between different groups. Usually the parameter is the mean for normally distributed data but could be the variance for something like Levene’s test.
Hypothesis testing¶
Testing to determine whether the data provided is sufficient to reject some other hypothesis. For most statitical tests we run will be used to determine whether we can reject the null hypothesis, usually the hypothesis that there is no difference between groups. Typically your hypothesis does not tell you the direction or magnitude of the difference.
Significance testing¶
Generating a number, usually the p-value, that we use to determine whether we reject the null hypothesis (usually that there is no difference). Signficance testing could more broadly include effects sizes. Random fact, Karl Pearson and Ronald Fisher are thought to have developed significance testing due to their interest in eugenics and testing for differences between different populations of people to justify their views.
Sample size¶
The number of samples or n in your dataset.
Effect size¶
The number tells you how large your difference or effect is in your statistical test. These include , Cohen’s d, eta/partial-eta squared, omega/partial-omega squared, odds ratio, or you can even use confidence intervals as an effect size though they are not considered an effect size.
Residuals¶
The difference between you fitted model and the actual data at for each set of X values. Linear regression models assume that residuals have a mean of zero and a standard deviation of one. Residuals are one the key ways that we assess the fit of a linear regression model even if the y is not normally distributed.
Sample vs population¶
Sample data or statistic refers to instances when we have just a portion or a sample of a larger population. In most cases we are working with sample statistics since we almost never have access to the full population. Think about how technically we have an infinite number of mice but we could never run an experiment on all the mice so we just get a sample of those mice. Due to this we end up using sample statistics rather than population statistics. Since we are using sample statistics we always lose one degree of freedom because we have to estimate a parameter. This is a constraint on our measurements.
Confidence level and intervals¶
Confidence intervals are a way to assess the effect size of a statistic like the mean or the difference between two variables when working with sample data. If you have a confidence level of 95% then your confidence then your interval would be (2.5%, 97.5%). The values in the interval are percentiles. If you have a 95% confidence interval that would tell that if you repeated sample from the population then 95% of the intervals would contain the true value of the parameter. With frequentist statistics we estimate a single parameter that has no distribution so we cannot say we cannot assign it a probability of existing within a range of values.
How are confidence intervals calculated?¶
The confidence interval percentiles can be drawn from a t-distribution and then adjusted with the model parameter standard deviation or they can be bootstrapped with something like the Bias-Corrected and accelerated confidence intervals. You can imagine that the confidence intervals act as a bridge between the sample data and the actual population data. We do not know the population parameter we can only. Confidence levels help us understand the confidence we have in estimating the population parameter. Lastly, the smaller the confidence level the narrower the interval. So confidence intervals are a trade off between the confidence and precision. Low confidence (narrow intervals) --> high precision. The more sure you want to be about the population parameter the wider you will want your intervals.
Credible intervals¶
We will not use credible intervals since they are used with Bayesian statistics but we will go over what they are since they are often confused with confidence intervals. Bayesian statistics calculate distributions for parameters rather than a single intervals. Credible intervals tell you that there is an x% chance that the actual value falls within the interval.
Degrees of freedom¶
Degrees of freedom relates the number of paramters you are estimating and the size of your dataset. When we take a sample of a larger population we are now estimating a population parameter so we lose a degree of freedom for things like standard deviation. For regression we lose a degree of freedom for each parameter that we estimate. Degrees of freedom helps us reduce bias in our samples. This is why you cannot estimate the sample standard deviation of a single value because we would have a degrees of freedom equal to zero.