Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Statistics

Statistics is one of those topics where I feel like the classes need to be updated for a modern world. Most of the statistics you will do will be run on a computer, yet in the stats class I took we never touched a computer. It is important to know some of the math underneath the hood of the statistical models but, I feel that we have limited time and cannot become an expert in everything. There are very smart people who have created statistical tools we should use. Programming does allow you to see more of the inner workings of statistical tests. This tutorial will show you best practices on how to use Python (this chapter) and R (the next chapter) to run statistics. The R chapter will mostly be programming examples since the Python chapter will cover more whys. R will cover more mixed model statistics since the ecosystem for that is way better. Lastly, we will cover only frequentist methods and not bayesian.

Statistics as models of data

It is important to think of statistics as modeling our data. Models are often simplified versions of what we see in real life. Generally, we are trying to find the simplest model that can most fully recapitulate our actual data. Since we will never be able to fully model our data there is error between what our model predicts vs what our data actually is. We also need to be keenly aware of the assumptions that statistic models make. Failure of data to meet the assumptions of a statistic model can result in false positives. More complex models can mitigate some of these assumptions but can also penalize the power of our model. When we get to the regression models there are a variety of ways that we can compare different regression models within regression class (OLS, GLM, mixed-model, etc.).

Data

Data that we collect are often called realizations that are generated by a random function (i.e. a distribution). Data can come in several types and the type determines the test we can use.

Continuous

This is the most common data type in biology and is the most common depedent variable type that we collect. Continuous means we could collect any value, real or imaginary, with infinite precision however, in reality, we typically get real values. Examples are rheobase, frequency, membrane resistance, etc. Continuous data can come from a variety of continuous distributions which we cover in the distributions chapter.

Categorical

Categorical data is very common in biology and is most commonly an indepedent variable. Categorical data has a finite number of values that are unordered and usually are the result of some grouping like genotype or sex. Categorical data can be depedent a variable as well but that is much less common.

Ordinal

Ordinal data is probably less common in biology. It is data with a finite number of values (discrete) that have some intrisic order like more votes is better than less or date-time. Something that is inherently ordered but not continuous.

Binary

This is 0 or 1, success or failure. Binary data is not that common in biology but can be very useful. Binary data can be generated by using a single cutoff on continuous data. This is useful if you have data that is heavily skewed, kurtotic or bimodal where you can split values into two groups and test that instead. Binary data fall under the binomial distribution.

Basic terminology

Before we start covering stats we will cover some basic terminology. Part of understanding a topic is sometimes just learning the language a topic uses.

Random function/variable and realizations

A random function/variable is a mathematical function that generates realizations. Sometimes and maybe often times we do not know what the actual random function/variable is that generates our data and this introduces error. Realizations are the data we work with and is the what comes out of a random function/variable. Realizations can be a single number or a set of numbers. The random component is essentially the probability of getting a specific relization. You do not choose the realization but is randomly drawn from a random function like a probability distribution.

Central tendency

Central tendency is the typical value for a dataset or in particular a probability distribution. We typically use the mean as a central tendency however there are several other measures used such as the median and mode.

Spread or dispersion of data

There are several ways to measure the spread/dispersion of your data. The most commonly used in science is the SEM however, it is probably the least useful for understanding your data.

Model

A model is essentially anything you can put your data into and get a out something that helps you describe your data. All statistical tests model your data. Models make assumptions about your data thus you need to make sure your data fits those assumptions. Models often a simplified version of what happens in real life thus they have error. We generally want to minimize the error of our statistical tests. Below are some terms you will see when describing a statistical model.

Multiple comparisions

This when you compare one group with many other groups and usually get a p value for each comparison. Multiple comparision p value’s need to be corrected for the many comparisions. Multiple comparisions are usually run as posthoc tests for ANOVA or on RNA expression data. Multiple comparisions are not a substitute for ANOVA which I have seen used too many times.

Bootstraping

Resampling from a set of data over and over again. You can bootstrap with replacement where you can draw a value multiple times, or without replacement where you cannot draw a value multiple times. Bootstrapping can be used to construct non-parametric statistics. Bootstrapping assumes your data is independent and identically distributed.

Independent and identically distributed

Independent means that each data point you have was not influenced or related to any other data points that you have in your dataset. Knowledge of one value does not give you information about another value. The indepedendence assumption is violated when you use within-subject tests. Identically distributed means that all values in your datasets are drawn from the same probability distribution. In a way, many statistical tests with categorical independent variables (predictors) are testing whether identically distributed is false; the means are different with the assumption that the variance is the same. Other tests can test whether the variance is different such as Levene’s test.

Null hypothesis

Baseline assumption that there is no difference in a parameter between different groups. Usually the parameter is the mean for normally distributed data but could be the variance for something like Levene’s test.

Hypothesis testing

Testing to determine whether the data provided is sufficient to reject some other hypothesis. For most statitical tests we run will be used to determine whether we can reject the null hypothesis, usually the hypothesis that there is no difference between groups. Typically your hypothesis does not tell you the direction or magnitude of the difference.

Significance testing

Generating a number, usually the p-value, that we use to determine whether we reject the null hypothesis (usually that there is no difference). Signficance testing could more broadly include effects sizes. Random fact, Karl Pearson and Ronald Fisher are thought to have developed significance testing due to their interest in eugenics and testing for differences between different populations of people to justify their views.

Sample size

The number of samples or n in your dataset.

Effect size

The number tells you how large your difference or effect is in your statistical test. These include r2r^2, Cohen’s d, eta/partial-eta squared, omega/partial-omega squared, odds ratio, or you can even use confidence intervals as an effect size though they are not considered an effect size.

Residuals

The difference between you fitted model and the actual data at for each set of X values. Linear regression models assume that residuals have a mean of zero and a standard deviation of one. Residuals are one the key ways that we assess the fit of a linear regression model even if the y is not normally distributed.

Sample vs population

Sample data or statistic refers to instances when we have just a portion or a sample of a larger population. In most cases we are working with sample statistics since we almost never have access to the full population. Think about how technically we have an infinite number of mice but we could never run an experiment on all the mice so we just get a sample of those mice. Due to this we end up using sample statistics rather than population statistics. Since we are using sample statistics we always lose one degree of freedom because we have to estimate a parameter. This is a constraint on our measurements.

Confidence level and intervals

Confidence intervals are a way to assess the effect size of a statistic like the mean or the difference between two variables when working with sample data. If you have a confidence level of 95% then your confidence then your interval would be (2.5%, 97.5%). The values in the interval are percentiles. If you have a 95% confidence interval that would tell that if you repeated sample from the population then 95% of the intervals would contain the true value of the parameter. With frequentist statistics we estimate a single parameter that has no distribution so we cannot say we cannot assign it a probability of existing within a range of values.

How are confidence intervals calculated?

The confidence interval percentiles can be drawn from a t-distribution and then adjusted with the model parameter standard deviation or they can be bootstrapped with something like the Bias-Corrected and accelerated confidence intervals. You can imagine that the confidence intervals act as a bridge between the sample data and the actual population data. We do not know the population parameter we can only. Confidence levels help us understand the confidence we have in estimating the population parameter. Lastly, the smaller the confidence level the narrower the interval. So confidence intervals are a trade off between the confidence and precision. Low confidence (narrow intervals) --> high precision. The more sure you want to be about the population parameter the wider you will want your intervals.

Credible intervals

We will not use credible intervals since they are used with Bayesian statistics but we will go over what they are since they are often confused with confidence intervals. Bayesian statistics calculate distributions for parameters rather than a single intervals. Credible intervals tell you that there is an x% chance that the actual value falls within the interval.

Degrees of freedom

Degrees of freedom relates the number of paramters you are estimating and the size of your dataset. When we take a sample of a larger population we are now estimating a population parameter so we lose a degree of freedom for things like standard deviation. For regression we lose a degree of freedom for each parameter that we estimate. Degrees of freedom helps us reduce bias in our samples. This is why you cannot estimate the sample standard deviation of a single value because we would have a degrees of freedom equal to zero.