Synthesizing data (IN PROGRESS) - Patch Clamp Handbook

Synthesizing data is rarely covered in slice electrophysiology but I feel like it can help one understand how data is being generated and also provides some insight to the statistical assumptions we make about our data.

import warnings
from pathlib import Path

import numpy as np
import pandas as pd
from bokeh.io import output_notebook, show
from bokeh.layouts import layout, row
from bokeh.models import ColumnDataSource, Div
from bokeh.plotting import figure
from bokeh.transform import factor_cmap, jitter
from scipy import stats

warnings.filterwarnings("ignore", category=RuntimeWarning)

cwd = Path.cwd().parent / "data/pv"
df = pd.read_csv(cwd / "mini_data.csv")

output_notebook()

Point processes¶

Point processes are a mathematical way to describe how points are randomly located in space. Time is just a space along a real line. We can describe point processes in time by their interevent interval, the space between each event.

Poisson point process¶

In the distributions section we cover the Poisson distribution. The Poisson distribution is one of the simplest ways to generate a point process. The Poisson process assumes that events are indepedent of each other and are randomly distributed over time. With the Poisson distribution you could model how many minis or spikes are likely to occur in a time frame based on a rate. To model a Poisson process you would draw IEIs from an exponential distribution. We can use the Fano Factor to model how “Poissonian” timeseries data is. There is one big caveat to using the fano factor in that it is rate dependent.

Below I will create two different point processes and show how at face value they look similar but, when you dig deeper you find important differences. While this analysis may not be super useful for mPSCs, I believe understanding the basics of point processes can help you think about your how your data is generated. Think about what processes might lead you have events non-randomly distributed in time. What if the fano factor differs between your experiments groups? Can you use this information to inform your interpretation of the data and future experiments? Remember that when you put TTX in your bath you are blocking any evoked activity therefore your events should be randomly distributed over time.

Pseudo Poisson process: Creating a point process by randomly distributing points in 10 second segments using a fixed 3 Hz rate for each segment¶

rng = np.random.default_rng(42)
ieis_pseudo = []
train_pseudo = []
for i in range(30):
    times = rng.uniform(0, 10, size=30)
    times = np.sort(times)
    ieis_pseudo.extend(np.diff(np.sort(times)))
    train_pseudo.extend(times + i * 10)

Poisson process: Creating a point process by drawing from an exponential distribution for each 10 second segment.¶

ieis_poisson = []
train_poisson = []
num_minis = 30
length = 10
for i in range(30):
    start = stats.expon.rvs(size=1, scale=1 / 3)
    times = np.array(start)

    three_stds = int(np.ceil(num_minis + 3 * np.sqrt(num_minis)))

    while times[-1] < length:
        isi = stats.expon.rvs(size=three_stds, scale=1 / 3)

        t_last_mepscs = times[-1]
        times = np.r_[times, t_last_mepscs + np.cumsum(isi)]
    stop = times.searchsorted(length)
    times = times[:stop]
    ieis_poisson.extend(np.diff(times))
    train_poisson.extend(times + i * 10)

Below you can see the IEI distribution for timeseries as well as a “continuous” event rate. The continuous event rate t is just the number of events in a bin divided by the bin width (in this case 300 seconds divided by 50 bins).

pseudo_div = Div(
    text="""
<h2>Pseudo Poisson</h2>
"""
)

loc, scale = stats.expon.fit(ieis_pseudo)
fig_pseudo = figure(
    height=250,
    width=350,
    title=f"Expected scale: {1 / 3:.3f} vs Fit scale: {scale:.3f}",
)
hist, bins = np.histogram(ieis_pseudo, bins=50, density=True)
hist_data = fig_pseudo.quad(
    top=hist,
    bottom=0,
    left=bins[:-1],
    right=bins[1:],
    alpha=0.5,
)
x = np.linspace(
    stats.expon.ppf(0.0001, scale=scale), stats.expon.ppf(0.9999, scale=scale), 100
)

line = fig_pseudo.line(x, stats.expon.pdf(x, scale=scale), line_color="black")
ts_pseudo = figure(height=250, title="Continuous event rate")
hist, bins = np.histogram(train_pseudo, bins=50)
hist = hist / (300 / 50)
hist_data = ts_pseudo.quad(
    top=hist,
    bottom=0,
    left=bins[:-1],
    right=bins[1:],
    alpha=0.5,
)

poisson_div = Div(
    text="""
<h2>Poisson</h2>
"""
)
loc, scale = stats.expon.fit(ieis_poisson)
fig_poisson = figure(
    height=250,
    width=350,
    title=f"Expected scale: {1 / 3:.3f} vs Fit scale: {scale:.3f}",
)
hist, bins = np.histogram(ieis_poisson, bins=50, density=True)
hist_data = fig_poisson.quad(
    top=hist,
    bottom=0,
    left=bins[:-1],
    right=bins[1:],
    alpha=0.5,
)
x = np.linspace(
    stats.expon.ppf(0.0001, scale=scale), stats.expon.ppf(0.9999, scale=scale), 100
)

line = fig_poisson.line(x, stats.expon.pdf(x, scale=scale), line_color="black")
ts_poisson = figure(height=250, title="Continuous event rate")
hist, bins = np.histogram(train_poisson, bins=50)
hist = hist / (300 / 50)
hist_data = ts_poisson.quad(
    top=hist,
    bottom=0,
    left=bins[:-1],
    right=bins[1:],
    alpha=0.5,
)

show(
    layout(
        [pseudo_div], [fig_pseudo, ts_pseudo], [poisson_div], [fig_poisson, ts_poisson]
    )
)

By eye these times series look almost identical. If you analyzed these two timeseries use rate alone you would not find a statistical difference since the mean IEI is not different. To see the differences we need to look deeper in the structure of the timeseries. To do this we can calculate something called dispersion or the Fano Factor which we talked about in the distributions chapter. To calculate the Fano Factor we just need to bin the data and get mean and variance of the counts in each bin. However, we are going to bin our data several different bin sizes. If you run the code below several times you will notice that on average the “pure” poisson process will have a high fano factor over all the time bins.

def fano_factor(timestamps, total_time, bin_size):
    n_bins = int(total_time / bin_size)
    bins = np.linspace(0, total_time, n_bins + 1)
    counts, _ = np.histogram(timestamps, bins)
    return np.var(counts) / np.mean(counts)

def fano_factor_bins(timestamps, total_time, b0=0.5, b1=5, bn=100):
    bsizes = np.linspace(b0, b1, num=bn)
    ff = np.zeros(bsizes.size)
    for index, bin_size in enumerate(bsizes):
        ff[index] = fano_factor(timestamps, total_time, bin_size)
    return bsizes, ff

total_time = 10 * 30
pseudo_ff = figure(height=250, width=350, title="Pseudo Poisson")
pseudo_ff.scatter(*fano_factor_bins(train_pseudo, total_time, b1=2.5))
poisson_ff = figure(height=250, width=350, title="Poisson")
poisson_ff.scatter(*fano_factor_bins(train_poisson, total_time, b1=2.5))
show(row(pseudo_ff, poisson_ff))

You can see the difference between these two timeseries when we compare the fano factor across bins. The Pseudo Poisson show a decreasing Fano Factor with increasing binwidth which means the signal becomes more regular or predictable. The Poisson process shows a Fano Factor that hovers around 1 for most of the bin sizes.

There are some caveats to the way I created the Poisson process. I created short 10 second chunks and concatenated them together. This technically violates an assumption of the Poisson process however, I ran it this way to model how mPSC data is collected. Usually you record short 5-20 second chunks. These chunks are usually not 100% continguous due to how digitizers work. The Fano factor will also depend on your refactory time. Refactory time is minimum time between events. Neurons have a reset period between spikes, or for PSCs have cannot be resolved when they are too close together.

Below I simulate an exponential with a refractory period. You will notice that the fano factor tends to be lower the longer the refractory period even though the scale is technically the same. The refractory period does change the rate. So our effective rate is:

R_e = \frac{(1/R)}{\frac{1/R}*t+1}

(1)

where $R_e$ is the effective rate, $R$ is the rate you put in and $t$ is the refractory period. If you want to correct for the refractory period we can change the effective rate with this equation:

R = \frac{R_t}{1-R_t*t}

(2)

where $R_t$ is your target and $R$ is the actual rate. If we don’t correct for the refractory rate you can see that the fano factor decreases which means the regularity of the signal increases. The effect gets more noticable as you increase the rate. So we if we wanted to analyze signal with refractory period or say lower recovery events in a noisy environment we could use create synthetic data to test our hypothesis and compare to our actual data.

def r_eff(rate, ref_period):
    return rate / (1 - rate * ref_period)


offsets = [
    (0, 3),
    (0, 5),
    (0.005, 3),
    (0.005, 5),
    (0.01, 3),
    (0.01, 5),
    (0.005, np.round(r_eff(3, 0.005), decimals=3)),
    (0.005, np.round(r_eff(5, 0.005), decimals=3)),
    (0.001, np.round(r_eff(3, 0.001), decimals=3)),
    (0.001, np.round(r_eff(5, 0.001), decimals=3)),
]
time = 300
trains = []
for rf, rate in offsets:
    ff = []
    for j in range(100):
        start = stats.expon.rvs(size=1, scale=1 / rate)
        event_times = np.array(start)

        three_stds = int(np.ceil(time * 3 + 3 * np.sqrt(time * 3)))

        while event_times[-1] < time:
            isi = stats.expon.rvs(size=three_stds, scale=1 / rate) + rf

            t_last_mepscs = event_times[-1]
            event_times = np.r_[event_times, t_last_mepscs + np.cumsum(isi)]
        stop = event_times.searchsorted(time)
        event_times = event_times[:stop]
        ff.append(fano_factor(event_times, time, 1))
    trains.append(pd.DataFrame({"fano_factor": ff, "setting": str((rf, float(rate)))}))
trains = pd.concat(trains)

cts = list(sorted(trains["setting"].unique()))
source = ColumnDataSource(trains)
source1 = ColumnDataSource(trains.groupby("setting", as_index=False).mean())
fig = figure(height=250, width=400, x_range=cts)
xy_line = fig.scatter(
    x=jitter("setting", 0.3, range=fig.x_range),
    y="fano_factor",
    source=source,
    alpha=0.6,
    color=factor_cmap("setting", "Paired10", cts),
)
fig.xaxis.major_label_orientation = "vertical"

xy_line1 = fig.scatter(
    x="setting",
    y="fano_factor",
    size=15,
    marker="diamond",
    source=source1,
    alpha=0.8,
    color="grey",
)
show(fig)

You will notice that if we look at the fano factor when binning the at 1 second we see that the longer the refractory period and the higher the frequency of events the more regular the signal becomes.

Inhomogenous Poisson process and Hawkes processes¶

There are other ways to model point processes. Inhomogeonous point processes use an intensity factor to modify the output of a Poisson process. A Hawkes process removes the assumption that events are independent of each other to a self exciting parameter. Hawkes processes can even be used to generate other point processes (called a multivariate model). These processes model what more likely occurs in real life. However, a Poisson process can be useful as a ground truth to determine whether your data is randomly distributed in time and if your timepoints are indepedent of each other.

More comming soon!