For many, Bayesian statistics still feels somewhat alien.
There are these subjective priors, but no hard test statistics: what gives?
On the right you can find slides of a presentation I gave recently at a workshop in Princeton to clarify why I think that Bayesian methods are not only logically sound, but have also become straightforward to use. To demonstrate, we'll implement that same inference problem in three different ways.
## Data
We look into a problem of finding a peak in noisy data with a sloped background, which can arise in particle physics and other multi-component event processes.
Let's generate some data first:
```python
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
import numpy as np
def signal(theta, x):
l, m, s, a, b = theta
peak = l * np.exp(-(m-x)**2 / (2*s**2))
background = a + b*x
return peak + background
def plot_results(x, y, y_err, samples=None, predictions=None):
fig = plt.figure()
ax = fig.gca()
ax.errorbar(x, y, yerr=y_err, fmt=".k", capsize=0, label="Data")
x0 = np.linspace(-0.2, 1.2, 100)
ax.plot(x0, signal(theta, x0), "r", label="Truth", zorder=0)
if samples is not None:
inds = np.random.randint(len(samples), size=50)
for i,ind in enumerate(inds):
theta_ = samples[ind]
if i==0:
label='Posterior'
else:
label=None
ax.plot(x0, signal(theta_, x0), "C0", alpha=0.1, zorder=-1, label=label)
elif predictions is not None:
for i, pred in enumerate(predictions):
if i==0:
label='Posterior'
else:
label=None
ax.plot(x0, pred, "C0", alpha=0.1, zorder=-1, label=label)
ax.legend(frameon=False)
ax.set_xlabel("x")
ax.set_ylabel("y")
fig.tight_layout()
plt.close();
return fig
# random x locations
N = 40
np.random.seed(0)
x = np.random.rand(N)
# evaluate the true model at the given x values
theta = [1, 0.5, 0.1, -0.1, 0.4]
y = signal(theta, x)
# add heteroscedastic Gaussian uncertainties only in y direction
y_err = np.random.uniform(0.05, 0.25, size=N)
y = y + np.random.normal(0, y_err)
# plot
plot_results(x, y, y_err)
```
![svg](bayesian-inference-three-ways_files/output_1_0.svg)
## Markov Chain Monte Carlo
[`emcee`](https://emcee.readthedocs.io/) is implemented in pure python, and it only requires to evaluate the log of the posterior as a function of parameters $\theta$. Using the log is useful because it makes the analytic evaluates easier for the exponential family of distributions and because it deals better with the very small numbers that typically arise.
```python
import emcee
def log_likelihood(theta, x, y, yerr):
y_model = signal(theta, x)
chi2 = (y - y_model)**2 / (yerr**2)
return np.sum(-chi2 / 2)
def log_prior(theta):
if all(theta > -2) and (theta[2] > 0) and all(theta < 2):
return 0
return -np.inf
def log_posterior(theta, x, y, yerr):
lp = log_prior(theta)
if np.isfinite(lp):
lp += log_likelihood(theta, x, y, yerr)
return lp
# create a small ball around the MLE the initialize each walker
nwalkers, ndim = 30, 5
theta_guess = [0.5, 0.6, 0.2, -0.2, 0.1]
pos = theta_guess + 1e-4 * np.random.randn(nwalkers, ndim)
# run emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=(x, y, y_err))
sampler.run_mcmc(pos, 10000, progress=True);
```
100%|██████████| 10000/10000 [00:05<00:00, 1856.57it/s]
We should always inspect the resulting chains should to identify the burn-in period, and visually check for stationarity:
```python
fig, axes = plt.subplots(ndim, sharex=True)
mcmc_samples = sampler.get_chain()
labels = ["l", "m", "s", "a", "b"]
for i in range(ndim):
ax = axes[i]
ax.plot(mcmc_samples[:, :, i], "k", alpha=0.3, rasterized=True)
ax.set_xlim(0, 1000)
ax.set_ylabel(labels[i])
axes[-1].set_xlabel("step number");
```
![svg](bayesian-inference-three-ways_files/output_5_0.svg)
We now need to thin out the chains because the sample are correlated. For that `emcee` has a method to compute the auto-correlation time of each parameters. We can then combine the samples across all walkers:
```python
tau = sampler.get_autocorr_time()
print("Autocorrelation time:", tau)
mcmc_samples = sampler.get_chain(discard=300, thin=np.int32(np.max(tau)/2), flat=True)
print("Remaining samples:", mcmc_samples.shape)
```
Autocorrelation time: [122.51626866 75.87228105 137.195509 54.63572513 79.0331587 ]
Remaining samples: (4260, 5)
From the creator of `emcee`, Dan Foreman-Mackey, comes also this useful package to visualize the posterior samples: [`corner`](http://corner.readthedocs.io/). Let's use it:
```python
import corner
corner.corner(mcmc_samples, labels=labels, truths=theta);
```
![svg](bayesian-inference-three-ways_files/output_9_0.svg)
While posterior samples are the main currency for inference, the parameter contours are difficult to interpret themselves. It's much more straightforward to use the samples to generate new data because we have much more intuition in the data space. Here are model evaluation from 50 random samples:
```python
plot_results(x, y, y_err, samples=mcmc_samples)
```
![svg](bayesian-inference-three-ways_files/output_11_0.svg)
## Hamiltonian Monte Carlo
Having gradients provides much more guidance in high-dimensional settings. To enable general inference, we need a framework to compute gradients of (essentially) arbitrary probabilistic models. The key development is **automatic differentiation**, which is a computational framework to trace the path of parameters through the various manipulations. One such framework is [`jax`](https://jax.readthedocs.io/). In practice, this means that any function that ordinarily is implemented in `numpy` needs to be replaced by its analog in `jax` because the latter knows how to compute the gradients of function.
In addition, we need the ability to compute gradients of _probability distributions_. This capability is implemented in several **Probabilistic Programming Languages**, e.g. [NumPyro](https://num.pyro.ai/). Let's see how to do **automatic inference**:
```python
import jax.numpy as jnp
import jax.random as random
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS
def model(x, y=None, y_err=0.1):
# define parameters (incl. prior ranges)
l = numpyro.sample('l', dist.Uniform(-2, 2))
m = numpyro.sample('m', dist.Uniform(-2, 2))
s = numpyro.sample('s', dist.Uniform(0, 2))
a = numpyro.sample('a', dist.Uniform(-2, 2))
b = numpyro.sample('b', dist.Uniform(-2, 2))
# implement the model
# needs jax numpy for differentiability here
peak = l * jnp.exp(-(m-x)**2 / (2*s**2))
background = a + b*x
y_model = peak + background
# notice that we clamp the outcome of this sampling to the observation y
numpyro.sample('obs', dist.Normal(y_model, y_err), obs=y)
# need to split the key for jax's random implementation
rng_key = random.PRNGKey(0)
rng_key, rng_key_ = random.split(rng_key)
# run HMC with NUTS
kernel = NUTS(model, target_accept_prob=0.9)
mcmc = MCMC(kernel, num_warmup=1000, num_samples=3000)
mcmc.run(rng_key_, x=x, y=y, y_err=y_err)
mcmc.print_summary()
```
sample: 100%|██████████| 4000/4000 [00:03<00:00, 1022.99it/s, 17 steps of size 2.08e-01. acc. prob=0.94]
mean std median 5.0% 95.0% n_eff r_hat
a -0.13 0.05 -0.13 -0.22 -0.05 1151.15 1.00
b 0.46 0.07 0.46 0.36 0.57 1237.44 1.00
l 0.98 0.05 0.98 0.89 1.06 1874.34 1.00
m 0.50 0.01 0.50 0.49 0.51 1546.56 1.01
s 0.11 0.01 0.11 0.09 0.12 1446.08 1.00
Number of divergences: 0
Visualization works the same as before, and `corner` already understand the `mcmc` structure from Numpyro:
```python
# just for reference, we'll add the truths
truths = {k:v for k,v in zip(labels, theta)}
fig = corner.corner(mcmc, truths=truths);
```
![svg](bayesian-inference-three-ways_files/output_15_0.svg)
Because we already implemented the entire probabilistic model (in contrast to `emcee`, where we only implemented the posterior), we can directly create posterior predictions from the samples. Below, we set the noise to zero to give us the noise-free representation of the pure model:
```python
from numpyro.infer import Predictive
# make predictions from posterior
hmc_samples = mcmc.get_samples()
predictive = Predictive(model, hmc_samples)
# need to set noise to zero
# since the full model contains noise contribution
predictions = predictive(rng_key_, x=x0, y_err=0)['obs']
# select 50 predictions to show
inds = random.randint(rng_key_, (50,) , 0, mcmc.num_samples)
predictions = predictions[inds]
plot_results(x, y, y_err, predictions=predictions)
```
![svg](bayesian-inference-three-ways_files/output_17_0.svg)
## Simulation-based Inference
There are situations where we either cannot or don't want to compute the likelihood. However, observing a simulator, i.e. learning the mapping between the inputs $\theta$ and outputs $\mathcal{D}$ of the simulator, allows to form an approximate surrogate for the likelihood or the posterior. One important difference to the traditional simulation case, which yields noise-free models, we need to _add noise_ to the simulations, and the noise model should match the observational noises as closely as possible. Otherwise it is impossible to differentiate changes in the data due to noise from those due to changes in parameters.
```python
import torch
from sbi import utils as utils
low = torch.zeros(ndim)
low[3] = -1
high = 1*torch.ones(ndim)
high[0] = 2
prior = utils.BoxUniform(low=low, high=high)
def simulator(theta, x, y_err):
# signal model
l, m, s, a, b = theta
peak = l * torch.exp(-(m-x)**2 / (2*s**2))
background = a + b*x
y_model = peak + background
# add noise consistent with observations
y = y_model + y_err * torch.randn(len(x))
return y
```
Let's have a look at the outputs of the noisy simulator:
```python
plt.errorbar(x, this_simulator(torch.tensor(theta)), yerr=y_err, fmt=".r", capsize=0)
plt.errorbar(x, y, yerr=y_err, fmt=".k", capsize=0)
plt.plot(x0, signal(theta, x0), "k", label="truth")
```
[]
![svg](bayesian-inference-three-ways_files/output_21_1.svg)
Now, we use [`sbi`](www.mackelab.org/sbi/) to train a neural posterior estimate (NPE) from these simulations. It uses a conditional normalizing flow to learn how to generate the posterior distribution given some data:
```python
from sbi.inference.base import infer
this_simulator = lambda theta: simulator(theta, torch.tensor(x), torch.tensor(y_err))
posterior = infer(this_simulator, prior, method='SNPE', num_simulations=10000)
```
Running 10000 simulations.: 0%| | 0/10000 [00:00, ?it/s]
Neural network successfully converged after 172 epochs.
At inference time, we simply evaluate this neural posterior, _conditioned on the actual data_ $y$:
```python
sbi_samples = posterior.sample((10000,), x=torch.tensor(y))
sbi_samples = sbi_samples.detach().numpy()
```
Drawing 10000 posterior samples: 0%| | 0/10000 [00:00, ?it/s]
This runs in virtually no time. And then we visualize the posterior samples again:
```python
corner.corner(sbi_samples, labels=labels, truths=theta);
```
![svg](bayesian-inference-three-ways_files/output_27_0.svg)
```python
plot_results(x, y, y_err, samples=sbi_samples)
```
![svg](bayesian-inference-three-ways_files/output_28_0.svg)
The results for SBI are not quite as good as those of MCMC and HMC. They could be improved by training on more simulations, and also by tuning the architecture of the normalizing flow network. That is the price we have to pay for giving up the ideal summary statistic: the likelihood.
However, SBI enables approximate Bayesian inference even in cases where no likelihood is available.
# Bayesian inference three ways