For many, Bayesian statistics still feels somewhat alien. There are these subjective priors, but no hard test statistics: what gives? On the right you can find slides of a presnetation I gave recently at a workshop in Princeton to clarify why I think that Bayesian methods are not only logically sound, but have also become straightforward to use. To demonstrate, we'll implement that same inference problem in three different ways. ## Data We look into a problem of finding a peak in noisy data with a sloped background, which can arise in particle physics and other multi-component event processes. Let's generate some data first: ```python %matplotlib inline %config InlineBackend.figure_format = 'svg' import matplotlib.pyplot as plt import numpy as np def signal(theta, x): l, m, s, a, b = theta peak = l * np.exp(-(m-x)**2 / (2*s**2)) background = a + b*x return peak + background def plot_results(x, y, y_err, samples=None, predictions=None): fig = plt.figure() ax = fig.gca() ax.errorbar(x, y, yerr=y_err, fmt=".k", capsize=0, label="Data") x0 = np.linspace(-0.2, 1.2, 100) ax.plot(x0, signal(theta, x0), "r", label="Truth", zorder=0) if samples is not None: inds = np.random.randint(len(samples), size=50) for i,ind in enumerate(inds): theta_ = samples[ind] if i==0: label='Posterior' else: label=None ax.plot(x0, signal(theta_, x0), "C0", alpha=0.1, zorder=-1, label=label) elif predictions is not None: for i, pred in enumerate(predictions): if i==0: label='Posterior' else: label=None ax.plot(x0, pred, "C0", alpha=0.1, zorder=-1, label=label) ax.legend(frameon=False) ax.set_xlabel("x") ax.set_ylabel("y") fig.tight_layout() plt.close(); return fig # random x locations N = 40 np.random.seed(0) x = np.random.rand(N) # evaluate the true model at the given x values theta = [1, 0.5, 0.1, -0.1, 0.4] y = signal(theta, x) # add heteroscedastic Gaussian uncertainties only in y direction y_err = np.random.uniform(0.05, 0.25, size=N) y = y + np.random.normal(0, y_err) # plot plot_results(x, y, y_err) ``` ![svg](bayesian-inference-three-ways_files/output_1_0.svg) ## Markov Chain Monte Carlo [`emcee`](https://emcee.readthedocs.io/) is implemented in pure python, and it only requires to evaluate the log of the posterior as a function of parameters $\theta$. Using the log is useful because it makes the analytic evaluates easier for the exponential family of distributions and because it deals better with the very small numbers that typically arise. ```python import emcee def log_likelihood(theta, x, y, yerr): y_model = signal(theta, x) chi2 = (y - y_model)**2 / (yerr**2) return np.sum(-chi2 / 2) def log_prior(theta): if all(theta > -2) and (theta[2] > 0) and all(theta < 2): return 0 return -np.inf def log_posterior(theta, x, y, yerr): lp = log_prior(theta) if np.isfinite(lp): lp += log_likelihood(theta, x, y, yerr) return lp # create a small ball around the MLE the initialize each walker nwalkers, ndim = 30, 5 theta_guess = [0.5, 0.6, 0.2, -0.2, 0.1] pos = theta_guess + 1e-4 * np.random.randn(nwalkers, ndim) # run emcee sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=(x, y, y_err)) sampler.run_mcmc(pos, 10000, progress=True); ``` 100%|██████████| 10000/10000 [00:05<00:00, 1856.57it/s] We should always inspect the resulting chains should to identify the burn-in period, and visually check for stationarity: ```python fig, axes = plt.subplots(ndim, sharex=True) mcmc_samples = sampler.get_chain() labels = ["l", "m", "s", "a", "b"] for i in range(ndim): ax = axes[i] ax.plot(mcmc_samples[:, :, i], "k", alpha=0.3, rasterized=True) ax.set_xlim(0, 1000) ax.set_ylabel(labels[i]) axes[-1].set_xlabel("step number"); ``` ![svg](bayesian-inference-three-ways_files/output_5_0.svg) We now need to thin out the chains because the sample are correlated. For that `emcee` has a method to compute the auto-correlation time of each parameters. We can then combine the samples across all walkers: ```python tau = sampler.get_autocorr_time() print("Autocorrelation time:", tau) mcmc_samples = sampler.get_chain(discard=300, thin=np.int32(np.max(tau)/2), flat=True) print("Remaining samples:", mcmc_samples.shape) ``` Autocorrelation time: [122.51626866 75.87228105 137.195509 54.63572513 79.0331587 ] Remaining samples: (4260, 5) From the creator of `emcee`, Dan Foreman-Mackey, comes also this useful package to visualize the posterior samples: [`corner`](http://corner.readthedocs.io/). Let's use it: ```python import corner corner.corner(mcmc_samples, labels=labels, truths=theta); ``` ![svg](bayesian-inference-three-ways_files/output_9_0.svg) While posterior samples are the main currency for inference, the parameter contours are difficult to interpret themselves. It's much more straightforward to use the samples to generate new data because we have much more intuition in the data space. Here are model evaluation from 50 random samples: ```python plot_results(x, y, y_err, samples=mcmc_samples) ``` ![svg](bayesian-inference-three-ways_files/output_11_0.svg) ## Hamiltonian Monte Carlo Having gradients provides much more guidance in high-dimensional settings. To enable general inference, we need a framework to compute gradients of (essentially) arbitrary probabilistic models. The key development is **automatic differentiation**, which is a computational framework to trace the path of parameters through the various manipulations. One such framework is [`jax`](https://jax.readthedocs.io/). In practice, this means that any function that ordinarily is implemented in `numpy` needs to be replaced by its analog in `jax` because the latter knows how to compute the gradients of function. In addition, we need the ability to compute gradients of _probability distributions_. This capability is implemented in several **Probabilistic Programming Languages**, e.g. [NumPyro](https://num.pyro.ai/). Let's see how to do **automatic inference**: ```python import jax.numpy as jnp import jax.random as random import numpyro import numpyro.distributions as dist from numpyro.infer import MCMC, NUTS def model(x, y=None, y_err=0.1): # define parameters (incl. prior ranges) l = numpyro.sample('l', dist.Uniform(-2, 2)) m = numpyro.sample('m', dist.Uniform(-2, 2)) s = numpyro.sample('s', dist.Uniform(0, 2)) a = numpyro.sample('a', dist.Uniform(-2, 2)) b = numpyro.sample('b', dist.Uniform(-2, 2)) # implement the model # needs jax numpy for differentiability here peak = l * jnp.exp(-(m-x)**2 / (2*s**2)) background = a + b*x y_model = peak + background # notice that we clamp the outcome of this sampling to the observation y numpyro.sample('obs', dist.Normal(y_model, y_err), obs=y) # need to split the key for jax's random implementation rng_key = random.PRNGKey(0) rng_key, rng_key_ = random.split(rng_key) # run HMC with NUTS kernel = NUTS(model, target_accept_prob=0.9) mcmc = MCMC(kernel, num_warmup=1000, num_samples=3000) mcmc.run(rng_key_, x=x, y=y, y_err=y_err) mcmc.print_summary() ``` sample: 100%|██████████| 4000/4000 [00:03<00:00, 1022.99it/s, 17 steps of size 2.08e-01. acc. prob=0.94] mean std median 5.0% 95.0% n_eff r_hat a -0.13 0.05 -0.13 -0.22 -0.05 1151.15 1.00 b 0.46 0.07 0.46 0.36 0.57 1237.44 1.00 l 0.98 0.05 0.98 0.89 1.06 1874.34 1.00 m 0.50 0.01 0.50 0.49 0.51 1546.56 1.01 s 0.11 0.01 0.11 0.09 0.12 1446.08 1.00 Number of divergences: 0 Visualization works the same as before, and `corner` already understand the `mcmc` structure from Numpyro: ```python # just for reference, we'll add the truths truths = {k:v for k,v in zip(labels, theta)} fig = corner.corner(mcmc, truths=truths); ``` ![svg](bayesian-inference-three-ways_files/output_15_0.svg) Because we already implemented the entire probabilistic model (in contrast to `emcee`, where we only implemented the posterior), we can directly create posterior predictions from the samples. Below, we set the noise to zero to give us the noise-free representation of the pure model: ```python from numpyro.infer import Predictive # make predictions from posterior hmc_samples = mcmc.get_samples() predictive = Predictive(model, hmc_samples) # need to set noise to zero # since the full model contains noise contribution predictions = predictive(rng_key_, x=x0, y_err=0)['obs'] # select 50 predictions to show inds = random.randint(rng_key_, (50,) , 0, mcmc.num_samples) predictions = predictions[inds] plot_results(x, y, y_err, predictions=predictions) ``` ![svg](bayesian-inference-three-ways_files/output_17_0.svg) ## Simulation-based Inference There are situations where we either cannot or don't want to compute the likelihood. However, observing a simulator, i.e. learning the mapping between the inputs $\theta$ and outputs $\mathcal{D}$ of the simulator, allows to form an approximate surrogate for the likelihood or the posterior. One important difference to the traditional simulation case, which yields noise-free models, we need to _add noise_ to the simulations, and the noise model should match the observational noises as closely as possible. Otherwise it is impossible to differentiate changes in the data due to noise from those due to changes in parameters. ```python import torch from sbi import utils as utils low = torch.zeros(ndim) low[3] = -1 high = 1*torch.ones(ndim) high[0] = 2 prior = utils.BoxUniform(low=low, high=high) def simulator(theta, x, y_err): # signal model l, m, s, a, b = theta peak = l * torch.exp(-(m-x)**2 / (2*s**2)) background = a + b*x y_model = peak + background # add noise consistent with observations y = y_model + y_err * torch.randn(len(x)) return y ``` Let's have a look at the outputs of the noisy simulator: ```python plt.errorbar(x, this_simulator(torch.tensor(theta)), yerr=y_err, fmt=".r", capsize=0) plt.errorbar(x, y, yerr=y_err, fmt=".k", capsize=0) plt.plot(x0, signal(theta, x0), "k", label="truth") ``` [<matplotlib.lines.Line2D at 0x2b12419a0>] ![svg](bayesian-inference-three-ways_files/output_21_1.svg) Now, we use [`sbi`](www.mackelab.org/sbi/) to train a neural posterior estimate (NPE) from these simulations. It uses a conditional normalizing flow to learn how to generate the posterior distribution given some data: ```python from sbi.inference.base import infer this_simulator = lambda theta: simulator(theta, torch.tensor(x), torch.tensor(y_err)) posterior = infer(this_simulator, prior, method='SNPE', num_simulations=10000) ``` Running 10000 simulations.: 0%| | 0/10000 [00:00<?, ?it/s] Neural network successfully converged after 172 epochs. At inference time, we simply evaluate this neural posterior, _conditioned on the actual data_ $y$: ```python sbi_samples = posterior.sample((10000,), x=torch.tensor(y)) sbi_samples = sbi_samples.detach().numpy() ``` Drawing 10000 posterior samples: 0%| | 0/10000 [00:00<?, ?it/s] This runs in virtually no time. And then we visualize the posterior samples again: ```python corner.corner(sbi_samples, labels=labels, truths=theta); ``` ![svg](bayesian-inference-three-ways_files/output_27_0.svg) ```python plot_results(x, y, y_err, samples=sbi_samples) ``` ![svg](bayesian-inference-three-ways_files/output_28_0.svg) The results for SBI are not quite as good as those of MCMC and HMC. They could be improved by training on more simulations, and also by tuning the architecture of the normalizing flow network. That is the price we have to pay for giving up the ideal summary statistic: the likelihood. However, SBI enables approximate Bayesian inference even in cases where no likelihood is available.

Bayesian inference three ways

Publication Date
Tue, 24.05.2022