Jay's Blog

An Overview of Bayesian Inference

A few weeks ago I wrote about Kuhn’s theory of paradigm shifts and how it relates to Bayesian inference. In this post I want to back up a little bit and explain what Bayesian inference is, and eventually rediscover the idea of a paradigm shift just from understanding how Bayesian inference works.

Bayesian inference is important in its own right for many reasons beyond just improving our understanding of philosophy of science. Bayesianism is at its heart an extremely powerful mathematical method of using evidence to make predictions. Almost any time you see anyone making predictions that involve probabilities—whether that’s a projection of election results like the ones from FiveThirtyEight, a prediction for the results of a big sports game, or just a weather forecast telling you the chances of rain tomorrow—you’re seeing the results of a Bayesian inference.

Bayesian inference is also the foundation of many machine learning and artificial intelligence tools. Amazon wants to predict how likely you are to buy things. Netflix wants to predict how likely you are to like a show. Image recognition programs want to predict whether that picture contains a bird. And self-driving cars want to predict whether they’re going to crash into that wall.

You’re using tools based on Bayesian inference every day, and probably at this very moment.1 So it’s worth understanding how they work.

The basic idea of Bayesian inference is that we start with some prior probability that describes what we originally believe the world is like in terms of probability, by specifying the probabilities of various things happening. Then we make observations of the world, and update our beliefs, giving our conclusion as a posterior probability.

As a really simple example: suppose I tell you I’ve flipped a coin, but I don’t tell you how it landed. Your prior is probably a 50% chance that it shows heads, and a 50% chance that it shows tails. After you get to look at the coin, you update your prior beliefs to reflect your new knowledge. Your posterior probability says there is a 100% chance that it shows heads and a 0% chance that it shows tails.2

The rule we use to update our beliefs is called Bayes’s Theorem (hence the name “Bayesian inference”). Specifically, we use the mathematical formula $P(H |E) = \frac{ P(E|H) P(H)}{P(E)},$ where

• $H$ is some hypothesis we had—some thing we thought might maybe happen—and $P(H)$ is how likely we originally thought that hypothesis was.
• $E$ is the evidence we just observed, and $P(E)$ is how likely we originally thought we were to see that evidence.
• $P(E|H)$ is the most complicated bit to explain. It tells us, if we assume that our hypothesis $H$ is true, how likely we originally thought seeing the evidence $E$ would be. So it tells us what we would have thought before seeing the new evidence, if we had assumed the hypothesis $H$ was true.
• $P(H|E)$ is the new, updated, posterior probability we give to the hypothesis $H$, after seeing the evidence $E$.

Let’s work through a quick example. Suppose I have a coin, and you think that there’s a 50% chance it’s a fair coin, and a 50% chance that it actually has two heads. So we have $P(H_{fair}) = .5$ and $P(H_{unfair}) = .5$.

Now you flip the coin ten times, and it comes up heads all ten times. If the coin is fair, this is pretty unlikely! The probability of that happening is $\frac{1}{2}^{10} = \frac{1}{1024}$, so we have $P(E|H_{fair}) = \frac{1}{1024}$. But if the coin is two-headed, this will definitely happen; the probability of getting ten heads is 100%, or $1$. So when you see this, you probably conclude that the coin is unfair.

Now let’s work through that same chain of reasoning algebraically. If the coin is fair, the probability of seeing ten heads in a row is $\frac{1}{2^{10}} = \frac{1}{1024}$. And if the coin is unfair, the probability is 1. So if we think there’s a 50% chance the coin is fair, and a 50% chance it’s unfair, then the overall probability of seeing ten heads in a row is \begin{align} P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\\ = .5 \cdot \frac{1}{1024} + .5 \cdot 1 = \frac{1025}{2048} \approx .5005. \end{align}

By Bayes’s Theorem, we have \begin{align} P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\
& = \frac{ \frac{1}{1024} \cdot .5}{\frac{1025}{2048}} = \frac{1}{1025} \\
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\
&= \frac{1 \cdot \frac{1}{2}}{\frac{1025}{2048}} = \frac{1024}{1025}. \end{align} Thus we conclude that the probability the coin is fair is $\frac{1}{1025} \approx .001$, and the probability it is two-headed is $\frac{1024}{1025} \approx .999$. This matches what our intuition tells us: if it comes up ten heads in a row, it probably isn’t fair.

But let’s tweak things a bit. Suppose I have a table with a thousand coins, and I tell you that all of them are fair except one two-headed one. You pick one at random, flip it ten times, and see ten heads. Now what do you think?

You have exactly the same evidence, but now your prior is different. Your prior tells you that $P(H_{fair}) = \frac{999}{1000}$ and $P(H_{unfair}) = \frac{1}{1000}$. We can do the same calculations as before. We have \begin{align} P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\
= \frac{999}{1000} \cdot \frac{1}{1024} + \frac{1}{1000} \cdot 1 \approx .00198 \end{align}

\begin{align} P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\
& = \frac{ \frac{1}{1024} \cdot \frac{999}{1000}}{.00198} \approx .494 \\
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\
&= \frac{1 \cdot \frac{1}{1000}}{.00198} \approx .506. \end{align} So now you should think it’s about equally likely that your coin is fair or unfair. 3

Why does this happen? If you have a fair coin, then seeing ten heads in a row is pretty unlikely. But having an unfair coin is also unlikely, because of the thousand coins you could have picked, only one was unfair. In this example those two unlikelinesses cancel out almost exactly, leaving us uncertain whether you got a (normal) fair coin and then a surprisingly unlikely result, or if you got a surprisingly unfair coin and then the normal, expected result.

In other words, you should definitely be somewhat surprised to see ten heads in a row. Remember, we worked out that your prior probability of seeing that is just $P(E) \approx .00198$—less than two tenths of a percent! But there are two different ways to get that unusual result, and you don’t know which of those unusual things happened.

Bayesian inference also does a good job of handling evidence that disproves one of your hypotheses. Suppose you have the same prior we were just discussing: $999$ fair coins, and one two-headed coin. What happens if you flip the coin once and it comes up tails?

Informally, we immediately realize that we can’t be flipping a two-headed coin. It came up tails, after all. So how does this work out in the math?

If the coin is fair, we have a $50\%$ chance of getting tails, and a $50\%$ chance of getting heads. If the coin is unfair, we have a $0\%$ chance of tails and a $100\%$ chance of heads. So we compute: \begin{align} P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\
= \frac{999}{1000} \cdot \frac{1}{2} + \frac{1}{1000} \cdot 0 = \frac{999}{2000} \end{align}

\begin{align} P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\
& = \frac{ \frac{1}{2} \cdot \frac{999}{1000}}{\frac{999}{2000}} = 1 \\
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\
&= \frac{0 \cdot \frac{1}{1000}}{\frac{999}{2000}} = 0. \end{align}

Thus the math agrees with us: once we see a tails, the probability that we’re flipping a two-headed coin is zero.

As long as everything behaves well, we can use these techniques to update our beliefs. In fact, this method is pretty powerful. We can prove that it is the best possible decision rule according to a few different sets of criteria4; and there are pretty good guarantees about eventually converging to the right answer after collecting enough evidence.

But there are still a few ways Bayesian inference can go wrong.

What if you get tails and keep flipping the coin—and get ten tails in a row? We’ll still draw the same conclusion: the coin can’t be double-headed, so it’s definitely fair. (You can work through the equations on this if you like; they’ll look just like the last computation I did, but longer). And if we keep flipping and get a thousand tails in a row, or a million, our computation will still tell us yes, the coin is definitely fair.

But before we get to a million flips, we might start suspecting, pretty strongly, that the coin is not fair. When it comes up tails a thousand times in a row, we probably suspect that in fact the coin has two tails. 5 So why doesn’t the math reflect this at all?

In this case, we made a mistake at the very beginning. Our prior told us that there was a $99.9\%$ chance we had a fair coin, and a $.1\%$ chance that we had a coin with two heads. And that means that our prior left no room for the possibility that our coin did anything else. We said our prior was $P(H_{fair}) = \frac{999}{1000} \qquad P(H_{unfair}) = \frac{1}{1000};$ but we really should have said $P(H_{fair}) = \frac{999}{1000} \qquad P(H_{two\ heads}) = \frac{1}{1000} \qquad P(H_{two\ tails}) = 0.$ And since we started with the belief that a two-tailed coin was impossible, no amount of evidence will cause us to change our beliefs. Thus Bayesian inference follows the old rule of Sherlock Holmes: “when you have excluded the impossible, whatever remains, however improbable, must be the truth.”

This example demonstrates both the power and the problems of doing Bayesian inference. The power is that it reflects what we already know. If something is known to be quite rare, then we probably didn’t just encounter it. (It’s more likely that I saw a random bear than a sasquatch—and that’s true even if sasquatch exist, since bear sightings are clearly more common). And if something is outright impossible, we don’t need to spend a lot of time thinking about the implications of it happening.

The problem is that in pure Bayesian inference, you’re trapped by your prior. If your prior thinks the “true” hypothesis is possible, then eventually, with enough evidence, you will conclude that the true hypothesis is extremely likely. But if your prior gives no probability to the true hypothesis, then no amount of evidence can ever change your mind. If we start out with $P(H) = 0$, then it is mathematically impossible to update your prior to believe that $H$ is possible.

But Douglas Adams neatly explained the flaw in the Sherlock Holmes principle in the voice of his character Dirk Gently:

The impossible often has a kind of integrity to it which the merely improbable lacks. How often have you been presented with an apparently rational explanation of something that works in all respects other than one, which is that it is hopelessly improbable?…The first idea merely supposes that there is something we don’t know about, and God knows there are enough of those. The second, however, runs contrary to something fundamental and human which we do know about. We should therefore be very suspicious of it and all its specious rationality.

In real life, when we see something we had thought was extremely improbable, we often reconsider our beliefs about what is possible. Maybe there’s some possibility we had originally dismissed, or not even considered, that makes our evidence look reasonable or even likely; and if we change our prior to include that possibility, suddenly our evidence makes sense. This is the “paradigm shift” I talked about in my recent post on Thomas Kuhn, and extremely unlikely evidence, like our extended series of tails, is a Kuhnian anomaly.

But rethinking your prior isn’t really allowed by the mathematics and machinery of Bayesian inference—it’s something else, something outside of the procedure, that we do to cover for the shortcomings of unaugmented Bayesianism.

Let’s return to the coin-flipping thought experiment; there’s one other way it can go wrong that I want to tell you about. Suppose you fix your prior to acknowledge the possibility that is two-headed or two-tailed. (We could even set up our prior to include the possibility that the coin is two-sided but biased— so that the coin comes up head 70% of the time, say. I’m going to ignore this case completely because it makes the calculations a lot more complicated and doesn’t actually clarify anything. But it’s important that we can do that if we want to).6

You assign the prior probabilities $P(H_{fair}) = \frac{98}{100} \qquad P(H_{two\ heads}) = \frac{1}{100} \qquad P(H_{two\ tails}) = \frac{1}{100},$ giving a 1% chance of each possible double-sided coin. (This is a higher chance than you gave it before, but clearly when I give you these coins I’ve been messing with you, so you should probably be less certain of everything). You flip the coin.

And it lands on its edge.

What does our rule of inference tell us now? We can try to do the same calculations we did before. The first thing we need to calculate is $P(E)$, which is easy. We started out by assuming this couldn’t happen, so the prior probability of seeing the coin landing on its side is zero!

(Algebraically, a fair coin has a 50% chance of heads and a 50% chance of tails. So if the coin is fair, then $P(E|H_{fair}) = 0$. But if the coin has a 100% chance of heads, then $P(E| H_{two\ heads}) = 0$. And if the coin has a 100% chance of tails, then $P(E| H_{two\ tails}) = 0$. Thus \begin{align} P(E) &= P(E|H_{fair}) \cdot P(H_{fair}) + P(E|H_{two\ heads}) \cdot P(H_{two\ heads}) + P(E|H_{two\ heads}) \cdot P(H_{two\ heads}) \\
& = 0 \cdot \frac{98}{100} + 0 \cdot \frac{1}{100} + 0 \cdot \frac{1}{100} = 0. \end{align} So we conclude that $P(E) = 0$).

Now we can actually calculate our new, updated, posterior probabilities—or can we? We have the formula that $P(H_{fair} | E) = \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)}.$ But with the probabilities we just calculated, this works out to $P(H_{fair} | E) = \frac{ 0 \cdot \frac{98}{100}}{0} = \frac{0}{0}.$ And our calculation has broken down completely; $\frac{0}{0}$ isn’t a number, let alone a useful probability.

Even more so than the last example, this is a serious Kuhnian anomaly. If we ever try to update and get $\frac{0}{0}$ as a response, something has gone wrong. We had said that something was totally impossible, and then it happened. All we can do is back up and choose a new prior.

And Bayesian inference can’t tell us how to do that.

There are a few different ways people try to get around this problem. But that’s another post.

1. I’m old enough to remember the late nineties, when spam was such a big problem that email became almost unusable. These days when I complain about email spam it’s usually my employer sending too many messages out through internal mailing lists; but there was a period in the nineties when for every legitimate email you’d get four or five filled with links to pr0n sites or trying to sell you v1@gr@ and c1@lis CHEAP!!! It was a major problem. Entire conferences were held on developing methods to defeat the spam problem.

These days I see about one true spam message like that per year. And one major reason for that is the invention of effective spam filters using Bayesian inference to predict whether a given email is spam or legitimate. So you’re using Bayesian tools right now purely by not receiving dozens of unwanted pornographic pictures in your email inbox every day.

2. This particular example is far too simple to really be worth setting up the Bayesian framework, but it gives a pretty direct and explicit demonstration of what all the pieces mean.

3. The exact probabilities are 999/2023 and 1024/2023. As a bonus, try to see why having some of those exact numbers makes sense, and reassures us that we did this right.

4. I’m primarily thinking of two really important results here. Cox’s Theorem gives a collection of reasonable-sounding conditions, and proves that Bayesian inference is the only possible rule that satisfies them all. Dutch Book Arguments show that this inference rule protects you from making a collection of bets which are guaranteed to lose you money.

5. No, you can’t just check this by looking at the coin. Because I said so.

More seriously, it’s pretty common to have experiments where you can see the results, but can’t inspect the mechanism by which those results are reached. In a particle collider you can see the tracks of exiting particles, but you can’t actually observe the collision. In an educational study, you can look at students’ test results, but you can’t look inside their brains and observe exactly when the learning happens. So it’s useful for this thought experiment to assume we can see how the coin lands, but can never look at both sides at the same time.

6. Gelman and Nolan have argued that it’s not physically possible to bias a coin flip in this way. This is arguably another reason to ignore the possibility that a coin is biased. And if you believe Gelman and Nolan’s argument, then you should have a low or zero prior probability that the coin is biased. But the actual reason I’m ignoring it is to avoid computing integrals in public.

Tags: math bayes probability statistics explainer