# Jay's Blog

## An Overview of Bayesian Inference

A few weeks ago I wrote about Kuhn’s theory of paradigm shifts and how it relates to Bayesian inference. In this post I want to back up a little bit and explain what Bayesian inference is, and eventually rediscover the idea of a paradigm shift just from understanding how Bayesian inference works.

Bayesian inference is important in its own right for many reasons beyond just improving our understanding of philosophy of science. Bayesianism is at its heart an extremely powerful mathematical method of using evidence to make predictions. Almost any time you see anyone making predictions that involve probabilities—whether that’s a projection of election results like the ones from FiveThirtyEight, a prediction for the results of a big sports game, or just a weather forecast telling you the chances of rain tomorrow—you’re seeing the results of a Bayesian inference.

Bayesian inference is also the foundation of many machine learning and artificial intelligence tools. Amazon wants to predict how likely you are to buy things. Netflix wants to predict how likely you are to like a show. Image recognition programs want to predict whether that picture contains a bird. And self-driving cars want to predict whether they’re going to crash into that wall.

You’re using tools based on Bayesian inference every day, and probably at this very moment.1 So it’s worth understanding how they work.

The basic idea of Bayesian inference is that we start with some prior probability that describes what we originally believe the world is like in terms of probability, by specifying the probabilities of various things happening. Then we make observations of the world, and update our beliefs, giving our conclusion as a posterior probability.

As a really simple example: suppose I tell you I’ve flipped a coin, but I don’t tell you how it landed. Your prior is probably a 50% chance that it shows heads, and a 50% chance that it shows tails. After you get to look at the coin, you update your prior beliefs to reflect your new knowledge. Your posterior probability says there is a 100% chance that it shows heads and a 0% chance that it shows tails.2

The rule we use to update our beliefs is called Bayes’s Theorem (hence the name “Bayesian inference”). Specifically, we use the mathematical formula $P(H |E) = \frac{ P(E|H) P(H)}{P(E)},$ where

• $H$ is some hypothesis we had—some thing we thought might maybe happen—and $P(H)$ is how likely we originally thought that hypothesis was.
• $E$ is the evidence we just observed, and $P(E)$ is how likely we originally thought we were to see that evidence.
• $P(E|H)$ is the most complicated bit to explain. It tells us, if we assume that our hypothesis $H$ is true, how likely we originally thought seeing the evidence $E$ would be. So it tells us what we would have thought before seeing the new evidence, if we had assumed the hypothesis $H$ was true.
• $P(H|E)$ is the new, updated, posterior probability we give to the hypothesis $H$, after seeing the evidence $E$.

Let’s work through a quick example. Suppose I have a coin, and you think that there’s a 50% chance it’s a fair coin, and a 50% chance that it actually has two heads. So we have $P(H_{fair}) = .5$ and $P(H_{unfair}) = .5$.

Now you flip the coin ten times, and it comes up heads all ten times. If the coin is fair, this is pretty unlikely! The probability of that happening is $\frac{1}{2}^{10} = \frac{1}{1024}$, so we have $P(E|H_{fair}) = \frac{1}{1024}$. But if the coin is two-headed, this will definitely happen; the probability of getting ten heads is 100%, or $1$. So when you see this, you probably conclude that the coin is unfair.

Now let’s work through that same chain of reasoning algebraically. If the coin is fair, the probability of seeing ten heads in a row is $\frac{1}{2^{10}} = \frac{1}{1024}$. And if the coin is unfair, the probability is 1. So if we think there’s a 50% chance the coin is fair, and a 50% chance it’s unfair, then the overall probability of seeing ten heads in a row is \begin{align} P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\\ = .5 \cdot \frac{1}{1024} + .5 \cdot 1 = \frac{1025}{2048} \approx .5005. \end{align}

By Bayes’s Theorem, we have \begin{align} P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\
& = \frac{ \frac{1}{1024} \cdot .5}{\frac{1025}{2048}} = \frac{1}{1025} \\
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\
&= \frac{1 \cdot \frac{1}{2}}{\frac{1025}{2048}} = \frac{1024}{1025}. \end{align} Thus we conclude that the probability the coin is fair is $\frac{1}{1025} \approx .001$, and the probability it is two-headed is $\frac{1024}{1025} \approx .999$. This matches what our intuition tells us: if it comes up ten heads in a row, it probably isn’t fair.

But let’s tweak things a bit. Suppose I have a table with a thousand coins, and I tell you that all of them are fair except one two-headed one. You pick one at random, flip it ten times, and see ten heads. Now what do you think?

You have exactly the same evidence, but now your prior is different. Your prior tells you that $P(H_{fair}) = \frac{999}{1000}$ and $P(H_{unfair}) = \frac{1}{1000}$. We can do the same calculations as before. We have \begin{align} P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\
= \frac{999}{1000} \cdot \frac{1}{1024} + \frac{1}{1000} \cdot 1 \approx .00198 \end{align}

\begin{align} P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\
& = \frac{ \frac{1}{1024} \cdot \frac{999}{1000}}{.00198} \approx .494 \\
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\
&= \frac{1 \cdot \frac{1}{1000}}{.00198} \approx .506. \end{align} So now you should think it’s about equally likely that your coin is fair or unfair. 3

Why does this happen? If you have a fair coin, then seeing ten heads in a row is pretty unlikely. But having an unfair coin is also unlikely, because of the thousand coins you could have picked, only one was unfair. In this example those two unlikelinesses cancel out almost exactly, leaving us uncertain whether you got a (normal) fair coin and then a surprisingly unlikely result, or if you got a surprisingly unfair coin and then the normal, expected result.

In other words, you should definitely be somewhat surprised to see ten heads in a row. Remember, we worked out that your prior probability of seeing that is just $P(E) \approx .00198$—less than two tenths of a percent! But there are two different ways to get that unusual result, and you don’t know which of those unusual things happened.

Bayesian inference also does a good job of handling evidence that disproves one of your hypotheses. Suppose you have the same prior we were just discussing: $999$ fair coins, and one two-headed coin. What happens if you flip the coin once and it comes up tails?

Informally, we immediately realize that we can’t be flipping a two-headed coin. It came up tails, after all. So how does this work out in the math?

If the coin is fair, we have a $50\%$ chance of getting tails, and a $50\%$ chance of getting heads. If the coin is unfair, we have a $0\%$ chance of tails and a $100\%$ chance of heads. So we compute: \begin{align} P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\
= \frac{999}{1000} \cdot \frac{1}{2} + \frac{1}{1000} \cdot 0 = \frac{999}{2000} \end{align}

\begin{align} P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\
& = \frac{ \frac{1}{2} \cdot \frac{999}{1000}}{\frac{999}{2000}} = 1 \\
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\
&= \frac{0 \cdot \frac{1}{1000}}{\frac{999}{2000}} = 0. \end{align}

Thus the math agrees with us: once we see a tails, the probability that we’re flipping a two-headed coin is zero.

As long as everything behaves well, we can use these techniques to update our beliefs. In fact, this method is pretty powerful. We can prove that it is the best possible decision rule according to a few different sets of criteria4; and there are pretty good guarantees about eventually converging to the right answer after collecting enough evidence.

But there are still a few ways Bayesian inference can go wrong.

What if you get tails and keep flipping the coin—and get ten tails in a row? We’ll still draw the same conclusion: the coin can’t be double-headed, so it’s definitely fair. (You can work through the equations on this if you like; they’ll look just like the last computation I did, but longer). And if we keep flipping and get a thousand tails in a row, or a million, our computation will still tell us yes, the coin is definitely fair.

But before we get to a million flips, we might start suspecting, pretty strongly, that the coin is not fair. When it comes up tails a thousand times in a row, we probably suspect that in fact the coin has two tails. 5 So why doesn’t the math reflect this at all?

In this case, we made a mistake at the very beginning. Our prior told us that there was a $99.9\%$ chance we had a fair coin, and a $.1\%$ chance that we had a coin with two heads. And that means that our prior left no room for the possibility that our coin did anything else. We said our prior was $P(H_{fair}) = \frac{999}{1000} \qquad P(H_{unfair}) = \frac{1}{1000};$ but we really should have said $P(H_{fair}) = \frac{999}{1000} \qquad P(H_{two\ heads}) = \frac{1}{1000} \qquad P(H_{two\ tails}) = 0.$ And since we started with the belief that a two-tailed coin was impossible, no amount of evidence will cause us to change our beliefs. Thus Bayesian inference follows the old rule of Sherlock Holmes: “when you have excluded the impossible, whatever remains, however improbable, must be the truth.”

This example demonstrates both the power and the problems of doing Bayesian inference. The power is that it reflects what we already know. If something is known to be quite rare, then we probably didn’t just encounter it. (It’s more likely that I saw a random bear than a sasquatch—and that’s true even if sasquatch exist, since bear sightings are clearly more common). And if something is outright impossible, we don’t need to spend a lot of time thinking about the implications of it happening.

The problem is that in pure Bayesian inference, you’re trapped by your prior. If your prior thinks the “true” hypothesis is possible, then eventually, with enough evidence, you will conclude that the true hypothesis is extremely likely. But if your prior gives no probability to the true hypothesis, then no amount of evidence can ever change your mind. If we start out with $P(H) = 0$, then it is mathematically impossible to update your prior to believe that $H$ is possible.

But Douglas Adams neatly explained the flaw in the Sherlock Holmes principle in the voice of his character Dirk Gently:

The impossible often has a kind of integrity to it which the merely improbable lacks. How often have you been presented with an apparently rational explanation of something that works in all respects other than one, which is that it is hopelessly improbable?…The first idea merely supposes that there is something we don’t know about, and God knows there are enough of those. The second, however, runs contrary to something fundamental and human which we do know about. We should therefore be very suspicious of it and all its specious rationality.

In real life, when we see something we had thought was extremely improbable, we often reconsider our beliefs about what is possible. Maybe there’s some possibility we had originally dismissed, or not even considered, that makes our evidence look reasonable or even likely; and if we change our prior to include that possibility, suddenly our evidence makes sense. This is the “paradigm shift” I talked about in my recent post on Thomas Kuhn, and extremely unlikely evidence, like our extended series of tails, is a Kuhnian anomaly.

But rethinking your prior isn’t really allowed by the mathematics and machinery of Bayesian inference—it’s something else, something outside of the procedure, that we do to cover for the shortcomings of unaugmented Bayesianism.

Let’s return to the coin-flipping thought experiment; there’s one other way it can go wrong that I want to tell you about. Suppose you fix your prior to acknowledge the possibility that is two-headed or two-tailed. (We could even set up our prior to include the possibility that the coin is two-sided but biased— so that the coin comes up head 70% of the time, say. I’m going to ignore this case completely because it makes the calculations a lot more complicated and doesn’t actually clarify anything. But it’s important that we can do that if we want to).6

You assign the prior probabilities $P(H_{fair}) = \frac{98}{100} \qquad P(H_{two\ heads}) = \frac{1}{100} \qquad P(H_{two\ tails}) = \frac{1}{100},$ giving a 1% chance of each possible double-sided coin. (This is a higher chance than you gave it before, but clearly when I give you these coins I’ve been messing with you, so you should probably be less certain of everything). You flip the coin.

And it lands on its edge.

What does our rule of inference tell us now? We can try to do the same calculations we did before. The first thing we need to calculate is $P(E)$, which is easy. We started out by assuming this couldn’t happen, so the prior probability of seeing the coin landing on its side is zero!

(Algebraically, a fair coin has a 50% chance of heads and a 50% chance of tails. So if the coin is fair, then $P(E|H_{fair}) = 0$. But if the coin has a 100% chance of heads, then $P(E| H_{two\ heads}) = 0$. And if the coin has a 100% chance of tails, then $P(E| H_{two\ tails}) = 0$. Thus \begin{align} P(E) &= P(E|H_{fair}) \cdot P(H_{fair}) + P(E|H_{two\ heads}) \cdot P(H_{two\ heads}) + P(E|H_{two\ heads}) \cdot P(H_{two\ heads}) \\
& = 0 \cdot \frac{98}{100} + 0 \cdot \frac{1}{100} + 0 \cdot \frac{1}{100} = 0. \end{align} So we conclude that $P(E) = 0$).

Now we can actually calculate our new, updated, posterior probabilities—or can we? We have the formula that $P(H_{fair} | E) = \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)}.$ But with the probabilities we just calculated, this works out to $P(H_{fair} | E) = \frac{ 0 \cdot \frac{98}{100}}{0} = \frac{0}{0}.$ And our calculation has broken down completely; $\frac{0}{0}$ isn’t a number, let alone a useful probability.

Even more so than the last example, this is a serious Kuhnian anomaly. If we ever try to update and get $\frac{0}{0}$ as a response, something has gone wrong. We had said that something was totally impossible, and then it happened. All we can do is back up and choose a new prior.

And Bayesian inference can’t tell us how to do that.

There are a few different ways people try to get around this problem. But that’s another post.

1. I’m old enough to remember the late nineties, when spam was such a big problem that email became almost unusable. These days when I complain about email spam it’s usually my employer sending too many messages out through internal mailing lists; but there was a period in the nineties when for every legitimate email you’d get four or five filled with links to pr0n sites or trying to sell you v1@gr@ and c1@lis CHEAP!!! It was a major problem. Entire conferences were held on developing methods to defeat the spam problem.

These days I see about one true spam message like that per year. And one major reason for that is the invention of effective spam filters using Bayesian inference to predict whether a given email is spam or legitimate. So you’re using Bayesian tools right now purely by not receiving dozens of unwanted pornographic pictures in your email inbox every day.

2. This particular example is far too simple to really be worth setting up the Bayesian framework, but it gives a pretty direct and explicit demonstration of what all the pieces mean.

3. The exact probabilities are 999/2023 and 1024/2023. As a bonus, try to see why having some of those exact numbers makes sense, and reassures us that we did this right.

4. I’m primarily thinking of two really important results here. Cox’s Theorem gives a collection of reasonable-sounding conditions, and proves that Bayesian inference is the only possible rule that satisfies them all. Dutch Book Arguments show that this inference rule protects you from making a collection of bets which are guaranteed to lose you money.

5. No, you can’t just check this by looking at the coin. Because I said so.

More seriously, it’s pretty common to have experiments where you can see the results, but can’t inspect the mechanism by which those results are reached. In a particle collider you can see the tracks of exiting particles, but you can’t actually observe the collision. In an educational study, you can look at students’ test results, but you can’t look inside their brains and observe exactly when the learning happens. So it’s useful for this thought experiment to assume we can see how the coin lands, but can never look at both sides at the same time.

6. Gelman and Nolan have argued that it’s not physically possible to bias a coin flip in this way. This is arguably another reason to ignore the possibility that a coin is biased. And if you believe Gelman and Nolan’s argument, then you should have a low or zero prior probability that the coin is biased. But the actual reason I’m ignoring it is to avoid computing integrals in public.

Scott Alexander at Slate Star Codex has been blogging lately about Thomas Kuhn and the idea of paradigm shifts in science. This is a topic near and dear to my heart, so I wanted to take the opportunity to share some of my thoughts and answer some questions that Scott asked in his posts.

### The Big Idea

I’m going to start with my own rough summary of what I take from Kuhn’s work. But since this is all in response to Scott’s book review of The Structure of Scientific Revolutions, you may want to read his post first.

The main idea I draw from Kuhn’s work is that science and knowledge aren’t only, or even primarily, of a collection of facts. Observing the world and incorporating evidence is important to learning about the world, but evidence can’t really be interpreted or used without a prior framework or model through which to interpret it. For example, check out this Twitter thread: researchers were able to draw thousands of different and often mutually contradictory conclusions from a single data set by varying the theoretical assumptions they used to analyze it.

Kuhn also provided a response to Popperian falsificationism. No theory can ever truly be falsified by observation, because you can force almost any observation to match most theories with enough special cases and extra rules added in. And it’s often quite difficult to tell whether a given extra rule is an important development in scientific knowledge, or merely motivated reasoning to protect a familiar theory. After all, if you claim that objects with different weights fall at the same speed, you then have to explain why that doesn’t apply to bowling balls and feathers.

This is often described as the theory-ladenness of observation. Even when we think directly perceiving things, those perceptions are always mediated by our theories of how the world works and can’t be fully separated from them. This is most obvious when engaging in a complicated indirect experiment: there’s a lot of work going on between “I’m hearing a clicking sound from this thing I’m holding in my hand” and “a bunch of atoms just ejected alpha particles from their nuclei”.

But even in more straightforward scenarios, any inference comes with a lot of theory behind it. I drop two things that weigh different amounts, and see that the heavier one falls faster—proof that Galileo was wrong!

Or even more mundanely: I look through my window when I wake up, see a puddle, and conclude that it rained overnight. Of course I’m relying on the assumption that when I look through my window I actually see what’s on the other side of it, and not, say, a clever science-fiction style holoscreen. But more importantly, my conclusion that it rained depends on a lot of assumptions I normally wouldn’t explicitly mention—that rain would leave a puddle, and that my patio would be dry if it hadn’t rained.

(In fact, I discovered several months after moving in that my air conditioner condensation tray overflows on hot days. So the presence of puddles doesn’t actually tell me that it rained overnight).

Even direct perception, what we can see right in front of us, is mediated by internal modeling our brains do to put our observations into some comprehensible context. This is why optical illusions work so well; they hijack the modeling assumptions of your perceptual system to make you “see” things that aren’t there.

There are no black dots in this picture.
Who are you going to believe: me, or your own eyes?

### What does this tell us about science?

Kuhn divides scientific practice into three categories. The first he calls pre-science, where there is no generally accepted model to interpret observations. Most of life falls into this category—which makes sense, because most of life isn’t “science”. Subjects like history and psychology with multiple competing “schools” of thought are pre-scientific, because while there are a number of useful and informative models that we can use to understand parts of the subject, no single model provides a coherent shared context for all of our evidence. There is no unifying consensus perspective that basically explains everything we know.

A model that does achieve such a coherent consensus is called a paradigm. A paradigm is a theory that explains all the known evidence in a reasonable and satisfactory way. When there is a consensus paradigm, Kuhn says that we have “normal science”. And in normal science, the idea that scientists are just collecting more facts actually makes sense. Everyone is using the same underlying theory, so no one needs to spend time arguing about it; the work of science is just to collect more data to interpret within that theory.

But sometimes during the course of normal science you find anomalies, evidence that your paradigm can’t readily explain. If you have one or two anomalies, the best response is to assume that they really are anomalies—there’s something weird going on there, but it isn’t a problem for the paradigm.

A great example of an unimportant anomaly is the OPERA experiment from a few years ago that measured neutrinos traveling faster than the speed of light. This meant one of two things: either special relativity, a key component of the modern physics paradigm, was wrong; or there was an error somewhere in a delicate measurement process. Pretty much everyone assumed that the measurement was flawed, and pretty much everyone was right.

In contrast, sometimes the anomalies aren’t so easy to resolve. Scientists find more and more anomalies, more results that the dominant paradigm can’t explain. It becomes clear the paradigm is flawed, and can’t provide a satisfying explanation for the evidence. At this point people start experimenting with other models, and with luck, eventually find something new and different that explains all the evidence, old and new, normal and anomalous. A new paradigm takes over, and normal science returns.

(Notice that the old paradigm was never falsified, since you can always add epicycles to make the new data fit. In fact, the proverbial “epicycles” were added to the Ptolemaic model of the solar system to make it fit astronomical observations. In the early days of the Copernican model, it actually fit the evidence worse than the Ptolemaic model did—but it didn’t require the convoluted epicycles that made the Ptolemaic model work. Sabine Hossenfelder describes this process as, not falsification, but “implausification”: “a continuously adapted theory becomes increasingly difficult and arcane—not to say ugly—and eventually practitioners lose interest.”)

Importantly, Kuhn argued that two different paradigms would be incommensurable, so different from each other that communication between them is effectively impossible. I think this is sometimes overblown, but also often underestimated. Imagine trying to explain a modern medical diagnosis to someone who believes in four humors theory. Or remember how difficult it is to have conversations with someone whose politics are very different from your own; the background assumptions about how the world works are sometimes so different that it’s hard to agree even on basic facts.1

### Scott’s example questions

Now I can turn to the very good questions Scott asks in section II of his book review.

For example, consider three scientific papers I’ve looked at on this blog recently….What paradigm is each of these working from?

As a preliminary note, if we’re maintaining the Kuhnian distinction between a paradigm on the one hand and a model or a school of thought on the other, it is plausible that none of these are working in true paradigms. One major difficulty in many fields, especially the social sciences is that there isn’t a paradigm that unifies all our disparate strands of knowledge. But asking what possibly-incommensurable model or theory these papers are working from is still a useful and informative exercise.

I’m going to discuss the first study Scott mentions in a fair amount of depth, because it turned out I had a lot to say about it. I’ll follow that up by making briefer comments on his other two examples.

#### Cipriani, Ioannidis, et al.

– Cipriani, Ioannidis, et al perform a meta-analysis of antidepressant effect sizes and find that although almost all of them seem to work, amitriptyline works best.

This is actually a great example of some of the ways paradigms and models shape science. The study is a meta-analysis of various antidepressants to assess their effectiveness. So what’s the underlying model here?

Probably the best answer is: “depression is a real thing that can be caused or alleviated by chemicals”. Think about how completely incoherent this entire study would seem to a Szasian who thinks that mental illnesses are just choices made by people with weird preferences, to a medieval farmer who thinks mental illnesses are caused by demonic possession, or to a natural-health advocate who thinks that “chemicals” are bad for you. The medical model of mental illness is powerful and influential enough that we often don’t even notice we’re relying on it, or that there are alternatives. But it’s not the only model that we could use.2

While this is the best answer Scott’s question, it’s not the only one. When Scott originally wrote about this study he compared it to one he had done himself, which got very different results. Since they’re (mostly) studying the same drugs, in the same world, they “should” get similar results. But they don’t. Why not?

I’m not in any position to actually answer that question, since I don’t know much about psychiatric medications. But I can point out one very plausible reason: the studies made different modeling assumptions. And Scott highlights some of these assumptions himself in his analysis. For instance, he looks at the way Cipriani et al. control for possible bias in studies:

I’m actually a little concerned about the exact way he did this. If a pharma company sponsored a trial, he called the pharma company’s drug’s results biased, and the comparison drugs unbiased….

But surely if Lundbeck wants to make Celexa look good [relative to clomipramine], they can either finagle the Celexa numbers upward, finagle the clomipramine numbers downward, or both. If you flag Celexa as high risk of being finagled upwards, but don’t flag clomipramine as at risk of being finagled downwards, I worry you’re likely to understate clomipramine’s case.

I make a big deal of this because about a dozen of the twenty clomipramine studies included in the analysis were very obviously pharma companies using clomipramine as the comparison for their own drug that they wanted to make look good; I suspect some of the non-obvious ones were too. If all of these are marked as “no risk of bias against clomipramine”, we’re going to have clomipramine come out looking pretty bad.

Cipriani et al. had a model for which studies were producing reliable data, and fed it into their meta-analysis. Notice they aren’t denying or ignoring the numbers that were reported, but they are interpreting them differently based on background assumptions they have about the way studies work. And Scott is disagreeing with those assumptions and suggesting a different set of assumptions instead.

(For bonus points, look at why Scott flags this specific case. Cipriani et al. rated clomipramine badly, but Scott’s experience is that clomipramine is quite good. This is one of Kuhn’s paradigm-violating anomalies: the model says you should expect one result, but you observe another. Sometimes this causes you to question the observation; sometimes a drug that “everyone knows” is great actually doesn’t do very much. But sometimes it causes you to question the model instead.)

Scott’s model here isn’t really incommensurable with Cipriani et al.’s in a deep sense. But the difference in models does make numbers incommensurable. An odds ratio of 1.5 means something very different if your model expects it to be biased downwards than it does if you expect it to be neutral—or biased upwards. You can’t escape this sort of assumption just by “looking at the numbers”.

And this is true even though Scott and Cipriani et al. are largely working with the same sorts of models. They both believe in the medical model of mental illness. Their paradigm does include the idea that randomized controlled trials work, as Scott suggests in his piece. A bit more subtly, their shared paradigm also includes whatever instruments they use to measure antidepressant effectiveness. Since Cipriani et al. is actually a meta-analysis, they don’t address this directly. But each study they include is probably using some sort of questionnaire to assess how depressed people are. The numbers they get are only coherent or meaningful at all if you think that questionnaire is measuring something you care about.

There’s one more paradigm choice here that I want to draw attention to, because it’s important, and because I know Scott is interested in it, and because we may be in the middle of a moderate paradigm shift right now.

Studies this one tend to assume that a given drug will work about the same for everyone. And then people find that no antidepressant works consistently for everyone, and they all have small effect sizes, and conclude that maybe antidepressants aren’t very useful. But that’s hard to square with the fact that people regularly report massive benefits from going on antidepressants. We found an anomaly!

A number of researchers, including Scott himself, have suggested that any given person will respond well to some antidepressants and poorly to others. So when a study says that bupropion (or whatever) has a small effect on average, maybe that doesn’t mean bupropion isn’t helping anyone. Maybe instead it’s helping some people quite a lot, and it’s completely useless for other people, and so on average its effect is small but positive.

But this is a completely different way of thinking clinically and scientifically about these drugs. And it potentially undermines the entire idea behind meta-analyses like Cipriani et al. If our data is useless because we’re doing too much averaging, then averaging all our averages together isn’t really going to help. Maybe we should be doing something entirely different. We just need to figure out what.

#### Ceballos, Ehrlich et al.

– Ceballos, Ehrlich, et al calculate whether more species have become extinct recently than would be expected based on historical background rates; after finding almost 500 extinctions since 1900, they conclude they definitely have.

I actually think Scott mostly answers his own questions here.

As for the extinction paper, surely it can be attributed to some chain of thought starting with Cuvier’s catastrophism, passing through Lyell, and continuing on to the current day, based on the idea that the world has changed dramatically over its history and new species can arise and old ones disappear. But is that “the” paradigm of biology, or ecology, or whatever field Ceballos and Lyell are working in? Doesn’t it also depend on the idea of species, a different paradigm starting with Linnaeus and developed by zoologists over the ensuing centuries? It look like it dips into a bunch of different paradigms, but is not wholly within any.

The paper is using a model where

• Species is a real and important distinction;
• Species extinction is a thing that happens and matters;
• Their calculated background rate for extinction is the relevant comparison.

(You can in fact see a lot of their model/paradigm come through pretty clearly in the “Discussion” section of the paper— which is good writing practice.)

Scott seems concerned that it might dip a whole bunch of paradigms, but I don’t think that’s really a problem. Any true unifying paradigm will include more than one big idea; on the other hand, if there isn’t a true paradigm, you’d expect research to sometimes dip into multiple models or schools of thought. My impression is that biology is closer to having a real paradigm than not, but I can’t say for sure.

#### Terrell et al.

– Terrell et al examine contributions to open source projects and find that men are more likely to be accepted than women when adjusted for some measure of competence they believe is appropriate, suggesting a gender bias.

Social science tends to be less paradigm-y than the physical sciences, and this sort of politically-charged sociological question is probably the least paradigm-y of all, in that there’s no well-developed overarching framework that can be used to explain and understand data. If you can look at a study and know that people will immediately start arguing about what it “really means”, there’s probably no paradigm.

There is, however, a model underlying any study like this, as there is for any sort of research. Here I’d summarize it something like:

• Gender is an interesting and important construct;
• Acceptance rates for pull requests are a measure of (perceived) code quality;
• Their program that evaluated “obvious gender cues” does a good job of evaluating gender as perceived by other GitHub users;
• The “insider versus outsider” measure they report is important;
• The confounders they check are important, and the confounders they don’t check aren’t.

Basically, any time you get to do some comparisons and not others, or report some numbers and not others, you have to fall back on a model or paradigm to tell you which comparisons are actually important. Without some guiding model, you’d just have to report every number you measured in a giant table.

Now, sometimes people actually do this. They measure a whole bunch of data, and then they try to correlate everything with everything else, and see what pops up. This is not usually good research practice.

If you had exactly this same paper except, instead of “men and women” it covered “blondes and brunettes”, you’d probably be able to communicate the content of the paper to other people; but they’d probably look at you kind of funny, because why would that possibly matter?

### Anomalies and Bayes

Possibly the most interesting thing Scott has posted is his Grand Unified Chart relating Kuhnian theories to related ideas in other disciplines. The chart takes the Kuhnian ideas of “paradigm”, “data”, and “anomaly” and identifies equivalents from other fields. (I’ve flipped the order of the second and third columns here). In political discourse Scott relates them to “ideology”, “facts”, and “cognitive dissonance”; in psychology he relates them to “prediction”, “sense data”, and “surprisal”.

In the original version of the chart, several entries in the “anomalies” column were left blank. He has since filled some of them in, and removed a couple of other rows. I think his answer for the “Bayesian probability” row is wrong; but I think it’s interestingly wrong, in a way that effectively illuminates some of the philosophical and practical issues with Bayesian reasoning.

A quick informal refresher: in Bayesian inference, we start with some prior probability that describes what we originally believe the world is like, by specifying the probabilities of various things happening. Then we make observations of the world, and update our beliefs, giving our conclusion as a posterior probability.

The rule we use to update our beliefs is called Bayes’s Theorem (hence the name “Bayesian inference”). Specifically, we use the mathematical formula $P(H |E) = \frac{ P(E|H) P(H)}{P(E)},$ where $P$ is the probability function, $H$ is some hypothesis, and $E$ is our new evidence.

I have often drawn the same comparison Scott draws between a Kuhnian paradigm and a Bayesian prior. (They’re not exactly the same, and I’ll come back to this in a bit). And certainly Kuhnian “data” and Bayesian “evidence” correspond pretty well. But the Bayesian equivalent of the Kuhnian anomaly isn’t really the KL-divergence that Scott suggests.

KL-divergence is mathematical way to measure how far apart two probability distributions are. So it’s an appropriate way to look at two priors and tell how different they are. But you never directly observe a probability distribution—just a collection of data points—so KL-divergence doesn’t tell you how surprising your data is. (Your prior does that on its own).

But “surprising evidence” isn’t the same thing as an anomaly. If you make a new observation that was likely under your prior, you get an updated posterior probability and everything is fine. And if you make a new observation that was unlikely under your prior, you get an updated posterior probability and everything is fine. As long as the true3 hypothesis is in your prior at all, you’ll converge to it with enough evidence; that’s one of the great strengths of Bayesian inference. So even a very surprising observation doesn’t force you to rethink your model.

In contrast, if you make a new observation that was impossible under your prior, you hit a literal divide-by-zero error. If your prior says that $E$ can’t happen, then you can’t actually carry out the Bayesian update calculation, because Bayes’s rule tells you to divide by $P(E)$—which is zero. And this is the Bayesian equivalent of a Kuhnian anomaly.

We can imagine a robot in an Asimov short story encountering this situation, trying to divide by zero, and crashing fatally. But people aren’t quite so easy to crash, and an intelligently designed AI wouldn’t be either. We can do something that a simple Bayesian inference algorithm doesn’t allow: we can invent a new prior and start over from the beginning. We can shift paradigms.

A theoretically perfect Bayesian inference algorithm would start with a universal prior—a prior that gives positive probability to every conceivable hypothesis and every describable piece of evidence. No observation would ever be impossible under the universal prior, so no update would require division by zero.

But it’s easier to talk about such a prior than it is to actually come up with one. The usual example I hear is the Solomonoff prior, but it is known to be uncomputable. I would guess that any useful universal prior would be similarly uncomputable. But even if I’m wrong and a theoretically computable universal prior exists, there’s definitely no way we could actually carry out the infinitely many computations it would require.

Any practical use of Bayesian inference, or really any sort of analysis, has to restrict itself to considering only a few classes of hypotheses. And that means that sometimes, the “true” hypothesis won’t be in your prior. Your prior gives it a zero probability. And that means that as you run more experiments and collect more evidence, your results will look weirder and weirder. Eventually you might get one of those zero-probability results, those anomalies. And then you have to start over.

A lot of the work of science—the “normal” work—is accumulating more evidence and feeding it to the (metaphorical) Bayesian machine. But the most difficult and creative part is coming up with better hypotheses to include in the prior. Once the “true” hypothesis is in your prior, collecting more evidence will drive its probability up. But you need to add the hypothesis to your prior first. And that’s what a paradigm shift looks like.

It’s important to remember that this is an analogy; a paradigm isn’t exactly the same thing as a prior. Just as “surprising evidence” isn’t an anomaly, two priors with slightly different probabilities put on some hypotheses aren’t operating in different paradigms.

Instead, a paradigm comes before your prior. Your paradigm tells you what counts as a hypothesis, what you should include in your prior and what you should leave out. You can have two different priors in the same paradigm; you can’t have the same prior in two different paradigms. Which is kind of what it means to say that different paradigms are incommensurable.

This is probably the biggest weakness of Bayesian inference, in practice. Bayes gives you a systematic way of evaluating the hypotheses you have based on the evidence you see. But it doesn’t help you figure out what sort of hypotheses you should be considering in the first place; you need some theoretical foundation to do that.

Have questions about philosophy of science? Questions about Bayesian inference? Want to tell me I got Kuhn completely wrong? Tweet me @ProfJayDaigle or leave a comment below, and let me know!

1. If you’re interested in the political angle on this more than the scientific, check out the talk I gave at TedxOccidentalCollege last year

2. In fact, this was my third or fourth answer in the first draft of this section. Then I looked at it again and realized it was by far the best answer. That’s how paradigms work: as long as everything is working normally, you don’t even have to think about the fact that they’re there.

3. "True" isn’t really the most accurate word to use here, but it works well enough and I want to avoid another thousand-word digression on the subject of metaphysics.

## Numerical Semigroups and Delta Sets

In this post I want to outline my main research project, which involves non-unique factorization in numerical semigroups. I’m going to define semigroups and numerical semigroups; explain what non-unique factorization means; define the invariant I study, called the delta set; and talk about some of the specific questions I’m interested in.

### Semigroups

A semigroup is a set $S$ with one associative operation. This really just means we have a set of things, and some way of combining any two of them to get another. Semigroups generalize the more common idea of a group, which has an identity and inverses in addition to the associative operation. Every group is also a semigroup, but not every semigroup is a group.1

The simplest example of a semigroup is the natural numbers $\mathbb{N}$, with the operation of addition: we can add any two natural numbers together, but without negative numbers we don’t have any way to subtract, which would be an inverse. This is the free semigroup on one generator, which means we can get every element by starting with $1$ and adding it to itself some number of times.

Other examples of semigroups are:

• $\mathbb{N}^n, +$: ordered $n$-tuplets of natural numbers.
• $\mathbb{N}, \times$: the natural numbers using multiplication as the operation. This has infinitely many generators, since we need to start with every prime number to get every possible natural number.
• String Concatenation: we can take our set to be the set of all strings of English letters, and we combine two strings by just sticking the second one after the first.
• Block Monoids are semigroups whose elements are lists of group elements that mulitiply out to zero under the operation of concatenation.

Numerical semigroups, which are the main object I study, are formally defined as sub-semigroups of the natural numbers but that phrase doesn’t actually explain a lot if you’re not already familiar with the field. However, I can explain what they actually are them much less technically and more simply.

### Numerical Semigroups

We can define the numerical semigroup generated by $a_1, \dots, a_k$ to be the set of integers $\langle a_1, \dots, a_k \rangle = {n_1 a_1 + \dots + n_k a_k : n_i \in \mathbb{Z}_{\geq 0} }.$ In other words, our semigroup is the set of all the numbers you can get by adding up the generators some number of times, but without allowing subtraction.

I like to think about the Chicken McNugget semigroup to explain this. When I was a kid, at McDonald’s you could get a 4-piece, 6-piece, or 9-piece order of Chicken McNuggets.2 And then we can ask: which numbers of nuggets is it possible to order?

You certainly can’t order one, two, or three nuggets. You can order four, but not five. You can order six, but not seven. You can get eight by ordering two 4-pieces, nine by ordering one 9-piece, and ten by ordering a 4-piece and a 6-piece. There’s no way to order exactly eleven nuggets, and it turns out we can get any number of nuggets past that exactly. (This makes eleven the Frobenius number for this semigroup). We can summarize all this in the table below:

$\begin{array}{cc} 1 & \text{not possible} \\\ 2 & \text{not possible} \\\ 3 & \text{not possible} \\\ 4 & = 1 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 \\\ 5 & \text{not possible} \\\ 6 & = 0 \cdot 4 + 1 \cdot 6 + 0 \cdot 9 \\\ 7 & \text{not possible} \\\ 8 & = 2 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 \\\ 9 & = 0 \cdot 4 + 0 \cdot 6 + 1 \cdot 9 \\\ 10 & = 1 \cdot 4 + 1 \cdot 6 + 0 \cdot 9 \\\ 11 & \text{not possible} \\\ 12 & = 3 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 \\\ & = 0 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 \\\ 13 & = 1 \cdot 4 + 0 \cdot 6 + 1 \cdot 9 \end{array}$

Looking at this table you might notice something else: there are two rows for the number 12, because we can order 12 nuggets in two different ways: we can order three 4-piece orders, or two 6-piece orders. We call each of these ways of ordering twelve nuggets a factorization of 12 with respect to the generators $4,6,9$. And not only do we have two different factorizations of 12; they actually have different numbers of factors!

If we look at larger numbers, the variety in factorizations becomes far greater. Consider this table of ways to factor 36: $\begin{array}{cc} \text{factorization} & \text{length} \\\ 9 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 & 9 \\\ 6 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 & 8 \\\ 3 \cdot 4 + 4 \cdot 6 + 0 \cdot 9 & 7 \\\ 3 \cdot 4 + 1 \cdot 6 + 2 \cdot 9 & 6 \\\ 0 \cdot 4 + 6 \cdot 6 + 0 \cdot 9 & 6 \\\ 0 \cdot 4 + 3 \cdot 6 + 2 \cdot 9 & 5 \\\ 0 \cdot 4 + 0 \cdot 6 + 4 \cdot 9 & 4 \end{array}$ We have seven distinct ways we can factor 12. The shortest has four factors and the longest has nine; every length in between is represented.

From here we can ask a number of questions. How many ways can we order a given number of chicken nuggets? How many different lengths can these factorizations have? What patterns can we find?

All this is very different from what we’re used to. When we factor integers into prime numbers, the Fundamental Theorem of Arithmetic tells us that there is a unique way to do this. We generally learn this in grade school, and so from a very young age we’re used to having only one way to factor things. But this unique factorization property isn’t universal, and it doesn’t apply here.

Numerical semigroups essentially never have unique factorization. But we want to find ways to measure how not-unique their factorization is.

### The Delta Set

In my research I study something called the delta set of a semigroup. The delta set is a way of measuring how complicated the relationships among different factorizations can get.

For an element $x$ in a semigroup, we can look at all the factorizations of $x$, and then we can look at all the possible lengths of these factorizations. (In our example above, we had $\mathbf{L}(36) = \{4,5,6,7,8,9\}$; we don’t repeat the $6$ because we only care about which lengths are possible, and not how many times they occur). Then we can ask a bunch of questions about these sets of lengths.

A simple thing to compute is the elasticity of an element, which is just the ratio of the longest factorization to the shortest, and tells you how much the lengths can vary. (The elasticity of $36$ is $9/4$). A good exercise is to convince yourself that the largest elasticity of any element in a semigroup is the ratio of the largest generator to the smallest generator. (And thus that $36$ has the maximum possible elasticity for $\langle 4, 6, 9 \rangle$).

The delta set is a bit more complicated. The delta set of $x$ is the set of successive differences in lengths. So instead of looking at the shortest and longest factorizations, we look at all of them, and see what sort of gaps show up. (For our example, the delta set is just $\Delta(36) = \{1\}$, since there’s a factorization of each length between $4$ and $9$. If the set of lengths were $\{3,5,8,15\}$ then the delta set would be $\{2,3,7\}$).

We want to understand the whole semigroup, not just individual elements. So we often want to talk about the delta set of an entire semigroup, which is just the union of the delta sets of all the elements. So $\Delta(S)$ tells us what kind of gaps can appear in any set of lengths for any element of the semigroup. It turns out that for the Chicken McNugget semigroup $S = \langle 4,6,9 \rangle$, the delta set is just $\Delta(S) = \{1\}$. This means that the delta set of any element is just $\{1\}$, and thus that every set of lengths is a set of consecutive integers $\{n,n+1, \dots, n+k \}$.

### What Do We Know?

Delta sets can be a little tricky to compute. It’s fairly easy to show a number is in the delta set of a semigroup: find an element, calculate all the factorization lengths, and see that you have a gap of the desired size. But to show that a number is not in the delta set of the semigroup, you have to show that it isn’t in the delta set of any element, which is much trickier.

However, there are a few things we do know.

• The smallest element of the delta set is the greatest common divisor of the other elements of the delta set. This means that $\{2,3\}$ can’t be the delta set of any semigroup, since $2$ isn’t the GCD of $2$ and $3$.

• If $S = \langle a, b \rangle$ is generated by exactly two elements, then $\Delta(S) = \{b - a\}$. More generally, if $S = \langle a, a+d, a+2d, \dots, a+kd \rangle$ then $\Delta(S) = \{d\}$. (We call such semigroups “arithmetic semigroups” since their generating set is an arithmetic progression).

• For any numerical semigroup $S$, there is a finite collection of (computable) elements called the Betti elements, and the maximum element of the delta set of $S$ is in the delta set of at least one of the Betti elements.

• Finally and most importantly, the delta set is eventually periodic. This means that if you check the delta sets for a (possibly large but known) number of elements of the semigroup, you will see everything you can possibly see. This makes it possible to compute the delta set of any given semigroup and know you haven’t left anything out. 3

But this is nearly everything that we really know about delta sets. There are a lot of open questions left, which primarily fall into two categories:

1. For some nice category of semigroup, compute the delta set. We’ve already seen this question answered for semigroups generated by arithmetic sequences; we also have complete or partial answers for semigroups generated by generalized arithmetic sequences, geometric sequences, and compound sequences.

2. The realization problem: given a set of natural numbers, is it the delta set of some numerical semigroup? We don’t actually know a lot about this. About the only thing that we know can’t happen is a minimum element that isn’t the GCD of the set. But to show that something can happen, about all we can do is find a specific semigroup that has that delta set. There’s a lot of room to explore here.

### Non-Minimal Generating Sets

In my research I introduce one more complication. Earlier we talked about the Chicken McNugget semigroup, of all the ways we can build orders out of 4, 6, or 9 chicken nuggets. But McDonald’s also offers a 20 piece order of chicken nuggets. 4

From a purely algebraic perspective, this doesn’t change anything. Anything we can get with 20 piece orders, we can get with a combination of 4 and 6 pieces, so we have the same set and the same operation, and thus the same semigroup. (We say that 20 isn’t “irreducible” because we can factor it into other simpler elements). So in this sense, nothing should change.

But the set of factorizations does change. If we replicate our earlier table of factorizations of 36 but now allow $20$ as a factor, we get $\begin{array}{cc} \text{factorization} & \text{length} \\\ 9 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 9 \\\ 6 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 8 \\\ 3 \cdot 4 + 4 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 7 \\\ 3 \cdot 4 + 1 \cdot 6 + 2 \cdot 9 + 0 \cdot 20 & 6 \\\ 0 \cdot 4 + 6 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 6 \\\ 0 \cdot 4 + 3 \cdot 6 + 2 \cdot 9 + 0 \cdot 20 & 5 \\\ \color{blue}{4 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 + 1 \cdot 20} & \color{blue}{5} \\\ 0 \cdot 4 + 0 \cdot 6 + 4 \cdot 9 + 0 \cdot 20 & 4 \\\ \color{blue}{1 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 + 1 \cdot 20 } & \color{blue}{4} \end{array}$ The extra generator gives us the two additional factorizations in blue.

Now every question we asked about factorizations in numerical semigroups, we can ask again for factorizations with respect to our non-minimal generating set. For instance, we can ask for the delta set with respect to our generating set. For 36 above, we see that the delta set is still 1, just as it was before; nothing has changed.

But let’s look instead at the element 20. With our old generating set of $4,6,9$, we can only get 20 nuggets in two ways. But with our non-minimal generating set, we have three different ways to order 20 nuggets: $20 = 5 \cdot 4 = 2 \cdot 4 + 2 \cdot 6 = 1 \cdot 20$. These three “factorizations” have lengths 5, 4, and 1, and a little experimentation will convince you that they’re the only possible factorizations. Therefore our set of lengths is $\mathbf{L}(20) = \{1,4,5\}$ and the delta set is $\Delta(20) = \{1,3\}$.

This is a big change! With the original, minimal generating set, the delta set of the entire semigroup was ${1}$. There was no element with a length gap larger than 1. But by adding a new generator in, we can get an element whose delta set is ${1,3}$. And a little experimentation shows us that $26 = 5 \cdot 4 + 1 \cdot 6 = 2 \cdot 4 + 3 \cdot 6 = 2 \cdot 4 + 2 \cdot 9 = 1 \cdot 6 + 1 \cdot 20$ and thus $\mathbf{L}(26) = \{2,4,5,6\}$ and $\Delta(26) = \{1,2\}$. So the delta set for the entire semigroup is $\{1,2,3\}$.5 We’ve gotten a different delta set for the exact same semigroup, but using a different set of generators.

This raises a number of questions for us to study. We can start with our previous two questions: given a semigroup (and a non-minimal set of generators), what is the delta set? And given a set, is it the delta set of some semigroup and non-minimal generating set? But we also have a new question: what happens to the delta set of a semigroup as we continually add things to the generating set? Can we make the delta set bigger? Can we make it smaller? What ways of adding generators produce interesting patterns?

There’s a lot of fertile ground here. A few questions have been answered already, in a paper I cowrote with Scott Chapman, Rolf Hoyer, and Nathan Kaplan in 2010. For instance, it is always possible to force the delta set to be $\{1\}$ by adding more elements to the generating set. A couple other groups have done some work since then, but as far as I know, nothing else has been published.

But hopefully I’ve convinced you that there are quite a few interesting and unanswered questions in this field. Many of the answers should be accessible with a bit of work, and I hope to be able to provide some of them soon.

1. There is also something called a “monoid”, which has an identity element but no inverses; thus every group is a monoid and every monoid is a semigroup. The presence of an identity element doesn’t actually matter for any of the questions we’re asking, so researchers use the terms “semigroup” and “monoid” more or less interchangeably.

2. For some reason, they switched over to 4-, 6-, and 10-piece orders when I was a teenager. That semigroup is much less interesting, so I’m going to pretend that never happened.

3. This result was originally proven by Scott Chapman, Rolf Hoyer, and Nathan Kaplan in 2008, during an undergraduate REU research program I was also participating in. But the original result had an unfortunately large bound, so using this to compute delta sets wasn’t really practically feasible. In 2014, a paper by J. I. García-García, M. A. Moreno-Frías, and A. Vigneron-Tenorio improved the bound dramatically and made computation of delta sets feasible on personal computers.

4. My parents would never let me order this when I was a child, and I’m still bitter.

5. I haven’t actually shown that you can’t get a gap bigger than $3$. But it’s true.

## The difference between science and engineering

I wrote this essay a few years back elsewhere on the internet. It still seems relevant, so I’m posting this updated and lightly edited version.

I’ve noticed that people regularly get confused, on a number of subjects, by the difference between science and engineering.
In summary: science is sensitive and finds facts; engineering is robust and gives praxis. Many problems happen when we confuse science for engineering and completely modify our praxis based on the results of a couple of studies in an unsettled area.

(Thanks to Cowbirds in Love for the perfect comic strip)

### The difference between science and engineering

As a rough definition, science is a system of techniques for finding out facts about the world. Engineering, in contrast, is the technique of using science to produce tools we can consistently use in the world. Engineering produces things that have useful effects. (And I’ll also point to a third category, of “folk traditions,” which are tools we use in the world that are not particularly founded in science.)

These things are importantly different. Science depends on a large number of people putting together a lot of little pieces, and building up an edifice of facts that together give us a good picture of how things work. It’s fine if any one experiment or study is flawed, because in the limit of infinite experiments we figure out what’s going on. (See for example Scott Alexander’s essay Beware the Man of One Study for excellent commentary on this problem).

Similarly, it’s fine if any one experiment holds in only very restricted cases, or detects a subtle effect that can only be seen with delicate machinery. The point is to build up a large number of data points and use them to generate a model of the world.

Engineering, in contrast, has to be robust. If I want to detect the Higgs Boson once, to find out if it exists, I can do that in a giant machine that costs billions of dollars and requires hundreds of hours of analysis. If I want to build a Higgs Boson detector into a cell phone, that doesn’t work.

This means two things. First is that we need to understand things much better for engineering than for science. In science it’s fine to say “The true effect is between +3 and -7 with 95% probability”. If that’s what we know, then that’s what we know. And an experiment that shrinks the bell curve by half a unit is useful. For engineering, we generally need to have a much better idea of what the true effect is. (Imagine trying to build an airplane based on the knowledge that acceleration due to gravity is probably between 9 and 13 m/s^2).

Second is that science in general cares about much smaller effects than engineering does. It was a very long time before engineering needed relativistic corrections due to gravity, say. A fact can be true but not (yet) useful or relevant, and then it’s in the domain of science but not engineering.

### Why does this matter?

The distinction is, I think fairly clear when we talk about physics. In particular, we understand the science of physics quite well, at least on every-day scales. And our practice of the engineering of physics is also quite well-developed, enough so that people rarely use folk traditions in place of engineering any more. (“I don’t know why this bridge stays up, but this is how daddy built them.”)

But people get much more confused when we move over to, say, psychology, or sociology, or nutrition. Researchers are doing a lot of science on these subjects, and doing good work. So there’s a ton of papers out there saying that eggs are good, or eggs are bad, or eggs are good for you but only until next Monday, or whatever.

And people often have one of two reactions to this situation. The first is to read one study and say “See, here’s the scientific study. It says eggs are bad for you. Why are you still eating eggs? Are you denying the science?” And the second reaction is to say that obviously the scientists can’t agree, and so we don’t know anything and maybe the whole scientific approach is flawed.

But the real situation is that we’re struggling to develop a science of nutrition. And that’s hard. We’ve put in a lot of work, and we know some things. But we don’t really have enough information to do engineering—to say “Okay, to optimize cardiovascular health you need to cut your simple carbs by 7%, eat an extra 10g of monounsaturated fats every day, and eat 200g of protein every Wednesday”, or whatever. We just don’t know enough.

And this is where folk traditions come in. Folk traditions are attempts to answer questions that we need decent answers to, that have been developed over time, and that are presumably non-horrible because they haven’t failed obviously and spectacularly yet. A person who eats “like grandma did” is probably on average at least as healthy as a person who tried to follow every trendy bit of scientistic nutrition advice from the past thirty years.

### Trendy teaching as confusing science for engineering

So where do I see this coming up other than nutrition? Well, the subject that really got me thinking about it was “scientific” teaching practices. I’ve attended a few workshops on “modern” teaching techniques like the use of clickers, and when I tell people about them I often get comments disparaging cargo cult teaching methods.

In general there’s a big split among university professors between people who want to teach in a more “traditional” way and people who want to teach in a more “scientific” way. With bad blood on both sides.

And my biggest problem with the “scientific” side is that some of their studies are so bad. I’d like good studies on teaching methods. I’d like a good engineering of teaching. But we don’t have one yet, and acting like “we have three studies, now we know the best thing to do” is just silly.

(Which shouldn’t be read as full-throated support for the “traditionalists”! The science is good enough to tell us some things about some things, and I do try to engage in judicious supplementation of folk teaching traditions with information from recent research. But the research is not in a good enough state to be dispositive, or produce an engineering discipline, or completely displace the folk tradition).

### Other examples

A few of my friends have complained about the sad state of excercise science; but I think they’re really complaining about the lack of exercise engineering. We are doing basic research that tells us about how the body responds to exercise. We don’t know enough to give advice that improves much on “do the things people have been doing for a while that seem to work”.

A lot of “lifehacks” boil down to “We read a study, and based on this study, here are three simple things you can do to accomplish X.” But a study is science, not engineering. Sometimes helpful, but easy to overinterpret. Don’t take any one study too seriously, and if what you’re doing works, don’t totally overhaul it because you read a study.

Similarly, any comment about how you can be more effective socially by doing this one trick is usually science, not engineering.

Lots of economics and public policy debates sound like this. “This study shows that raising the minimum wage (increases/decreases/has no effect on) unemployment.” All three of those statements can be true! There are a lot of studies with a lot of different results. We’re starting to develop an engineering practice of economics policy, but it’s in its infancy at best.

Or see this essay’s account of scientifically studying the most effective way for police to respond to domestic violence charges, for a good example of confusing science and engineering. Bonus points for the following quote:

Reflection upon these results led Sherman to formulate his “defiance” theory of the criminal sanction, beginning with the inauspicious generalization that, according to the evidence, “legal punishment either reduces, increases, or has no effect on future crimes, depending on the type of offenders, offenses, social settings and levels of analysis.” This is a fancy way of saying “we don’t know what works.”

### Marketing: engineering versus folk traditions

The field of marketing presents a good contrast between engineering and folk traditions. We have a mental image of a sleazy salesman, who has a whole host of interpersonal tactics that have been honed through centuries and millenia of sleazy sales tactics. And this works.

And there’s an entirely different field of marketing research and focus groups. And this shows what’s necessary to turn science into engineering. There’s a whole bunch of basic research about psychology that goes into designing marketing campaigns. But people also do focus groups, to gather a ton of data on how people respont to minute differences.

And, more importantly, they do A/B testing, which gives pretty good data on how actual people respond to actual differences. And by iterating a ton of A/B testing, you have a pretty good idea that people will buy 5% more if you use the green packaging, or whatever.

## An easier approach to partial fractions decomposition

I always found partial fraction decomposition incredibly annoying and tedious. But it turns out there’s a much easier way to compute it. (I learned this a couple years ago from Chris Towse).

Suppose we want to find a partial fraction decomposition for $\frac{7x+2}{(x+2)^2 (x-1)}$. The normal method is to take your fraction and write it as a sum of real numbers over your polynomial denominators:

(For this reason, my high school calculus teacher called this the “ABC method”). Then we clear denonminators: %

and we get a system of linear equations:

This is a system of linear equations, so we can solve it by any of the usual methods, and we get $A = -1, B = 4, C = 1$, so

And now we can integrate or do whatever else we needed to do with our fraction.

This process can get super tedious. In particular, solving the linear system at the end isn’t difficult but it is really annoying and easy to screw up if you do it by hand. (I used NumPy instead. Computer algebra systems are your friend).

It turns out there’s a much easier way to do this. It’s motivated by complex analysis residue integrals, but you can do it without actually knowing any complex analysis.

Let’s go back to our equation from earlier:

Instead of clearing all the denominators, let’s just clear one. If we multiply by $(x-1)$ on both sides, we get

This doesn’t look much nicer at first, but look at what happens if we evaluate at $x=1$. The left hand side becomes $1$. On the right-hand side, the $A$ and $B$ terms go away completely, and we’re just left with $C$. So we immediately see that $C = 1$.

We can find $A$ and $B$ the same way, with a bit more care. Multiplying our equation by $x+2$ doesn’t help, because we’ll still have a factor of $x+2$ in the denominator. But if we multiply by $(x+2)^2$ we get

and evaluating at $x = -2$ gives $4 = B$. To get $A$, we need to do a little bit of work and subtract off the $B$ term:

so

That might feel like it took longer, but that’s mostly because I actually worked through all the algebra with the new version. No NumPy here! I actually suspect the first way is more efficient if you’re doing a really big decomposition, because it paralellizes a bunch of stuff, and linear equation solvers are pretty efficient.

But for reasonable-sized problems I’d much rather do the second method, no question. And this makes me almost want to actually teach partial fraction decomposition next time I teach calc 2.