Jekyll2022-10-01T12:12:45-07:00https://jaydaigle.net/Jay DaigleJay Daigle is a professor of mathematics at The George Washington University in Washington, D.C. In addition to his research in number theory, he brings a mathematical style to thinking about philosophy, politics, social dynamics, and everyday life.Jay DaigleHypothesis Testing and its Discontents, Part 3: What Can We Do?2022-07-25T00:00:00-07:002022-07-25T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-3<p>Hypothesis testing is central to the way we do science, but it has major flaws that have encouraged widespread shoddy research. In <a href="/blog/hypothesis-testing-part-1/">part 1</a> of this series, we looked at the historical origins of hypothesis testing, and described two different approaches: Fisher’s significance testing, and Neyman-Pearson hypothesis testing. In <a href="/blog/hypothesis-testing-part-2/">part 2</a> we saw how modern researchers use hypothesis testing in practice. We looked at theoretical reasons the tools we use aren’t suited for many questions we want to ask, and also at the ways these tools encourage researchers to <em>misuse</em> them and draw dubious conclusions from questionable methods.</p>
<p>In this essay we’ll look at a number of methods that can help us draw better conclusions, and avoid the pitfalls of crappy hypothesis testing. We’ll start with some smaller and more conservative ideas, which basically involve doing hypothesis testing <em>better</em>. Then we’ll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.</p>
<h2 id="what-was-hypothesis-testing">1. What was hypothesis testing, again?</h2>
<p>But first, let’s remember what we’re talking about. The first two parts of this series answered two basic questions: how does hypothesis testing work, and how does it break?</p>
<p>In part 1, we learned about two major historical approaches to the idea of hypothesis testing: one by Fisher, and the other by Neyman and Pearson. Both start with a “null hypothesis”, which is usually an idea we’re trying to <em>disprove</em>. Then we collect some data, and analyze it under the assumption that the null hypothesis is true.</p>
<p>Fisher’s significance testing computes a \(p\)-value, which is the probability of seeing the experimental result you got <em>if</em> the null hypothesis is true. It is <strong><em>not</em></strong> the probability that the null hypothesis is false, but it does measure how much evidence your experiment provides against the null hypothesis. We say the result is <em>significant</em> if the \(p\)-value is below some pre-defined threshold, generally \(5\)%. <strong>If the null is actually false, we should be able to reliably produce these low \(p\)-values</strong>; Fisher wrote that a “scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”.</p>
<p>Neyman and Pearson didn’t worry about establishing facts; instead, they focused on making actionable, yes-or-no decisions. A Neyman-Pearson null hypothesis is generally that we should refuse to take some specific action, which may or may not be useful. We figure out how bad it would be to take the action if it is useless, and how much we’d miss out on if it’s useful, and use that to set a threshold; then we collect data and use our threshold to decide whether to act. <strong>This approach doesn’t tell us what to <em>believe</em>, just what to <em>do</em>.</strong> Sometimes we think that acting is probably useful, but that acting wrongly would be catastrophic so it would be wiser to do nothing. The Neyman-Pearson method takes that logic into account, and biases us towards inaction, making type I errors less common at the expense of making type II errors more common.<strong title="We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it's pretty uncommon to do it that way in practice."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<p>Modern researchers use an awkward combination of these methods. Like Fisher, we want to discover true facts; but we use Neyman and Pearson’s technical approach of setting specific thresholds. We set a false positive threshold (usually \(5\)% and ideally a false negative threshold (we want it to be less than \(20\)%), and run our experiment. If we get a \(p\)-value less than the threshold —data that would be pretty weird <em>if</em> the null hypothesis is true, so weird it would only happen once every twenty experiments we run—then we “reject the null” and believe some alternative hypothesis. If our \(p\)-value is bigger, meaning our data wouldn’t look too weird if the null hypothesis is true, then we “fail to reject” the null and err on the side of believing the null hypothesis.</p>
<p>There are a few major problems with this setup.</p>
<ul>
<li>
<h4 id="artificial-decisiveness">Artificial decisiveness</h4>
<p>The Neyman-Pearson method makes a definitive choice between two distinct courses of action. This reinforces a general tendency to <a href="https://statmodeling.stat.columbia.edu/2019/09/13/deterministic-thinking-dichotomania/">force questions into yes-or-no binaries</a>, even when that sort of clean dichotomy isn’t realistic or appropriate to the question. Hypothesis testing tells us whether something exists, but not really how common or how big it is.<strong title="We've seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks "work" or "don't work", rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.)"><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p><img src="/assets/blog/hypothesis-testing/size-matters-not.jpeg" alt="Yoda: "Size matters not."" class="blog-image center" />
<em class="blog-image center">Unfortunately, Yoda is wrong. Sometimes we do care about size.</em></p>
<p>And more importantly, <strong>scientific knowledge is always provisional</strong>, so we need to continually revise our beliefs based on new information. But Neyman-Pearson is designed to make a final decision and close the book on the question, which just isn’t how science needs to work.</p>
</li>
<li>
<h4 id="bias-towards-the-null">Bias towards the null</h4>
<p>Neyman-Pearson creates a bias towards the null hypothesis, so rejecting the null feels like learning something new, while failing to reject is a default outcome. On one hand, this means it’s not a good tool if we want to show the null is true<strong title="There are [variants of hypothesis testing] that help you show some null hypothesis is (probably) basically right. But they're not nearly as common as the more standard setup."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong>. On the other hand, a study that fails to reject the null feels like a failed study, and that’s a huge problem if the null really is true! This can <a href="https://en.wikipedia.org/wiki/Publication_bias">bias the studies we actually see</a> since many non-rejections aren’t published. <strong>It doesn’t help us that most research is accurate if <a href="/blog/hypothesis-testing-part-2#most-findings-false">most published papers are not</a>.</strong></p>
</li>
<li>
<h4 id="motivated-reasoning-and-p-hacking">Motivated reasoning and \(p\)-hacking</h4>
<p>Since researchers don’t want to fail, and do want to discover new things and get published, they have an incentive to <em>find</em> a way to reject the null.<strong title="[Nosek, Spies, and Motyl write] about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest....The solution requires making incentives for _getting it right_ competitive with the incentives for _getting it published_.""><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> When done deliberately, we call this \(p\)-hacking, and there are a variety of <a href="https://replicationindex.com/2015/01/24/qrps/">questionable research practices</a> that can help us wrongly and artificially reject a null hypothesis. Worse, the <a href="https://www.americanscientist.org/article/the-statistical-crisis-in-science">garden of forking paths</a> means you can effectively \(p\)-hack without even knowing that you’re doing it, fudging both your theory and your data until they match.</p>
</li>
<li>
<h4 id="low-power-creates-misleading-results">Low power creates misleading results</h4>
<p>At the same time, many studies <a href="https://marginalrevolution.com/marginalrevolution/2022/07/quantitative-political-science-research-is-greatly-underpowered.html">have a low <em>power</em></a>, meaning they probably won’t reject the null even if it is actually false. Combined with publication bias, this can make the published literature unreliable: in some subfields, a <a href="https://www.science.org/doi/10.1126/science.aac4716">majority of published results are untrue</a>. And more, when underpowered studies do find something they tend to <a href="https://statmodeling.stat.columbia.edu/2022/06/28/published-estimates-of-group-differences-in-multisensory-integration-are-inflated/">overestimate the effect</a>, leading us to think everything works better than it actually does.</p>
</li>
</ul>
<p>Despite all these problems, hypothesis testing is extremely useful—when we have a question it’s good for, and use it properly. So we’ll start by seeing how to make hypothesis testing work correctly, and some of the ways science has been shifting over the past couple decades to do a better job at significance testing.</p>
<h2 id="replication">2. Replication: Fisher’s principle</h2>
<p>To create reliable knowledge we need to <em>replicate</em> our results; there will always randomly be some bad studies and replication is the only way to weed them out. (There’s a reason it’s the “replication crisis” and not the “some bad studies” crisis.) Any one study may produce weird data through bad luck; but <strong>if we can get a specific result consistently, then we’ve found something real.</strong><strong title="The result we've found doesn't necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you've found _something_ even if you don't understand it yet."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>In some fields it’s common for important results to get replicated early and often. I’ve written <a href="/blog/replication-crisis-math/">before</a> about how mathematicians are continuously replicating major papers by using their ideas in future work, and even just by reading them. Any field where <a href="https://statmodeling.stat.columbia.edu/2022/03/04/biology-as-a-cumulative-science-and-the-relevance-of-this-idea-to-replication/">research is iterative</a> will generally have this same advantage.</p>
<p>In other fields replication is less automatic. Checking important results would take active effort, and often doesn’t happen at all. Complex experiments may be too expensive and specialized to replicate: the average phase \(3\) drug trial <a href="https://www.sofpromed.com/how-much-does-a-clinical-trial-cost">costs about \($20\) million</a>, and even an exploratory phase 1 trial costs about \($4\) million. At those prices we’re almost forced to rely on one or two studies, and if we get unlucky with our first study it will be hard to correct our beliefs.<strong title="If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the [covid vaccines are effective and extremely safe]. But if we _don't_ approve a drug, there's no followup data to analyze, and the drug stays unapproved."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>But sometimes we just don’t treat replication work like it’s important. If we run a new version of an old study and get the same result, it can feel like a waste of time: we “knew that already”. Since our results are old news, it can be hard to get the work published or otherwise acknowledged. But if we run a new version of an old study and <em>don’t</em> get the same result, many researchers will <a href="https://statmodeling.stat.columbia.edu/2016/01/26/more-power-posing/">assume our study must be flawed</a> because they already “know” the first study was right. Replication can be a thankless task.</p>
<p>The replication crisis led many researchers to <a href="https://statmodeling.stat.columbia.edu/2013/07/28/50-shades-of-gray-a-research-story/">reconsider these priorities</a>. Groups like the <a href="https://osf.io/wx7ck/">Many Labs Project</a> and <a href="https://osf.io/ezcuj/">the Reproducibility Project: Psychology</a> have engaged in large scale attempts to replicate famous results in psychology, which helped to clarify which “established” results we can actually trust. Devoting more attention to replication may mean we study fewer ideas and “discover” fewer things, but our knowledge will be much more reliable.<strong title="My favorite suggestion comes from [Daniel Quintana], who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it's a good fit for careful replication of already-designed studies."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong></p>
<h3 id="resistance-to-replication">Resistance to Replication</h3>
<p>Unfortunately, replication work often gets a response somewhere between apathy and active hostility. <strong>Lots of researchers see “failed” replications as actual failures</strong>—the original study managed to reject the null, so why can’t you?</p>
<p><a href="https://xkcd.com/892/"><img src="https://imgs.xkcd.com/comics/null_hypothesis.png" alt="XKCD 892: "I can't believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it _years_ ago." class="blog-image center" /></a>
<em class="blog-image center">Alt text: “Hell, my eighth grade science class managed to conclusively reject it just based on a classroom experiment. It’s pretty sad to hear about million-dollar research teams who can’t even manage that.”</em></p>
<p>Worse, replications that don’t find the original result are often treated like attacks on both the original research and the original researchers. They “followed the rules” and got a publishable result, and now the “data police” are trying to take it away from them. At its worst, this leads to accusations of <a href="https://www.businessinsider.com/susan-fiske-methodological-terrorism-2016-9">methodological terrorism</a>. But even in less hostile discussions, people want to “save” the original result and explain away the failed replication—either by finding <a href="https://en.wikipedia.org/wiki/Data_dredging">some specific subgroup</a> in the replication where the original result seems to hold, or by finding some way the replication differs from the original study and so “doesn’t count”.<strong title="You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That's a very good question. It's very easy to run a hypothesis test that basically _can't_ tell us anything interesting; we'll come back to this [later in the piece]."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p>This desire might seem weird, but it does follow pretty naturally from the Neyman-Pearson framework. The original goal of hypothesis testing is to make a decision and move on—even though that’s not how science should work. <strong>Replication re-opens questions that “were already answered”, which is good for science as a whole but frustrating to people who want to close the question and treat the result as proven.</strong></p>
<h3 id="meta-analysis">Meta-analysis: use all the data</h3>
<p>To make replication fit into a hypothesis testing framework, we often use <em>meta-analysis</em>, which synthesizes the data and results from multiple previous studies. Meta-analysis can be a powerful tool: why wouldn’t we want to use all the data out there, rather than picking just one study to believe? But it also allows us to move fully back into the Neyman-Pearson world. We can treat the whole collection of studies as one giant study, do one hypothesis test to it, and reach one conclusion.</p>
<p>Of course this leaves us with all the fundamental weaknesses of hypothesis testing: it tries to render a definitive yes-or-no answer, and it’s biased towards sticking with the null-hypothesis.</p>
<p>Moreover, a meta-analysis can only be as good as the studies that go into it. If those original studies are both representative and well-conducted, meta-analysis can produce a reliable conclusion. But if the component studies are sloppy and collect garbage data, as <a href="https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-022-06415-5">disturbingly many studies are</a>, the meta-analysis will necessarily produce a garbage result. Good researchers try to screen out unusually bad studies, but if <em>all</em> the studies on some topic are bad then that won’t help.</p>
<p>And if not all studies get published, then <em>any</em> meta-analysis will be drawing on unrepresentative data. Imagine trying to estimate average human height, but the only data you have access to comes from studies of professional basketball players. No matter how careful we are, our estimates will be far too high, because our data all comes from unusually tall people. In the same way, if only unusually significant data gets published, even a perfect meta-analysis will be biased, because it can only use biased data.</p>
<p>Even if all studies get published, the <a href="https://statmodeling.stat.columbia.edu/2021/03/16/the-garden-of-forking-paths-why-multiple-comparisons-can-be-a-problem-even-when-there-is-no-fishing-expedition-or-p-hacking-and-the-research-hypothesis-was-posited-ahead-of-time-2/">garden of forking paths</a> can bias the meta-analysis in exactly the same way, since each study may report an unusually favorable measurement. This is like if some studies report the height of their participants, and others the weight, and others the shoe size—but they all pick the measure that makes their subjects look biggest. Each study might report its data accurately, but we’d still end up with a misleading impression of how large people actually are.</p>
<p>Good meta-analyses will look for signs of selective publication, and there are statistical tools like <a href="https://en.wikipedia.org/wiki/Funnel_plot">funnel plots</a> or <a href="https://www.bitss.org/education/mooc-parent-page/week-2-publication-bias/detecting-and-reducing-publication-bias/p-curve-a-tool-for-detecting-publication-bias/">\(p\)-curves</a>, that can sometimes detect these biases in the literature. But these tools aren’t perfect, and of course they don’t tell us what we <em>would have seen</em> in the absence of publication bias. We can try to weed out bad studies after publication, but it’s better not to produce them in the first place.</p>
<p><img src="/assets/blog/hypothesis-testing/p-curve.png" alt="Two graphs illustrating the p-curve. Each graph measures the number of studies which had p=.01, .02, .03, .04, and .05. For experiments they expected to be p-hacked, the curve slopes upwards; for experiments they expected to not be p-hacked, the curve slopes downwards." class="blog-image center" />
<em class="blog-image center">The \(p\)-curve: when there’s \(p\)-hacking or selection bias, we expect most significant studies to be just barely significant. When the effect is real, we expect small \(p\)-values to be much more common than large ones.</em>
<em class="blog-image center">Figure from <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2256237">Simonsohn, Nelson, and Simmons</a>.</em></p>
<p>But of course, not all meta-analyses are good. Just like researchers have lots of ways to tweak their experiments to get statistical significance, doing a meta-analysis involves making a lot of choices about how to analyze the data, and so there are a lot of opportunities to \(p\)-hack or to get tricked by the garden of forking paths. Meta-analysis is like one giant hypothesis test, which means it can go wrong in exactly the same ways other hypothesis tests do.</p>
<h2 id="preregistration">3. Preregistration: do it right the first time</h2>
<p>Hypothesis testing does have real weaknesses, but many of the real-world problems we deal with only happen when we do it <em>wrong</em>. The point of the Neyman-Pearson method to set out a threshold that determines whether we should act or not, collect data, and then see whether the data crosses the threshold. If we <a href="https://royalsocietypublishing.org/doi/10.1098/rsos.220099">ignore the result when it doesn’t give the answer we want</a>, then we’re not <em>really</em> using the Neyman-Pearson method at all.</p>
<p>But that’s exactly what happens in many common errors. <strong>When we ignore negative studies, we change the question from “yes or no” to “yes or try again later”.</strong> The garden of forking paths and \(p\)-hacking involve changing the threshold after you see your data. This makes it very easy for your data to clear the threshold, but <em>not</em> very informative.</p>
<p><img src="/assets/blog/hypothesis-testing/TexasSharpShooter-768x646.png" alt="Cartoon of a wall filled with bullet holes, and a cowboy painting a target around each hole." class="blog-image center" />
<em class="blog-image center">It’s easy to hit your target, if you pick the target after you shoot. But you don’t learn anything that way.</em>
<em class="blog-image center">Illustration by Dirk-Jan Hoek, CC-BY</em></p>
<p><strong>For hypothesis testing to work, we have to decide what would count as evidence for our theory <em>before</em> we collect the data.</strong> And then we have to actually follow through on that, even if the data tells us something we don’t want to hear.</p>
<h3 id="public-registries">Public registries</h3>
<p>Following through with this is simple for private decisions, if not always easy. When I want to buy a new kitchen gadget, sometimes I’ll decide how much I’m willing to pay before I check the price. If it turns out to be cheaper than my threshold, I’ll buy it; if it’s more expensive, I won’t. This helps me avoid making dumb decisions like “oh, that fancy pasta roller set is on sale, so it <em>must</em> be a good deal”. I don’t need any fancy way to hold myself accountable, since there’s no one else involved for me to be accountable <em>to</em>. And of course, if the pasta roller is super expensive and I buy it anyway, I’m only hurting myself.</p>
<p>But <strong>science is a public, communal activity, and our decisions and behavior need to be transparent so that other researchers can trust and build on our results.</strong> Even if no one ever lied, it’s so easy for us to fool <em>ourselves</em> that we need some way to guarantee that we did it right—both to other scientists, and to ourselves. Everyone saying, “I <em>swear</em> I didn’t change my mind after the fact, honest!” just isn’t reliable enough.</p>
<p>To create trust and transparency, we can publicly <a href="https://en.wikipedia.org/wiki/Preregistration_(science)">preregister</a> of our research procedures. If we publish our plans before conducting the study, everyone else can <em>know</em> we made our decisions <em>before</em> we ran the study, and they can check to see if the analysis we did matches the analysis we said we would do. When done well, this prevents p-hacking and protects us from the garden of forking paths, because we aren’t making any choices after we see the data.</p>
<p>Public preregistration also limits publication bias. Even if the study turns produces boring negative results, the preregistration plan is already published, so we know the study happened—it can’t get lost in a file drawer where no one knows about it. This preserves the powerful statistical protection of the Neyman-Pearson method: our false positive rate <em>will</em> be five percent, and no more.</p>
<p>Many journals have implemented <a href="https://www.cos.io/initiatives/registered-reports">registered reports</a>, which allow researchers to submit their study designs for peer review, before they actually conduct the study. This means their work is evaluated based on the quality of the design and on whether the <em>question</em> is interesting; the publication won’t depend on what answer they find, which removes the selection bias towards only seeing positive results. Registered reports also restrict researchers to the analyses they had originally planned, rather than letting them fish around for an interesting result—or at least force them to explain why they changed their minds, so we can adjust for how much fishing they actually did.</p>
<p>The biggest concern about publication bias probably surrounds medical trials, where pharmaceutical companies have an incentive not to publish any work that would show their drugs don’t work. Many regulatory bodies including the FDA <a href="https://www.clinicaltrials.gov/ct2/manage-recs/background#RegLawPolicies">require clinical trials to be registered</a>; the NIH also maintains a public database of trial registries and results. And this change had a dramatic impact in the results we saw from clinical trials.</p>
<p><img src="https://ourworldindata.org/uploads/2022/02/Efficacy-in-trials-before-and-after-registration-requirement2.jpg" alt="Graph from OurWorldInData, showing the results of trials funded by the National Heart, Lung, and Blood institute. Before preregistration was required in 2000, most trials showed a substantial benefit. After 2000, most trials show a small and insignificant effect." class="blog-image center" />
<em class="blog-image center">Before widespread preregistration, most trials showed large benefits. When we got more careful, these benefits evaporated.</em></p>
<h3 id="planning-for-power">Planning for power</h3>
<p>Preregistration is also a great opportunity to <a href="https://twitter.com/BalazsAczel/status/1546871350316376064">plan out our study more carefully</a>, and in particular to think about statistical power in advance. Remember the power of a study is the probability that it will reject the null hypothesis if the null is in fact false. We get more power when the study is better and more precise, but also when the effect we’re trying to measure is bigger and more visible: it’s pretty easy to show that cigarette smoking is linked to cancer, because the effect is so dramatic.<strong title="Somewhat infamously, Fisher stubbornly resisted the claim that smoking _caused_ cancer until his death. But he never denied the correlation, which was too dramatic to hide."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong> But it’s much harder to detect the long-term effects of something like power posing, because the effects will be so small relative to other impacts on our personality.</p>
<p>On the other hand, if the effects are that small, maybe they don’t matter. If some economic policy reduces inflation by \(0.01\)%, then even if we could measure such a small reduction we wouldn’t really care—all we need to know is that the effect is “too small to matter”. With enough precision we could get statistical significance,<strong title="As long as two factors have [any relationship at all], the effect won't be [exactly zero], and with enough data we'll be able to reject the null hypothesis that there's no effect. But that just means "is the effect exactly zero&quot is often the wrong question; instead we want to know if the effect is big enough to matter."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong> but that doesn’t mean the result is <a href="https://statisticsbyjim.com/hypothesis-testing/practical-statistical-significance/">practically</a> or <a href="https://www.mhaonline.com/faq/clinical-vs-statistical-significance">clinically</a> significant. During the preregistration process we can decide <a href="http://daniellakens.blogspot.com/2017/05/how-power-analysis-implicitly-reveals.html">what kind of effects would be practically important</a>, and calibrate our studies to find those effects.</p>
<p><img src="/assets/blog/hypothesis-testing/scotty-power.png" alt="Star Trek image: "Do we have the power to pull it off, Scotty?"" class="blog-image center" /></p>
<p>Planning for power also makes it easier to treat negative results as serious scientific contributions. The aphorism says that <a href="https://quoteinvestigator.com/2019/09/17/absence/">absence of evidence is not evidence of absence</a>, but the aphorism is wrong. When a study has high power, we are very likely to see evidence <em>if</em> it exists; so absence of evidence becomes pretty good evidence of absence. If we know our studies have enough power, then our negative results become important and meaningful, and we won’t need to hide them in a file drawer.</p>
<h3 id="a-limited-tool">A limited tool</h3>
<p>And all of this is fantastic—but it doesn’t address many of the problems science actually presents us with. <strong>Modern hypothesis testing is optimized for taking a clear, well-designed question and giving a simple yes-or-no answer.</strong> That’s a good match for clinical trials, where the question is pretty much “should we use this drug or not?” By the time we’re in Phase 3 trials, we know what we think the drug will accomplish, and we can describe in advance a clean test of whether it will or not. Preregistration solves the implementation problems pretty thoroughly.</p>
<p>But preregistration does limit our ability to explore our data. This is necessary to make hypothesis testing work properly, but it’s still a <em>cost</em>. We really <em>do</em> want to learn new things from our data, not just confirm conjectures we’ve already made. Preregistration can’t help us if we don’t already have a hypothesis we want to test. And often, when we’re doing research, we don’t.</p>
<h2 id="bigger-better-questions">4. Bigger, Better Questions</h2>
<p>Here are some scientific questions we might want to answer:</p>
<ul>
<li>What sorts of fundamental particles exist?</li>
<li>What social factors contribute to crime rates?</li>
<li>How does sleep deprivation affect learning?</li>
<li>How effective is this cancer drug?</li>
<li>How cost-effective is this public health program?</li>
<li>How malleable are all the different steel alloys you can make?</li>
</ul>
<p>None of these are yes-or-no questions. All of them are important parts of the scientific program, but none of them suggest specific hypotheses to run tests on. What do we do instead?</p>
<h3 id="spaghetti-on-the-wall">Spaghetti on the wall</h3>
<p>Maybe the most obvious idea is just to test, well, everything.</p>
<p><img src="/assets/blog/hypothesis-testing/test-all-the-things.jpg" alt="Meme: Test all the things!" class="blog-image center" />
<em class="blog-image center">With apologies to <a href="http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html">Allie Brosh</a>.</em></p>
<p>Now, we can’t test literally everything; collecting data takes time and money, and we can only conduct so many experiments. But we can take all the data we already have on crime rates, or on learning; and we can list every hypothesis we can think of and test them all for statistical significance. This <a href="https://en.wikipedia.org/wiki/Data_dredging">data dredging</a> is a very common, <a href="https://xkcd.com/882/">very bad idea</a>, especially in the modern era of <a href="https://journals.sagepub.com/doi/full/10.1177/0268396220915600">machine learning and big data</a>. Mass testing like this takes all the problems of hypothesis testing—false positives, publication bias, low power, and biased estimates—and makes them much worse.</p>
<p><strong>If we test every idea we can think of, most of them will be wrong.</strong> As we saw in part 2, that means a huge fraction of our positive results will be false positives. Sure, if we run all our tests perfectly, then only \(5\)% of our wrong ideas will give false positives. But since we have so many <em>more</em> bad ideas than good ones, we’ll still get way more false positives than true positives. (This is easiest to see in the case where all of our ideas are wrong—then <em>all</em> our positive results will be false positives!)</p>
<p>If we test just twenty different wrong ideas, there’s a roughly two-in-three chance that one of them will fall under the \(5\)% significance threshold, just by luck.<strong title="The odds of getting no false positives after n trials is 0.95^n, so the odds of getting a false positive are 1 - 0.95^n. And 0.95^20 ≈ 0.358, so 1 - 0.95^20 ≈ 0.652. It's a little surprising this is so close to 2/3, but there's a reason for it—sort of. If you compute (1- 1/n)^n you will get approximately 1/e, so the odds of getting a false positive at a 1/20 false positive threshold after 20 trials are roughly 1-1/e ≈ .63."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong> That’s a lot higher than the false positive rate of \(5\)% that we asked for, and means we are very likely to “discover” something false. And then we’ll waste even more time and resources following up on our surprising new “discovery”.</p>
<p><img src="/assets/blog/hypothesis-testing/spurious-correlation.png" alt="Graph of "divorce rate in Maine" against "per capita consumption of margarine" between 2000 and 2009. The correlation is 99.26%." class="blog-image center" />
<em class="blog-image center">If you test everything, you’ll find a ton of <a href="https://www.tylervigen.com/spurious-correlations">spurious correlations</a> like this one.</em></p>
<h3 id="multiple-comparisons">Multiple Comparisons</h3>
<p>This <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem">multiple comparisons problem</a> has a mathematical solution: we can adjust our significance threshold to bring our false positive rate back down. A rough rule of thumb is the <a href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni correction</a>, where we divide our significance threshold by the number of different ideas we’re testing. If we test twenty ideas but divide our \(5\)% significance threshold by twenty to get a corrected threshold of \(0.25\)%, then each <em>individual</em> result has a one-in-four-hundred chance of giving a false positive, but that gives us a roughly five percent chance of getting a false positive on one of those ideas.</p>
<p>The problem is sociological, not mathematical: people don’t <em>like</em> correcting for multiple comparisons, because it makes it harder to reach statistical significance and <a href="https://royalsocietypublishing.org/doi/10.1098/rsos.220099">“win” the science game</a>. Less cynically, correcting for multiple comparisons reduces the power of our studies dramatically, making it harder to discover real and important results. Ken Rothman’s 1990 paper <a href="https://www.jstor.org/stable/20065622">No Adjustments Are Needed for Multiple Comparisons</a> articulates both of these arguments admirably clearly: “scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings.”</p>
<p>Rothman is right in two important ways. First, researchers should not be penalized for conducting studies that don’t reach statistical significance. Studies that fail to reject the null, or measure a tiny effect, are valuable contributions to our store of knowledge. We tend to overlook and devalue these null results, but that’s a mistake, and one of the major benefits of preregistration is protecting and rewarding them.</p>
<p>Second, it’s important to investigate potential leads that might not pan out. As Isaac Asimov <a href="https://quoteinvestigator.com/2015/03/02/eureka-funny/">may or may not have said</a>, “The most exciting phrase in science is not ‘Eureka!’ but ‘That’s funny…’”; and it’s important to follow up on those unexpected, funny-looking results. After all, we have to find hypotheses somewhere.</p>
<p><strong>But undirected exploration is, very specifically, not hypothesis testing.</strong> Rothman suggests that we often want to “earmark for further investigation” these unexpected findings. But <strong>hypothesis testing isn’t designed to flag ideas for future study; instead a hypothesis test <em>concludes</em> the study, with (in theory) a definitive answer.</strong> Rothman’s goals are correct and important, but hypothesis testing and statistical significance aren’t the right tools for those goals.<strong title="From what I can tell, Rothman may well agree with me. His [twitter feed] features arguments against [using statistical significance] and [dichotomized hypotheses in place of estimation], which is roughly the position I'm advocating. But _if_ you're doing hypothesis testing, you should try to do it correctly."><sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup></strong></p>
<h3 id="jump-to-conclusions">Jumping to conclusions</h3>
<p>At some point, though, we do generate some hypotheses.<strong title="You might notice that I'm not really saying anything about where we find these hypotheses. There's a good reason for that. Finding hypotheses is hard! It's also the most _creative_ and unstructured part of the scientific process. The question is important, but I don't have a good answer."><sup id="fnref:13"><a href="#fn:13" class="footnote">13</a></sup></strong> If we’re studying how memory interacts with speech, we might hypothesize that <a href="https://pubmed.ncbi.nlm.nih.gov/2295225/">describing a face verbally will make you worse at recognizing it later</a>, which gives us something concrete to test. Or, more tenuously, if we’re studying the ways that sexism affects decision-making, we might hypothesize that <a href="https://www.washingtonpost.com/news/monkey-cage/wp/2014/06/05/hurricanes-vs-himmicanes/">hurricanes with feminine names are more deadly because people don’t take them as seriously</a>.</p>
<p>And then we can test these hypotheses, and reject the null or not, and then—what? What does that tell us?</p>
<p><img src="/assets/blog/hypothesis-testing/what-did-we-learn.jpg" alt="Spongebob meme: "What did we learn today?"" class="blog-image center" /></p>
<p>We have a problem, because these hypotheses <em>aren’t</em> the questions we really want to answer. If <a href="https://www.vox.com/2020/1/8/21051869/indoor-air-pollution-student-achievement">installing air filters in classrooms increases measured learning outcomes</a>, that’s a fairly direct answer to the question of whether installing air filters in classrooms can help children learn, so a hypothesis test really can answer our question. But we shouldn’t decide that sexism is fake just because <a href="https://statmodeling.stat.columbia.edu/2016/04/02/himmicanes-and-hurricanes-update/">feminine names probably don’t make hurricanes deadlier</a>!<strong title="For that matter, if feminine hurricane names were _less_ dangerous we could easily tell a story about how _that_ was evidence for sexism. That's the garden of forking paths popping up again, where many different results could be evidence for our theory."><sup id="fnref:14"><a href="#fn:14" class="footnote">14</a></sup></strong> We should only care about the hurricane-names thing if we think it tells us something about our actual, real-world concerns.</p>
<p>And that means we can’t just test one random hypothesis relating to our big theoretical question and call it a day. We need to develop hypotheses that are reasonably connected to the questions we care about, and we need to approach those questions from <a href="https://www.nature.com/articles/d41586-018-01023-3">many different perspectives</a> to make sure we’re not missing anything. That means <strong>there’s a ton of work <em>other</em> than hypothesis testing that we need to do if we want our hypothesis tests to tell us anything useful</strong>:<strong title="In their wonderfully named (and very readable) paper [Why hypothesis testers should spend less time testing hypotheses], Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the _derivation chain_: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. "><sup id="fnref:15"><a href="#fn:15" class="footnote">15</a></sup></strong></p>
<ul>
<li><strong>Defining terms:</strong> First we need to decide what question we’re actually trying to answer! There are a lot of different things people mean by “sexism” or “memory” or “crime”, and our research will be confused unless we make sure we’re consistently talking about the same thing.<strong title="This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you're trying to answer. I've written about this [before], but I also recommend Keith Devlin's [excellent post] on what "mathematical thinking" is, especially the story he tells after the long blockquote."><sup id="fnref:16"><a href="#fn:16" class="footnote">16</a></sup></strong></li>
<li><strong>Causal modeling:</strong> What sort of relationships do we expect to see? If our theory on the Big Question is true, what experimental results does that imply? What other factors could confound or interfere with these effects? We need to know what relationships we’re looking for before we can design tests for them.</li>
<li><strong>Developing measurements:</strong> How will we measure the inputs and outputs to our theory? What numbers will we use to measure crime levels, or educational improvement, or ability to remember faces? Are the things we’re measuring closely connected to the definitions we chose earlier? It’s easy to measure <em>something</em> but hard to make sure the measurement <a href="https://en.wikipedia.org/wiki/Goodhart's_law">tells us what we want to know</a>.</li>
<li><strong>Determining scope:</strong> When do we expect our theory to work, and for what sort of extreme results do we expect it to break down? What experiments should we not bother running? It’s worth studying whether mild air pollution makes learning harder, without worrying about the major health effects that we know severe pollution causes.</li>
<li><strong>Auxiliary assumptions:</strong> What extra assumptions are we making in all the previous steps, and how can we verify them? Does installing classroom air filters actually reduce pollution? Do people who verbally described a face try equally hard at the later recall task? How can we tell? We can’t avoid making assumptions, but we can try to be explicit about them, and check the ones that could cause problems.</li>
</ul>
<p>Without all this work, we can come up with hypotheses, but they won’t make sense. We can run experiments, but we can’t interpret them. And we can do hypothesis tests, but we can’t use them to answer big questions.</p>
<h2 id="failing-to-measure-up">5. Failing to measure up</h2>
<p>And sometimes we have a direct question that presents a clear experiment to run, but not a clear <em>hypothesis</em>. Questions like “How effective is this cancer drug?” or “how malleable is this steel alloy?” aren’t big theoretical questions, but also aren’t specific hypotheses that can be right or wrong. We want <em>numbers</em>.</p>
<p>In practice we often use hypothesis testing to answer these questions anyway—but with an awkward kludge. We can test a null hypothesis like “this public health program doesn’t save lives”. If we fail to reject the null, we conclude that it doesn’t help <em>at all</em>; if we do reject the null, we see how many lives the program saved in our experiment, and use that as an estimate of its effectiveness.</p>
<p>This works well enough that we kinda get away with it, but it introduces consequential biases into our measurements. If the measured effect is small, we <a href="https://statmodeling.stat.columbia.edu/2020/09/17/we-want-certainty-even-when-its-not-appropriate/">round it down to zero</a>, concluding there is no benefit when there may well be a small but real benefit (or a small but real harm). And if significant studies are more likely to be seen than non-significant studies, we will see <a href="https://statmodeling.stat.columbia.edu/2022/05/25/the-failure-of-null-hypothesis-significance-testing-when-studying-incremental-changes-and-what-to-do-about-it/">more unusually good results than unusually bad ones</a>, which means we will believe basically everything is more effective than it actually is.<strong title="We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a 5% difference to be significant, and Drug A produces a 3% improvement over placebo and Drug B produces a 7% improvement. Then the effect of Drug A isn't significant, and the effect of Drug B is, so we say that Drug A doesn't work and Drug B does. But the difference between Drug A and Drug B is _not_ significant—so if we ask that question, we conclude that the two drugs are equally good! [The difference between "significant" and "not significant" is not itself statistically significant], so it matters exactly which hypothesis we choose to test."><sup id="fnref:17"><a href="#fn:17" class="footnote">17</a></sup></strong></p>
<p>We shouldn’t be surprised that hypothesis testing does a bad job of measuring things, because hypothesis testing isn’t designed to measure things. It’s specifically designed to <em>not</em> report a measurement, and just tell us whether we should act or not. It’s the wrong tool for this job.</p>
<p>We can and should do better. A study in which mortality decreases by \(0.1\)% is evidence that the program <em>works</em>—possibly weak evidence, but still evidence! And if we <a href="https://onlinelibrary.wiley.com/doi/10.1111/jeb.14009">skip the hypothesis testing and put measurement first</a>, we can represent that fact accurately.</p>
<h3 id="compatibility-checking">Compatibility checking</h3>
<p>The simplest thing to do would be to just average all our measurements and report that number. This is a type of <em>point estimate</em>, the single number that most accurately reflects our best guess at the true value of whatever we’re measuring.</p>
<p>But a point estimate by itself doesn’t give as much information as we need. We need to measure our uncertainty around that estimate, and describe how how <em>confident</em> we are in it. A drug that definitely makes you a bit healthier is very different from one that could save your life and could kill you, and it’s important to be clear which one we’re talking about.</p>
<p>We can supplement our point estimate with a <em>confidence interval</em>, also called a <em>compatibility interval</em>, which is sort of like a backwards hypothesis test. We give all the values that are compatible with our measurement—values that would make our estimate relatively unsurprising. <strong>Rather than starting with a single null hypothesis and checking whether our measurement is compatible with it, we start with the measurement, and describe all the hypotheses that would be compatible.</strong></p>
<p>The definition is a bit more technical, and easy to get slightly wrong: If we run \(100\) experiments, and generate a \(95\)% confidence interval for each experiment, then the true value will lie in about \(95\) of those intervals. A common mistake is to say that if we generate one confidence interval, the true value has a \(95\)% chance of landing in it, but that’s <a href="https://statmodeling.stat.columbia.edu/2019/04/21/no-its-not-correct-to-say-that-you-can-be-95-sure-that-the-true-value-will-be-in-the-confidence-interval/">backwards, and not quite right</a>.<strong title="Sometimes we can look at our interval after the fact and make an informed guess whether it's one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there's some risk I get a 95% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I'm pretty much 100% confident in that interval, not just 95%. "><sup id="fnref:18"><a href="#fn:18" class="footnote">18</a></sup></strong> But <em>before</em> we run the experiment, we expect a \(95\%\) chance that the true value will be in the confidence interval we compute.</p>
<p><img src="/assets/blog/hypothesis-testing/confidence-intervals.png" alt="a diagram of a collection of confidence intervals" class="blog-image center" />
<em class="blog-image center">Each vertical bar is a compatibility interval from one experiment, with a circle at the point estimate. Three of the intervals don’t include the true value, which is roughly \(5\)% of the \(50\) intervals.</em>
<em class="blog-image center">Image by <a href="https://commons.wikimedia.org/wiki/File:Neyman_Construction_Confidence_Intervals.png">Randy.l.goodrich</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a></em></p>
<p>Mathematically, these intervals are closely related to hypothesis tests. <strong>A result is statistically significant if the null hypothesis (often \(0\)) lies outside the compatibility interval.</strong> So in a sense compatibility intervals are just giving the same information as a hypothesis test, just in a different format. But changing the format shifts the emphasis of our work, and the way we think about it. Rather than starting by picking a specific claim and then saying yes or no, we give a <em>number</em>, and talk about what theories and models are compatible with it. This avoids needing to pick a specific hypothesis. It also gives our readers more information, rather than compressing our answer into a simple binary.</p>
<p>Focusing on compatibility intervals can also help avoid publication bias, and make it easier to use all the data that’s been collected. When we report measurements and compatibility intervals, we can’t “fail to reject” a null hypothesis. Every study will succeed at producing <em>an estimate</em>, and a compatibility interval, so every study produces knowledge we can use, and no study will “fail” and be hidden in a file drawer. Some studies might be designed and run better than others, and so give more precise estimates and narrower compatibility intervals. We can give more weight to these studies when forming an opinion. But we won’t discard a study just for yielding an answer we didn’t expect.</p>
<h2 id="bayes">6. Bayesian statistics: the other path</h2>
<p>Throughout this series, we’ve used the language and perspective of <a href="https://en.wikipedia.org/wiki/Frequentist_inference">frequentist statistics</a>. This is the older and more classical approach to statistics, which defines probability in terms of repeated procedures. “If we test a true null hypothesis a hundred times, we’ll only reject it about five times”. “If we run this sampling procedure a hundred times, the compatibility interval will include the true value about \(95\) times.” This approach to probability is philosophically straightforward, and leads to relatively simple calculations.</p>
<p>But there are questions it absolutely can’t answer—like “what is the probability my null hypothesis is true?”—since we can’t frame them in terms of repeated trials. Remember, <strong>the \(p\)-value is <em>not</em> the probability the null is false.</strong> Its definition is a complicated conditional hypothetical that’s hard to state clearly in English: it’s the probability that we would observe what we actually did observe under the assumption that the null hypothesis is true. This is easy to compute, but it’s difficult to understand what it <em>means</em> (which is why I wrote like <a href="/blog/hypothesis-testing-part-1/">six thousand words trying to explain it</a>).</p>
<p>But there’s another school of statistics that <em>can</em> produce answers to those questions. <a href="https://en.wikipedia.org/wiki/Bayesian_inference">Bayesian inference</a>, which I’ve <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">written about before</a>, lets us assign probabilities to pretty much any statement we can come up with. This is great, because <strong>it can directly answer almost any question we actually have. But it’s also much, <em>much</em> harder to use</strong>, because it requires much more data and more computation. And the bigger and more abstract the question we ask, the worse this gets.</p>
<p>Bayesian inference needs three distinct pieces of information:</p>
<ul>
<li>The probability of seeing our data, assuming the hypothesis is true, which is just the \(p\)-value we’ve been discussing;</li>
<li>The probability of seeing our data, assuming the hypothesis is <em>false</em>, which is another \(p\)-value; and</li>
<li>The <em>prior probability</em> that our hypothesis is true, based on the evidence we had <em>before</em> we run the experiment.</li>
</ul>
<p>Then we run an experiment, collect data, and use a formula called <a href="https://en.wikipedia.org/wiki/Bayes'_theorem">Bayes’s theorem</a> to produce a <em>posterior probability</em>, our final estimate of the likelihood our hypothesis is true.<strong title="We saw examples of this calculation in part 2, when we [calculated what fraction of positive results were true positives]. Note that we had to make assumptions about what fraction of null hypotheses are true; that's the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations."><sup id="fnref:19"><a href="#fn:19" class="footnote">19</a></sup></strong></p>
<p>That’s a lot more complicated! First of all, we have to compute two \(p\)-values, not just one. But second, we calculate the extra \(p\)-value under the assumption that “our hypothesis is false”, and that covers a lot of ground. If our hypothesis is that some drug prevents cancer deaths, then the alternative includes “the drug does nothing”, “the drug increases cancer deaths”, “the drug prevents some deaths and causes others”, and even silly stuff like “aliens are secretly interfering with our experiments”. To do the Bayesian calculation we need list every possible way our hypothesis could be false, and compute how likely each of those ways is and how plausible each one makes our data. That gets very complicated very quickly.</p>
<p>(In contrast, Fisher’s approach starts by assuming the null hypothesis is true, and ignores every other possibility. This makes the calculation much easier to actually do, but it also limits how much we can actually conclude. High \(p\)-value? Nothing weird. Low \(p\)-value? Something is weird. But that’s all we learn.)</p>
<p>And <em>third</em>, even if we can do all those calculations somehow, we need that prior probability. We want to figure out how likely it is that a drug prevents cancer. And as the first step, we have to plug in…the probability that the drug prevents cancer. We don’t know that! That’s what we’re trying to compute!</p>
<p>Bayesian machinery is great for refining and updating numbers we already have. And the more data we collect, the less the prior probability matters; we’ll eventually wind up in the correct place. So in practice, we just pick a prior that’s easy to compute with, plug it into Bayes’s theorem, and try to collect enough data that we expect our answer to be basically right.</p>
<p>And that brings us back to where we began, with replication. The more experiments we run, the more we can learn.</p>
<h2 id="conclusion">7. Conclusion: (Good) data is king</h2>
<p>I closed out part 2 with an <a href="https://xkcd.com/2400/">xkcd statistics tip</a>: “always try to get data that’s good enough that you don’t need to do statistics on it.” Here at the end of part 3, we find ourselves in exactly the same place. But this time, I hope you see that tip, not as a punchline, but as actionable advice.</p>
<p>Modern hypothesis testing “works”, statistically, as long as you ask exactly the questions it answers, and are extremely careful in how you use it. But we often misuse it by collecting flawed or inadequate data and then drawing strong, sweeping conclusions. We run small studies and then \(p\)-hack our results into significance, rather than running the careful, expensive studies that would genuinely justify our theoretical claims. We report the results as over-simplified yes-or-no answers rather than trying to communicate the complicated, messy things we observed. And if we manage to reject the null on one study we issue press releases claiming it confirms all our grand theories about society.</p>
<p><a href="https://xkcd.com/2494/"><img src="https://imgs.xkcd.com/comics/flawed_data.png" alt="XKCD 2494: "We realized all our data is flawed. Good: ...so we're not sure about our conclusions. Bad: ...so we did lots of math and then decided our data was actually fine. Very bad: ...so we trained an AI to generate better data." " class="blog-image center" /></a>
<em class="blog-image center">Too often, we use statistics to help us pretend bad data is actually good.</em></p>
<p>In this essay we’ve seen a number of possible solutions, but they’re basically all versions of “collect more and better data”:</p>
<ul>
<li>Do enough foundational work that you can formulate good hypotheses, and figure out what data you need to draw usable conclusions.</li>
<li>If you have numerical data, use the numbers, rather than throwing away information and just giving a single yes or no.</li>
<li>Preregister your studies, to make sure your data is useful and you’re not altering it to fit your conclusions.</li>
<li>Replicate your studies, so you collect more data that can either confirm or correct your beliefs.</li>
</ul>
<p>Even the Bayesian approach comes back to this. Bayesianism relies on the prior probability; but that really just means that, if we already have some knowledge before we run the experiment, we should use it!</p>
<p>Statistics is powerful and useful. We couldn’t do good science without it. But data—empirical observation—is the core of science. Statistics helps us understand the data we have, and it helps us figure out what data we need. But if our data sucks, statistics alone cannot save us.</p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Do you have a great idea for doing science better? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it’s pretty uncommon to do it that way in practice. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>We’ve seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks “work” or “don’t work”, rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.) <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>There are <a href="https://journals.sagepub.com/doi/full/10.1177/2515245918770963">variants of hypothesis testing</a> that help you show some null hypothesis is (probably) basically right. But they’re not nearly as common as the more standard setup. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="http://arxiv.org/pdf/1205.4251.pdf">Nosek, Spies, and Motyl write</a> about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest….The solution requires making incentives for <em>getting it right</em> competitive with the incentives for <em>getting it published</em>." <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The result we’ve found doesn’t necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you’ve found <em>something</em> even if you don’t understand it yet. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the <a href="https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus/is-the-covid19-vaccine-safe">covid vaccines are effective and extremely safe</a>. But if we <em>don’t</em> approve a drug, there’s no followup data to analyze, and the drug stays unapproved. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>My favorite suggestion comes from <a href="https://www.nature.com/articles/s41562-021-01192-8">Daniel Quintana</a>, who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it’s a good fit for careful replication of already-designed studies. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That’s a very good question. It’s very easy to run a hypothesis test that basically <em>can’t</em> tell us anything interesting; we’ll come back to this <a href="#jump-to-conclusions">later in the piece</a>. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>Somewhat infamously, Fisher stubbornly resisted the claim that smoking <em>caused</em> cancer until his death. But he never denied the correlation, which was too dramatic to hide. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>As long as two factors have <a href="https://www.gwern.net/Everything">any relationship at all</a>, the effect won’t be <a href="https://statmodeling.stat.columbia.edu/2017/06/29/lets-stop-talking-published-research-findings-true-false/">exactly zero</a>, and with enough data we’ll be able to reject the null hypothesis that there’s no effect. But that just means “is the effect exactly zero” is often the wrong question; instead we want to know if the effect is big enough to matter. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
<li id="fn:11">
<p>The odds of getting no false positives after \(n\) trials is \(0.95^n\), so the odds of getting a false positive are \(1 - 0.95^n\). And \(0.95^{20} \approx 0.358\), so \(1 - 0.95^{20} \approx 0.652\).</p>
<p>It’s a little surprising this is so close to \(2/3\), but there’s a reason for it—sort of. If you compute \( (1- 1/n)^n\) you will get approximately \(1/e\), so the odds of getting a false positive at a \(1/20\) false positive threshold after \(20\) trials are roughly \(1-1/e \approx .63\). <a href="#fnref:11" class="reversefootnote">↩</a></p>
</li>
<li id="fn:12">
<p>From what I can tell, Rothman may well agree with me. His <a href="https://twitter.com/ken_rothman">twitter feed</a> features arguments against <a href="https://twitter.com/_MiguelHernan/status/1476928329794027522">using statistical significance</a> and <a href="https://twitter.com/vamrhein/status/1526879947104702465">dichotomized hypotheses in place of estimation</a>, which is roughly the position I’m advocating. But <em>if</em> you’re doing hypothesis testing, you should try to do it correctly. <a href="#fnref:12" class="reversefootnote">↩</a></p>
</li>
<li id="fn:13">
<p>You might notice that I’m not really saying anything about where we find these hypotheses. There’s a good reason for that. Finding hypotheses is hard! It’s also the most <em>creative</em> and unstructured part of the scientific process. The question is important, but I don’t have a good answer. <a href="#fnref:13" class="reversefootnote">↩</a></p>
</li>
<li id="fn:14">
<p>For that matter, if feminine hurricane names were <em>less</em> dangerous we could easily tell a story about how <em>that</em> was evidence for sexism. That’s the garden of forking paths popping up again, where many different results could be evidence for our theory. <a href="#fnref:14" class="reversefootnote">↩</a></p>
</li>
<li id="fn:15">
<p>In their wonderfully named (and very readable) paper <a href="https://journals.sagepub.com/doi/10.1177/1745691620966795">Why hypothesis testers should spend less time testing hypotheses</a>, Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the <em>derivation chain</em>: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. <a href="#fnref:15" class="reversefootnote">↩</a></p>
</li>
<li id="fn:16">
<p>This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you’re trying to answer. I’ve written about this <a href="/blog/asking-the-right-question/">before</a>, but I also recommend Keith Devlin’s <a href="http://devlinsangle.blogspot.com/2012/08/what-is-mathematical-thinking.html">excellent post</a> on what “mathematical thinking” is, especially the story he tells after the long blockquote. <a href="#fnref:16" class="reversefootnote">↩</a></p>
</li>
<li id="fn:17">
<p>We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a \(5\)% difference to be significant, and Drug A produces a \(3\)% improvement over placebo and Drug B produces a \(7\)% improvement. Then the effect of Drug A isn’t significant, and the effect of Drug B is, so we say that Drug A doesn’t work and Drug B does.</p>
<p>But the difference between Drug A and Drug B is <em>not</em> significant—so if we ask that question, we conclude that the two drugs are equally good! <a href="https://statmodeling.stat.columbia.edu/2016/05/25/the-difference-between-significant-and-not-significant-is-not-itself-statistically-significant-education-edition/">The difference between "significant" and "not significant" is not itself statistically significant</a>, so it matters exactly which hypothesis we choose to test. <a href="#fnref:17" class="reversefootnote">↩</a></p>
</li>
<li id="fn:18">
<p>Sometimes we can look at our interval after the fact and make an informed guess whether it’s one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there’s some risk I get a \(95\)% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I’m pretty much \(100\)% confident in that interval, not just \(95\)%. <a href="#fnref:18" class="reversefootnote">↩</a></p>
</li>
<li id="fn:19">
<p>We saw examples of this calculation in part 2, when we <a href="/blog/hypothesis-testing-part-2/#most-findings-false">calculated what fraction of positive results were true positives</a>. Note that we had to make assumptions about what fraction of null hypotheses are true; that’s the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations. <a href="#fnref:19" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThis is the third part of a three-part series on hypothesis testing. Hypothesis testing is central to the way we do science, but it has major flaws that have encouraged widespread shoddy research. In this essay we consider methods that can help us draw better conclusions, and avoid the pitfalls of hypothesis testing. We start with some smaller and more conservative ideas, which basically involve doing hypothesis testing _better_. Then we'll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.Hypothesis Testing and its Discontents, Part 2: The Conquest of Decision Theory2022-05-24T00:00:00-07:002022-05-24T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-2<p>This is the second-part of a three-part series on hypothesis testing.</p>
<p>In <a href="/blog/hypothesis-testing-part-1/">part 1</a> of this series, we looked at the historical origins of hypothesis testing, and described two different approaches to the idea: Fisher’s significance testing, and Neyman-Pearson hypothesis testing. In this essay, we’ll see how modern researchers use hypothesis testing in practice. And in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">part 3</a> we’ll talk about alternatives to hypothesis testing that can help us avoid replication crisis-type problems.</p>
<p>The modern method is an awkward mix of Fisher’s goals and Neyman and Pearson’s methods that attempts to provide a one-size-fits-all solution for scientific statistics. The inconsistencies within this approach are a major contributor to the replication crisis, making bad science both more likely and more visible.</p>
<h2 id="modern-hypothesis-testing">Modern Hypothesis Testing</h2>
<p>The two approaches to hypothesis testing we saw in part 1 were each designed to answer specific questions.</p>
<p><strong>Fisher’s significance testing</strong> specifies a null hypothesis, and <strong>measures how much evidence our experiment provides</strong> against that null hypothesis. This is measured by the \(p\)-value, which tells us how likely our evidence would be if the null hypothesis is true. (It does <em>not</em> tell us how likely the null hypothesis is to be true!)</p>
<p><strong>Neyman-Pearson hypothesis testing helps us make a decision between two courses of action</strong>, like prescribing a drug or not. We weigh the costs of getting it wrong in either direction, and decide which direction we want to default to if the evidence is unclear. The null hypothesis is that we should take that default action (such as not prescribing the drug), and the alternative is that we should take the other action (prescribing the drug).</p>
<p>Based on our weighing of the costs of making a mistake in either direction, and the amount of information we have to work with, we set a “false positive” threshold \(\alpha\) and a “false negative” threshold \(\beta\). These numbers are tricky to understand and describe correctly, even for experienced researchers. I encourage you to go read part 1 if you haven’t already, but in brief:</p>
<ul>
<li>The number \(\alpha\) measures the chance that, <em>if</em> the drug doesn’t work and isn’t worth taking, we will screw up and prescribe it anyway.</li>
<li>The number \(\beta\) measures the chance that, <em>if</em> the drug works and is worth taking, we’ll make a mistake and withhold it.</li>
</ul>
<p><strong>The Neyman-Pearson method doesn’t try to tell us whether the drug “really works”</strong>; it <em>only</em> tells us how we should weigh the risks of making the two possible mistakes. <strong>Fisher’s method takes a very different approach and tries to measure the evidence</strong> to help us decide what to believe; but it does not give a clean yes-or-no answer.</p>
<p>Modern statistical hypothesis testing is a weird mishmash of these two approaches. We report \(p\)-values as evidence for or against the null hypothesis, as in Fisher-style significance testing. But we <em>also</em> try to give a yes-or-no, accept-or-reject verdict, as in the Neyman-Pearson approach. And while either approach can be useful on its own, the combination loses the key statistical benefits of each and leaves us in a bit of a muddle.</p>
<h3 id="the-modern-approach-in-practice">The modern approach in practice</h3>
<p>Modern researchers generally do something like this:</p>
<ul>
<li>First we choose a significance level \(\alpha\). We usually default to \(\alpha = .05\), but we sometimes make it lower if we want to be really confident in our conclusions. Particle physicists often use an \(\alpha\) of about \(.0000003\), or about \(1\) in \(3.5\) million.<strong title="This is the probability of getting data five standard deviations away from the mean. So you'll often see this reported as a significance threshold of 5σ. Related is the [Six Sigma techniques] for ensuring manufacturing quality, though somewhat counterintuitively they typically only aim for [4.5 σ] of accuracy."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></li>
</ul>
<ul>
<li>
<p>Next we specify a null hypothesis, which is usually something like “the thing we’re studying has no effect”. We generally choose a null hypothesis that we <em>don’t</em> believe, because our machinery will attempt to <em>disprove</em> our null.</p>
<p>If we want to prove that a new drug helps prevent cancer, our null hypothesis will be that the drug has no effect on cancer rates. If we want to show that hiring practices are racially discriminatory, our null hypothesis will be that race has no effect on whether people get hired.</p>
</li>
<li>
<p>Technically, we also have an alternative hypothesis: “this drug does help prevent cancer”, or “hiring practices are affected by race”. This alternative hypothesis often what we actually believe, but we often don’t make it too precise during the design of the experiment. Specifying the alternative hypothesis well is a really important part of research design, but it’s a bit tangential to this essay so we won’t talk about it much here.</p>
</li>
<li>
<p>We run the experiment, do a Fisher-style significance test, and report the \(p\)-value we get. If it’s less than \(\alpha\), we reject the null hypothesis, and generally consider the experiment to have successfully proven our alternative is true. If the \(p\)-value is greater than \(\alpha\), we don’t reject the null hypothesis,<strong title="It is common for people to be sloppy here and say they "accept" the null. In fact, I wrote that in my first draft of this paragraph. But it's bad practice to say that, because even a very high p-value doesn't provide good evidence that the null hypothesis is true. Our methods are designed to default to the null hypothesis when teh data is ambiguous. Neyman _did_ use the phrase "accept the null", but in the context of a decision process, where "accepting the null" means taking some specific, concrete action implied by the null, rather than more generally committing to believe something."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> and often view the experiment as a failure.</p>
</li>
</ul>
<p>There are a few problems with this approach, but most of them stem from the same core issue: <strong>classical statistical tools are incredibly fragile.</strong> If you use them <em>exactly</em> as described, you are mathematically guaranteed to get some specific benefit. (In a correct Neyman-Pearson setup, for instance, you are guaranteed a false positive rate of size \(\alpha\). ) But you get <em>exactly</em> that guarantee, and possibly nothing more. My friend Nostalgebraist <a href="https://nostalgebraist.tumblr.com/post/161645122124/bayes-a-kinda-sorta-masterpost">analogizes</a> on Tumblr:</p>
<blockquote>
<p>The classical toolbox also has a lot of oddities….The labels on the tools say things like “won’t melt below 300° F,” and you <em>are in fact</em> guaranteed <em>that</em>, but the same screwdriver might turn out to instantly vaporize when placed in water, or when held in the left hand. Whatever is not guaranteed on the label is possible, however dangerous or just plain dumb it may be.</p>
</blockquote>
<p>This fragility means that if you carelessly combine two tools, you often lose the guarantees of each of them, and wind up with a screwdriver that melts at room temperature and <em>also</em> vaporizes when held in your left hand. And you may not get anything at all in return—other than, I suppose, the inherent benefits of being careless and lazy.</p>
<p class="center"><a href="https://www.egscomics.com/comic/2015-05-01"><img src="/assets/blog/hypothesis-testing/lazy-egscomics.png" alt="Panel from El Goonish Shive comic: "Shoot, I'm going to be lazy all the time forever now. It gets _results_." /></a></p>
<p class="center"><em>Sure, being lazy gets results. But they might not replicate.</em></p>
<h3 id="the-wrong-tool-for-the-job">The wrong tool for the job</h3>
<p>The Neyman-Pearson method is designed to give an unambiguous yes-or-no answer to a question, so we can act on the information we currently have. This is exactly what we need when it’s time to make a specific decision about whether or not to open a new factory or change to a different brand of fertilizer. And the method was so successful that in 1955, John Tukey <a href="https://www.tandfonline.com/doi/abs/10.1080/00401706.1960.10489909">expressed concern about</a> the “tendency of decision theory to attempt to conquest all of statistics”.</p>
<p>He worried because <strong>in scientific research we don’t want to make decisions, but reach conclusions</strong>. On the one hand, we don’t need to make a definitive decision <em>right now</em>. If it’s not clear which theory describes the evidence better, we can just say that, and wait for more evidence to come in. On the other hand, we want to eventually reach firm conclusions that we can trust, and use as a foundation for further work. That requires a higher degree of confidence than “the best we can say right now”, which is what Neyman-Pearson gives us. Fisher’s methods, in contrast, were designed to accumulate certainty through repeated consistent experimental results, the sort of thing a true conclusion theory would need.</p>
<p>But because Neyman-Pearson worked so well for a very specific type of problem (and probably also because Fisher was <a href="https://www.newstatesman.com/long-reads/2020/07/ra-fisher-and-science-hatred">kind of terrible</a>), many fields adopted it as a default and use it for pretty much everything. <a href="http://daniellakens.blogspot.com/2022/05/tukey-on-decisions-and-conclusions.html">Daniel Lakens says</a> that in hindsight, Tukey didn’t need to worry, since statistics textbooks for the social sciences don’t even discuss decision theory; but in fact <strong>we’ve largely adopted a tool of decision theory, and repurposed it to reach conclusions instead</strong>.</p>
<p>A decision theory needs to produce a clear, discrete answer to our questions, even if there’s not much evidence available. And unfortunately, our scientific papers regularly try to transmute weak evidence into strong conclusions. We tend to over-interpret <a href="https://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">individual studies</a>, especially when one study is all we have. How often have you seen in the news that “a new study proves that” something is true? It’s almost never wise to conclude that a question is resolved because of one study. But the Neyman-Pearson framework is designed to do exactly that, and so inclines us to be overconfident.</p>
<p>Even if you have multiple studies, the same problem shows up in a different form. When there’s a complicated and messy body of research on a topic, we should probably hold complicated and messy beliefs, rather than forming a definitive conclusion. Instead, we often argue about which study is “right” and which is “wrong”, because that’s the lens we use to evaluate research.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/onion-eggs-good-this-week.png" alt="Screenshot of a short onion article, titled "Eggs Good For You This Week"" class="blog-image" /></p>
<p class="center blog-image"><em>My <a href="https://www.theonion.com/eggs-good-for-you-this-week-1819565159">favorite article from The Onion</a> demonstrates the wrong way to interpret conflicting studies.</em></p>
<p>Of course, sometimes one study <em>is</em> pretty much just wrong! If you have two studies and one shows that a child care program cuts poverty by 50% and the other shows that it increases poverty, at least one of them has to be pretty badly off the mark somehow. But even then, the hypothesis testing framework can mislead us, because of the way it handles the burden of proof.</p>
<h3 id="defaults-matter">Defaults Matter</h3>
<p>Hypothesis testing methods build in a bias toward sticking with the null hypothesis. This is intentional; we’re looking for strong evidence that the null is false, not just something that might check out if we squint really hard. <strong>We want to put the burden of proof on showing that something new is actually happening.</strong></p>
<p><strong>But once a study rejects the null, it’s very easy to be <em>decisive</em> and treat its result as “proven”, and shift the burden of proof onto work that challenges the original study.</strong> So when a paper runs a hypothesis test and concludes that <a href="https://statmodeling.stat.columbia.edu/2014/06/06/hurricanes-vs-himmicanes/">female-named hurricanes are more dangerous than male-named ones</a>, this belief is “proven” and becomes the new default. And since that one study established a new baseline, anyone who disagrees now faces the burden of proof, and faces an uphill battle to convince people.</p>
<p>It’s pretty common for a small early study find a big effect, and then be followed up by a few larger and better studies that <a href="https://statmodeling.stat.columbia.edu/2016/04/02/himmicanes-and-hurricanes-update/">don’t find the same effect</a>. But all too often people more or less conclude the big effect is real, because that first study found it, and the followups weren’t convincing <em>enough</em> to overcome the presumption that the effect is real.<strong title="Andrew Gelman suggests a helpful [time-reversal heuristic]: what would you think if you saw the same studies in the opposite order? You'd start with a few large studies establishing no effect, followed by one smaller study showing an effect. In theory that gives you the exact same information, but in practice people would treat it very differently—assuming the first studies [actually got published]."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>And the Neyman-Pearson framework reinforces this twice. First, because it is intentionally <em>decisive</em>, it encourages us to commit to the result of a single study. Second, rejecting the null hypothesis is seen as strong evidence against the null, but failing to reject is only weak evidence that the null is true. This is why we “fail to reject” rather than simply “accept” the null hypothesis: maybe the null is true, or maybe the experiment just wasn’t sensitive enough to reject it.</p>
<p>So if one study rejects the null and another fails to reject, it’s very easy to assume that the first study was just better. After all, it managed to reject the null, didn’t it? But a reasonable conclusion theory would incorporate both studies, rather than rejecting the one that “failed”.</p>
<h2 id="publication-in-practice">Publication in practice</h2>
<p>So far I’ve discussed theoretical problems with the hypothesis testing framework: reasons it might be the wrong tool for the problems we’re applying it to. But a possibly worse problem is that it’s very easy to <em>misuse</em> hypothesis testing, so that it doesn’t even do its own job correctly. And the structural dynamics of how research gets conducted, published, and distributed tends to encourage this misuse, and amplify the conclusions of sloppy studies.</p>
<h3 id="who-wants-to-be-boring">Who wants to be boring?</h3>
<p><strong>Most academics really care about doing good research and contributing to our knowledge about the world</strong>; otherwise they wouldn’t be academics. The academic career path is long and grueling, and doesn’t pay very well compared to other things that nascent academics could be doing; there’s a reason people say that you shouldn’t get a Ph.D. if you can imagine being happy without one.</p>
<p>But that doesn’t mean research is conducted by cloistered ascetics with no motivations other than a monastic devotion to the truth. <strong>People who do research want to <em>discover interesting things</em>, not spend thirty years on experiments that don’t uncover anything new.</strong> Moreover, they want to discover things that <em>other people</em> think are interesting—people who can give them grants, or jobs, or maybe even book deals and TED talks.</p>
<p>Even without any dishonesty, this shapes the questions people ask, and also the way they answer them. First, people want to reject the null hypothesis, because we see that as strong evidence, but see failing to reject the null as weak evidence. An experiment that fails to reject the null is rarely actually published; all too often, it’s seen as an experiment that simply failed.</p>
<p>Second, people want to prove <em>new</em> and <em>surprising</em> things. It would be extremely easy for me to run a study rejecting the null hypothesis that 15-year-olds are on average about as tall as 5-year-olds. But no one would care about this study—including me—because we already know that.</p>
<p>Now, sometimes it’s worth clearly establishing that obvious things are in fact true. And we do have data on the average height of children at various ages, and it wouldn’t be hard to use that to show that 15-year-olds are taller than 5-year-olds. Collecting that sort of routine data on important topics is <a href="https://twitter.com/ProfJayDaigle/status/1521911837897502723">very useful and important work</a> that we should probably reward more than we do.</p>
<p>But we <em>don’t</em> reward routine data collection heavily, and most of the time researchers are trying to prove surprising new results. And that’s exactly the problem: <strong>new results are “surprising” when you wouldn’t have expected them—which is exactly when they’re unlikely to be true.</strong></p>
<h3 id="most-findings-false">“Why most published research findings are false”</h3>
<p>This quest for surprising results interacts with the statistics of the Neyman-Pearson method in an extremely counterintuitive way. The statistical guarantee is: if we test a true null hypothesis, we’ll get a false rejection about five percent of the time. <strong>But that doesn’t mean a rejection has a five percent chance of being false. And the more studies of true null hypotheses we run, the bigger this difference gets.</strong></p>
<p>We can most easily understand how this works with a couple examples. As a baseline, let’s look at the case where half our null hypotheses are true. Imagine we run two hundred studies, \(100\) with a true null hypothesis and \(100\) with a false null hypothesis. Our false positive rate is \(\alpha = 0.05\), so we’ll reject the null in five of the \(100\) studies where the null is true. And we generally hope for a false negative rate of \(\beta = 0.20\), in which case we reject the null in \(80\) of the \(100\) studies where the null is false.</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>80</td>
<td>5</td>
<td>85</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>20</td>
<td>95</td>
<td>115</td>
</tr>
<tr>
<td>Total</td>
<td>100</td>
<td>100</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>So we have \(85\) positive results, of which \(80\) are true positives and \(5\) are false positives, and so \(5/85 \approx 6\)% of our positive results are false positives.<strong title="You might recognize this as an application of Bayes's theorem, and a basic example of [Bayesian inference]. Tables like these are very common in Bayesian calculations. "><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> And that’s not too bad—the fact that it’s <em>higher</em> than the false positive rate of \(5\)% should be a warning sign.</p>
<p>But now imagine our researchers get more ambitious, and start testing more interesting and potentially-surprising findings. This means we should expect more of our null hypotheses to actually be true. If only ten percent of the original \(200\) null hypotheses are false, then we’ll have 180 studies with a true null and only 20 with a false null. We’ll still reject \(80\)% of false null hypotheses, and \(5\) of true null hypotheses, so our results look like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>16</td>
<td>9</td>
<td>25</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>4</td>
<td>171</td>
<td>175</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>Now we only have \(16\) true positives (out of \(20\) cases where we should reject), and we get \(9\) false positives (out of \(180\) cases where we shouldn’t reject the null). So a full \(9/25\) or \(36\)% of our positive results are false positives—much higher than \(5\)%! And often, only the studies that reject the null, and land in the first row of the table, get published at all. So we might find that a third of published papers will have false conclusions.</p>
<p><strong>If researchers are regularly testing theories that are unlikely to be true, then most of the positive (and thus published) results can be false, even if the rate of false positives is quite low.</strong> This is the key observation of the famous paper by John Ioannidis that kicked off the replication crisis, <a href="https://en.wikipedia.org/wiki/Why_Most_Published_Research_Findings_Are_False">Why Most Published Research Findings Are False</a>.<strong title="Followups to Ioannidis's paper contend that only about 14% of published biomedical findings are actually false. I'm not in a position to comment on this one way or the other. In psychology, different studies estimate that somewhere [between from 36% and 62%] of published results replicate."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>This is sometimes known as the <a href="https://en.wikipedia.org/wiki/Publication_bias">file-drawer effect</a>: we see the studies that get published, but not the “failed” ones that are left in the researchers’ filing cabinets. So even though only thirteen of the \(200\) studies give the wrong answer, \(9\) of the \(25\) that actually get published are wrong.</p>
<p>And no, \(9/25\) isn’t quite a majority, so while this is bad, it doesn’t seem to justify Ioannidis’s claim that “most” published findings are false.</p>
<p>But we haven’t talked about everything that can go wrong yet!</p>
<h3 id="the-problem-of-power">The problem of power</h3>
<p>I said that “we generally hope for a false negative rate of \(\beta = 0.2\)”. But where does that hope come from?</p>
<p>The original Neyman-Pearson framework has two parameters, the false positive rate \(\alpha\) and the false negative rate \(\beta\). You can always make \(\alpha\) lower by accepting a higher \(\beta\), and researchers are supposed to balance these parameters against each other, based on the relative costs of making Type I and Type II errors. But in practice we just <a href="https://doi.org/10.1353/sof.2005.0108">set \(\alpha\) to \(.05\) and move on with our lives</a>; we don’t think about the relative balance of costs, or what it’s really saying about our research.</p>
<p>If our data is good enough, then we can make both \(\alpha\) and \(\beta\) are both small, and draw conclusions with a fair degree of confidence. But if our data is bad, then the study will be too weak to detect a lot of true effects, and so to keep \(\alpha\) small, we need to make \(\beta\) large. Consequently we say that the <em>power</em> of a study \(1 - \beta\), which is the <em>true</em> positive rate. A study with high power will usually give the correct answer; a study with low power can’t be trusted.</p>
<p><img src="/assets/blog/hypothesis-testing/abusing-your-power.jpg" alt="Picture of a cat, with text: "Don't even think about abusing your power"" class="blog-image center" /></p>
<p>Much like we typically set \(\alpha = 0.05\), we typically try to get \(\beta \leq 0.2 \), and thus conduct studies with a power of at least \(80\)%. And like with the false positive rate, this number is also not really motivated by anything in particular: the choice is generally attributed to Jacob Cohen, who <a href="http://daniellakens.blogspot.com/2019/05/justifying-your-alpha-by-minimizing-or.html">wrote</a> that</p>
<blockquote>
<p>The \(\beta\) of \(.20\) is chosen with the idea that… Type I errors are of the order of four times as serious as Type II errors. This \(.80\) desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value <em>ad hock</em>.</p>
</blockquote>
<p>That is, there’s no really good argument for not picking \(\beta = 0.1 \) or \(\beta = 0.3\) instead, but it seems like it’s about the right size if you don’t have any better ideas.</p>
<p>There are two problems here. The minor one is that both of these numbers are pretty arbitrary. If we have enough data that we can get \(\alpha = 0.05,\beta = 0.2\), then we could also choose to reject the null more readily and get something like \(\alpha = 0.1, \beta = 0.11\), with a high false positive rate but a power of \(89\)%; or we could be reject the null less often and get \(\alpha = 0.02, \beta = 0.33\), with a low false positive rate but power of only \(67\)%.</p>
<p>Which of those trade-offs we want depends the specifics of our current question: if Type I and Type II errors are about equally bad, we might want \(\alpha\) and \(\beta\) to be about the same size, but if a Type II error is much, much worse, we should want \(\alpha\) to be much smaller than \(\beta\). We can’t make an informed choice of \(\alpha\) and \(\beta\) without knowing details about the specific decision we’re trying to make.</p>
<p>But when we’re trying to do <em>science</em> it’s not clear what to choose. We can’t really quantify the costs of publishing a paper with a false conclusion; the entire setup of computing practical trade-offs doesn’t make all that much sense when we’re trying to discern the truth rather than make a decision. <strong>This is one major way that the Neyman-Perason framework isn’t the right tool for science: the entire method is premised on a calculation we can’t do.</strong></p>
<p>But we <em>can</em> just set \(\alpha = 0.05, \beta = 0.20\), and see what happens. And as long as these numbers are a vaguely reasonable size, we’ll probably get vaguely reasonable results. We hope.</p>
<h3 id="where-does-power-come-from">Where does power come from?</h3>
<p>There’s a second problem, though, which is widespread and frequently disastrous. Sometimes \(\beta\) gets so large that a study becomes useless—and we don’t even notice.</p>
<p>For a given \(\alpha\), your \(\beta\) depends on the quality of the data you have. With very good data, you can be very confident about your conclusion in both directions. We have a tremendous amount of data about the relationship between age and height in children, so we can design studies that will have low rates of false positives and false negatives. And physics experiments ask for a false positive rate less than one in a million—and they can actually <em>achieve</em> this because their data is both copious and precise.</p>
<p><strong>But with bad or noisy data, no amount of statistical cleverness can give any degree of confidence in our conclusions.</strong> If you want to study the effect on life expectancy of winning or losing an election to be a US state governor, <a href="https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/">you wind up with this scatterplot</a>:</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/governor-life-expectancy.png" alt="Scatterplot with "Percentage vote margin" on the x-axis, from -10 to 10, and "Years alive after election" on the y-axis, from 0 to 60. There is no noticeable pattern." class="blog-image center" />
<em>If your data is this scattered, you will never be able to detect small effects.</em></p>
<p>There aren’t <em>that</em> many governor races, and lifespan after any given race varies from just a couple years to more than fifty, so the data is extremely noisy. If winning an election boosted your lifespan by ten years, we would probably be able to tell. But an effect that large is absurd, and there’s no way to use data like this to pick up changes of just a year or two.</p>
<p>When we said we “ask for” a \(\beta\) below \(0.2\), we really meant “we should collect enough data to get a power of \(80\)%”. That’s not really an option for the governors study, without waiting around for more elections and more dead governors; on that question we’re kind of stuck with the data we have. Despite the Neyman-Pearson inclination to make a firm decision, all we can reasonably do is embrace uncertainty.</p>
<p>If we’re running a laboratory experiment, on the other hand, we can decide how big an effect we’re looking for, and calculate how many people we’d need to study to get a power of \(80\)%. But it’s hard to calculate this correctly, because it depends on how big the effect we’re studying is, and we <em>don’t know how big it is</em> because we <em>haven’t done the study yet</em>. So the calculation is based on a certain amount of guesswork.<strong title="We can also base it on [how big of an effect we _care_ about]. If we're studying reaction times, we might decide that an effect smaller than ten milliseconds is irrelevant, and we don't care about it even if it's real. Then we can choose a study with enough power to detect a 10ms effect at least 80% of the time. But this brings us back to the core issue, that "is there an effect" just isn't a great question, and the Neyman-Pearson method isn't a great tool for answering it. "><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>Even if we do this calculation correctly, there’s a real chance that we have to run a really big experiment to get the power we want. (If we’re looking for a small effect, we may have to run a really, <em>really</em> big experiment.) And big experiments are expensive! A lot of researchers skip this step entirely, and just run whatever experiment they can afford, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4961230/">regardless of how little power it has</a>.</p>
<p>And if the power is low enough, things get very dumb very quickly.</p>
<h3 id="we-need-more-power">We need more power!</h3>
<p>Let’s start by looking at what happens when the power is really, idiotically low. This graph shows what happens when you run an experiment with a power of \(0.06\), which means a false negative rate of \(94\)%. And there are three different problems that pop up.</p>
<p><img src="https://statmodeling.stat.columbia.edu/wp-content/uploads/2014/11/Screen-Shot-2014-11-17-at-11.19.42-AM.png" alt="A diagram of the effects of low-power studies.
This is what "power = 0.06" looks like. Get used to it.
Type S error probability: If the estimate is statistically significant, it has a 24% chance of having the wrong sign.
Exaggeration ratio: If the estimate is statistically significant, it must be at least 9 times higher than the effect size." class="blog-image center" /></p>
<p class="center"><em>Figure by <a href="https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">Andrew Gelman</a>.</em></p>
<p>The obvious problem is that even if the null hypothesis is wrong, we probably won’t reject it, because the data isn’t good enough to <em>show</em> that it’s wrong. Even if the null is false, we’ll fail to reject it \(94\)% of the time! (This is represented by the large white area in the middle of the graph.) But this, at least, is the process working as intended: our goal was to err on the side of not rejecting the null hypothesis, and that is in fact what we’re doing.</p>
<p>But there are two subtler problems, which cause more trouble than just a pile of inconclusive studies. We still manage to reject the null \(6\)% of the time, but because the study is so weak, this only happens when we get unusually lucky. And that happens when our data is much, <em>much</em> further away from the null hypothesis than it usually is. <strong>At a power of \(\mathbf{0.06}\), we only get a significant result when our measurement is <em>nine times</em> as big as the true effect we want to measure.</strong> (This is the red region on the right of Gelman’s graph; he calls it a “Type M error”, for “magnitude”.)</p>
<p>This is a major culprit behind a lot of improbable ideas that come out of shoddy research. In my <a href="/blog/replication-crisis-math/">post on the replication crisis</a> I talked about how a lot of careless research starts out asking whether an effect exists, but finds an effect that’s <em>surprisingly large</em>, and then the story people tell is focused on the dramatic, unexpectedly large effect. But that drama is a necessary result of running underpowered studies.</p>
<p>The study of gubernatorial elections and life expectancy is a perfect example of this process. Just by looking at the graph, you can tell there probably isn’t a big effect. But researchers Barfort, Klemmensen and Larsen found a clever analysis<strong title="Clever analyses like this are often a bad idea; we'll come back to this idea [soon]."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> that did <a href="https://www.cambridge.org/core/journals/political-science-research-and-methods/article/abs/longevity-returns-to-political-office/6205207F55C97729E66A8B08D7641572">produce a statistically significant result</a>—and claimed that the difference between narrowly winning and narrowly losing an election was <em>ten years</em> of lifespan. That’s far too large an effect to be believable, but any statistically significant result they got from that data set would have to be equally incredible.</p>
<p>Researchers are motivated to discover new and surprising things; and we, as news consumers, are most interested in new and surprising results. The wild overestimates that these low-power studies produce are surprising and counterintuitive, precisely because they are <em>false</em>. But they are surprising and counterintuitive, so they tend to draw public attention and show up in the news.</p>
<p>But a surprisingly large result isn’t as counterintuitive as one that’s the opposite of what you expect. (Imagine if a study “proved” that 5-year-olds are taller than 15-year-olds!) And low-power studies give us those results too.</p>
<p>Even if we’re studying something that really does (slightly) increase lifespan, we could get unusually <em>unlucky</em>, and randomly observe a bunch of people who die unusually early. If the data is noisy enough and we get unlucky enough, we can get statistically significant evidence that the effect decreases lifespan, when it really increases it.</p>
<p>We see this in the left tail of Gelman’s graph. <strong>When power is \(\mathbf{0.06}\), almost a quarter of statistically significant results will give you a large effect <em>in the wrong direction</em>.</strong> There’s a substantial chance that we get our result exactly backwards.</p>
<p>Now, a power of \(0.06\) is an extreme case, bad even by the usual standards of underpowered research. But the same problems come up with better-but-still-underpowered studies, just to a lesser degree. In fact, both effects are always <em>possible</em>, if your data is unlucky enough. But we’d much prefer having a \(0.1\)% chance of getting the direction of the effect wrong to having a \(24\)% chance. And the lower the power, the bigger an issue this is.</p>
<h3 id="file-drawer">The revenge of the file drawer</h3>
<p>There should be a saving grace here: if your study has low power, it’s unlikely to reject the null at all. We don’t have a \(24\)% chance of getting a statistically significant result in the wrong direction; because our power is only \(0.06\), we have a <em>six percent chance of having a \(24\)% chance</em> of getting a statistically significant result in the wrong direction. That’s less than two percent, in total.</p>
<p>But <strong>studies that don’t reject the null often don’t get published at all</strong>. There’s a good chance that the 94 studies that fail to reject the null get stuck in a file drawer somewhere; we’re left with a few studies that reject it, but wildly overestimate the effect, and one or two that reject the null in the wrong direction. When that’s all the information we have, it’s hard to figure out what’s really going on.</p>
<p>Let’s make another table of possible research findings, like the ones <a href="#most-findings-false">we used earlier</a> to see how the file-drawer effect works. But this time, instead of assuming a reasonable power of \(80\)%, let’s see what happens when the power is only \(20\)%. If half the hypotheses are true and half are false, we get something like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>20</td>
<td>5</td>
<td>25</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>80</td>
<td>95</td>
<td>175</td>
</tr>
<tr>
<td>Total</td>
<td>100</td>
<td>100</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>With \(80\)% power, our false-positive rate was \(6\)%. But with \(20\)% power, we have \(20\) true positives and \(5\) false positives, and our false-positive rate has risen \(5/25 = 20\)%.</p>
<p>And if we also suppose that are researchers are testing unlikely theories and so \(90\)% of null hypotheses are true, we get the following truly terrible table:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>4</td>
<td>9</td>
<td>13</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>16</td>
<td>171</td>
<td>187</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>Under these conditions we get \(9\) false positives and only \(4\) true positives, so almost \(70\)% of our positive results are false positives. If the only results we publish are these exciting positive results, then most published findings will, indeed, be false.</p>
<h3 id="the-problem-of-p-hacking-and-the-garden-of-forking-paths">The problem of \(p\)-hacking and the garden of forking paths</h3>
<p>It seems like we could fix this problem just by publishing null results as well. New norms like <a href="https://en.wikipedia.org/wiki/Preregistration_(science)">preregistration of studies</a> and institutions like <a href="https://www.jasnh.com">The Journal of Articles in Support of the Null Hypothesis</a> try to combat the file drawer bias by publishing studies that don’t reject the null, or at least letting us know they happened so we can count them. If we publish just a quarter of null results, then even under the bad assumptions of the last table we get something like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>4</td>
<td>9</td>
<td>13</td>
</tr>
<tr>
<td>Don’t Reject, but Publish</td>
<td>4</td>
<td>43</td>
<td>47</td>
</tr>
<tr>
<td>Don’t Reject or Publish</td>
<td>12</td>
<td>128</td>
<td>140</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>We see \(60\) published results. The \(4\) results where the null is false and we reject it are correct, as are the \(43\) where the null is true and we don’t reject it, so over \(70\)% of the published results will be true. If we publish more null results, this number only gets better.</p>
<p>But that doesn’t address the fundamental problem, which is that <em>researchers want to discover new, interesting things</em>. <strong>The fact that we mostly publish positive results that reject the null isn’t some accident of history; it’s a result of people trying to show that their ideas are correct.</strong></p>
<p>Since people want to reject the null hypothesis, they’ll work hard to find ways to do this. When done deliberately, this behavior is a form of research misconduct known as <a href="https://twitter.com/ephemeralidea/status/1504459823554908163">\(p\)-hacking</a> or <a href="https://en.wikipedia.org/wiki/Data_dredging">data dredging</a>. There are a variety of sketchy ways to tweak your statistical analysis to get an artificially low \(p\)-value. The most famous version is just running a bunch of experiments and <a href="https://imgs.xkcd.com/comics/significant.png">only reporting the ones with low \(p\)-values</a>.</p>
<p>Somewhat less famous, and less obvious, is the possibility of running one experiment, and then trying to <em>analyze</em> that data in a bunch of different ways and picking the one that makes your position look the best. We actually saw an example of this in <a href="hypothesis-testing-part-1#mileage">part 1</a> of this series, when I looked at my car’s gas mileage. I computed the \(p\)-value in two different ways, and got either \(0.0006\) or \(0.00004\). Either one of these is significant, but if they had been \(0.06\) and \(0.004\) instead, I could have just reported the second one and said “hey look, my data was significant!”</p>
<p>Moreover, it’s pretty common for people to look for secondary, “interaction” effects after looking for a main effect. Sure, watching a five-minute video didn’t have a statistically significant effect on depression in your study group. But maybe it worked on just the women? Or just the Asians? What if we control for income? You can check all the subgroups of your study, and whichever one reaches significance is <em>obviously</em> the interesting one.</p>
<p><a href="https://xkcd.com/1478/"><img src="https://imgs.xkcd.com/comics/p_values.png" alt="XKCD comic, translating p-values into verbal interpretations: "highly significant", "significant", "on the edge of significance". For a high p-value the interpretation is "hey, look at this interesting subgroup analysis"." class="blog-image center" /></a>
<em class="blog-image center">Sometimes your treatment really does have an effect on one specific subgroup. But it’s also an easy out when your main study didn’t reach significance.</em></p>
<p>This approach of doing multiple subgroup analyses, but only reporting one is still research misconduct, if done on purpose. But <strong>it’s possible to get the same effect without actually performing multiple analyses, in a process that Andrew Gelman and Eric Loken call the <a href="https://www.americanscientist.org/article/the-statistical-crisis-in-science">garden of forking paths</a>.</strong></p>
<p>Researchers often make decisions about how to test the data after looking at it for broad trends. If they notice one subgroup obviously sticking out, maybe they want to test it. Or they can tweak some minor parameters, decide to include or exclude outliers, and consider a few minor variations in the way they divide subjects into categories. This is all a reasonable way of looking at data, but it’s a violation of the rules of hypothesis testing, and has the same basic effect as running a bunch of experiments and only reporting the best one.</p>
<p>Most subtly, sometimes more than one pattern will provide support for the researcher’s hypothesis. We generally don’t actually care about specific statistical relationships; we care about broader questions, like “does media consumption affect rates of depression?”<strong title="This difference is the source of a lot of research pitfalls; if you want to dig into this more, I recommend [Tal Yarkoni] on generalizability, [Vazire, Schiavone, and Bottesini] on the four types of validity, and [Scheel, Tiokhin, Isager, and Lakens] on the derivation chain."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong> <strong>We run specific experiments in order to test these broad questions. And if there are, say, twenty different outcomes that would support our broad theoretical stance, it doesn’t help us very much that each one only has \(\mathbf{5}\)% odds of happening by chance.</strong></p>
<p>Gelman and Loken describe how this applies to research by Daryl Bem, which claims to provide strong evidence for ESP.<strong title="Scott Alexander [has pointed out] that ESP experiments are a great test case for our scientific and statistical methods, because we have extremely high confidence that we already know the true answer."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong></p>
<blockquote>
<p>In his first experiment, in which 100 students participated in visualizations of images, he found a statistically significant result for erotic pictures but not for nonerotic pictures….</p>
</blockquote>
<blockquote>
<p>But consider all the other comparisons he could have drawn: If the subjects had identified all images at a rate statistically significantly higher than chance, that certainly would have been reported as evidence of ESP. Or what if performance had been higher for the nonerotic pictures? One could easily argue that the erotic images were distracting and only the nonerotic images were a good test of the phenomenon. If participants had performed statistically significantly better in the second half of the trial than in the first half, that would be evidence of learning; if better in the first half, evidence of fatigue.</p>
</blockquote>
<blockquote>
<p>Bem insists his hypothesis “was not formulated from a post hoc exploration of the data,” but a data-dependent analysis would not necessarily look “post hoc.” For example, if men had performed better with erotic images and women with romantic but nonerotic images, there is no reason such a pattern would look like fishing or p-hacking. Rather, it would be seen as a natural implication of the research hypothesis, because there is a considerable amount of literature suggesting sex differences in response to visual erotic stimuli. The problem resides in the one-to-many mapping from scientific to statistical hypotheses.</p>
</blockquote>
<p>We even saw an example of forking paths earlier in this essay, in the <a href="#where-does-power-come-from">study of gubernatorial lifespans</a>. I said the study found a clever analysis to get a significant result. In the data set we saw from Barfort, Klemmensen, and Larsen, the obvious tests like linear regression don’t show any effect of winning margin on lifespan.</p>
<p class="blog-image center"><img src="/assets/blog/hypothesis-testing/governor-life-expectancy-loess.png" alt="The same scatterplot of "Percentage vote margin" on the x-axis and "Years alive after election" on the y-axis. This time a best-fit loess curve is drawn through the data; it again shows no real relationship." class="blog-image center" />
<em>A loess curve is a more sophisticated version of linear regression. It doesn’t show a clear relationship between electoral margin and lifespan. Graph again <a href="https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/">by Andrew Gelman</a>.</em></p>
<p>But if you average different candidates with the same electoral margin together, divide them into a group of winners and a group losers, and then do a regression on each group separately, the two regressions suggest that barely winning a race improves life expectancy, versus barely losing.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/governor-regression-discontinuity.png" alt="A figure from the Barfort, Klemmensen, and Larsen paper on gubernatorial elections and lifespan, showing their regression discontinuity analysis. It shows lifespan decreasing with increased voteshare, except with a large upwards discontinuity at the crossover from losing to winning." class="blog-image" /></p>
<p class="center blog-image"><em>The discontinuity between the two lines is large enough to be “statistically significant”. But does the data on the right really look qualitatively different from the data on the left?</em></p>
<p>This <a href="https://en.wikipedia.org/wiki/Regression_discontinuity_design">regression continuity design</a> isn’t a ridiculous approach to the question, but it’s also probably not the first idea you’d think of. And the paper’s own abstract says they’re not sure which way the effect should run, so <em>any pattern at all</em> would provide support for their research hypothesis. This is a subtle but crucial violation of the hypothesis testing framework, and dramatically inflates the rate of “positive” results.</p>
<h2 id="sowhy-does-science-work-at-all">So…why does science work <em>at all</em>?</h2>
<p>Hopefully I’ve convinced you, first, that the tools of modern hypothesis testing are badly suited for the questions we want them to answer, and second, that the structure of our scientific institutions leads us to regularly misuse them in ways that make them even more misleading. So then, how do we manage to learn anything at all?</p>
<p>Sometimes we don’t! The whole point of the “replication crisis” is that we’re almost having to throw out entire fields wholesale. <strong>When I hear about a promising new drug, or a cool new social psychology study, I <em>assume it’s bullshit</em>, because so many of them are. And that’s a real crisis for whole idea of “scientific knowledge”.</strong></p>
<p>But in many fields of study we do, in fact, manage to learn things. We know enough physics and chemistry to build things like spaceships and smartphones. And even though lot of drug studies are nonsense, modern medicine does in fact work.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/life-expectancy-at-age-10.png" alt="A graph from Our World In Data of life expectancy at age ten in various countries, from 1750 to the present. There is a dramatic increase over the 20th century." class="blog-image" /></p>
<p class="center blog-image"><em>We didn’t increase life expectancy by almost thirty years without learning</em> something <em>about biology.</em></p>
<p>And even in more vulnerable fields like psychology and sociology, we have developed a lot of consistent, replicable, useful knowledge. How did we get that to work, despite our shoddy statistics?</p>
<h3 id="inter-ocular-trauma">Inter-ocular trauma</h3>
<p>If your data are good enough, you can get away with having crappy statistics. One of the best and most useful statistical tools is what Joe Berkson called the <a href="https://stats.stackexchange.com/questions/458069/source-for-inter-ocular-trauma-test-for-significance">inter-ocular traumatic test</a>: “you know what the data mean when the conclusion hits you between the eyes”.</p>
<p><a href="https://xkcd.com/2400/"><img src="https://imgs.xkcd.com/comics/statistics.png" alt="XKCD 2400: graph of covid vaccine efficacy versus placebo. "Statistics tip: always try to get data that's good enough that you don't need to do statistics on it."" style="max-width:800px;" class="blog-image center" /></a></p>
<p class="center blog-image"><em>I didn’t worry that</em> this <em>result was bullshit statistical trickery, because I can easily see the evidence for myself.</em></p>
<p>Conversely, if your data isn’t very good, statistics can’t help you with it very much. John Tukey <a href="https://doi.org/10.2307/2683137">famously wrote</a>:</p>
<blockquote>
<p>The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.</p>
</blockquote>
<p>None of this means statistics is useless. But if we can consistently get good, high-quality data, we can afford a little sloppiness in our statistical methodology.</p>
<h3 id="putting-the-replication-in-replication-crisis">Putting the “replication” in “replication crisis”</h3>
<p>And this is where the “replication” half of “replication crisis” comes in. <strong>If the signal you’re detecting is real, you can run another experiment, or do another study, and (probably) see the same thing.</strong> In my <a href="https://jaydaigle.net/blog/replication-crisis-math/">post on the replication crisis</a> I wrote about how mathematicians are constantly replicating our important results, just by reading papers; and that protects us from a lot of the flaws plaguing social psychology.</p>
<p>Gelman recently <a href="https://statmodeling.stat.columbia.edu/2022/03/04/biology-as-a-cumulative-science-and-the-relevance-of-this-idea-to-replication/">made a similar point</a> about fields like biology. Because wet lab biology is cumulative, people are continually replicating old work in the process of trying to do new work. A boring false result can survive for a long time, if no one cares enough to use it; an exciting false result will be exposed quickly when people try to build on it and it collapses under the strain.</p>
<p>This is something Fisher himself wrote about clearly and firmly: “A scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”. That is, we shouldn’t accept a result when we successfully do <em>one</em> experiment that produces a low \(p\)-value; but we should listen when we can <em>consistently</em> do experiments with low \(p\)-values.</p>
<p><strong>But the entire concept of “replication” is in opposition to the artificial decisiveness of Neyman-Pearson hypothesis testing.</strong> The Neyman-Pearson method, if taken seriously, asks us to fully commit to believing a theory if our experiment comes up with \(p=0.049\); but that attitude is <em>utterly terrible science</em>. Good scientific practice <em>needs</em> to be able to hold beliefs lightly, revise them when new evidence comes in, and carefully build up solid foundations that can support further work.</p>
<p>The standard approach to hypothesis testing isn’t designed for that. Next time, in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">part 3</a>, we’ll look at some tools that are.</p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Or is there something you’d like to hear more about in the rest of this series? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is the probability of getting data five standard deviations away from the mean. So you’ll often see this reported as a significance threshold of \(5 \sigma\). Related is the <a href="https://en.wikipedia.org/wiki/Six_Sigma">Six Sigma techniques</a> for ensuring manufacturing quality, though somewhat counterintuitively they typically only aim for <a href="https://en.wikipedia.org/wiki/Six_Sigma#Role_of_the_1.5_sigma_shift">4.5 \(\sigma\)</a> of accuracy. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>It is common for people to be sloppy here and say they “accept” the null. In fact, I wrote that in my first draft of this paragraph. But it’s bad practice to say that, because even a very high \(p\)-value doesn’t provide good evidence that the null hypothesis is true. Our methods are designed to default to the null hypothesis when teh data is ambiguous.</p>
<p>Neyman <em>did</em> use the phrase “accept the null”, but in the context of a decision process, where “accepting the null” means taking some specific, concrete action implied by the null, rather than more generally committing to believe something. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Andrew Gelman suggests a helpful <a href="https://statmodeling.stat.columbia.edu/2016/01/26/more-power-posing/">time-reversal heuristic</a>: what would you think if you saw the same studies in the opposite order? You’d start with a few large studies establishing no effect, followed by one smaller study showing an effect. In theory that gives you the exact same information, but in practice people would treat it very differently—assuming the first studies <a href="#file-drawer">actually got published</a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>You might recognize this as an application of Bayes’s theorem, and a basic example of <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">Bayesian inference</a>. Tables like these are very common in Bayesian calculations. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>Followups to Ioannidis’s paper contend that only about \(14\)% of published biomedical findings are actually false. I’m not in a position to comment on this one way or the other. In psychology, different studies estimate that somewhere <a href="https://en.wikipedia.org/wiki/Replication_crisis#In_psychology">between from \(36\)% and \(62\)%</a> of published results replicate. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>We can also base it on <a href="https://twitter.com/lakens/status/1524799540250959873">how big of an effect we <em>care</em> about</a>. If we’re studying reaction times, we might decide that an effect smaller than ten milliseconds is irrelevant, and we don’t care about it even if it’s real. Then we can choose a study with enough power to detect a \(10\)<em>ms</em> effect at least \(80\)% of the time.</p>
<p>But this brings us back to the core issue, that “is there an effect” just isn’t a great question, and the Neyman-Pearson method isn’t a great tool for answering it. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>Clever analyses like this are often a bad idea; we’ll come back to this idea <a href="#file-drawer">soon</a>. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>This difference is the source of a lot of research pitfalls; if you want to dig into this more, I recommend <a href="https://psyarxiv.com/jqw35">Tal Yarkoni</a> on generalizability, <a href="https://psyarxiv.com/bu4d3/">Vazire, Schiavone, and Bottesini</a> on the four types of validity, and <a href="https://journals.sagepub.com/doi/10.1177/1745691620966795">Scheel, Tiokhin, Isager, and Lakens</a> on the derivation chain. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>Scott Alexander <a href="https://slatestarcodex.com/2014/04/28/the-control-group-is-out-of-control/">has pointed out</a> that ESP experiments are a great test case for our scientific and statistical methods, because we have extremely high confidence that we already know the true answer. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThis is the second-part of a three-part series on hypothesis testing. Today we'll look at the way we do hypothesis testing in practice, and how it tends to fail. Modern researchers use hypothesis testing as a tool to develop knowledge, but it's really a tool for making decisions, and so it encourages us to draw strong conclusions from weak evidence. It also encourages us to view studies that don't reject the null hypothesis as failures, which leads even honest and dedicated researchers to do shoddy research, producing "statistically significant" results that can't be reproduced.Hypothesis Testing and its Discontents, Part 1: How is it Supposed to Work?2022-03-31T00:00:00-07:002022-03-31T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-1<p>In my <a href="https://jaydaigle.net/blog/replication-crisis-math/">last post on the replication crisis</a>, I mentioned the basic ideas of <a href="https://en.wikipedia.org/wiki/Statistical_hypothesis_testing">statistical hypothesis testing</a>. There wasn’t room to give a full explanation in that post, but hypothesis testing is worth understanding, since it’s the foundation of most modern scientific research. It’s a powerful tool, but also incredibly easy to misunderstand and misuse.</p>
<p>This post is the first part of a three-part series explaining what hypothesis testing is and how it works. In this essay I’ll talk about the way hypothesis testing developed historically, in two rival schools of thought. I’ll explain how these two methodologies were originally supposed to work, and why you might (or might not) want to use them. In <a href="/blog/hypothesis-testing-part-2">Part 2</a> I’ll talk about how we do significance testing in practice today, and how that often goes wrong. And in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">Part 3</a> I’ll talk about alternatives to hypothesis testing that can help us avoid replication crisis-type problems.</p>
<h2 id="choose-your-question">Choose your question</h2>
<p>Perhaps the most important step in using math to solve real-world problems is figuring out precisely <a href="https://jaydaigle.net/blog/why-word-problems/">what question you want to ask</a>. Now, there’s a sense in which this process isn’t mathematical. Math can’t tell you, say, whether you want your clothing to be more comfortable or more stylish. No amount of math can tell you how you value inequality versus growth, or whether you’re willing to risk major side effects from an experimental medical treatment.</p>
<p>But math can help you figure out what question you’re asking, by clarifying exactly what questions you <em>could</em> be trying to answer, what their implications are, and what options you have for answering them. The history of hypothesis testing is a debate between people trying to answer different questions, but also a debate about which questions are the most fruitful to ask. Do we want to test a scientific principle? Record a precise measurement? Make a decision?</p>
<p>The statistical tools we use today were developed by specific people,<strong title="Some of these specific people were [pretty awful in one way or another]. Ronald Fisher in particular was [racist] and a [vigorous defender of tobacco companies], though Jezry Neyman seems to have been [perfectly lovely]. I'm not going to go into detail about their failings, among other things because I'm not especially well-informed on the subject; I recommend the articles I linked if you want to know more."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> at specific times, to answer specific questions. So I want to start off by asking some of those specific questions, and see how early statisticians would approach them and what ideas they developed in response.<strong title="Much of this essay, and especially the historical information on the way these schools of thought developed, draws heavily on the article [Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing] by Hubbard and Bayarri. This extremely readable article is also a fascinating historical artifact, basically predicting the entire contour of the replication crisis in 2003."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> After we’ve seen how Fisher’s significance testing and the Neyman-Pearson hypothesis testing framework worked in their original contexts, we can talk about what questions each tool is best suited to answer, and what types of question neither tool can really handle.</p>
<h2 id="fishers-significance-testing">Fisher’s Significance Testing</h2>
<h3 id="mileage">Are You Surprised?</h3>
<p>In 2016 I got a new car with a fancy new electronic system. And one of the new features was a meter that kept track of my gas mileage. It was fun to watch the mileage adjust as I was driving. (And I may have gotten a little obsessed with trying to eke out another tenth of a mile per gallon by driving funny.)</p>
<p>But how accurate is that mileage number? In 2019 my friend Casey suggested an experiment to me and I decided to try it. For several months, every time I filled up my gas tank, I recorded the mpg number from my car dashboard. I also recorded the number of miles I’d driven and the number of gallons of gas I’d used, which let me calculate the mpg directly.</p>
<table class="smalltable">
<thead>
<tr>
<th>Miles Driven</th>
<th>Gallons</th>
<th>Calculated MPG</th>
<th>Dashboard MPG</th>
<th>Difference</th>
</tr>
</thead>
<tbody>
<tr>
<td>340.7</td>
<td>10.276</td>
<td>33.2</td>
<td>34.2</td>
<td>1.0</td>
</tr>
<tr>
<td>300.1</td>
<td>8.97</td>
<td>33.5</td>
<td>34.7</td>
<td>1.2</td>
</tr>
<tr>
<td>232.6</td>
<td>8.04</td>
<td>28.9</td>
<td>29.0</td>
<td>0.1</td>
</tr>
<tr>
<td>261.8</td>
<td>8.5</td>
<td>30.8</td>
<td>31.1</td>
<td>0.3</td>
</tr>
<tr>
<td>301.3</td>
<td>9.316</td>
<td>32.3</td>
<td>32.5</td>
<td>0.2</td>
</tr>
<tr>
<td>505.1</td>
<td>15.127</td>
<td>33.4</td>
<td>34.8</td>
<td>1.4</td>
</tr>
<tr>
<td>290.3</td>
<td>9.814</td>
<td>29.6</td>
<td>30.3</td>
<td>0.7</td>
</tr>
<tr>
<td>290.2</td>
<td>8.566</td>
<td>33.9</td>
<td>34.9</td>
<td>1.0</td>
</tr>
<tr>
<td>294.9</td>
<td>9.005</td>
<td>32.7</td>
<td>32.8</td>
<td>0.1</td>
</tr>
<tr>
<td>301.4</td>
<td>9.592</td>
<td>31.4</td>
<td>32.0</td>
<td>0.6</td>
</tr>
<tr>
<td>230.9</td>
<td>7.643</td>
<td>30.2</td>
<td>32.0</td>
<td>1.8</td>
</tr>
<tr>
<td>269.2</td>
<td>8.644</td>
<td>31.1</td>
<td>30.8</td>
<td>-0.3</td>
</tr>
<tr>
<td>267</td>
<td>8.327</td>
<td>32.1</td>
<td>32.6</td>
<td>0.5</td>
</tr>
<tr>
<td>319.7</td>
<td>9.42</td>
<td>33.9</td>
<td>34.7</td>
<td>0.8</td>
</tr>
<tr>
<td>314.3</td>
<td>9.868</td>
<td>31.9</td>
<td>33.3</td>
<td>1.4</td>
</tr>
<tr>
<td>264.4</td>
<td>8.693</td>
<td>30.4</td>
<td>31.7</td>
<td>1.3</td>
</tr>
<tr>
<td>273</td>
<td>9.229</td>
<td>29.6</td>
<td>30.4</td>
<td>0.8</td>
</tr>
<tr>
<td>320.2</td>
<td>9.618</td>
<td>33.3</td>
<td>33.3</td>
<td>0.0</td>
</tr>
</tbody>
</table>
<p>These numbers show that my car reported a better mileage than I actually got almost every time. Out of eighteen measurements, my car overestimated sixteen times, underestimated once, and was accurate to one decimal place once. But was this tendency toward overestimation a coincidence? Is my car’s mileage calculation biased high, or did I just get weirdly unlucky?</p>
<p>We can try to get a sense of how easily this could have happened by chance. We took eighteen measurements, and sixteen of them were high. (One was a tie, but we’ll be generous and count it as “not high”.) If the car is equally likely to guess high or low, this is like flipping a coin eighteen times and getting sixteen heads. That’s pretty unlikely: the probability is about \(0.0006\), or \(0.06\)%, or about one in \(1700\). It’s still <em>possible</em> that my car is unbiased and I just got unlucky. But if so, I was extremely unlucky.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/star-wars-asteroids.jpeg" alt="Screenshot from The Empire Strikes Back, with dialog: "Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1!"" width="75%" /></p>
<p class="center"><em>But still only half as unlucky as Han Solo’s enemies.</em></p>
<h3 id="what-is-a-significance-test">What is a significance test?</h3>
<p>We call this approach a <em>significance test</em>. This approach was developed by <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald Fisher</a>, following up work by <a href="https://en.wikipedia.org/wiki/Karl_Pearson">Karl Pearson</a> and <a href="https://en.wikipedia.org/wiki/Student%27s_t-distribution">William Sealy Gosset</a>.</p>
<p>We start by formulating a <em>null hypothesis</em> that represents some form of “expected” behavior, which we call \(H_0\). In this case, I expected<strong title="Okay, maybe I didn't _actually_ expect my car to be accurate and unbiased. But it's at least _supposed_ to be true, so it provides a good baseline for comparison."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> my car to correctly measure my gas mileage, without consistent bias in either direction. There are a few ways to make that expectation mathematically precise; in the example above, my precise hypothesis was “an overestimate is just as likely as an underestimate”, or more formally, \(P(\text{overestimate}| H_0 ) = 0.5 \).</p>
<p>(There are other ways to formalize my expectations here. I ignored the size of the errors, and just looked at whether the measured mileage was better or worse than the mileage I calculated. But with a more complicated statistical tool called a <a href="https://en.wikipedia.org/wiki/Student's_t-test">paired \(t\)-test</a> we can use the exact numbers to get a bit more information out of our measurements. When I do this, I get a \(p\)-value of \(0.00004\), or \(0.004\)%—an order of magnitude lower than my first figure.)</p>
<p>Once we have a null hypothesis, <strong>we compute how unlikely the measurement we actually got would be, if we assume the null hypothesis is true</strong>. And if that sentence looks confusing and grammatically tangled, there’s a reason for that: while this process is absolutely unambiguous mathematically, it has nested “if-then” statements that are hard to think clearly about and don’t translate easily into English. In mathematical notation, we want \( P( \text{measurement} \mid H_0 ) \), which we can read as the probability of seeing our measurement given the null hypothesis.</p>
<p>There are a couple of subtle points here, so I want to be super explicit and run them into the ground. The first is that we need to be careful about what we mean by “how unlikely our result is”, because any <em>specific</em> result is extremely unlikely. The odds of getting the exact sequence I got in my experiment—HHHHHH HHHHHT HHHHHT—are exactly \(1\) in \(2^{18}\), because that specific sequence isn’t special. If you pick any specific sequence, whether it’s all heads like HHHHHH HHHHHH HHHHHH, or half-and-half like HTHTHT HTHTHT HTHTHT, or something totally random like HHHTHT HHHHTT HTTHTT, the odds of getting those exact flips in that exact order is \(1\) in \(2^{18}\).</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/coin-flips.png" alt="A picture of eighteen flipped coins" class="blog-image" /></p>
<p class="center"><em>The probability of getting these exact flips in this exact order is \(1\) in \(2^{18}\), or about \(0.000004\).</em></p>
<p>But that doesn’t tell us anything useful! Fortunately, in the context of hypothesis testing, we can do something smarter. It doesn’t really matter what <em>order</em> we get the heads in; it just matters how many we get, because that tells us how often the car is overestimating my mileage. So we can compute the odds of getting sixteen heads in any order. And getting seventeen heads would be even <em>more</em> unlikely, so we include that as well; so what we wind up computing is the odds of getting \(16\), \(17\), or \(18\) heads. That’s how I got the number \(0.0006\) earlier.</p>
<p>We say that we want to compute the chance of getting a result <em>at least as bad</em> as what we got. But that requires us to decide what counts as “better” or “worse”; and that depends on what question we’re actually trying to ask. In this context, I’m testing the null hypothesis that my car underestimates as often as it overestimates, so I can basically order the possible results from “most overestimation” to “most underestimation” and find the probability of overestimating at least as often as my car actually did.</p>
<h3 id="what-we-dont-learn">What we don’t learn</h3>
<p>Another subtle point, but an absolutely vital one, is that <strong>the \(p\)-value does <em>not</em> tell us how likely the null hypothesis is to be true</strong>. When we say that \(p = 0.0006\) that does <em>not</em> mean that there’s only a \(0.06\)% chance that my car is accurate! It just measures how unusual my evidence is, <em>if</em> the null hypothesis is true.</p>
<p>Often the question we really care about is how likely the null hypothesis is to be true. There are in fact ways to try to address that directly, which I’ll discuss in Part 3 of this series. But answering that question requires a lot more information than we usually have; Fisher’s significance test doesn’t try. <strong>It just assumes the null hypothesis is true</strong>, and tells us how weird that makes the result look.</p>
<p>Significance testing does numerically measure the strength of the experimental evidence we got: the lower the \(p\)-value, the stronger our evidence. But it doesn’t try to account for any <em>other</em> evidence we have, whether against the null hypothesis or for it. If I get a coin from the bank, flip it ten times and get ten heads, I get \(p \approx 0.001\) for the null hypothesis that it’s a normal coin. But I still expect it to be normal, because most coins are. And if I pick it up and see that it has a normal “tails” side, I’ll be really confident that I just got weirdly lucky<strong title="You might worry about whether it's a two-sided but biased coin. But Gelman and Nolan have argued that [coins physically can't be biased], and I find their argument compelling. If you don't find it compelling, you have to decide how likely you think a weighted coin would be—which is exactly the "other evidence" that Fisher's paradigm doesn't even try to account for."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong>.</p>
<p>And that’s why <strong>the analysis of my gas mileage above didn’t really have a firm a conclusion</strong>. We got a \(p\)-value of \(0.0006\), and determined: “huh, that’s kinda funny”. <em>Either</em> our null hypothesis was false, <em>or</em> something extremely unusual happened. But <strong>the math doesn’t tell us which of those two things to believe</strong>.</p>
<p>And in the case of my car, it doesn’t need to. On the one hand, I’m not all that surprised if the mileage calculator is a little wrong; the super-low \(p\)-value just reinforces what I already suspected. And on the other hand, I’m not really going to do anything different if my mileage is half an mpg lower than my dashboard says. I’m not going to sue Honda, or lead an activist campaign, or try to raise awareness about faulty mileage estimates.</p>
<p>But if I really cared, I could run more experiments. I got \(p = 0.0006\) in my first experiment; but I could do the experiment again. If I get \( p = 0.31\) next time, maybe I should assume the first result was just a fluke. But if I get \(p = 0.0003\) and then \( p = 0.0008\) I’ll see a pattern. And that pattern would make a convincing argument that my car is lying to me.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/omg-kitten.jpg" alt="picture of a shocked kitten: "OMG I knew it!"" class="blog-image" /></p>
<p>In “The Arrangement of Field Experiments”, Fisher writes that “A scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”. (Italics in the original.) That is, <a href="https://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">no one experiment should convince us of anything</a>. Instead, <strong>we should believe our results when we can reliably design experiments that give the same results</strong> (which is arguably the point that we <a href="https://jaydaigle.net/blog/science-vs-engineering/">pass from science to engineering</a>).<strong title="A friend asks if meta-analysis accomplishes the same thing, but meta-analysis is actually a much weaker threshold than the one Fisher gives here. Meta-analysis tries to amplify weak signals and reconcile inconsistent results; Fisher says we should only believe a claim when we can consistently get a strong signal."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>But that’s a slow, grinding, painstaking process. And it still doesn’t give us a rule for when to pull the trigger! We just gradually believe the null hypothesis less and less as we collect more data. That’s perfectly fine for doing basic science—maybe even ideal.</p>
<p>But what if the stakes are higher, and more immediate? Sometimes we need to make a real decision, now, with the data we have. So what do we do?</p>
<h2 id="neyman-pearson-hypothesis-testing">Neyman-Pearson Hypothesis Testing</h2>
<h3 id="time-to-make-a-choice">Time to make a choice</h3>
<p>Suppose we’re studying a new drug, which we hope will prevent deaths from cancer. We can collect data on how effective the drug seems to be in trials, but just reporting a \(p\)-value isn’t enough. At some point <strong>we have to make a <em>decision</em>: should we give people the drug, or not?</strong> And Fisher’s methods don’t answer that.<strong title="From what I understand, Fisher was a little contemptuous of the idea that you could answer this question mathematically."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p><a href="https://en.wikipedia.org/wiki/Jerzy_Neyman">Jezry Neyman</a> and <a href="https://en.wikipedia.org/wiki/Egon_Pearson">Egon Pearson</a> (the son of Karl Pearson) decided to attack that question head-on. They began by observing that there are two different mistakes we could make, which they called “Type I” and “Type II” errors.</p>
<p>These names are infamously unmemorable, but in their original context they make perfect sense: <strong>whichever mistake we most want to avoid is the “first type” of mistake</strong>. For drug testing, there’s a widespread consensus that it’s worse to prescribe a drug that doesn’t work, or has nasty side effects, than it is to withhold a drug that works as expected.<strong title="I'm not convinced I agree with this, but that's beside the point here. I'll discuss this choice a bit more in Part 2 of this series."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> So the Type I error would be prescribing a drug that doesn’t work, and the Type II error would be failing to prescribe a drug that does work. This means we can take “the drug doesn’t work” as our null hypothesis \(H_0\). <strong>But we can contrast our null hypothesis with a specific alternative: that the drug does, in fact, work</strong>. We call this our “alternative hypothesis” \(H_A\). And we get the following classic chart:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null Hypothesis is false <br /> (Drug works)</th>
<th>Null Hypothesis is true <br /> (Drug doesn’t work)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Give the drug <br /> (Reject the Null)</td>
<td>Correct decision <br /> “True Positive”</td>
<td>First, worse error <br /> (Type I Error) <br /> “False Positive”</td>
</tr>
<tr>
<td>Don’t give the drug <br /> (Don’t Reject)</td>
<td>Second, less bad error <br /> (Type II Error) <br /> “False Negative”</td>
<td>Correct Decision <br /> “True Negative”</td>
</tr>
</tbody>
</table>
<p>This leaves us with a problem. There are two different mistakes we could make. And without getting better data, we can only reduce the Type II errors by increasing the Type I errors: if we’re more generally willing to say “yes, prescribe the drug”, we’ll say “yes” more often when the drug works, but also when it doesn’t. We need to strike some sort of balance between the two risks. But how?</p>
<p><strong>There’s no abstract, mathematical answer to this question; it depends on the specific, practical consequences of the decision we’re making</strong>, and how much we care about the specific trade-offs in play. We already said that a Type I error is worse than a Type II error—but by how much? Is it two times as bad? Five? Ten? We have to decide exactly how we weigh the two risks against each other.</p>
<p>In drug testing, a Type I error means spending money on drugs that don’t work and might hurt people. A Type II error means people don’t get treatment that would help them. If a disease is really bad, we’re more willing to make Type I errors, because a drug that <em>might</em> kill you compares favorably to a disease that <em>definitely</em> will. If a drug is really expensive, or has bad side effects, we might be more willing to make Type II errors, because people will be hurt more by letting a bad drug slip through. And there are dozens more factors like that that we have to weigh against each other.</p>
<p>Once we’ve decided how we want to balance these risks, we can define a threshold for our experiment. If our data falls crosses that threshold we prescribe the drug; if the data doesn’t cross the threshold, then we don’t. And that’s our decision.</p>
<h3 id="the-risk-of-error">The risk of error</h3>
<p>All this setup leaves us with a pair of numbers that describe the trade-offs we’ve made. The rate of Type I errors is \(\alpha\), which tells us: <em>if</em> the drug doesn’t work, how likely are we to prescribe it? Its mirror is \(\beta\), the rate of Type II errors. This tells us: if the drug <em>does</em> work, how likely are we to withhold it? <strong title="In a medical context, we often talk about the related concepts of _sensitivity_ and _specificity_. Sensitivity is the "true positive" rate 1-β, the probability of correctly prescribing the drug if it would help. Specificity is the "true negative" rate 1-α, the probability of correctly withholding the drug if it would not help. These terms come from diagnostic testing. "Sensitivity" measures the chance of correctly detecting a condition that you have; "specificity" measures the chance of correctly detecting that you don't have a condition. "><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p class="center"><img src="/assets/blog/hypothesis-testing/neyman-pearson-confusion-chart.png" alt="" /></p>
<p class="center"><em>We give the drug if our measurement is bigger than the threshold. If the drug works, we’ll get a result from the right (green) bell curve; if it doesn’t, we’ll get a result from the left (yellow) one.</em></p>
<p class="center"><em>ROC_curves.svg: Sharprderivative work: נדב ס, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via <a href="https://commons.wikimedia.org/wiki/File:ROC_curves_colors.svg">Wikimedia Commons</a></em></p>
<p>(You’ll often see \(\alpha\) referred to as the “false positive rate” and \(\beta\) as the “false negative” rate, but that’s a little inexact. In modern practice, the null hypothesis is almost always “there is no effect”, but this isn’t necessary to the framework. If we want to err on the side of prescribing the drug, then “the drug works” would be the null hypothesis and “no it doesn’t” would be the alternative. In that case, rejecting the null would be a <em>negative</em> result and a Type I error would be a false <em>negative</em>.)</p>
<p>But through all this, <strong>we have to be careful about what question we’re asking, and whether our methods can answer it.</strong> Naively we might want to ask something like “how likely is it that this drug works”, but Fisher, Neyman, and Pearson all would have agreed that that’s an incoherent question that can’t really be answered.<strong title="All three were [frequentists], and believed (roughly) that you can only give a "probability" for something repeatable. You can talk about the probability a study will give a null result, since you could run a hundred studies and count how many give the null. But you can't talk about the probability that a given drug works, since there's only the one drug. The major modern alternative to frequentist probability is [Bayesianism], which _does_ think this question makes sense. I've written about Bayesian reasoning [in the past](https://jaydaigle.net/blog/overview-of-bayesian-inference/) and I'll come back to it in Part 3 of this series. But the Neyman-Pearson method is definitely not Bayesian."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong> (And even if you believe it’s a coherent question, it’s still not an easy one.)</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/seriously-you-didnt-answer-my-question.jpg" alt="An upset-looking cat: "Seriously, you didn't answer my question"" class="blog-image" /></p>
<p>Instead, the probabilities we computed are both conditional: <em>if</em> the drug doesn’t work, how likely are we prescribe it? And <em>if</em> the drug does work, how likely are we to withhold it? We can use those probabilities to make the best possible decision, given the information we used and the assumptions we made. <strong>But we can’t compute the probability that our decision is correct</strong>, because that’s just not the question that the Neyman-Pearson method can answer.</p>
<h3 id="dont-tell-me-what-to-think">Don’t tell me what to think!</h3>
<p>In fact, the Neyman-Pearson method is even less able to answer that than the Fisher method. Fisher can’t tell us the probability that we’re right, but it’s at least an attempt to figure out whether we’re right, by measuring our experimental evidence against the null hypothesis. But <strong>Neyman-Pearson doesn’t even try to tell us whether the drug “really works” or not. It just tells us what we should <em>do</em>.</strong></p>
<p>And it is very possible to believe that a drug probably works and is safe, but also that we’re <a href="https://en.wikipedia.org/wiki/Primum_non_nocere">not sure enough</a> to go around prescribing it; it’s equally possible to believe a drug probably doesn’t work, but it’s cheap and harmless so we <a href="https://jaydaigle.net/blog/pascalian-medicine/">might as well give it a shot</a>. Neyman himself wrote, in his <em>First Course in Probability and Statistics</em>:</p>
<blockquote>
<p>[T]o accept a hypothesis \(H\) means only to decide to take action \(A\) rather than action \(B\). This does not mean that we necessarily believe that the hypothesis \(H\) is true. Also, [to reject] \(H\) means only that the rule prescribes action \(B\) and does not imply that we believe \(H\) is false.</p>
</blockquote>
<p>Researchers talk about the difference between statistical significance and <a href="https://statisticsbyjim.com/hypothesis-testing/practical-statistical-significance/">practical</a> or <a href="https://www.mhaonline.com/faq/clinical-vs-statistical-significance">clinical significance</a>, but <strong>in the true Neyman-Pearson setup, practical and statistical significance should be the same</strong>. Sure, if your measurements are precise enough, you can detect an effect that’s too small to matter. Conversely, a small pilot experiment can provide exciting, suggestive data without conclusively establishing any facts. But Neyman-Pearson is designed to choose a significance threshold \(\alpha\) to optimize <em>decision-making</em>, and that means that the statistical threshold <em>must</em> be a practically significant threshold.</p>
<p>If we’re trying to make an optimal decision based on limited information, Neyman-Pearson is about the best we can do. And that’s a pretty plausible description of a lot of medical studies. Phase III drug trials are slow, difficult, and expensive; we’re not going to run the whole thing over again just to check. We need a threshold for deciding whether to approve a drug or not, with the information we have; and that threshold is necessarily a practical one.</p>
<p>But scientific research isn’t generally about single isolated decisions; it’s a search for knowledge, an attempt to figure out what’s true and what isn’t. <strong>Neyman-Pearson very specifically <em>wasn’t</em> designed to answer questions about truth, but we try to use it to do science anyway.</strong> I’ll talk about how exactly that works (and doesn’t work) in Part 2 of this series; but (spoilers!) it works out <em>awkwardly</em>, and the mismatch between what Neyman-Pearson does and what we <em>want</em> it to do is a major contributor to the replication crisis.</p>
<h3 id="making-promises">Making promises</h3>
<p>The Neyman-Pearson method doesn’t tell you what to believe, but it does make a very specific promise: if you set your significance threshold to \(\alpha =5\)%, then your false positive rate will be \(5\)%. This is a statistics theorem, so it really is guaranteed—if you set everything up correctly.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/promise-kitten.jpg" alt="sad kitten: "Do you promise?"" class="blog-image" /></p>
<p>But that guarantee only applies to the threshold you set <em>before you saw the data</em>. If you run your experiment, do your analysis, and compute \(p = 0.048\), then your result is significant, and the background false positive rate is \(5\)%. But if you run your experiment, do your analysis, and compute \(p = 0.001\), then your result is significant, and the background false positive rate is <em>still</em> \(5\)%. The false positive rate doesn’t get lower just because the \(p\)-value does.</p>
<p>Huh? Isn’t \(p = 0.001\) much stronger evidence than \(p = 0.048\)?</p>
<p>In one sense, yes. That’s what Fisher tells us. But Fisher doesn’t make <em>decisions</em>, and doesn’t make this statistical guarantee. It’s a different tool that answers a different question.</p>
<p>Neyman-Pearson <em>does</em> make a guarantee, but that guarantee is very specific. <strong>If you run a hundred experiments where the null hypothesis is true, you’ll only reject about five times.</strong> (And you get the lowest possible \(\beta\), the fewest possible false negatives, compatible with that false positive rate.) But that’s all you’re guaranteed.</p>
<p>And in particular, if the null hypothesis is true then all \(p\)-values are equally likely. So if you do a hundred experiments, you should expect one of them to give you \(p=0.95\), and one to give \(p = 0.05\), and one to give \(p=0.01\). And that \(0.01\) isn’t, mathematically, special. It’s just one of the five false positives you expect.</p>
<p>If you want the guarantees of Neyman-Pearson’s methods, you can’t treat especially low \(p\)-values as especially, well, <em>special</em>. They land in your critical region. You reject the null. The answer to your question is “yes, prescribe the drug”. And that’s <em>all you get</em>.</p>
<p>And the same reasoning applies to results “trending towards significance”. If your \(p\)-value is \(0.06\), then you’re outside the critical region, you accept the null, and the answer to your question is “no, don’t prescribe it”.</p>
<p>And here’s the weirdest bit. If you get \(p=0.06\), you can change your significance threshold after the fact. Now you’re getting a \(6\)% false positive rate. And maybe that sounds like what you’d expect? But <strong>that also applies, retroactively, to every <em>other</em> time you ran an experiment</strong>, even if you got \(p=0.04\) and didn’t have to change your threshold.</p>
<p>If you set yourself a spending limit of \$20, but then spend \$25 when you see something you really wanted, you didn’t actually have a spending limit of \$20 in the first place. And if you’re willing to lift your \(\alpha\) when your \(p\)-value is too high—if you know that when \(p = 0.06\) you’ll frown, and hesitate, and grudgingly prescribe the drug anyway—then your \(\alpha\) is really \(6\)%, regardless of what you say. You’ll get false positives six percent of the time. You’re answering a slightly different question. Which as fine—<em>if</em> it’s closer to the question you really want to answer.</p>
<h2 id="what-are-they-good-for">What are they good for?</h2>
<p>We’ve seen these two different approaches to significance testing, and which specific questions they’re trying to answer. Now we can try to figure out when to use each of these tools, and when neither of them is quite right.</p>
<h3 id="the-measure-of-some-things">The measure of some things</h3>
<p>If you have a specific, yes-or-no decision you need to make on limited evidence, the Neyman-Pearson framework is fantastic. For a doctor deciding whether to prescribe a drug, or a company doing A/B testing deciding whether to roll out a new feature, it is exactly the right tool. Choose your \(\alpha\) and \(\beta\) intelligently, commit to your threshold, run your experiment, and you’re done.</p>
<p>But scientific research doesn’t really work that way. In part, because we accumulate knowledge over time; we don’t need to make a big decision after one study.<strong title="Modern researchers have ways to get around that using tools like meta-analysis: at any given time you can make a decision based on all your information, and when you get new information you can make a new decision. But it's still a bit forced, and not what Neyman-Pearson was designed for."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong> Fisher’s methods were designed to handle this accumulation of evidence much more adroitly, since they don’t create hard cutoffs: as Fisher wrote, “decisions are final, while the state of opinion derived from a test of significance is provisional, and capable, not only of confirmation, but of revision.”</p>
<p>The bigger problem is that Neyman-Pearson and Fisher are often used to answer the wrong question entirely. <strong>Sometimes in science we just want to know whether something is real.</strong> For example, the Large Hadron Collider wanted to find out <a href="https://en.wikipedia.org/wiki/Higgs_boson#Search_and_discovery">if the Higgs Boson existed</a>. This isn’t really what Neyman-Pearson is built for—remember, it’s for making decisions, not finding the truth— but it is a yes-or-no question, so we can kind of make it work. Fisher’s methods were designed for <em>exactly</em> this question, by measuring how much evidence your experiment gives for the thing’s existence, and they are essentially what the CERN team used.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/higgs-mordor.jpg" alt="One does not simply find the Higgs Boson" class="blog-image" /></p>
<p><strong>But more often we want to measure something.</strong> This is true even for things like the Higgs search, where the initial announcement of the Higgs boson discovery was for “a new particle with a mass between \(125\) and \(127\) \(\text{GeV}/c^2\)”. It’s even more true in other contexts. In medicine, we want to know <em>how effective</em> a drug will be; in psychology we want to know <em>how strongly</em> a picture can affect our emotions; in public policy we want to know <em>how much</em> a new program will reduce poverty.</p>
<p><strong>And neither Fisher nor Neyman-Pearson answer those questions at all.</strong> It’s just not what they’re designed to do.</p>
<p>I talked about this problem in my <a href="https://jaydaigle.net/blog/replication-crisis-math/#effect-sizes">post on the replication crisis</a>. Amy Cuddy started by asking whether the power pose had an effect—a yes-or-no question. She wound up talking about <em>how large</em> the effect was, which is a completely different question. Hypothesis testing only answers the first question; if you try to use it to <em>measure</em> things you cause yourself all sorts of problems, just like the ones Cuddy ran into.</p>
<p>We also see these problems in research on politically controversial subjects like <a href="https://en.wikipedia.org/wiki/Minimum_wage#Statistical_meta-analyses">the minimum wage</a> and <a href="https://en.wikipedia.org/wiki/Minimum_wage#Statistical_meta-analyses">gun control</a>. Economic theory suggests that raising the minimum wage should increase unemployment; there’s an extensive literature of dueling empirical studies, with some showing that it does, and others showing that it doesn’t. A lot of ink has been spilled over whether minimum wage increases <em>really</em> increase unemployment, and that’s a genuinely tricky question that I can’t answer.<strong title="Among other things, because the answer is probably "sometimes yes and sometimes no, it depends on the circumstances." And I don't think anyone seriously doubts that a minimum wage of $100 per hour would increase unemployment, and a minimum wage of $1 per hour would not."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong></p>
<p>But what I can do is <em>reframe</em> the question. We don’t know if the minimum wage raises are increasing unemployment. But we do know they can’t be increasing it <em>very much</em>. If they were, we’d be able to tell! So the effect may be real, but if it is, it’s <em>small</em>.<strong title="This is the difference between "practical significance" and "statistical significance" we talked about earlier. But that distinction shouldn't arise in a proper Neyman-Pearson setup, which is one way you can tell it's being misused here."><sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup></strong> That’s a good enough answer to make policy. But it’s not an answer that hypothesis testing can give you.</p>
<p>If we care about the size of what we’re studying, and not just whether it exists at all, there are much better tools to use than hypothesis tests like Fisher or Neyman-Pearson. I’ll talk about some of these in Part 3 of this series.</p>
<h3 id="the-significance-binary">The Significance Binary</h3>
<p>The other major difference between Fisher’s approach and Neyman-Pearson is the degree of nuance allowed in their answers. In Fisher’s formulation, we ask how much evidence our experiment gives against the null hypothesis, which means we can have a lot of shades of gray in our result. The lower the \(p\)-value, the stronger the evidence; a \(p\)-value of \(0.001\) is ten times as good as a \(p\)-value of \(0.01\).</p>
<p>This still doesn’t measure the size of the effect, because you can have lots of evidence for a small effect. (I have plenty of evidence that I can move things by pushing them with my finger, but that won’t allow me to knock over the Washington Monument.) But Fisher’s methods do give a fine-grained, quantitative measurement of something: the strength of the evidence against our null hypothesis.</p>
<p>In contrast, <strong>the Neyman-Pearson formulation doesn’t give us fine distinctions</strong>. We ask if our alternative hypothesis is better than the null, and we get an answer to exactly that question—and that answer can only be “yes” or “no”. The entire continuous \(p\)-value spectrum gets compressed into a definitive “yes” or “no” with no middle ground.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/false-dichotomy.jpg" alt="Image of umpire: "False dichotomy on the play. Arbitrarily reducing a set of many possibilities to only two."" class="blog-image" /></p>
<p>That’s a huge problem when nuance is important, with consequences visible throughout the body of scientific literature. But the problems especially bad in contexts like public health communication, where both honesty and clarity save lives.</p>
<p>Our medical establishment uses what’s essentially a Neyman-Pearson framework to evaluate possible treatments. And it is (understandably) conservative about approving new drugs, which means that \(\alpha\) is set fairly aggressively. We get a lot of false negative results, denying treatments that would work. And in a terrible misuse of language, when a treatment doesn’t clear our fairly high bar for significance, we tend to say there is <a href="https://twitter.com/zeynep/status/1366175070507384836">“no evidence”</a> for it, or even flatly that it “doesn’t work”—whether we mean that it definitely doesn’t work, or that it probably does but we’re not quite sure yet.</p>
<p>This failing was on full display in the early days of the coronavirus pandemic. In February and March 2020, the Surgeon General issued a statement that masks “are NOT effective in preventing” Covid infections, even though we had good reasons to believe they were; the evidence was real, but not (yet) sufficient to reject the null. In December, the World Health Organization said there was <a href="https://twitter.com/WHO/status/1254160944638447618">no evidence that vaccines would reduce covid transmission</a>. Again, there was real evidence that vaccines would reduce transmission, but not enough to cross WHO’s Neyman-Pearson-style decision threshold. And because of the binary output of a Neyman-Pearson process, this tentative wait-and-see approach was communicated in the form of definitive, final-sounding judgments.</p>
<p>There are definitely smarter and more sophisticated ways to use hypothesis testing on questions like this. First, it would help just to remember that our results are provisional and not absolute truths. Sometimes we do have to make a decision <em>now</em> about whether to prescribe a treatment, or roll out a new product, or even just change some official guidelines. But that doesn’t mean we’re locked into that decision forever; and simply saying there was “not enough” evidence for masks, rather than “no evidence”, would have been more honest and <em>also</em> made the subsequent reversal less confusing.</p>
<p>Second, when we do have to make decisions, we can be more thoughtful about the trade-offs between false positives and false negatives. It’s become standard to take \(\alpha=0.05\) and let \(\beta\) fall where it may; but the decision theory works best when we think about the actual trade-offs involved, and choose our parameters accordingly. That, too, would have helped with communication around Covid: the risks of having people wear masks for a couple months while we figured out if they helped were low, and we didn’t need to be as cautious about recommending masking as we are about approving a new cancer drug.</p>
<h2 id="where-do-we-stand">Where do we stand?</h2>
<p>Hypothesis tests are ways of using data to give yes-or-no answer to certain questions. They’re extremely powerful in the contexts they were designed for: Neyman-Pearson gives a good rule for making decisions, and Fisher gives a good approach to describing how much evidence your experiment produced. But when you try to apply them outside of those contexts, you can easily get confusing or misleading results.</p>
<p>But this essay has presented both approaches to hypothesis testing more or less as they were originally designed, in their original contexts. Modern hypothesis testing works a little differently. <strong>The Fisher approach gives us a nuanced evaluation of the evidence, but no firm conclusion; the Neyman-Pearson approach gives us a clear answer, but nothing else.</strong></p>
<p>But modern researchers often want both. Modern methods try to deliver. And modern methods often, predictably, fail.</p>
<p>Next time in <a href="/blog/hypothesis-testing-part-2">Part 2</a> we’ll see how the modern approach to hypothesis testing works. And we’ll see how the modifications we’ve made to try to have it both ways loses some of the benefits of both approaches, and invites the sort of research failures that we’ve seen throughout the replication crisis.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/cats-soon.jpg" alt="Cats staring at city skyline. Caption: "Soon."" class="blog-image" /></p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Or is there something you’d like to hear more about in the rest of this series? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Some of these specific people were <a href="https://nautil.us/how-eugenics-shaped-statistics-9365/">pretty awful in one way or another</a>. Ronald Fisher in particular was <a href="https://www.newstatesman.com/uncategorized/2020/07/ra-fisher-and-science-hatred">racist</a> and a <a href="https://priceonomics.com/why-the-father-of-modern-statistics-didnt-believe/">vigorous defender of tobacco companies</a>, though Jezry Neyman seems to have been <a href="https://daniellakens.blogspot.com/2021/09/jerzy-neyman-positive-role-model-in.html?m=1">perfectly lovely</a>. I’m not going to go into detail about their failings, among other things because I’m not especially well-informed on the subject; I recommend the articles I linked if you want to know more. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Much of this essay, and especially the historical information on the way these schools of thought developed, draws heavily on the article <a href="https://doi.org/10.1198/0003130031856">Confusion Over Measures of Evidence (\(p\)’s) Versus Errors (\(\alpha\)’s) in Classical Statistical Testing</a> by Hubbard and Bayarri. This extremely readable article is also a fascinating historical artifact, basically predicting the entire contour of the replication crisis in 2003. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Okay, maybe I didn’t <em>actually</em> expect my car to be accurate and unbiased. But it’s at least <em>supposed</em> to be true, so it provides a good baseline for comparison. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>You might worry about whether it’s a two-sided but biased coin. But Gelman and Nolan have argued that <a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">coins physically can’t be biased</a>, and I find their argument compelling. If you don’t find it compelling, you have to decide how likely you think a weighted coin would be—which is exactly the “other evidence” that Fisher’s paradigm doesn’t even try to account for. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>A friend asks if meta-analysis accomplishes the same thing, but meta-analysis is actually a much weaker threshold than the one Fisher gives here. Meta-analysis tries to amplify weak signals and reconcile inconsistent results; Fisher says we should only believe a claim when we can consistently get a strong signal. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>From what I understand, Fisher was a little contemptuous of the idea that you could answer this question mathematically. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>I’m not convinced I agree with this, but that’s beside the point here. I’ll discuss this choice a bit more in Part 2 of this series. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>In a medical context, we often talk about the related concepts of <em>sensitivity</em> and <em>specificity</em>. Sensitivity is the “true positive” rate \(1-\beta\), the probability of correctly prescribing the drug if it would help. Specificity is the “true negative” rate \(1-\alpha\), the probability of correctly withholding the drug if it would not help.</p>
<p>These terms come from diagnostic testing. “Sensitivity” measures the chance of correctly detecting a condition that you have; “specificity” measures the chance of correctly detecting that you don’t have a condition. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>All three were <a href="https://en.wikipedia.org/wiki/Frequentist_probability">frequentists</a>, and believed (roughly) that you can only give a “probability” for something repeatable. You can talk about the probability a study will give a null result, since you could run a hundred studies and count how many give the null. But you can’t talk about the probability that a given drug works, since there’s only the one drug.</p>
<p>The major modern alternative to frequentist probability is <a href="https://en.wikipedia.org/wiki/Bayesian_probability">Bayesianism</a>, which <em>does</em> think this question makes sense. I’ve written about Bayesian reasoning [in the past] and I’ll come back to it in Part 3 of this series. But the Neyman-Pearson method is definitely not Bayesian. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>Modern researchers have ways to get around that using tools like meta-analysis: at any given time you can make a decision based on all your information, and when you get new information you can make a new decision. But it’s still a bit forced, and not what Neyman-Pearson was designed for. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
<li id="fn:11">
<p>Among other things, because the answer is probably “sometimes yes and sometimes no, it depends on the circumstances.” And I don’t think anyone seriously doubts that a minimum wage of \$100 per hour would increase unemployment, and a minimum wage of \$1 per hour would not. <a href="#fnref:11" class="reversefootnote">↩</a></p>
</li>
<li id="fn:12">
<p>This is the difference between “practical significance” and “statistical significance” we talked about earlier. But that distinction shouldn’t arise in a proper Neyman-Pearson setup, which is one way you can tell it’s being misused here. <a href="#fnref:12" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThis is the first part of a three-part series explaining what hypothesis testing is and how it works. In this essay I'll talk about the way hypothesis testing developed historically, in two rival schools of thought. I'll explain how these two methodologies were originally supposed to work, and why you might (or might not) want to use them.Why Isn’t There a Replication Crisis in Math?2022-02-02T00:00:00-08:002022-02-02T00:00:00-08:00https://jaydaigle.net/blog/replication-crisis-math<p>One important thing that I think about a lot, even though I have no formal expertise, is the <a href="https://www.vox.com/future-perfect/21504366/science-replication-crisis-peer-review-statistics">replication crisis</a>. A shocking fraction of published research in many fields, including medicine and psychology, is flatly wrong—the results of the studies can’t be obtained in the same way again, and the conclusions don’t hold up to further investigation. Medical researcher John Ioannidis brought this problem to wide attention in 2005 with a paper titled <a href="https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124">Why Most Published Research Findings Are False</a>; attempts to replicate the results of major psychology papers suggest that <a href="https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/">only about half of them hold up</a>. A recent analysis gives <a href="https://apnews.com/article/science-business-health-cancer-marcia-mcnutt-93219170405e3de753651b89d4308461">a similar result for cancer research</a>.</p>
<p>This is a real crisis for the whole process of science. If we can’t rely on the results of famous, large, well-established studies, it’s hard to feel secure in <em>any</em> of our knowledge. It’s probably the most important problem facing the entire project of science right now.</p>
<p>There’s a lot to say about the mathematics we use in social science research, especially statistically, and how bad math feeds the replication crisis.<strong title="I'm a big fan of the [Data Colada] project, and of [Andrew Gelman's writing] on the subject"><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> But I want to approach it from a different angle. <strong>Why doesn’t <em>the field of mathematics</em> have a replication crisis?</strong> And what does that tell us about other fields, that do?</p>
<h2 id="why-doesnt-math-have-a-replication-crisis">Why doesn’t math have a replication crisis?</h2>
<h3 id="maybe-mathematicians-dont-make-mistakes">Maybe mathematicians don’t make mistakes</h3>
<p>Have you, uh, <a href="https://mathwithbaddrawings.com/2017/01/11/why-are-mathematicians-so-bad-at-arithmetic/">met any mathematicians</a>?</p>
<p style="text-align: center;"><a href="https://mathwithbaddrawings.com/2017/01/11/why-are-mathematicians-so-bad-at-arithmetic/"><img src="/assets/blog/replication-crisis-math/sign-error.jpg" alt="Cartoon: "So the tip is...$70? But the meal was only $32..." "Maybe we made a sign error, and they owe us $70."" width="75%" /></a></p>
<p style="text-align: center;"><em>Comic by Ben Orlin at <a href="https://mathwithbaddrawings.com/2017/01/11/why-are-mathematicians-so-bad-at-arithmetic/">Math with Bad Drawings</a></em></p>
<p style="text-align: center;"><em>At Caltech, they made the youngest non-math major split the check: the closer you were to high school, the more you remembered of basic arithmetic. But everyone knew the math majors were hopeless.</em></p>
<p>More seriously, it’s reasonably well-known among mathematicians that <strong>published math papers are <a href="https://twitter.com/benskuhn/status/1419281164951556097"><em>full</em> of errors</a></strong>. Many of them are eventually fixed, and most of the errors are in a deep sense “unimportant” mistakes. But the frequency with which proof formalization efforts <a href="https://mathoverflow.net/questions/291158/proofs-shown-to-be-wrong-after-formalization-with-proof-assistant">find flaws in widely-accepted proofs</a> suggests that there are plenty more errors in published papers that no one has noticed.</p>
<p>So math has, if not a replication crisis, at least a replication problem. Many of our published papers are flawed. But it doesn’t seem like we have a crisis.</p>
<h3 id="maybe-our-mistakes-get-caught">Maybe our mistakes get caught</h3>
<p>In the social sciences, replicating a paper is hard. You have to get new funding and run a new version of the same experiment. There’s a lot of dispute about how closely you need to replicate all the mechanics of the original experiment for it to “count” as a replication, and sometimes you can’t get a lot of the details you’d need to do it right—especially if the original authors aren’t feeling helpful.<strong title="In theory, all papers should include enough information that you can replicate all the experiments they describe. In practice, I think this basically never happens. There's just too much information, and it's hard to even guess which things are going to be important."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> And after all that work, people won’t even be impressed, because you didn’t do anything original!</p>
<p>But one of the distinctive things about math is that our papers aren’t just records of experiments we did elsewhere. In experimental sciences, the experiment is the “real work” and the paper is just a description of it. But <strong>in math, the paper, itself, is the “real work”</strong>. Our papers don’t describe everything we do, of course. There’s a lot of intellectual exploration and just straight-up messing around that doesn’t get written down anywhere. But the paper contains a (hopefully) complete version of the argument that we’ve constructed.</p>
<p>And that means that <strong>you can <em>replicate</em> a math paper by <em>reading</em> it</strong>. When I’ve served as a peer reviewer I’ve read the papers closely and checked all the steps of the proofs, and that means that I have replicated the results. And any time you want to use an argument from someone else’s paper, you have to work through the details, and that means you’re replicating it again.</p>
<p>The replication crisis is partly the discovery that many major social science results do not replicate. But it’s also the discovery that we hadn’t been trying to replicate them, and we really should have been. In the social sciences we fooled ourselves into thinking our foundation was stronger than it was, by never testing it. But in math we couldn’t avoid testing it.</p>
<h3 id="maybe-the-crisis-is-here-and-we-just-havent-noticed">Maybe the crisis is here, and we just haven’t noticed</h3>
<p>As our mathematics gets more advanced and our results get more complicated, this replication process becomes harder: it takes more time, knowledge, and expertise to understand a single paper. If replication gets hard enough, we may fall into crisis. The crisis might even <a href="https://link.springer.com/article/10.1007/s00283-020-10037-7">already be here</a>; the problems in psychological and medical research existed for decades before they were widely appreciated.</p>
<p>There’s some fascinating work in using <a href="https://www.nature.com/articles/d41586-021-01627-2">computer tools to formally verify proofs</a>, but this is still a niche practice. In theory we are continually re-checking all our work, but in practice that’s inconsistent, so it’s hard to be sure how deep the problems run. (Especially since flawed papers <a href="https://twitter.com/zbMATH/status/1474326312517271560">don’t really get retracted</a> and you pretty much have to talk to active researchers in a field to know which papers you can trust.)</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/trust.jpg" alt="Picture of kitten in bubble bath with caption: "my trust, u loses it." " width="50%" /></p>
<p>But while this is a real possibility that people should take seriously, I’m skeptical that we’re in the middle of a true crisis of replicability.<strong title="I'm sure every practitioner in every field says that, though, even years after the problems become obvious to anyone who looks. So take this with a grain of salt."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> <strong>Many papers have errors, yes—but our major results generally hold up, even when the intermediate steps are wrong!</strong> Our errors can usually be fixed without really changing our conclusions.</p>
<p>Since our main conclusions hold up, we don’t need to fix any downstream papers that relied on those conclusions. We don’t need to substantially revise what we thought we knew. We don’t need to jettison entire fields of research, the way <a href="https://replicationindex.com/2017/02/02/reconstruction-of-a-train-wreck-how-priming-research-went-of-the-rails/comment-page-1/">psychology had to abandon the literature on social priming</a>. There are problems, to be sure, and we could always do better. But it’s not a crisis.</p>
<h3 id="mysterious-intuition">“Mysterious” intuition</h3>
<p>But isn’t it…<em>weird</em>…that our results hold up when our methods don’t? How does that even work?</p>
<p>We get away with it becuase we can be right for the wrong reasons—<strong>we mostly only try to prove things that are basically true</strong>. Ben Kuhn tweeted a very accurate-feeling summary of the whole situation <a href="https://twitter.com/benskuhn/status/1419281164951556097">in this twitter thread</a>:</p>
<blockquote>
<p>[D]espite the fact that error-correction is really hard, publishing actually false results was quite rare because “people’s intuition about what’s true is mysteriously really good.” Because we mostly only try to prove true things, our conclusions are right even when our proofs are wrong.<strong title="A friend asks: if we mostly know what's true already, why do we need to actually find the proofs? The bad answer is "you're not doing math if you don't prove things". The good answer is that finding proofs is how we train this mysteriously good intuition; if we didn't work out proofs in detail, we wouldn't be able to make good guesses about the next steps."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong></p>
</blockquote>
<p>This can make it weirdly difficult to resolve disagreements about whether a proof is actually correct. In a recent example, Shinichi Mochizuki claims that he has <a href="https://www.quantamagazine.org/titans-of-mathematics-clash-over-epic-proof-of-abc-conjecture-20180920/">proven the \(abc\) conjecture</a>, while most mathematicians don’t believe his argument is valid. But everyone involved is pretty confident the \(abc\) conjecture is true; the disagreement is about whether the proof itself is good.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/proof.jpg" alt="Picture of cat walking through kitchen covered in trash: "come find me when you have proof."" width="75%" /></p>
<p style="text-align: center;"><em>Circumstantial evidence isn’t enough to make mathematicians happy.</em></p>
<p>If we find a counterexample to \(abc\) then Mochizuki is clearly wrong, but so is everyone else. If we find a consensus proof of \(abc\), then Mochizuki’s conclusion is right, but that does very little to make his argument more convincing. He could, very easily, just be lucky.</p>
<h2 id="butpsychologists-have-intuition-too">But—Psychologists have intuition, too</h2>
<p>A lot of psychology results that don’t replicate look a little different from this perspective. Does standing in a <a href="https://en.wikipedia.org/wiki/Power_posing">power pose</a> for a few seconds make you feel more confident? Probably! It sure feels like it does (seriously, stand up and give it a try right now); and it would be weird if it made you feel <em>worse</em>. Does it affect you enough, for a long enough time, to matter much? Probably not. That would also be weird.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/power-pose.jpg" alt="Picture of Amy Cuddy standing in front of a picture of Wonder Woman, in matching poses" width="50%" /></p>
<p style="text-align: center;"><em>Amy Cuddy demonstrating a power pose. <br />
Photo by Erik (HASH) Hersman from Orlando, <a href="https://creativecommons.org/licenses/by/2.0">CC BY 2.0</a>, via <a href="https://commons.wikimedia.org/wiki/File:Power_pose_by_Amy_Cuddy_at_PopTech_2011_(6279920726).jpg">Wikimedia Commons</a></em></p>
<p>The studies we’ve done, when analyzed properly, don’t show a clear, consistent, and measurable effect from a few seconds of power posing. But that’s what you’d expect, right? There’s probably an effect, but it should be too small to reasonably measure. And that’s totally consistent with everything we’ve found.</p>
<p>Amy Cuddy<strong title="I'm going to pick on Amy Cuddy and power posing a lot. That's not entirely fair to Cuddy; the pattern I'm describing is extremely common and easy to fall into, and I could make the same argument about [social priming research] or the [hungry judges study] or the dozens of others. (That's why it's a "replication crisis" and not a "this one researcher made a mistake one time crisis".) But for simplicity I'm going to stick to the same example for most of this post."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> had the intuition that power posing would increase confidence, and set out to prove it—just like Mochizuki had the intuition that the \(abc\) conjecture was true, and set out to prove it. Mochizuki’s proof was bad, but his top-line conclusion was probably right because the \(abc\) conjecture is probably correct. And Cuddy’s studies were flawed, but her intuition at the start was probably right, so her top-line conclusion is probably true.</p>
<p>Well, sort of.</p>
<h3 id="defaulting-to-zero">Defaulting to zero</h3>
<p>Let’s turn Cuddy’s question around for a bit.<strong title="Mathematicians love doing this. I'm a mathematician, so I love doing this. But it's genuinely a useful way to think about what's going on."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> What are the chances that power posing has <em>exactly zero</em> affect on your psychology? That would be extremely surprising. Most things you do affect your mindset at least a little.<strong title="This is your regular reminder to stand up, stretch, and drink some water."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong></strong></p>
<p>So our expectation should be: either power posing makes you a little more confident, or it makes you a little less confident. It also probably makes you either a little more friendly or a little less friendly, a little more or a little less experimental, a little more or a little less agreeable—<strong>an effect of exactly zero would be a surprise</strong>.</p>
<p>But for confidence specifically, it would also be kind of surprising if power posing made you feel less confident. So my default assumption is that power posing causes a small increase in confidence. And nominally, Cuddy’s research asked whether that default assumption is correct.</p>
<p>But that’s just not a great question. It doesn’t really matter if standing in a power pose makes you feel marginally better for five seconds. Not worth a book deal and a TED talk, and barely worth publishing. <strong>Cuddy’s research was interesting because it suggested the effect of power posing was not only positive, but <em>large</em></strong>—enough to make a dramatic, usable impact over an extended period of time.</p>
<p>If Cuddy’s results were true, they would be both surprising and important. But that’s just another way of saying they’re probably not true.</p>
<h3 id="power-and-precision">Power and Precision</h3>
<p>Notice: we’ve shifted to a new, different question. We started out asking “does power posing make you more confident”, but now we’re answering “how much more confident does power posing make you”. This is a better question, sure, but it’s different. And <strong>the statistical tools appropriate to the first question don’t really work for the new and better one.</strong></p>
<p><a href="https://en.wikipedia.org/wiki/Statistical_hypothesis_testing">Statistical hypothesis testing</a> is designed to give a yes/no answer to “is this effect real”. Hypothesis testing is surprisingly complicated to actually explain correctly, and probably deserves <a href="/blog/hypothesis-testing-part-1">an essay</a> or two on its own.<strong title="I originally tried to write a concise explanation to include here. It hit a thousand words and was nowhere near finished, so I decided to save it for later. Update: I have now posted the [first] and [second] essays in a three-part series on hypothesis testing."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/hypothesis-testing.png" alt="Diagram of true and false positives and negatives on a bell curve" width="75%" /></p>
<p style="text-align: center;"><em>I swear this picture makes sense.<br />
ROC_curves.svg: Sharprderivative work: נדב ס, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via <a href="https://commons.wikimedia.org/wiki/File:ROC_curves_colors.svg">Wikimedia Commons</a></em></p>
<p>To wildly oversimplify, we measure something, and check if that measurement is so big that it’s unlikely to occur by chance. If yes, we conclude that there’s a real effect from whatever we’re studying. If not, we generally conclude that there’s no effect.</p>
<p>But what if the effect is real, but very small? With this method, we conclude the effect is real if our measurements are big enough. <strong>But if the effect is small, our measurements won’t be <em>big</em>. Our study might not have enough <a href="https://en.wikipedia.org/wiki/Power_of_a_test">power</a> to find the effect</strong> even if it is real.<strong title="This means we have to be really careful about interpreting studies that don't find any effect. A study with low power will find "[no evidence]" of an effect even if the effect is very real, and that can be [just as misleading] as the errors I'm discussing in this essay. More careful researchers will say they "fail to reject the null hypothesis" or "fail to find an effect". If everyone were always that careful I wouldn't need to write this essay."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong></p>
<p>We could run a more powerful study and find evidence of smaller effects if we could make more precise measurements. This approach has worked really well in fields like physics and chemistry, and a lot of fundamental physical discoveries were driven by new technology that allowed the measurement of smaller effects. Galileo’s experiments with falling speeds required him to invent <a href="https://www.thegreatcoursesdaily.com/the-rolling-ball-experiments-galileos-terrestrial-mechanics/">improved timekeeping methods</a>, and Coulomb developed his inverse-square law after <a href="https://en.wikipedia.org/wiki/Coulomb%27s_law#History">his torsion balance</a> allowed him to precisely measure electromagnetic attraction. In the modern era, we built extremely sensitive measurement devices to try to measure <a href="https://en.wikipedia.org/wiki/LIGO">gravity waves</a> and detect <a href="https://en.wikipedia.org/wiki/Higgs_boson#Search_and_discovery">the Higgs boson</a>.</p>
<p>If power posing increases confidence by 1% for thirty seconds, that would actually be perfectly fine if we could measure confidence to within a hundredth of a percent on a second-to-second basis. But social psychology experiments just don’t work that way—at least, not with our current technology. There’s too much randomness and behavioral variation. Effects of that size just aren’t detectable.</p>
<p>This doesn’t have to be a problem! If we want to know “how big is the effect of power posing”, the answer is “too small to detect”. That’s a fine answer. It tells you that you shouldn’t build any complicated apparatus based on exploiting the power pose. (Or write <a href="https://www.goodreads.com/book/show/25066556-presence">entire books</a> on how it can change your life.)</p>
<p>But the question we started with was “does power posing have an effect at all?”. If the effect is small, we might struggle to tell whether it’s real or not.</p>
<h3 id="but-we-already-know-the-answer">But we already know the answer!</h3>
<p>Imagine you’re a psychologist researching power posing. You measure a small effect, which could just be due to chance. But you’re pretty sure that the effect is real; clearly you didn’t do a good enough job in your study! It’s probably <a href="https://en.wikipedia.org/wiki/Publication_bias">not even worth publishing</a>.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/X-Men-question-answer.gif" alt="Gif from X-Men movie. " Why do you ask questions to which you already know the answers? " width="75%" /></p>
<p>So you try again. Or someone else tries again. And eventually someone runs a study that <em>does</em> see a large effect. (Occasionally the large effect is due to fraud. Usually it’s methodology with subtler flaws that the researcher doesn’t notice. And sometimes it’s just luck: you’ll get a one-in-twenty outcome once in every twenty tries.)</p>
<p>Now we’re all happy. We were pretty sure that we would see an effect if we looked closely enough. And there it is! At this point no one has an incentive to look for flaws in the study. The result makes sense. (You might remember we said this is the state of a lot of mathematical research.)</p>
<p>But there are two major problems we can run into here. The first is that <strong>our intuition can, in fact, be wrong</strong>. If your process can only ever prove things that you already believed, it’s not a good process; you can’t really learn anything. Andrew Gelman <a href="https://statmodeling.stat.columbia.edu/2021/11/18/fake-drug-studies/">recently made this observation about fraudulent medical research</a>:</p>
<blockquote>
<p>If you frame the situation as, “These drugs work, we just need the paperwork to get them approved, and who cares if we cut a few corners, even if a couple people die of unfortunate reactions to these drugs, they’re still saving thousands of lives,” then, sure, when you think of aggregate utility we shouldn’t worry too much about some fraud here and there…</p>
</blockquote>
<blockquote>
<p>But I don’t know that this optimistic framing is correct. I’m concerned that bad drugs are being approved instead of good drugs….Also, negative data—examples where the treatment fails to work as expected—provide valuable information, and by not doing real trials you’re depriving yourself of opportunities to get this feedback.</p>
</blockquote>
<p>Shoddy research practices make sense if you see scientific studies purely as bureaucratic hoops you have to jump through: it’s “obviously true” that power posing will make you bolder and more confident, and the study is just a box you have to check before you can go around saying that out loud. But <strong>if you want to learn things, or be surprised by your data, you need to be more careful</strong>.</p>
<h2 id="effect-sizes">Effect Sizes Matter</h2>
<h3 id="overestimation">Overestimation</h3>
<p>The second problem can bite you even if your original intuition is right. You start out just wanting to know “is there an effect, y/n?”, but your experiment will make a measurement. You will get an estimate of the <em>size</em> of the effect. And that estimate will be wrong.</p>
<p>Your estimate will be wrong for a silly, almost tautological reason: <strong>if you can only detect large effects, then any effect you detect will be large</strong>. If you keep looking for an effect, over and over again, until finally one study gets lucky and sees it, that study will almost necessarily give <a href="https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">a wild overestimate</a> of the effect size.</p>
<p style="text-align: center;"><img src="https://statmodeling.stat.columbia.edu/wp-content/uploads/2014/11/Screen-Shot-2014-11-17-at-11.19.42-AM.png" alt="A diagram of the effects of low-power studies.
This is what "power = 0.06" looks like. Get used to it.
Type S error probability: If the estimate is statistically significant, it has a 24% chance of having the wrong sign.
Exaggeration ratio: If the estimate is statistically significant, it must be at least 9 times higher than the effect size." width="75%" /></p>
<p style="text-align: center;"><em>If the effect is small relative to your measurement precision, your results are guaranteed to be misleading. Figure by <a href="https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">Andrew Gelman</a>.</em></p>
<p>And this is how you wind up with shoddy research telling you that all sorts of things have shockingly large and dramatic impacts on…whatever you’re studying. You start out with the intuition that power posing should increase confidence, which is reasonable enough. You run studies, and eventually one of them agrees with you: power posing does make you more confident. But not just a little. In your study, people who did a little power posing saw big benefits.</p>
<p>To your surprise, you’ve discovered a life-changing innovation. You issue press releases, write a book, give a TED talk, spread the good news of how much you can benefit from this little tweak to your life.</p>
<p>Then other researchers try to probe the effect further—and it vanishes. Most studies don’t find clear evidence at all. The ones that do find something show much smaller effects than you had found. Of course they do. Your study had an unusually rare result, because that’s why it got published in the first place.</p>
<h3 id="dont-forget-your-prior">Don’t forget your prior</h3>
<p>Notice how, in all of this, we lost sight of our original hypothesis. It seemed basically reasonable to think power posing might perk you up a bit. That’s what we originally wanted to test, and that’s the conviction that made us keep trying. But we <em>didn’t</em> start out thinking that it would have a huge, life-altering impact.</p>
<p><strong>A really large result should feel just as weird as no result at all, if not weirder</strong>. And when we stop to think about that, we know it; some research suggests that <a href="https://twitter.com/BrianNosek/status/1034093709971873794">social scientists have a pretty good idea which results are actually plausible</a>, and which are nonsense overestimates But since we started with the question “is there an effect at all”, the large result we got <em>feels</em> like it confirms our original belief, even though it really doesn’t.</p>
<p>This specific combination is dangerous. The direction of the effect is reasonable and expected, so we accept the study as plausible. The size of the effect is shocking, which makes the study <em>interesting</em>, and gets news coverage and book deals and TED talks.</p>
<p>And this process repeats itself over and over, and the field builds up a huge library of incredible results that <a href="https://statmodeling.stat.columbia.edu/2017/12/15/piranha-problem-social-psychology-behavioral-economics-button-pushing-model-science-eats/">can’t possibly all be true</a>. Eventually the music stops, and there’s a crisis, and that’s where we are today. But it all starts somewhere reasonable: with people trying to prove something that is obviously true.</p>
<h3 id="so-how-is-math-different">So how is math different?</h3>
<p>This is exactly the situation we said math was in. Mathematicians have pretty good idea of what results should be true; but so do psychologists! Mathematicians sometimes make mistakes, but since they’re mostly trying to prove true things, it all works out okay. Social scientists are also (generally) trying to prove true things, but it doesn’t work out nearly so well. Why not?</p>
<p>In math, a result that’s too good <em>looks</em> just as troubling as one that isn’t good enough. The idea of “<a href="https://en.wikipedia.org/wiki/Proving_too_much">proving too much</a>” is a core tool for reasoning about mathematical arguments. It’s common to critique a proposed proof with something like “if that argument worked, it would prove all numbers are even, and we know that’s wrong”. This happens at all levels of math, whether you’re in college taking Intro to Proofs, or vetting a high-profile attempt to solve a major open problem. <strong>We’re in the habit of checking whether a result is—literally!—too good to be true</strong>.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/anti-gravity-cat.jpg" alt="Picture of a floating cat. "damn anti-gravity cat always disproving ma theorem"" width="50%" /></p>
<p>We could bring a similar approach to social science research. Daniël Lakens <a href="http://daniellakens.blogspot.com/2017/07/impossibly-hungry-judges.html">uses this sort of argument</a> to critique a <a href="https://www.pnas.org/content/108/17/6889.short">famous study</a> on hunger and judicial decisions:</p>
<blockquote>
<p>I think we should dismiss this finding, simply because it is impossible. When we interpret how impossibly large the effect size is, anyone with even a modest understanding of psychology should be able to conclude that it is impossible that this data pattern is caused by a psychological mechanism. As psychologists, we shouldn’t teach or cite this finding, nor use it in policy decisions as an example of psychological bias in decision making.</p>
</blockquote>
<p>Other researchers have found <a href="https://mindhacks.com/2016/12/08/rational-judges-not-extraneous-factors-in-decisions/">specific problems with the study</a>, but Lakens’s point is that we could dismiss the result even before they did. If a proposed proof of Fermat’s last theorem also shows there are no solutions to $a^2 + b^2 = c^2$, we know it’s <em>wrong</em>, even before we find the specific flaw in the argument. And if a study suggest humans aren’t capable of making reasoned decisions at 11:30 AM, it’s confounded by <em>something</em>, even if we don’t know what.</p>
<p>And yet, while I don’t believe in these studies, and I don’t believe their effect sizes, I still believe their basic claims. I believe that people make worse decisions when they’re hungry. (I know I do.) I believe standing in a power pose can make you feel stronger and more assertive. I believe that <a href="https://www.vox.com/2016/3/14/11219446/psychology-replication-crisis">exercising self-control can deplete your willpower</a>.</p>
<p>But as a mathematician, I’m forced to admit: we don’t have proof.</p>
<hr />
<p><em>Do you think we have a replication crisis in math? Disagree with me about the replication crisis? Think you make better decisions when you’re hungry? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I’m a big fan of the <a href="http://datacolada.org">Data Colada</a> project, and of <a href="https://statmodeling.stat.columbia.edu/2018/05/07/replication-crisis-centered-social-psychology/">Andrew Gelman’s writing</a> on the subject. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In theory, all papers should include enough information that you can replicate all the experiments they describe. In practice, I think this basically never happens. There’s just too much information, and it’s hard to even guess which things are going to be important. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I’m sure every practitioner in every field says that, though, even years after the problems become obvious to anyone who looks. So take this with a grain of salt. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>A friend asks: if we mostly know what’s true already, why do we need to actually find the proofs? The bad answer is “you’re not doing math if you don’t prove things”. The good answer is that finding proofs is how we train this mysteriously good intuition; if we didn’t work out proofs in detail, we wouldn’t be able to make good guesses about the next steps. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>I’m going to pick on Amy Cuddy and power posing a lot. That’s not entirely fair to Cuddy; the pattern I’m describing is extremely common and easy to fall into, and I could make the same argument about <a href="https://www.nature.com/articles/d41586-019-03755-2">social priming research</a> or the <a href="https://mindhacks.com/2016/12/08/rational-judges-not-extraneous-factors-in-decisions/">hungry judges study</a> or the dozens of others. (That’s why it’s a “replication crisis” and not a “this one researcher made a mistake one time crisis”.) But for simplicity I’m going to stick to the same example for most of this post. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Mathematicians love doing this. I’m a mathematician, so I love doing this. But it’s genuinely a useful way to think about what’s going on. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>This is your regular reminder to stand up, stretch, and drink some water. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>I originally tried to write a concise explanation to include here. It hit a thousand words and was nowhere near finished, so I decided to save it for later. Update: I have now posted the <a href="/blog/hypothesis-testing-part-1">first</a> and <a href="/blog/hypothesis-testing-part-2">second</a> essays in a three-part series on hypothesis testing. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>This means we have to be really careful about interpreting studies that don’t find any effect. A study with low power will find “<a href="https://twitter.com/zeynep/status/1366175070507384836?lang=en">no evidence</a>” of an effect even if the effect is very real, and that can be <a href="https://twitter.com/CT_Bergstrom/status/1487491536010944512">just as misleading</a> as the errors I’m discussing in this essay.</p>
<p>More careful researchers will say they “fail to reject the null hypothesis” or “fail to find an effect”. If everyone were always that careful I wouldn’t need to write this essay. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThe replication crisis is a major problem in medicine and social science; we know that a huge fraction of the published literature is outright wrong. But in math we don't seem to have a similar crisis, despite reasonably frequent minor errors in published papers. Why not, and what can this tell us about the fields that are in crisis?Pascal’s Wager, Medicine, and the Limits of Formal Reasoning2021-11-28T00:00:00-08:002021-11-28T00:00:00-08:00https://jaydaigle.net/blog/pascalian-medicine<p>Scott Alexander at Astral Codex Ten has a good post recently thinking about what he calls <a href="https://astralcodexten.substack.com/p/pascalian-medicine">Pascalian Medicine</a>. As always the entire post is worth reading, but here’s an excerpt:</p>
<blockquote>
<p>Another way of looking at this is that I must think there’s a 25% chance Vitamin D works, and a 10% chance ivermectin does. Both substances are generally safe with few side effects. So (as many commenters brought up) there’s a <a href="https://en.wikipedia.org/wiki/Pascal%27s_wager">Pascal’s Wager</a> like argument that someone with COVID should take both. The downside is some mild inconvenience and cost (both drugs together probably cost $20 for a week-long course). The upside is a well-below-50% but still pretty substantial probability that they could save my life.</p>
</blockquote>
<blockquote>
<p>…</p>
</blockquote>
<blockquote>
<p>But why stop there? Sure, take twenty untested chemicals for COVID. But there are almost as many poorly-tested supplements that purport to treat depression. The cold! The flu! Diabetes! Some of these have known side effects, but others are about as safe as we can ever prove anything to be. Maybe we should be taking twenty untested supplements for every condition!</p>
</blockquote>
<p>Scott doesn’t seem to believe we should do this, but is trying to figure out the actual flaw in this reasoning. The most convincing argument he comes up with is based in how unreliable modern medical studies are, and how easy it is to generate spurious positive results.</p>
<blockquote>
<p>I think ivermectin doesn’t work. I think that it looks like it works, because it has lots of positive studies and a few big-name endorsements. But our current scientific method is so weak and error-prone that any chemical which gets raised to researchers’ attentions and studied in depth will get approximately this amount of positive results and buzz. Look through the thirty different chemicals featured on the sidebar of the ivmmeta site if you don’t believe me.</p>
</blockquote>
<blockquote>
<p>…</p>
</blockquote>
<blockquote>
<p>Probably what I’m doing wrong here is saying that ivermectin having some decent studies raises its probability of working to 5%. I should just say 0.1% or 0.01% or whatever my prior on a randomly-selected medication treating a randomly-selected disease is (higher than you’d think, based on the argument from antibiotics).</p>
</blockquote>
<blockquote>
<p>From the Outside View, this argument seems strong. From the Inside View, I have a lot of trouble looking at a bunch of studies apparently supporting a thing, and no contrary evidence against the thing besides my own skepticism, and saying there’s a less than 1% chance that thing is true.</p>
</blockquote>
<p>The <a href="https://www.lesswrong.com/tag/inside-outside-view">Outside View</a> argument here is <em>completely right</em>, and is a great illustration of the limitations of Bayesian reasoning that I talked about <a href="/blog/paradigms-and-priors/#anomalies-and-bayes">here</a> and <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">here</a>.</p>
<h3 id="unknown-unknowns">Unknown Unknowns</h3>
<p>The basic argument for Pascalian medicine goes: okay, suppose ivermectin has a 10% chance of reducing covid mortality by 10%. About a thousand people are dying of covid every <del>week</del> day<strong title="I originally misread the CDC page and interpreted the weekly average of daily numbers as weekly numbers. I've edited the piece throughout to reflect the true numbers, but it doesn't change any of the conclusions, since the same error happened to every rate I discussed in the piece."><sup id="fnref:edit"><a href="#fn:edit" class="footnote">1</a></sup></strong> in the US <a href="https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/index.html">according to the CDC weekly tracker</a>, so the expected benefit of giving all our covid patients ivermectin is something like saving ten lives per day.<strong title="There would also be benefits from fewer people being hospitalized, fewer people suffering long-term health consequences, fewer people being miserable and bedridden for a week, etc. I'm going to talk about deaths pretty exclusively because it's easier to talk about just one number."><sup id="fnref:1"><a href="#fn:1" class="footnote">2</a></sup></strong></p>
<p>Even if you think the probability ivermectin works is only something like 1%, that still adds up to one life saved per day. Since ivermectin is cheap, and “generally safe with few side effects”, an expected value of “saves one life per day” looks pretty good! So maybe we should prescribe it out of an abundance of caution.<strong title="This is very different from claims that ivermectin is a miracle cure, and we should take that instead of getting vaccinated. Ivermectin is at best mildly beneficial; vaccines are safe and effective and you should get a booster shot if you haven't already. We're talking about whether the small possibility of a minor benefit from ivermectin makes it worth taking."><sup id="fnref:2"><a href="#fn:2" class="footnote">3</a></sup></strong></p>
<p>And then we make the same argument about, apparently, twenty other drugs, and we’re taking a crazy drug cocktail. (Scott calls this the Insanity Wolf position.) So it looks like something has gone wrong. But what?</p>
<p style="text-align: center;"><img src="/assets/blog/pascalian/insanity_wolf.jpeg" alt="Insanity Wolf meme: "TAKE EVERY MEDICATION ALL THE TIME BECOME INFINITELY HEALTHY, LIVE FOREVER"" /></p>
<p>We made a basic, common error that really isn’t fully avoidable: we took a bunch of stuff we can’t measure, and decided it didn’t matter. “Generally safe with few side effects” isn’t the same as “perfectly safe”, and “cheap” isn’t the same as “free”. And something like ninety thousand people get covid in the US every day; to save that one life we’re probably giving drugs to tens of thousands of people. How confident are we that our drugs won’t hurt any of them? Especially if we give an Insanity Wolf-style twenty-drug cocktail?</p>
<p>Scott discusses this idea, of course. But I think he seriously underestimates the problem of unknown unknowns here. For well-understood drugs with large probable benefits, the unknown unknowns don’t matter very much. But for long-shot possible payoffs, like with ivermectin, unknown unknowns present a real, unavoidable problem. And the theoretically, mathematically correct response is to throw up our hands and take the Outside View instead.</p>
<h3 id="three-example-drugs">Three Example Drugs</h3>
<p>I want to take a look at three different drugs and do some illustrative calculations for the possible risks and benefits.</p>
<h5 id="paxlovid">Paxlovid</h5>
<p>There are always unknown unknowns, but in many cases we can put bounds on how good, or bad, things can be. <a href="https://en.wikipedia.org/wiki/PF-07321332">Paxlovid</a>, Pfizer’s new antiviral pill, provides a good example of this reasoning. In trials, Paxlovid <a href="https://www.pfizer.com/news/press-release/press-release-detail/pfizers-novel-covid-19-oral-antiviral-treatment-candidate">cut covid hospitalizations and deaths by about 90%</a>.<strong title="These numbers are reported a little weirdly. Looking at the study, it seems like Paxlovid cut hospitalizations by 85%, from 41/612 to 6/607; it cut deaths by 100% from 10/612 to 0/607. I think the 90% figure is the extent to which it cut (hospitalizations plus deaths), since that math checks out, but that's a slightly weird metric to judge by."><sup id="fnref:3"><a href="#fn:3" class="footnote">4</a></sup></strong> Let’s assume that’s a wildly optimistic overestimate, and give it a 50% chance of cutting deaths by 50%. Then in expectation that’s going to save a couple hundred lives each day.</p>
<p>What are the risks? This is a new drug so it’s hard to know what they are; all we know is that (1) Pfizer didn’t expect the side effects to be too bad, based on prior knowledge of this drug class, and (2) they didn’t notice anything too dramatic in the trial they ran. That doesn’t tell us how bad the side effects are, but it does put limits on them: if Paxlovid killed 1% of the people who took it, we’d know.</p>
<p>But suppose Paxlovid kills .1% of everyone who takes it. That’s about as high as it could go without us probably having noticed already, since the trial administered it to about 600 people and none of them died. (And realistically if it killed .1% of people, way more than that would have severe side effects and we probably would have noticed.) If we give Paxlovid to everyone in the US who gets covid, that’s about 90,000 people a day, and Paxlovid would kill 90 people a day. And that’s less than the couple hundred lives it would save.</p>
<p>Now, all of these numbers are <em>extremely handwavy</em>. But I chose them to make Paxlovid looks as bad as reasonably possible, and it still comes out looking pretty good. My estimate of the benefit of Paxlovid was a huge lowball; it’s probably going to save closer to 800 lives in a day than 200 if we manage to give it to everybody. And on the other hand, I’d be shocked if it’s anywhere <em>near</em> as dangerous as I assumed in the last paragraph. Sure, there’s some minuscule chance that it it’s really really dangerous but only several years after you take it, but since that’s not how these drugs usually work we can round that off to zero.</p>
<p>The benefit of Paxlovid is large enough that it outweighs any vaguely reasonable estimate of the costs. And we don’t need any especially fancy calculations to see that.</p>
<p style="text-align: center;"><img src="https://imgs.xkcd.com/comics/statistics.png" alt="https://xkcd.com/2400 Statistics. "Statistics tip: always try to get data that's good enough that you don't need to do statistics on it."" /></p>
<p style="text-align: center"><em>We could make basically the same argument about vaccines, except the worst plausible numbers look even better than for Paxlovid.</em></p>
<h5 id="tylenol">Tylenol</h5>
<p>We can run a similar analysis with common every-day drugs like Tylenol. Scott observes that “We don’t fret over the unknown unknowns of Benadryl or Tylenol or whatever, even though we know their benefits are minor.” But by the same token, we also are reasonably confident that the unknown unknown costs of those drugs are minor. If Tylenol killed .1% of patients who took it, or even .01%, <em>we would know</em>. (And in fact we know Tylenol can cause liver damage, and that is a thing we very much do fret over.) Sure, unknown harms always could exist. But in this case we can be pretty confident that they have to be really small.</p>
<p>Apparently a new potentially deadly side effect of Tylenol was discovered in 2013. If I’m reading the FDA report correctly, they belive that <a href="https://www.fda.gov/drugs/drug-safety-and-availability/fda-drug-safety-communication-fda-warns-rare-serious-skin-reactions-pain-relieverfever-reducer">one person has died</a> from this side effect since 1969. That’s the scale of side effect that can slip under the radar for a drug as widely taken and studied as Tylenol.</p>
<p>Tylenol could have unknown unknowns, but they won’t be <em>very</em> unknown.</p>
<h5 id="back-to-ivermectin">Back to Ivermectin</h5>
<p>Now compare this with the ivermectin situation. Let’s suppose we give ivermectin a 10% chance of being effective, with a benefit of reducing deaths by 20%. (The Together trial has a non-significant effect of about 10%, so let’s double that.) Then in expectation we’re saving like 2% of lives a day, which is 20 lives saved if we give it to everyone.</p>
<p>How many people would ivermectin have to kill to net out negative? If we give it to 90,000 people every day, then 20 is about .02%. So does ivermectin kill about .02% of the people who take it? My guess is, probably not. But that seems a lot more within the realm of “maybe, it’s hard to be sure”.</p>
<p>We also reach the point where a lot of our ass-pull assumptions start to really matter. We said “maybe ivermectin has a 10% chance of working”. Scott’s the expert, not me, but that seems high to me. (Do you really think that one in ten drugs that have vague but mildly-promising data in preliminary trials pan out?) If we say ivermectin has a 1% chance of reducing deaths by 20%, then our expected value is two lives per day.</p>
<p>This could still pencil out as a good trade, but with benefits so small (and uncertain) it could easily not be worth it. Especially if we account for the guaranteed annoyance of taking a pill and the common minor side effects we know ivermectin has.</p>
<h3 id="the-problem-with-made-up-numbers">The Problem with Made-Up Numbers</h3>
<p>But the larger point here is that <em>all this math is bullshit</em>. Are the odds of ivermectin working 10%? 1%? .01%? Where did that number come from? What do we mean by “working”—is it a 5% improvement? A 50% improvement?<strong title="There are systematic ways of estimating this, but they would all require numbers for "how inflated do you expect non-significant effect sizes in published studies to be?" If you spend a lot of time with the medical literature you might have a number to put here; I don't."><sup id="fnref:4"><a href="#fn:4" class="footnote">5</a></sup></strong> And at the same time, I don’t have real odds for “negative side effects”, which covers a lot of ground. (Scott himself points out that the odds of ivermectin unexpectedly killing you are definitely not zero.) And all this is the simple version of the calculation, where we don’t try to weigh things like “fever from covid might last one day less?” versus “ivermectin can cause fever?”</p>
<p>Scott argued many years ago that <a href="https://slatestarcodex.com/2013/05/02/if-its-worth-doing-its-worth-doing-with-made-up-statistics/">if it’s worth doing, it’s worth doing with made-up statistics</a>. And I don’t really disagree with that essay. Doing experimental calculations with made-up numbers can give us information, and I certainly think the analysis of Paxlovid that I did above tells us something useful. But to learn anything from these calculations, we need our made-up numbers to at least vaguely reflect reality.</p>
<p>Scott wrote:</p>
<blockquote>
<p>Remember the <a href="http://yudkowsky.net/rational/bayes">Bayes mammogram problem</a>? The correct answer is 7.8%; most doctors (and others) intuitively feel like the answer should be about 80%. So doctors – who are specifically trained in having good intuitive judgment about diseases – are wrong by an order of magnitude….But suppose some doctor’s internet is down (you have NO IDEA how much doctors secretly rely on the Internet) and she can’t remember the prevalence of breast cancer. If the doctor thinks her guess will be off by less than an order of magnitude, then making up a number and plugging it into Bayes will be more accurate than just using a gut feeling about how likely the test is to work.</p>
</blockquote>
<p>And this is right, but the caveat at the end is critical. If you have a good estimate of the prevalence of breast cancer, and a bad estimate of the chance of a false positive, then you can use the first number to get a better estimate of the second. But if you have a really good idea of the false positive rate (maybe you’ve seen thousands of positive results and learned which ones turned out to be false positives), but a shaky idea of the prevalence of breast cancer (hell, I have no idea how likely some lump is to be cancerous), you’ll be better off going with your intuition for how accurate the test is—and using that to estimate breast cancer prevalence!</p>
<p>Scott says that “varying the value of the “unknown unknowns” term until it says whatever justifies our pre-existing intuitions is the coward’s way out.” And this is one of the rare cases where I think he’s completely, unequivocally wrong. This isn’t the coward’s way out; it’s the only thing we can possibly do.</p>
<h3 id="reflective-equilibrium">Reflective Equilibrium</h3>
<p>If you find a convincing argument that generates an unlikely conclusion, you can accept the unlikely conclusion, you can decide that the premises of the argument were flawed, <em>or</em> you can decide the argument itself doesn’t work. If I collect some data, do some statistics, and calculate that taking Tylenol will cut my lifespan by thirty years, I don’t immediately throw away all my Tylenol—I look for where I screwed up my math. And that’s the correct, and rational, response.</p>
<p>If you think A is true and B is false, and find an argument that A implies B, you have three choices: you can decide A is false after all; you can decide B is true after all; or you can decide that the argument actually isn’t valid. Or you can adopt some probabilistic combination: it’s perfectly consistent to believe A is 60% likely to be true, B 60% likely to be false, and the argument 60% likely to be correct. But fundamentally you have to make a choice about which of the three pieces to adjust, and by how much.<strong title="David Chapman calls this [meta-rational reasoning](https://twitter.com/Meaningness/status/1463632030059544576). I see where he's coming from but think that's an unnecessarily complex and provocative way of talking about it."><sup id="fnref:5"><a href="#fn:5" class="footnote">6</a></sup></strong></p>
<p style="text-align: center;"><img src="/assets/blog/pascalian/two-answers.jpg" alt="Picture of kitten raising two paws: "i has two ansers. which you want?"" /></p>
<p>In the case of ivermectin, we have some data from some studies. We have an Inside View argument that, based on expected values computed from that data, taking ivermectin is probably worth it. And we have the Outside View argument that taking random long-shot drugs is not a great idea. And we have to reconcile these somehow.</p>
<p>First, we could reject, or disbelieve, the data. And we totally did that: a bunch of ivermectin studies are fraudulent or incompetent, and Scott <a href="https://astralcodexten.substack.com/p/ivermectin-much-more-than-you-wanted">argues pretty convincingly</a> that some of the honest, competent studies are really picking up the benefits of killing off intestinal parasites. But even after doing that, we’re left with the Pascalian argument: ivermectin probably doesn’t work, but it might, and the costs of taking it are low, so we might as well. Do we listen to that argument, or to our gut belief that this can’t be a good idea?</p>
<p>A common trap that smart, math-oriented people fall into is thinking that the argument with numbers and calculations must be the better one. The Inside View argument did some math, and multiplied some percentages, and came up with an expected value; the Outside View argument comes from a fuzzy intuitive sense that medicine Doesn’t Work That Way. So the mathy argument should win out.</p>
<p style="text-align: center;"><img src="/assets/blog/pascalian/peanuts-opinion.gif" alt="Peanuts comic. "How are you doing in school these days, Charlie Brown?"
"Oh, fairly well, I guess...I'm having most of my trouble in arithmetic.."
"I should think you'd like arithmetic...it's a very precise subject.."
"That's just the trouble. I'm at my best in something where the answers are mostly a matter of opinion!"" width="100%" /></p>
<p>But in this case, we were doing calculations with numbers that were, you might remember, completely made up. Sure, the Outside View argument reflects a fuzzy intuitive sense of whether a random potential cure is likely to help us. The Inside View argument, on the other hand, reflects a fuzzy intuitive sense of whether Ivermectin is likely to protect us from covid.</p>
<p>The only real difference is that we took the second fuzzy intuition, put a fuzzy number on it, and plugged it into some cost-benefit analysis formulas. And no matter what fancy formulas we use, they can never make our starting numbers <em>less</em> fuzzy. Given the choice between a fuzzy intuition, and an equally fuzzy intuition that we’ve done math to, I’m inclined to trust the first one. With fewer steps, there are fewer ways to screw up.</p>
<h3 id="finding-the-error">Finding the Error</h3>
<p>At this point I think we’ve reached roughly Scott’s position at the end of his essay. The Outside View argument is winning out in practice, but we haven’t articulated any specific problems with the Inside View argument. And this is uncomfortable, because <em>they can’t both be right</em>. We can say it’s more likely we screwed up the more complicated, mathier argument. But <em>how</em> did we screw it up?</p>
<p>And on reflection, the answer is that we’re confusing two different arguments. I think that “Sure, go ahead and take ivermectin, it probably won’t help but it might, and it probably won’t hurt either” is a pretty reasonable position, and was even more reasonable six months ago, when we knew less than we do now.<strong title="Again, "Ivermectin is a miracle cure, take that instead of getting vaccinated" is, in fact, a completely and totally nonsense position. And many public "ivermectin advocates" are saying that, and they are wrong. But that's not what we're talking about here."><sup id="fnref:6"><a href="#fn:6" class="footnote">7</a></sup></strong></p>
<p>I know a bunch of people who take Vitamin C, even though it’s not clear that accomplishes anything. I myself flip-flop between taking a multivitamin because it seems like it might make me healthier, and not-taking a multivitamin because there’s no real evidence that it does. Taking Ivermectin it case it’s helpful doesn’t really seem that different.</p>
<p>No, the crazy position is when we go full Insanity Wolf and take twenty different long-shot cures at once. <em>That</em> was the conclusion that seemed like it couldn’t possibly hold up, at least for me. And that’s <em>also</em> the point where it really does seem like the unknown unknowns start piling up. There are twenty different drugs that could all possibly cause negative side effects. There are 190 potential two-drug interactions and over a thousand potential three-drug interactions, and even if interactions are, in Scott’s words, “rarer than laypeople think”, that seems like a lot of room for something weird to happen.</p>
<p>So this is how we screwed up. We said these drugs are cheap and generally safe. But in order to make our math reasonable, we rounded “generally safe” down to “safe”, and ignored the risks entirely. As long as the risks are small enough, that works fine; but at some point we cross the threshold we can’t just ignore all the downsides when doing our calculations.</p>
<p>Is taking twenty drugs over that threshold? I don’t know, but it seems likely. Taking that many drugs <em>probably</em> won’t hurt you, but it might! And it will definitely be expensive and annoying, and a lot of those drugs have common mild-but-unpleasant side effects. And the potential benefits are relatively small, and relatively unlikely; it’s easy for them to be swamped by all these downsides.</p>
<p>But now we’re talking about the interaction of hundreds of numbers that are both small and uncertain. We can’t get away with ignoring the risks, but we can’t realistically quantify them either. All we can do is make some half-assed guesses, and our conclusions will change a lot depending on exactly which guesses we make. So we <a href="https://twitter.com/ProfJayDaigle/status/1463598150585888775">can’t do a useful Inside View calculation at all</a>. Instead we’re basically forced to rely on the Outside View argument: taking twenty pills every day that probably don’t even work seems kinda dumb.</p>
<p>But then why take ivermectin specifically, rather than Vitamin D or curcumin or some other possible treatment? I dunno. You’re buying a long-shot lottery ticket. Pick your favorite number and hope it pays out.</p>
<h3 id="the-takeaway">The Takeaway</h3>
<p>A back-of-the-envelope cost-benefit analysis tells us that taking ivermectin for covid might have positive expected value. If we follow that logic to its conclusion, we wind up taking twenty different supplements and this seems like it can’t be wise.</p>
<p>A blinkered view of rationality tells us to ignore our intuition and follow the math. A more expansive view realizes that if the numbers we’re plugging into our cost-benefit analysis are shakier than that intuition, then we should take the intuition seriously. Cost-benefit analyses and other “mathematically rational” are only as good as the numbers and arguments that we bring to them.</p>
<p>But even with shaky numbers, we can learn things from comparing our intuitions with the result of our calculations. Figuring out <em>why</em> we get two different answers can teach us a lot about our reasoning, and help us figure out where we went wrong. Taking the full Insanity Wolf cocktail really seems qualitatively different from picking your favorite long-shot drug, but the way we set up our math hid that from us.</p>
<p>Finally: please get vaccinated, and get your booster shot. And if you have a choice between Paxlovid and ivermectin, you should probably take the Paxlovid.</p>
<hr />
<p><em>Questions about cost-benefit analysis, or where the math breaks down? Do you know something I missed? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:edit">
<p>I originally misread the CDC page and interpreted the weekly average of daily numbers as weekly numbers. I’ve edited the piece throughout to reflect the true numbers, but it doesn’t change any of the conclusions, since the same error happened to every rate I discussed in the piece. <a href="#fnref:edit" class="reversefootnote">↩</a></p>
</li>
<li id="fn:1">
<p>There would also be benefits from fewer people being hospitalized, fewer people suffering long-term health consequences, fewer people being miserable and bedridden for a week, etc. I’m going to talk about deaths pretty exclusively because it’s easier to talk about just one number. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This is very different from claims that ivermectin is a miracle cure, and we should take that instead of getting vaccinated. Ivermectin is at best mildly beneficial; vaccines are safe and effective and you should get a booster shot if you haven’t already. We’re talking about whether the small possibility of a minor benefit from ivermectin makes it worth taking. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>These numbers are reported a little weirdly. Looking at the study, it seems like Paxlovid cut hospitalizations by 85%, from 41/612 to 6/607; it cut deaths by 100% from 10/612 to 0/607. I think the 90% figure is the extent to which it cut (hospitalizations plus deaths), since that math checks out, but that’s a slightly weird metric to judge by. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>There are systematic ways of estimating this, but they would all require numbers for “how inflated do you expect non-significant effect sizes in published studies to be?” If you spend a lot of time with the medical literature you might have a number to put here; I don’t. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>David Chapman calls this <a href="https://twitter.com/Meaningness/status/1463632030059544576">meta-rational reasoning</a>. I see where he’s coming from but think that’s an unnecessarily complex and provocative way of talking about it. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Again, "Ivermectin is a miracle cure, take that instead of getting vaccinated" is, in fact, a completely and totally nonsense position. And many public "ivermectin advocates" are saying that, and they are wrong. But that’s not what we’re talking about here. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleA back-of-the-envelope cost-benefit analysis tells us that taking ivermectin for covid might have positive expected value. If we follow that logic to its conclusion, we wind up taking twenty different supplements and this seems like it can't be wise. Resolving this apparent conflict exposes some of the deep flaws in how we often think about rationality and Bayesian reasoning. A response to a piece by Scott Alexander at Astral Codex Ten.More Thoughts on the Axiom of Choice2021-07-28T00:00:00-07:002021-07-28T00:00:00-07:00https://jaydaigle.net/blog/more-on-the-axiom-of-choice<p>I got a lot of good, interesting comments on my recent <a href="https://jaydaigle.net/blog/what-is-the-axiom-of-choice/">post on the axiom of choice</a> (both on the post itself, and in this <a href="https://news.ycombinator.com/item?id=27836406">very good Hacker News thread</a>). I wanted to answer some common questions and share the most interesting thing I learned.</p>
<h3 id="cant-we-just-pick-at-random">Can’t we just pick at random?</h3>
<p>A lot of people asked why we can’t just avoid the whole problem of the axiom of choice by picking set elements randomly. Because obviously we can just make a bunch of random choices, right? If there’s no limit to what the choices have to look like then there’s no problem.</p>
<p>If you believe that, then you believe the axiom of choice. “We can pick some element from each set, without being fussy about which one we get” is just what the axiom of choice says. And that’s fine. A lot of people believe the axiom of choice! But it’s not an alternative to the axiom of choice; it is the axiom of choice.</p>
<p>The fact that this “just pick at random” idea seems so facially compelling, or “obvious”, is a big part of why many mathematicians want to accept the axiom of choice. It just seems like we should be able to make a bunch of
choices at once, if we’re not picky about which choices we make. It’s only when they are shown the really bizarre implications of getting to make those choices that most people start questioning whether the axiom makes sense.</p>
<h3 id="why-do-we-want-to-believe-the-axiom-of-choice">Why do we want to believe the axiom of choice?</h3>
<p>Another recurring question asked why we <em>should</em> want to believe the axiom of choice. It has a lot of bizarre consequences. In the last post I argued that those consequences aren’t as troubling as they seem, but they’re still weird. Why can’t we just dumpster the axiom of choice and avoid all of them?</p>
<p>One reason is the intuitive plausibility of the “just pick at random” idea. The goal of an axiomatic system is to formalize our list of “basic moves we should be able to make”. The ZF axioms include things like the <a href="https://en.wikipedia.org/wiki/Axiom_of_extensionality">axiom of extensionality</a>, which says that two sets are equal if they have the same elements, and the <a href="https://en.wikipedia.org/wiki/Axiom_of_pairing">axiom of pairing</a>, which says that if \(A\) and \(B\) are sets then we can talk about the set \( {A,B } \). These aren’t weird exotic ideas. They’re just things we should be able to do with collections of things. They’re part of the intuition that the word “set” is trying to formalize.</p>
<p>You could see the axiom of choice as something like this—something in our basic, intuitive understanding of what a “set” is, that pre-exists formal definitions. It’s pretty easy to convince people that “choose an element from each set” is a reasonable thing to be able to do. The only problem is that it leads to absurd results like Banach-Tarski or the solution to the Infinite Hats puzzle. But if we satisfy ourselves that those absurdities aren’t a real problem, we return to “this seems like a thing we should be able to do”.</p>
<h3 id="but-really-why-do-we-want-to-believe-the-axiom-of-choice">But really, why do we <em>want</em> to believe the axiom of choice?</h3>
<p>On the other hand, that’s not a very strong reason to really care about the axiom of choice. At best, that leaves us at “why shouldn’t we, it doesn’t hurt anything”, which could just as easily be “why should we, it doesn’t help?” We <em>care</em> about the axiom of choice, and put up with the peripheral weirdness, because it lets us prove a <a href="https://en.wikipedia.org/wiki/Axiom_of_choice#Weaker_forms">variety of other results we care about</a>. These include:</p>
<ul>
<li>Every Hilbert space has an orthonormal basis (so we can put coordinates on function spaces);</li>
<li>Every field has an algebraic closure (very important in number theory—in my research I often wanted to talk about “the algebraic closure” of some large field, and that implicitly relies on the axiom of choice);</li>
<li>The union of countably many countable sets is countable;</li>
<li><a href="https://en.wikipedia.org/wiki/Hahn%E2%80%93Banach_theorem">The Hahn-Banach theorem</a> (lets us extend linear functionals and guarantees that dual spaces are “interesting”);</li>
<li><a href="https://en.wikipedia.org/wiki/G%C3%B6del's_completeness_theorem">Gödel’s completeness theorem</a> for first-order logic;</li>
<li><a href="https://en.wikipedia.org/wiki/Baire_category_theorem">The Baire category theorem</a>, which I don’t even want to try to summarize but which shows up constantly in functional analysis.</li>
</ul>
<p>All of these results are really useful in their respective fields, and we need the axiom of choice to prove them. And that’s a true “need”: these are all provable from ZFC but not from ZF.</p>
<p>These statements aren’t equivalent to the axiom of choice. If we wanted, we could take the above list as a list of new <em>axioms</em> to attach to ZF, and then we wouldn’t be stuck with choice. But that is a really strange and ad-hoc list of foundational axioms. It feels much better to take the one axiom—the axiom of choice, which is reasonably foundational and sounds plausible enough on its own—and get all these consequences for free.</p>
<h3 id="shoenfields-theorem-you-only-need-the-axiom-of-choice-for-weird-things">Shoenfield’s Theorem: You only need the axiom of choice for weird things</h3>
<p>But the coolest thing I learned about after writing the last post is <a href="https://en.wikipedia.org/wiki/Absoluteness#Shoenfield's_absoluteness_theorem">Shoenfield’s Absoluteness Theorem</a>. The statement of this theorem is pretty dense and I don’t think I completely understand it, but it has really nice implications for the axiom of choice.</p>
<p>In the last post I said that the axiom of choice just doesn’t cause problems as long as we’re not getting too far away from finite sets. This applies even to half the results in the previous section.</p>
<ul>
<li>We need the axiom of choice to show that <em>every</em> field has an algebraic closure, but not to show that the rationals do.</li>
<li>We need the axiom of choice to show that <em>every</em> Hilbert space has an orthonormal basis, but not to show that Fourier theory gives an orthonormal basis for \(L^2([-\pi,\pi])\).</li>
<li>We need the axiom of choice to prove the Baire Category Theorem for every complete metric space, but not to prove it for the real numbers or the real function space \(L^2(\mathbb{R}^n)\).</li>
</ul>
<p>Shoenfield’s theorem helps tell us exactly when the axiom of choice is actually going to matter.</p>
<p>In the last post we talked about <em>models</em> of the ZF axioms, which are collections of sets that obey all the rules. Given a model, Kurt Gödel defined something called the <a href="https://en.wikipedia.org/wiki/Constructible_universe">constructible universe</a>, which is a sort of smaller model, contained in the original model, which can be built up explicitly from smaller pieces. The constructible universe usually doesn’t contain everything in the original model, but it will in some sense contain all the simple explicitly describable things in the original model.</p>
<p>But the constructible universe has some extra nice properties. One is that the constructible universe will always satisfy the axiom of choice, even if the original model did not!<strong title="This is how Gödel proved that the axiom of choice must be consistent with the ZF axioms: the constructible universe gives us a model of ZF that also satisfies the axiom of choice."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> Specifically, since we construct the universe in a specific <em>order</em>, everything we’ve constructed can be <a href="/blog/what-is-the-axiom-of-choice/#well-ordering">well-ordered</a>, which implies the axiom of choice. So any theorem that relies on the axiom of choice is automatically true as long as we’re only talking about sets in the constructible universe.</p>
<p>Shoenfield’s theorem extends that result even further. If you have a sufficiently simple question (for a <a href="https://en.wikipedia.org/wiki/Analytical_hierarchy">precise definition of sufficiently simple</a>), then the original model and the constructible universe must give the same answer. Since the axiom of choice always holds in the constructible universe, the answers to these simple questions can’t depend on whether you accept the axiom of choice or not.</p>
<p>What does that mean? Any simple-enough result that you can prove with the axiom of choice, you can also prove without it. That includes everything about Peano arithmetic and basic number theory, and also everything about the <a href="https://news.ycombinator.com/item?id=27855515">correctness of explicit computable algorithms</a>. It also includes <a href="https://en.wikipedia.org/wiki/Axiom_of_choice#cite_ref-16">\(P = NP\) and the Riemann Hypothesis</a>, and a number of other major unsolved problems.</p>
<p>There are questions that the axiom of choice really does matter for. But Gödel and Shoenfield’s results show that they have to be pretty far removed from anything finite or concretely constructible. So in practice, we can use the axiom of choice as a tool to make our work simpler, knowing that it won’t screw up anything practical that really matters.</p>
<hr />
<p><em>Do you have other questions about the axiom of choice? Another cool fact I don’t know about? Or some other math topic you’d like me to explain? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is how Gödel proved that the axiom of choice must be consistent with the ZF axioms: the constructible universe gives us a model of ZF that also satisfies the axiom of choice. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleI got a lot of good, interesting comments on my recent post on the axiom of choice (both on the post itself, and in this very good Hacker News thread). I wanted to answer some common questions and share the most interesting thing I learned.What is the Axiom of Choice?2021-07-14T00:00:00-07:002021-07-14T00:00:00-07:00https://jaydaigle.net/blog/what-is-the-axiom-of-choice<p>One of the easiest ways to start a (friendly) fight in a group of mathematicians is to bring up the <a href="https://en.wikipedia.org/wiki/Axiom_of_choice">axiom of choice</a>. This axiom has a really interesting place in the foundations of mathematics, and I wanted to see if I can explain what it means and why it’s controversial. As a bonus, we’ll get some insight into what an axiom <em>is</em> and how to think about them, and about how we use math to think about the actual world.</p>
<p style="text-align: center;"><a href="https://xkcd.com/982"><img src="https://imgs.xkcd.com/comics/set_theory.png" alt="xkcd 982: "The axiom of choice allows you to select one element from each set in a collection—and have it executed as an example to the others"" /></a></p>
<p>The axiom seems pretty simple at first:</p>
<blockquote>
<p><strong>Axiom of Choice:</strong> Given a collection of (non-empty) sets, we can choose one element from each set.<strong title="We can be more formal by phrasing this in terms of _choice functions_: given a collection of sets X = {A} there is a function f : X \to ⋃ A such that f(A) ∈ A for each A ∈ X. But I want to keep the discussion as readable as possible if you're not comfortable with the language of formal set theory."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
</blockquote>
<p>Most people find this principle pretty inoffensive, or even obviously right, on first contact. But it’s extremely controversial and produces strong emotions; and unusually for a mathematical debate, there’s essentially no hope of a clear resolution. And I want to try to explain why.</p>
<h3 id="easy-choices">Easy choices</h3>
<p>One reason the axiom of choice can <em>sound</em> trivial is that there are a lot of superficially similar rules that are totally fine; the controversial bit is subtle. So here are a few things that don’t cause controversy:</p>
<ul>
<li>If we have one set, we can definitely pick an element from it. The axiom of choice says if we have a collection of sets, we can pick one element from each set simultaneously.</li>
<li>
<p>But if we can pick an element from one set, can’t we pick an element from the first set, and then the second set, and then the third, etc.? Eventually we’ll pick an element from each set.</p>
<p>This works if we only have a <em>finite</em> collection of sets. So if I have five sets, I can pick one element from each set, by picking an element from the first set, then the second set, then the third, then the fourth, then the fifth. This is sometimes known as the <strong>axiom of finite choice</strong>. And no one argues about this.</p>
<p>But that approach doesn’t work if we have infinitely many sets.<strong title="Using this sort of process on an infinite set is called transfinite induction. Transfinite induction can sometimes allow us to make choices without the axiom, but only if we can put our sets in some order. Conversely, the axiom of choice allows us to use transfinite induction in cases we otherwise couldn't. (Corrected from earlier version; thanks to Sniffnoy for the correction)"><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> If we pick elements from one set at a time, we’ll never get to all the sets; there will still be infinitely many left. This infinitude of sets is where the real problem lies. (And things get worse if we have an <a href="https://en.wikipedia.org/wiki/Uncountable_set">uncountably infinite</a> collection of sets, which is too many to even put in order!)</p>
</li>
</ul>
<p style="text-align: center;"><img src="/assets/blog/aoc/count_over_eleventy.jpg" alt="A kitten holding up its paws like it's counting. "Ai can count ober elebenty. Look see? Elebenty one elebenty two elebenty free..."" /></p>
<ul>
<li>
<p>Even if we have an infinite collection of sets, we <em>might</em> be able to pick an element from each set. If the sets have a nice enough pattern to them, we can give an explicit rule that lets us pick an element from each set consistently. For instance, if we have a bunch of sets of positive integers, we can always say something like “pick the smallest number in each set”.</p>
<p>But not every collection of sets allows a deterministic rule like this.<strong title="The set of real numbers doesn't have a smallest element or a largest element. Nor does the set of positive real numbers, or the set of numbers between zero and one. So if we have a colleciton of sets of real numbers, the rule we used for sets of positive integers doesn't work."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> </strong> The axiom of choice says that we can choose an element from each set, even if we can’t describe a rule for making that choice. If we have infinitely many pairs of shoes we don’t need the axiom of choice, since we can just take the left shoe from each pair; but if we have infinitely many pairs of socks, we do need the axiom of choice.<strong title="This example was originally offered by Bertrand Russell. "><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong></p>
</li>
</ul>
<h3 id="whats-the-problem">What’s the problem?</h3>
<p>The axiom of choice has weird effects precisely because it is so unlimited. It tells us that given any infinite collection of infinite sets, we can pick one option from each set, even if the sets are too big to really understand, and even if we don’t have any extra structure to guide us.</p>
<p>We can see how this matters by looking at a classic logic puzzle, and then taking it to infinity.</p>
<h5 id="the-finite-hat-puzzle">The (finite) hat puzzle</h5>
<p>Imagine a game show host<strong title="The _classic_ version of the puzzle features a sadistic prison warden. While that setup is traditional, it seems unnecessarily violent, so I've replaced it with something friendlier."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> is going to line you up with 99 other people, and give each of you a hat to wear, which is either black or white. You can see everyone in front of you, including the colors of their hats; you can’t see your own hat, nor can you see anyone behind you.</p>
<p>Starting at the back of the line, the host will ask each person to guess whether their own hat is black or white. You’ll be able to hear the guesses, and whether they’re right or wrong.</p>
<p>Before the game starts, you all get a few minutes to talk and plan out your strategy. What should you do to get as many correct guesses as possible?</p>
<p>Stop and take a minute to think about this one. It doesn’t require any fancy mathematics, just a cute trick that’s surprisingly useful in other contexts.</p>
<p style="text-align: center;"><img src="https://64.media.tumblr.com/a6eec2d9352742626fe1fbe09b668cec/tumblr_nvpzxaBNRN1qgomego1_500.png" alt="Papyrus from Undertale dressed as Professor Layton: "Human would you like a puzzle?" Small child: "Not really" Papyrus: "Too bad you're getting a puzzle"" />
<em>Drawing by <a href="https://nightmargin.tumblr.com/post/130512412496/professor-skeleton-and-the-mystery-of-why-is">nightmargin</a> on Tumblr</em></p>
<p>As a hint, you can do really, really well. A simple approach that isn’t too bad is to have each odd-numbered person announce the color of the hat in front of them. This guarantees 50 right answers, and on average will get 75. But we can do much better than that.</p>
<p>Ready?</p>
<p>The person in the back of the line (call them \(A\)) doesn’t have any information, so there’s no possible way to guarantee they’ll get it right. But we can make sure everyone else wins. \(A\) can count up all the black hats in front of them and figure out if the number is even or odd. If it’s even, they’ll say “white”; if it’s odd, they’ll say “black”.</p>
<p>Now the second person \(B\) now knows whether \(A\) saw an even or odd number of black hats. But \(B\) can count up all the black hats <em>they</em> see. If \(A\) sees an even number of hats, but \(B\) sees an odd number, that means \(B\) must be wearing the remaining black hat.</p>
<p>The process continues down the line. \(C\) can tell whether \(A\) saw an even or odd number of black hats, and can also tell whether \(B\) was wearing black or white. Between that information, and seeing all the hats in front of them, \(C\) can figure out their own hat color.</p>
<p>(This sounds like it gets complicated very quickly, but we can streamline it. Count up all the black hats in front of you, and then add 1 to the number every time someone behind you says “black”. When the host reaches you, if the number is even you’re wearing a white hat, and if it’s odd you’re wearing a black hat.)</p>
<p>This exact algorithm is used by a lot of computer systems, especially when transmitting data over noisy connections. Computers store information in bytes, which are strings of eight bits. But often they will only use seven of the bits to store information (for instance, in standard <a href="http://rabbit.eng.miami.edu/info/ascii.html">ASCII encoding</a> there are 128 possible characters, represented as a 7-bit number). In transmission, the eighth bit can be used as a <a href="https://en.wikipedia.org/wiki/Parity_bit">parity bit</a>, which will be 1 if the other digits include an even number of “1”s, and 0 if they include an odd number of “1”s.</p>
<p>Thus every byte should have an odd number of “1”s, and if any byte has an even number of “1”s the system knows it contains an error. In our solutions \(A\) is effectively providing a parity bit for the string of hat colors, letting each player infer the information they don’t have: the color of their own hat.</p>
<h5 id="the-uncountable-hat-puzzle">The uncountable hat puzzle</h5>
<p>That puzzle is fun, and the solution is clever, but there’s nothing especially paradoxical or brain-breaking about it. And it doesn’t involve the axiom of choice at all. But we can write a harder version that does use the axiom of choice, and has truly ridiculous results.<strong title="I think I first heard about this version from Greg Muller at https://cornellmath.wordpress.com/2007/09/13/the-axiom-of-choice-is-wrong/"><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> </strong></p>
<p style="text-align: center;"><img src="/assets/blog/aoc/this-puzzle-reminds-me-of-a-puzzle.png" alt="Professor Layton's head: Doing a puzzle? That reminds me of a puzzle!"" /></p>
<p>Suppose the game host now gets an infinite line of people, so each person can see an infinite collection of people in front of them. (Let’s assume there is a <em>first</em> person in the line, so it’s not infinite in both directions; you have infinitely many people in front of you, but only finitely many behind.) And instead of black or white hats, we’ll write a random real number on each person’s hat: you could have 3 or 7, or \(5.234\) or \(\pi^e\) or \(\Gamma(3.5^{7.2e^2})\). And just to make it harder, you can’t even hear what happens behind you.</p>
<p>This looks plainly impossible. No one who can see your hat can communicate with you at all. Even if they could, there are <a href="https://en.wikipedia.org/wiki/Cantor's_diagonal_argument">more possible hat labels</a> than there are people in line. It seems like everyone working together wouldn’t be able to guarantee even one right answer. But if we can use the axiom of choice, we can guarantee that infinitely many people get the right answer—and even better, only finitely many people will get it wrong. In our endless infinite line, there will be a <em>last</em> wrong person; all the endless people in front of them will guess right.</p>
<p>How can this possibly work? First we’ll think about the set of all possible sequences<strong title="If you don't know what a sequence is, just think of this as an infinite list."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> of real numbers. (If we’re being fancy we might call this set \(\mathbb{R}^{\mathbb{N}}\).) We’ll say that two sequences are equivalent if they’re only different in finitely many places. So the sequences \( \Big( 1,2,3,4,5,6, \dots \Big) \) and \( \Big( 17, 2000 \pi, -\frac{345}{e}, 4, 5, 6, \dots \Big) \) are equivalent, but \( \Big( 1,0,3,0,5,0, \dots \Big) \) isn’t equivalent to either of them.</p>
<p>This gives us what’s called an <a href="https://en.wikipedia.org/wiki/Equivalence_relation">equivalence relation</a> on the set of real sequences. Equivalence relations are a widely useful tool, and I might write about them some other time, but for right now the important thing is that they <em>partition</em> the set, or subdivide it into smaller sets of things that are all equivalent to each other. Each thing will be in one and only one smaller set, which we call an <em>equivalence class</em>.</p>
<p>In our case, this means we’ve taken the set of all sequences of real numbers, and split it up into a bunch of equivalence classes of sequences. Every sequence belongs to exactly one equivalence class. And within each equivalence class, all the sequences are equivalent to each other—which means that they only have finitely many differences from each other.</p>
<p>Now we use the axiom of choice. We can <em>choose</em> one representative sequence from each equivalence class, and have everyone memorize this set of chosen sequences. When we all line up, I can see everyone in front of me, so there are only finitely many people I can’t see. There’s only one sequence on my list that can possibly be equivalent to this one.</p>
<p>Now when the host reaches me, I don’t know what’s happened behind me. I don’t know the exact sequence of hat labels. But I don’t need to! I know which equivlence class the sequence is in, and I know which representative sequence we chose for that equivalence class. So I can tell the host the number for my position from the representative sequence that we chose.</p>
<p>I might not be right; I have no way to know until the host tells me. But since we’re all using the <em>same</em> representative sequence that we chose earlier, and the sequence is only different from the “true” sequence finitely many times, an infinite number of us will get answer correctly. And only a finite number will fail.</p>
<h3 id="what-does-it-do-for-us">What does it do for us?</h3>
<p>The hat puzzle is obviously a little contrived, but the axiom of choice has a lot of surprising and sometimes disconcerting implications that are relevant to other fields of math. Some of these consequences are apparent paradoxes; others are things we would very much like to be true, and make the axiom of choice extremely useful.</p>
<h5 id="zorns-lemma">Zorn’s lemma</h5>
<p style="text-align: center;"><img src="/assets/blog/aoc/zorns_lemon.png" alt="What's yellow, sour, and equivalent to the axiom of choice? Zorn's Lemon!" /></p>
<p>Zorn’s Lemma is probably the most common use of the axiom of choice, but it’s a little tricky to explain. The formal statement is short enough:</p>
<blockquote>
<p><strong>Zorn’s Lemma:</strong> Every non-empty partially ordered set in which every totally ordered subset has an upper bound contains at least one maximal element.</p>
</blockquote>
<p>But it’s not super obvious what this means. The basic idea is that if we have some set where</p>
<ul>
<li>We can compare two elements and sometimes decide which one is “larger”;</li>
<li>but sometimes neither element counts as “larger”;</li>
<li><del>and we can never have an infinite collection of successively larger elements;</del>
any time we have an infinite collection of successively larger elements, there’s some other element bigger than all of them (thanks to Sniffnoy for the correction);</li>
</ul>
<p>then there must be a “largest” element.<strong title="Sometimes there can be _more than one_ largest element, which is a little weird. But since some pairs of elements can't be compared, you can have multiple elements that don't have anything above them. Imagine a company with two presidents: each of them is a highest-ranking person at the company. And that's why we say 'a' largest element rather than 'the' largest."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p>This is surprisingly useful, for one very specific reason: we can build up solutions to our problems step by step, and have a guarantee that we’ll finish. This is a tool we want to use all the time in math. We even tried it earlier: if we have a collection of sets, we can choose an element from the first one, and then the second one, and then the third one….</p>
<p>The problem we ran into is that this will eventually let us choose one element from each of a thousand sets, or a million, or a billion. But we have no guarantee that we can “eventually” choose from each of an infinite, possibly uncountable, collection of sets. Zorn’s lemma <a href="https://gowers.wordpress.com/2008/08/12/how-to-use-zorns-lemma/">solves this exact problem for us</a>, and lets us extend these constructions to infinity. And often when we’re defining functions on an infinite set, that’s exactly what we want to do.</p>
<p>Zorn’s lemma has one more important consequence: it is <em>equivalent</em> to the axiom of choice. We can use the axiom of choice to prove Zorn’s lemma; but we can also use Zorn’s lemma to prove the axiom of choice (by extending the axiom of finite choice to infinity, in exactly the way we were just discussing). We can’t duck the axiom-of-choice question by just making Zorn’s lemma into an axiom; the two are a package deal. If we want the power of Zorn’s lemma, we’re stuck with the axiom of choice and all the weirdness it implies.</p>
<h5 id="well-ordering"><a name="well-ordering">Well-ordering</a></h5>
<blockquote>
<blockquote>
<p>The axiom of choice is obviously true, the well-ordering principle obviously false, and who can tell about Zorn’s lemma?</p>
</blockquote>
</blockquote>
<blockquote>
<blockquote>
<blockquote>
<p><a href="https://books.google.com/books?id=eqUv3Bcd56EC&q=Bona#v=snippet&q=Bona&f=false">Jerry Bona</a></p>
</blockquote>
</blockquote>
</blockquote>
<p>These equivalences are a recurring theme in discussions of the axiom of choice. Another non-obviously equivalent statement is the Well-Ordering Principle, which says we can put any set \(X\) in a <a href="https://en.wikipedia.org/wiki/Well-order">definite order</a>, so that any subset has a “first” element. This is much stranger than it probably sounds. For instance, it’s really easy to put the real numbers in order, but most subsets won’t have a first element. (What’s the smallest real number? What’s the smallest positive real number? What’s the smallest number greater than 3?)</p>
<p>In fact, the fact that the usual order on the real numbers is <em>not</em> a well-ordering is a traditional source of internet math flame wars. There have been many <a href="https://forums.whirlpool.net.au/thread/9nxvlq19">forum threads</a> and <a href="https://polymathematics.typepad.com/polymath/2006/06/no_im_sorry_it_.html">blog comment threads</a> arguing endlessly about whether the infinitely repeating decimal \(.\bar{9}\) is actually equal to \(1\). (Yes, it is.)</p>
<p>Skeptics often suggest that maybe \(.\bar{9}\) isn’t <em>quite</em> \(1\), but just very close. Maybe it’s the last number before \(1\), the biggest number smaller that \(1\). But with the normal order for the reals, no such number exists. The reals are not well-ordered.</p>
<p>But with the axiom of choice, we can make up some <em>other</em> order for the real numbers, where every set has a first number. In fact, for any set, we can look at all the subsets and choose a first element for each once. We need to make sure that we do this consistently, but if we’re careful that’s not a problem, and so we can create a well-ordering on any set.</p>
<p>So what happens if we do this to the real numbers? There’s no real way to describe it—which is exactly why it requires the axiom of choice! You can make your favorite list of numbers and “choose” those to be first; the real difficulty is the need to make infinitely many choices. The axiom of choice lets us do this, but only in a totally non-explicit way that we can’t describe concretely.</p>
<h5 id="the-banach-tarski-paradox">The Banach-Tarski “paradox”</h5>
<p style="text-align: center;"><a href="https://xkcd.com/804"><img src="/assets/blog/aoc/xkcd_pumpkin_carving_edit.png" alt="xkcd 804: Pumpkin Carving. "I carved and carved, and the next thing I knew I had _two_ pumpkins." "I _told_ you not to take the axiom of choice."" /></a></p>
<p>But the most famous consequence of the axiom of choice, which probably deserves its own post, is the <a href="https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox">Banach-Tarski paradox</a>. Banach-Tarski says that if we have a solid three-dimensional ball, we can split it into five non-overlapping sets, rearrange these sets without any stretching or bending, and finish with two balls, each identical to the original ball.<strong title="The more general result is: given any two three-dimensional objects A and B, we can partition A into a finite collection of sets, and then rearrange those sets to get precisely B. In the special case people usually quote, A is 'a ball' and B is 'two balls'."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong></p>
<p>That means we’ve doubled the volume of our stuff just by moving the pieces around, which seems, um, implausible. We definitely can’t do that with a real ball. But with the axiom of choice, we can define “pieces” of the ball that are so strange that they don’t really have sizes at all. If we put them together one way, we get one volume; if we put them together a different way, we get a different volume. But the components don’t have a well-defined volume, so this is logically consistent. (And thus not actually a paradox, despite the name!)</p>
<h5 id="a-bunch-of-other-things">A bunch of other things</h5>
<p>There’s a <a href="https://en.wikipedia.org/wiki/Axiom_of_choice#Equivalents">long list of statements</a> that are equivalent to the axiom of choice. They show up in fields all over math, and algebra, analysis, and topology all become much simpler if these things are true:</p>
<ul>
<li>Every vector space has a basis</li>
<li>A product of non-empty sets is non-empty</li>
<li>Every set can be made into a group</li>
<li>The product of compact topological spaces is compact</li>
<li><a name="tarski">Tarski’s theorem:</a> If \(A\) is an infinite set, there’s a bijection between \(A\) and \( A \times A \)</li>
</ul>
<p>Since these are all equivalences, we can prove axiom of choice with any one of them. If you believe <em>any</em> of these statements, you’re stuck believing all of them—and the axiom of choice as well, with all its bizarre ball-cloning hat-identifying implications.</p>
<h3 id="sois-it-true">So…is it true?</h3>
<blockquote>
<blockquote>
<p>Tarski…tried to publish his theorem (<a href="#tarski">stated above</a>) in the <em>Comptes Rendus Acad. Sci. Paris</em> but Fréchet and Lebesgue refused to present it. Fréchet wrote that an implication between two well known propositions is not a new result. Lebesgue wrote that an implication between two false propositions is of no interest. And Tarski said that after this misadventure he never tried to publish in the <em>Comptes Rendus</em>.</p>
</blockquote>
</blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>Jan Mycielski, <a href="http://www.ams.org/notices/200602/fea-mycielski.pdf"><em>A System of Axioms of Set Theory for the Rationalists</em></a></p>
</blockquote>
</blockquote>
</blockquote>
<p>The big question is: <em>should</em> we believe any of these statements?</p>
<p>That might be a surprising question. Isn’t the whole point of math to have definitive, objectively correct answers? Either we can prove a result is true, or we can’t. We don’t generally ask whether we feel like believing a theorem. We proved it; we’re stuck with it.</p>
<p>But <em>axioms</em> are a little different. We need to decide on our axioms before we can prove things at all—or even decide what counts as a proof. Just like we can’t use a recipe to decide whether we want to make a cake or a cheeseburger, we can’t prove that an axiom is “correct”.</p>
<p>What we can do is look at a cake recipe, see what we’d have to do, and decide that maybe we don’t feel like making a cake after all. And we can look at what an axiom allows us to prove, and decide that maybe we don’t like those results and should pick some different axioms that don’t allow them.</p>
<h5 id="the-zermelo-fraenkel-axioms">The Zermelo-Fraenkel Axioms</h5>
<p>The standard system of axioms we use in math is called <a href="https://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory">Zermelo-Fraenkel Set Theory</a>, or just ZF. These are the rules we use as the base for all our work. If we can use them to prove a statement, we say just it’s proven; if a statement contradicts the ZF axioms, we’ve disproven it.</p>
<p style="text-align: center;"><img src="/assets/blog/aoc/set-theory-is-enough-theory-already.jpg" alt="Grumpy Cat says: Set Theory / is enough theory already" /></p>
<p>If the axiom of choice contradicted ZF, then we could forget about it and move on with our lives. But in 1938 Kurt Gödel proved that this isn’t the case: you can have fully consistent systems that respect both the ZF axioms and the axiom of choice.</p>
<p>Similarly, if we could prove the axiom of choice from the ZF axioms, we would have to either accept it as true, or completely rework all the foundations of math<strong title="We've actually done that before. At the beginning of the 20th century, Bertrand Russell and others found deep contradictions in the naive version of set theory in use at the time, and the ZF axioms were developed to avoid those problems. But we'd rather avoid doing it again."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong>. But we can’t do that either. And this is more than just acknowledging that we haven’t proved it <em>yet</em>: in 1963 Paul Cohen invented a technique called forcing to prove that if ZF is consistent, then we can never prove the axiom of choice from the rest of the ZF axioms.</p>
<p>This combination of results feels a little weird, because it’s so different from the way we usually approach math. Math has a reputation for black-and-white thinking<strong title="I don't like this reputation in any context. Mathematical thinking creates tons of space for nuance and subtlety and shades of gray. But that's probably a different essay."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong>: there’s a right answer to every question, and other answers are wrong. But here I’m telling you that there is no right answer. We can accept or reject the axiom of choice, and it works equally well either way.</p>
<h5 id="independence-is-normal">Independence is normal</h5>
<p>But this is actually perfectly normal! Suppose I asked you “are triangles isosceles?” The right answer isn’t “yes” <em>or</em> “no”: it depends on the triangle. And there are some theorems we can prove about isosceles triangles, like “if a triangle is isosceles, it has two equal angles”. And there are different theorems we can prove about non-isosceles triangles. The “axiom of isosceles-ness” is independent from the definition of a triangle.</p>
<p>But that might sound a little glib; no one talks about triangles like that. A better example is Euclidean geometry. When Euclid gave his formalization of geometry in <em>Elements</em>, he began with <a href="https://en.wikipedia.org/wiki/Euclidean_geometry#Axioms">five axioms</a> (or “postulates”, as you might have called them in high school geometry). The fifth (and final) postulate, called the <a href="https://en.wikipedia.org/wiki/Parallel_postulate">parallel postulate</a>, proved to be rather awkward.</p>
<blockquote>
<p><strong><a href="https://en.wikipedia.org/wiki/Parallel_postulate">Parallel postulate</a>:</strong> There is at most one line that can be drawn parallel to another given one through an external point.<strong title="This version is more precisely known as Playfair's axiom. Euclid's phrasing (translated from Greek) was 'if a straight line falling on two straight lines make the interior angles on the same side less than two right angles, the two straight lines, if produced indefinitely, meet on that side on which the angles are less than two right angles.' But Playfair's axiom is much simpler to state, and the two statements are equivalent."><sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup></strong></p>
</blockquote>
<p>This axiom is extremely important to geometry, but is much more complex and less self-evident than the other four axioms, which are statements like “all right angles are equal” and “we can draw a line connecting any two points”. Two millennia of mathematicians tried to remove this awkward complexity by proving the parallel postulate just from Euclid’s other axioms.</p>
<p>Then in the 1800s, we finally solved this problem—in the other direction. Euclidean geometry, including the parallel postulate, is completely consistent; but it’s also consistent to work with <em>non</em>-Euclidean geometries, in which the parallel postulate is false. Mathematicians constructed <a href="https://en.wikipedia.org/wiki/Non-Euclidean_geometry#Models_of_non-Euclidean_geometry">models</a> of elliptic geometry, in which there are no parallel lines, and of hyperbolic geometry, in which parallel lines are not unique.</p>
<p>What is a model? It’s just something that obeys all the axioms. So the work we do in high school, with pencil and paper on a flat surface, is a model of Euclidean geometry. It follows all five axioms, and any theorem that follows from the Euclidean axioms will be true of our pencil-and-paper work.</p>
<p>But if we work on the surface of a sphere, we get a model of non-Euclidean elliptic geometry. We can define a line to be a <a href="https://en.wikipedia.org/wiki/Great_circle">great circle</a>, a circle that goes fully around a sphere the long way. Any two points lie on exactly one great circle, so these “lines” obey Euclid’s first four axioms. But with a little bit of playing around, you can see that any pair of great circles will intersect in two points. This model doesn’t have any parallel lines at all.</p>
<p style="text-align: center;"><img src="/assets/blog/aoc/Grosskreis.svg" alt="Image of a sphere, with great circles marked." /></p>
<p style="text-align: center"><em>The solid curves are great circles. The solid blue curve is the equator.</em> <br />
<em>The dashed curves aren’t great circles, so they don’t count as lines.</em> <br />
<em>Adapted from <a href="https://commons.wikimedia.org/wiki/File:Grosskreis.svg">Wikimedia Commons</a></em></p>
<p>We can also build <a href="https://en.wikipedia.org/wiki/Poincar%C3%A9_disk_model">models of hyperbolic geometries</a>, but they’re a little harder to describe. But just one of these models is enough to know that we can’t prove the parallel postulate from Euclid’s other axioms—at least, not unless the other axioms are themselves contradictory. Nor can we disprove it. We have to <em>decide</em> if we want to use the parallel postulate.</p>
<p>This is exactly what Gödel and Cohen did for the axiom of choice. Gödel constructed a model of ZF set theory with choice; Cohen constructed a model of ZF set theory without choice. So we have to decide if we want to use the axiom of choice. And this brings us back to the same question: what are we trying to describe? Is the world we want to study a model of ZF with choice, or without?</p>
<h5 id="how-do-we-choose">How do we choose?</h5>
<p>To decide if we should adopt an axiom, we need to know what our goals are, and what we’re trying to describe. Euclidean geometry is good for arranging furniture in my room, but it’s bad for planning long-range flights, for which the fact that we live on a sphere matters.</p>
<p style="text-align: center;"><img src="/assets/blog/aoc/great_circle_routes.png" alt="A diagram of a great circle flight path. First on a rectangular/planar projection, where it doesn't look like a straight line; then on a sphere, where it does." /></p>
<p style="text-align: center"><em>Plane flight paths don’t look like straight lines on a flat map.</em> <br />
<em>On a sphere we see they really are the shortest, “straightest” path.</em> <br />
<em>Adapted from <a href="https://commons.wikimedia.org/wiki/File:Different_map_projections.png">Wikimedia Commons</a> CC-BY-SA-3.0</em></p>
<p>We should ask the same question about the axiom of choice: what are we trying to describe? Does the axiom of choice bring us closer to describing the world accurately, or farther away? Is the world we want to study a model of ZF with choice, or without?</p>
<p>The obvious answer is that the axiom of choice has absurd and unrealistic results. In the real world we can’t slice up one billiard ball and assemble the pieces into two billiard balls, or save infinitely many people in the hat puzzle. So if the axiom of choice says we can, it must not be describing the real world.</p>
<p>But this argument isn’t terribly persuasive, because every single thing about the uncountable hat puzzle is physically absurd. Even the setup is ridiculous: we can’t have an infinite line of people, and if we were somehow put in an infinite line, we wouldn’t be able to see all the people in it, let alone the numbers on their hats.</p>
<p>The step where we use the axiom of choice is even more unrealistic. We take the uncountably infinite set of real sequences; we partition it into an uncountably infinite collection of infinite sets of sequences; and then we ask everyone to memorize an (infinite!) sequence from each of these infinitely many infinite sets.</p>
<p>I’d have a hard time remembering one list of a hundred numbers. Memorizing a thousand lists of a thousand numbers is extremely unlikely; memorizing infinitely many lists of infinitely many numbers is flatly impossible. And that’s before we ask how we can communicate the lists we’ve chosen to each other, so that each of the (infinitely many) people memorize the <em>same</em> infinite collection of infinite lists.</p>
<p>The Banach-Tarski argument isn’t any better. It splits the ball into only five pieces,sure, but each of those pieces is infinitely complex, enough so that you can’t concretely describe their shapes, let alone actually cut a ball into those pieces. The informal explanation that “you can slice a ball into five pieces and reassemble those pieces into two balls” is not true, because there’s no real way to produce the pieces you need.<strong title="Feynman has a story about this in his memoir. A math grad student described the Banach-Tarski paradox to him, and he bet that it was made up, rather than a real theorem. He was able to wriggle out of losing by pointing out that the grad student had described cutting up an _orange_, and you can't slice a physical object made up of atoms infinitely finely."><sup id="fnref:13"><a href="#fn:13" class="footnote">13</a></sup></strong></p>
<p>In the real world we <em>never see infinite sets</em>. We pretend some sets are infinite because it makes our lives easier. But any principle that <em>only</em> kicks in at infinity will never make contact with the reality.</p>
<p style="text-align: center"><img src="/assets/blog/aoc/einstein_stupidity.jpeg" alt="Picture of Einstein: Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." height="50%" width="50%" /></p>
<p style="text-align: center"><em>Einstein <a href="https://quoteinvestigator.com/2010/05/04/universe-einstein">probably didn’t say this</a>, but it’s a good line.</em></p>
<h3 id="not-as-crazy-as-it-seems">Not as crazy as it seems</h3>
<p>This might feel like it’s dodging the question, though. If infinity is fake, why should we use axioms that only matter for infinity? And if we are going to say things about infinity, shouldn’t they make sense?</p>
<p>Maybe it’s fine for a physicist to dismiss mathematical abstractions as unphysical and thus irrelevant. But math is about reasoning through the consequences of abstract hypotheticals! If we’re going to adopt an foundational principle like the axiom of choice, we should really mean that we believe it in every abstract hypothetical situation we’re going to apply it in.</p>
<p>But after we realize how infinity works, our absurd results look somewhat more reasonable.<strong title="This is a common mathematical rhetorical trick. Earlier I was trying to convince you that the implications of the axiom of choice were really weird. Now I'm going to try to convince you that they're perfectly reasonable. This exact two-step happens quite a lot in math exposition. I suspect this is due partially to the demands of pedagogy, and partly to the way we form our mathematical intuition."><sup id="fnref:14"><a href="#fn:14" class="footnote">14</a></sup></strong> Our “successful” strategy in the infinite hat game actually doesn’t give us all that much. Sure, only finitely many people lose; some person in the line will be the last to answer wrong. But what would this look like in practice?</p>
<p>You could imagine the first hundred people all getting the question wrong. But that’s okay; only finitely many people will get it wrong. Then the first thousand people all get it wrong. But we know that at some point a last person will get it wrong and everyone left will get it right. A million people all get it wrong. Everyone gets bored. The game show host decides to leave. And sure enough, only finitely many people ever answered the question wrong!</p>
<p>The axiom of choice argument somehow doesn’t do anything after a finite number of answers. You could have the first million, or the first trillion, people all get the question wrong, and that wouldn’t contradict our proof. All the weirdness happens out at infinity—and we already know that infinity is deeply weird.</p>
<h3 id="whats-the-point">What’s the point?</h3>
<p>The axiom of choice is logically independent of our axioms for set theory, so we can’t ever prove it true or false. And it says deeply strange things about deeply strange situations that can never really happen. So why does it matter?</p>
<h5 id="infinity-is-fake-but-useful">Infinity is fake <em>but useful</em></h5>
<p>The answer is the same as the reason we use infinity at all. Everything we’ve ever seen is finite and discrete: objects are made out of atoms, and even if space and time aren’t truly quantized, our ability to measure them definitely is. But it’s extremely convenient to pretend that reality is continuous, which allows us to solve problems with calculus and other clever math tricks. If the world is “close enough” to being continuous, our answers will be good enough for whatever we’re doing.</p>
<p>Any infinity we care about will come from a limit of finite things. I can measure the width of my office in meters, or centimeters, or millimeters. With the right equipment I could measure it in micrometers or nanometers. I can’t ever measure it with infinite precision, but I can <em>imagine</em> doing that. And it’s really convenient to say the width is a real number, rather than to insist that it must <em>really</em> be some integer number of picometers</p>
<p>This exact reasoning is basically how all of calculus works. If I want to know how fast my car is going in miles per hour, I can measure the distance it travels in miles over the course of an hour. Or I can see how many miles it goes in a minute, and multiply by sixty. I could measure the number of miles it goes in a second, and multiply by 3600 (or more realistically, measure the number of <em>feet</em> it goes in a second, and multiply by 3600/5280).</p>
<p>But what is the speed “right now”? We imagine taking measurements over these shorter and shorter intervals; in the limit, when our interval is “infinitely short”, we get the instantaneous velocity. And that’s a derivative, which is an extremely powerful tool for doing math and physics.</p>
<p>But we can’t <em>actually</em> measure the distance traveled in an infinitely small window of time. (Nor can we measure the infinitely small time itself.) We’re taking some real, physical, finite measurements. We can measure how far a car goes in one second, multiply by 3600/5280, and then display that number on the dashboard. But the infinite version is something we only imagine.</p>
<h3 id="just-relax">Just relax</h3>
<p>If we’re trying to model the world, any infinite set we have to deal with will be a limit of finite sets. And any infinite family of infinite sets will be a limit of finite families of finite sets. And we know we have choice for finite sets of finite sets. So we can always get choice for these specific infinite sets, if we really need it—just by taking the limit of the elements we chose from our finite families.</p>
<p>What the axiom of choice says is: don’t worry about it. You don’t have to explain <em>how</em> your family of sets came from a finite family. You don’t have to explain <em>how</em> you’re choosing elements. We’ll just assume you can make it work somehow.</p>
<p>That’s what axioms are for. They tell us what we want to just assume we can do, without really explaining how. Our axioms are a list of things we don’t want to have to think about. And in practice, we don’t have to think about whether we can make choices. Any time it really matters, we can.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We can be more formal by phrasing this in terms of <em>choice functions</em>: given a collection of sets \(\mathcal{X} = \{A\}\) there is a function \(f : \mathcal{X} \to \bigcup_{A \in \mathcal{X}} A\) such that \(f(A) \in A \) for each \(A \in \mathcal{X} \). But I want to keep the discussion as readable as possible if you’re not comfortable with the language of formal set theory. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Using this sort of process on an infinite set is called <a href="https://en.wikipedia.org/wiki/Transfinite_induction">transfinite induction</a>. <del>If we allow transfinite induction then we get the axiom of choice for free. But the axiom of choice also implies that we can do transfinite induction; the two concepts are logically equivalent.</del> Transfinite induction can sometimes allow us to make choices without the axiom, but only if we can put our sets in some order. Conversely, the axiom of choice allows us to <a href="https://en.wikipedia.org/wiki/Transfinite_induction#Relationship_to_the_axiom_of_choice">use transfinite induction in cases we otherwise couldn’t</a>.</p>
<p>Thanks to Sniffnoy for a helpful correction here. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>The set of real numbers doesn’t have a smallest element or a largest element. Nor does the set of positive real numbers, or the set of numbers between zero and one. So if we have a colleciton of sets of real numbers, the rule we used for sets of positive integers doesn’t work. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>This example was originally offered by Bertrand Russell. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The <em>classic</em> version of the puzzle features a sadistic prison warden. While that setup is traditional, it seems unnecessarily violent, so I’ve replaced it with something friendlier. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>I think I first heard about this version from <a href="https://cornellmath.wordpress.com/2007/09/13/the-axiom-of-choice-is-wrong/">Greg Muller</a>. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>If you don’t know what a sequence is, just think of this as an infinite list. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Sometimes there can be <em>more than one</em> largest element, which is a little weird. But since some pairs of elements can’t be compared, you can have multiple elements that don’t have anything above them. Imagine a company with two presidents: each of them is the highest-ranking person at the company. And that’s why we say “a” largest element rather than “the” largest. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>The more general result is: given any two three-dimensional objects \(A\) and \(B\), we can partition \(A\) into a finite collection of sets, and then rearrange those sets to get precisely \(B\). In the special case people usually quote, \(A\) is “a ball” and \(B\) is “two balls”. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>We’ve actually done that before. At the beginning of the 20th century, Bertrand Russell and others found deep contradictions in the naive version of set theory in use at the time, and the ZF axioms were developed to avoid those problems. But we’d rather avoid doing it again. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
<li id="fn:11">
<p>I don’t like this reputation in any context. Mathematical thinking creates tons of space for nuance and subtlety and shades of grey. But that’s probably a different essay. <a href="#fnref:11" class="reversefootnote">↩</a></p>
</li>
<li id="fn:12">
<p>This version is more precisely known as <a href="https://en.wikipedia.org/wiki/Playfair's_axiom">Playfair’s axiom</a>. Euclid’s phrasing (translated from Greek) was “if a straight line falling on two straight lines make the interior angles on the same side less than two right angles, the two straight lines, if produced indefinitely, meet on that side on which the angles are less than two right angles.” But Playfair’s axiom is much simpler to state, and the two statements are equivalent. <a href="#fnref:12" class="reversefootnote">↩</a></p>
</li>
<li id="fn:13">
<p>Feynman has a story about this in <a href="https://en.wikipedia.org/wiki/Surely_You're_Joking,_Mr._Feynman!">his memoir</a>. A math grad student described the Banach-Tarski paradox to him, and he bet that it was made up, rather than a real theorem. He was able to wriggle out of losing by pointing out that the grad student had described cutting up an <em>orange</em>, and you can’t slice a physical object made up of atoms infinitely finely. <a href="#fnref:13" class="reversefootnote">↩</a></p>
</li>
<li id="fn:14">
<p>This is a common mathematical rhetorical trick. Earlier I was trying to convince you that the implications of the axiom of choice were really weird. Now I’m going to try to convince you that they’re perfectly reasonable. This exact two-step happens quite a lot in math exposition. I suspect this is due partially to the demands of pedagogy, and partly to the way we form our mathematical intuition. <a href="#fnref:14" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleOne of the easiest ways to start a (friendly) fight in a group of mathematicians is to bring up the axiom of choice. I'll explain what it is, why it's so controversial, and hopefully shed some light on how we choose axiomatic systems and what that means for the math we do.Lockdown Recipes&colon; Red Beans and Rice2020-05-25T00:00:00-07:002020-05-25T00:00:00-07:00https://jaydaigle.net/blog/lockdown-recipes-red-beans<p>Since we’re all stuck at home and cooking more than usual, I wanted to share one of my favorite recipes from my childhood, which is also especially suited to our current stuck-at-home ways.<strong title="Yeah, it would have made even more sense to post this two months ago. But two months ago I was trying to figure out how to teach three math classes over the internet instead of recipeblogging."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<p><img src="/assets/blog/recipes/red_beans_and_rice.jpg" alt="test" style="width:250px; float:right" /></p>
<p><a href="https://en.wikipedia.org/wiki/Red_beans_and_rice">Red Beans and Rice</a> is a traditional Louisiana Creole dish. It’s cheap and extremely easy and low effort to make. The one major downside is that it takes several hours of simmering (that don’t require any attention); in normal times that’s a major disadvantage, but if you’re working from home that’s not a problem at all.</p>
<p>In fact, this dish was originally a solution to a working-from-home dilemma that Louisiana cooks faced. Monday was laundry day, and the women of the house were so busy doing the wash that they couldn’t spend all day tending food on the stove. So this hands-off dish became a traditional Monday dinner.</p>
<p>There are a <em>lot</em> of ways you can vary this dish. I’ll give two straightforward recipes: one for the traditional stovetop method, and a faster pressure-cooker method that I use during busier times that takes less time, both in prep and in waiting. But I also want to talk about what some of the steps are doing, and how you can change things up to get different flavor profiles if you want.</p>
<h3 id="ingredients">Ingredients</h3>
<h5 id="aromatics">Aromatics</h5>
<ul>
<li>1/2 stick butter</li>
<li>1 chopped onion</li>
<li>4-5 ribs celery</li>
<li>1-2 chopped bell peppers</li>
<li>3 cloves garlic finely chopped garlic</li>
<li>Tablespoon chopped parsley</li>
<li>Teaspoon chopped thyme</li>
</ul>
<h5 id="body">Body</h5>
<ul>
<li>1 pound dried red kidney beans</li>
<li>1-2 pounds smoked or andouille sausage, sliced into bite-size pieces</li>
<li>6 oz tomato paste (one small can)</li>
</ul>
<h5 id="seasoning">Seasoning</h5>
<ul>
<li>2 bay leaves</li>
<li>quarter cup of brown sugar</li>
<li>1 tablespoon mustard</li>
<li>1 teaspoon paprika</li>
<li>Salt and cayenne pepper to taste</li>
</ul>
<h3 id="traditional-red-beans">Traditional red beans</h3>
<ol>
<li>In a large (at least two gallons) pot, melt the butter over medium heat. Sweat the onions, celery, and bell peppers for 5-10 minutes, until soft and onions are translucent.</li>
<li>Add garlic, parsley, and thyme and sautee for a couple minutes more, until soft.</li>
<li>Rinse kidney beans and add them to pot. Add water (or stock) until covered by an inch or two of water, and heat to a high simmer. Cover pot and leave to simmer.</li>
<li>After a half hour or so, add meat and tomato paste, and stir to combine. Return to a simmer and cover.</li>
<li>After another hour, add seasonings. Return to a simmer and cover again.</li>
<li>Once every hour or so, check on the pot. Top it off with extra liquid if it’s starting to run low, and scrape the bottom a bit to make sure nothing is sticking to the bottom.</li>
<li>After six to eight hours, the beans should be basically disintegrated: you’ll see the shells floating in the liquid, but the insides of the bean will have absorbed into the liquid base and formed a rich, thick paste. At this point you might want to taste it and adjust seasonings to your preference.</li>
<li>Serve over rice.</li>
</ol>
<h3 id="pressure-cooker-red-beans">Pressure cooker red beans</h3>
<p>Rinse the red beans. Then dump all the ingredients in the pressure cooker. Cook on high pressure for two hours, then simmer until consistency is good. Serve over rice.</p>
<p>(See how easy that was?)</p>
<h3 id="variations">Variations</h3>
<h5 id="aromatics-1">Aromatics</h5>
<p>Onions, celery, and bell peppers are the traditional base for New Orleans stocks and soups, known as the “<a href="https://en.wikipedia.org/wiki/Holy_trinity_(cuisine)">Holy Trinity</a>”. They serve the same role as the French <a href="https://en.wikipedia.org/wiki/Mirepoix_(cuisine)">mirepoix</a> (onions, celery, and carrots) or the Spanish <a href="https://en.wikipedia.org/wiki/Sofrito">sofrito</a> (garlic, onion, peppers, and tomatoes). If you like those other flavor profiles more, you can substitute a different aromatic base. You can also use whatever fat you like for the sauteeing.</p>
<p>Some people like to brown their aromatics, while others like to gently sweat them without browning. The flavor profiles are slightly different, so take your pick.</p>
<p>If you want to speed things up a bit, you can sweat your aromatics in a separate skillet while starting the boil on the red beans. I often find this easier to manage, not needing to stir the aromatics in the giant stock pot, but it does require a second pan.</p>
<h5 id="body-1">Body</h5>
<p>The most important aspect here is the kidney beans. It is <em>very important</em> that they stay at a full boil for at least half an hour; kidney beans <a href="https://en.wikipedia.org/wiki/Kidney_bean#Toxicity">are toxic</a> and it takes a good boiling to break those toxins down.</p>
<p>A lot of people like to soak their beans overnight before cooking with them. This makes the toxins break down a bit easier, and also makes them cook faster; it probably cuts the cooking time from eight hours or so down to six. It changes the flavor in a way I don’t like, so I don’t do it. But you might prefer that flavor!</p>
<p>You can definitely substitute in other beans, but you’ll get a different texture. Kidney beans are extremely tough and starchy and give the stock a nice body when completely broken down.</p>
<p>I like the flavor effect of adding a can of tomato paste, but it’s not especially traditional. This is totally optional.</p>
<p>Because the red beans add body, this broth works just fine with plain water. But if you have stock in your kitchen it can add extra layers of flavor and body to your dish. I generally start with homemade stock, and top it off with water as the cooking continues.</p>
<h5 id="meat">Meat</h5>
<p>You can flavor this broth with nearly any meat you have. Traditionally, the cook would use the leftover bones from the Sunday roast to flavor the red bean broth on Monday. If you happen to have some chicken or pork bones left over, you can do <em>far</em> worse than adding them to the pot.</p>
<p>When I’m doing it in the pressure cooker, I often like to take a 3-4 pound bone-in pork shoulder and add that in place of the sausage. I get the broth richness from the bone, and the meat of the pork shoulder falls off into the stew nicely. I haven’t tried this in the traditional method but I’m sure it would work.</p>
<p>If you do use pre-chopped meat like sausage, you can brown it in a separate pan for extra flavor. Extra steps and an extra pan, but extra flavor; your call whether it’s worth it.</p>
<p>Andouille sausage is probably the most standard sausage choice right now. It’s spicy, so you may want something milder. It’s also a bit more expensive than I tend to want to go for this dish; the sausage can easily be more than half the cost of the entire dish. My default option is Hillshire Farms smoked sausage, but you can use whichever firm sausage you like.</p>
<p>And the dish does work fine with no meat at all, if you’d prefer a vegetarian option. Replace the butter with oil and you can make it vegan.</p>
<h5 id="seasoning-1">Seasoning</h5>
<p>This is really flexible. To be honest, I primarily season with a healthy dose of Tony Chachere’s spice mix. I also add the sugar, and either a dollop of oyster sauce or a pinch of MSG powder.</p>
<p>But there are of course lots of options here. I don’t think the mustard is super traditional, but I very much like the effect.</p>
<p>Almost any spices you like can go here. I suspect coriander would be good. Swap out the cayenne pepper for black pepper, or for Tabasco sauce (very traditional in New Orleans food). Or you could change up the flavor profile entirely and push it towards your favorite cuisine. Use an Italian spice blend, or a Mexican blend, or an Indian blend, whatever strikes your fancy. And if you find something that works really well—let me know!</p>
<p><em>Did you make this? What did you think? Do you have a favorite lockdown recipe to share? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Yeah, it would have made even more sense to post this two months ago. But two months ago I was trying to <a href="https://jaydaigle.net/blog/online-teaching-in-the-time-of-coronavirus/">figure out how to teach three math classes over the internet</a> instead of recipeblogging. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleSince we're all stuck at home and cooking more than usual, I wanted to share one of my favorite recipes from my childhood, which is also especially suited to our current stuck-at-home ways. Red Beans and Rice is a traditional Louisiana Creole dish. It's cheap and extremely easy and low effort to make. The one major downside is that it takes several hours of simmering (that don't require any attention); in normal times that's a major disadvantage, but if you're working from home that's not a problem at all.The SIR Model of Epidemics2020-03-27T00:00:00-07:002020-03-27T00:00:00-07:00https://jaydaigle.net/blog/the-sir-model-of-epidemics<script src="https://sagecell.sagemath.org/static/embedded_sagecell.js"></script>
<script>sagecell.makeSagecell({"inputLocation": ".sage"});</script>
<p>For <em>some</em> reason, a lot of people have gotten really interested in epidemiology lately. Myself included.</p>
<p><img src="/assets/blog/sir/coronavirus.jpg" alt="Picture of a coronavirus, by Alissa Eckert, MS and Dan Higgins, MAMS, courtesy of the CDC" class="center" style="width:350px" /></p>
<p style="text-align: center"><em>I have no idea why.</em></p>
<p>Now, I’m not an epidemiologist. I don’t study infectious diseases. But I do know a little about how mathematical models work, so I wanted to explain how one of the common, simple epidemiological models works. This model isn’t anywhere near good enough to make concrete predictions about what’s going to happen. But it <em>can</em> give some basic intuition about how epidemics progress, and provide some context for what the experts are saying.</p>
<hr />
<p><strong>Disclaimer:</strong> I don’t study epidemics, and I don’t even study differential equation models like this one. I’m basically an interested amateur. I’m going to try my best not to make any predictions, or say anything specific about COVID-19. I don’t know what’s going to happen, and you shouldn’t listen to my guesses, or the guesses of anyone else who isn’t an actual epidemiologist.</p>
<hr />
<h2 id="the-sir-model">The SIR Model</h2>
<h3 id="parameters">Parameters</h3>
<p>The SIR model divides the population into three groups, which give the model its name:</p>
<ul>
<li>$S$ is the number of <strong>S</strong>usceptible people in the population. These are people who aren’t sick yet, but could get sick in the future.</li>
<li>$I$ is the number of <strong>I</strong>nfected people. These are the people who are sick<strong title="Or people who are asymptomatic carriers. This model doesn't worry about who actually gets a fever and starts coughing, just who carries the virus and can maybe infect others."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> right now.</li>
<li>$R$ is the number of people who have <strong>R</strong>ecovered from the virus. They are immune and can’t get sick again.</li>
<li>We also will use $N$ for the total number of people. So $N = S+ I + R$.</li>
</ul>
<p><img src="/assets/blog/sir/knight.jpg" alt="Picture of a Knight, by Paul Mercuri (1860)" class="center" style="width:400px" /></p>
<p style="text-align:center"><em>Not that kind of “sir”.</em></p>
<p>For the purposes of this model, we assume that the total number of people, $N$, doesn’t change. But the number of people in each $S,I,R$ group is changing all the time: susceptible people get infected, and infected people recover. So we write $S(t)$ for the number of susceptible people “at time $t$”—which is just a fancy way of saying that $S(3)$ means the number of susceptible people on the third day.</p>
<h3 id="change-over-time">Change Over Time</h3>
<p>In order to model how these groups evolve over time, we need to know how often those two changes happen. How quickly do sick people recover? And how quickly do susceptible people get sick?</p>
<p>The first question, in this model, is simple. Each infected person has a chance of recovering each day, which we call $\gamma$. So if the average person is sick for two weeks, we have $\gamma = \frac{1}{14}$. And on each day, $\gamma I$ sick people recover from the virus.</p>
<p>The second question is a little trickier. There are basically three things that determine how likely a susceptible person is to get sick: how many people they encounter in a day, what fraction of those people are sick, and how likely a sick person is to transmit the disease. The middle factor, the fraction of people who are sick, is $\frac{I}{N}$. We could think about the other two separately, but for mathematical convenience we group them together and call them $\beta$.</p>
<p>So the chance that a given susceptible person gets sick on each day is $\beta \frac{I}{N}$.<strong title="If we're being fancy, we say that the chance of getting sick is proportional to I/N and that β is the constant of proportionality. But if you're not used to differential equations already I'm not sure that tells you very much."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> And thus the total number of people who get sick each day is $\beta \frac{I}{N} S$.</p>
<p>If these letters look scary, it might help to realize that you’ve probably spent a lot of time lately thinking about $\beta$—although you probably didn’t call it that. The parameter $\beta$ measures how likely you are to get sick. You can decrease it by reducing the number of people you encounter in a day, through “social distancing” (or <a href="https://www.washingtonpost.com/lifestyle/wellness/social-distancing-coronavirus-physical-distancing/2020/03/25/a4d4b8bc-6ecf-11ea-aa80-c2470c6b2034_story.html">physical distancing</a>). And you can decrease it by improved hygiene—better handwashing, not touching your face, and sterilizing common surfaces.</p>
<p>There’s one more number we can extract from this model, which you might have heard of. In a population with no resistance to the disease (so $S$ and $I$ are both small, and we can pretend that $S=N$), a sick person will infect $\beta$ people each day, and will be sick for $\frac{1}{\gamma}$ days, and so will infect a total of $\frac{\beta}{\gamma}$ people. We call this ratio is $R_0$; you may have seen in the news that the $R_0$ for COVID-19 is probably about $2.5$.</p>
<p><img src="/assets/blog/sir/file-20200128-120039-bogv2t.png" alt="A graph demonstrating exponential growth when R0 = 2" class="center" style="width:377px;" /></p>
<p style="text-align: center;"><em>When $\beta$ is twice as big as $\gamma$, things can get bad very quickly. From <a href="https://theconversation.com/r0-how-scientists-quantify-the-intensity-of-an-outbreak-like-coronavirus-and-predict-the-pandemics-spread-130777">The Conversation</a>, licensed under <a href="http://creativecommons.org/licenses/by-nd/4.0/">CC BY-ND</a></em></p>
<h3 id="assumptions-and-limitations">Assumptions and Limitations</h3>
<p>Like all models, this is a dramatic oversimplification of the real world. Simplifcation is good, because it means we can actually understand what the model says, and use that to improve our intuitions. But we do need to stay aware of some of the things we’re leaving out, and think about whether they matter.</p>
<p><strong>First</strong>: the model assumes a static population: no one is born and no one dies. This is obviously <em>wrong</em> but it shouldn’t matter too much over the months-long timescale that we’re thinking about here. On the other hand, if you want to model years of disease progression, then you might need to include terms for new susceptible people being born, and for people from all three groups dying.</p>
<p><strong>Second</strong>: the model assumes that recovery gives permanent immunity. Everyone who’s infected will eventually transition to recovered, and recovered people never lose their immunity and become susceptible again. I don’t think we know yet how many people develop immunity after getting COVID-19, or how long that immunity lasts.</p>
<p>But it seems basically reasonable to assume that most people will get immunity for at least several months; in this model we’re simplifying that to assume “all” of them do. And since we’re only trying to model the next several months, it doesn’t matter for our purposes whether immunity will last for one year or ten.</p>
<p><strong>Third</strong>: we assumed that $\beta$ and $\gamma$ are constants, and not changing over time. But a lot of the response to the coronavirus has been designed to decrease $\beta$—and the extent of those changes may vary over time. People will be more or less careful as they get more or less worried, as the disease gets worse or better. And people might just get restless from staying home all the time and start being sloppier. An improved testing regime might also decrease $\beta$, and better treatments could improve $\gamma$.</p>
<p>But the model leaves $\beta$ and $\gamma$ the same at all times. So we can imagine it as describing what would happen if we didn’t change our lifestyle or do anything in response to the virus.</p>
<p><strong>Finally</strong>: the first two factors, combined, mean that the susceptible population can only decrease, and the recovered population can only increase. Since we also hold $\beta$ and $\gamma$ constant, this model of the pandemic will only have one peak. It will never predict periodic or seasonal resurgences of infection, like we see with the flu.</p>
<p><img src="/assets/blog/sir/CDC-influenza-pneumonia-deaths-2015-01-10.gif" alt="graph of flu deaths, 2010 - 2014" class="center" /></p>
<p style="text-align: center;"><em>A graph of flu deaths per week, peaking each winter, from the CDC. The vanilla SIR model will never produce this sort of periodic seasonal pattern.</em></p>
<p><img src="https://miro.medium.com/max/2000/1*ok3NLISRGvK-4SQyDA5KTg.png" alt="stylized graph of possible COVID-19 trajectories" class="center" style="width:500px;" /></p>
<p style="text-align: center;"><em>This green curve imagines a “dance” where we suppress coronavirus infections through an aggressive quarantine, and then spend months alternately relaxing the quarantine until infections get too high, and then tightening it again until infections fall back down. The SIR model doesn’t allow this sort of dynamic variation of $\beta$ and can never produce the green curve.</em></p>
<h3 id="the-whole-system">The Whole System</h3>
<p>If we put all this together we get a <em>system of ordinary nonlinear differential equations</em>. A differential equation is an equation that talks about how quickly something changes; in these equations, we have the rates at which the number of susceptible, infected, and recovered people change. “Ordinary” means that there’s only one input variable; all the parameters change with time, but we’re not taking location as an input or anything. “Nonlinear” means that our equations aren’t in a specific “linear” form that’s really easy to work with.</p>
<p><img src="/assets/blog/sir/13974391215433.jpg" alt="Photo of a Kitten" class="center" style="width:479px" /></p>
<p style="text-align: center"><em>Calling these equations a “nonlinear system” is a lot like calling this kitten a “nondog animal”. It’s not wrong, but it’s kind of weirdly specific if you’re not at a dog show.</em></p>
<p>If you took calculus, you might remember that we often write $\frac{dS}{dt}$ to mean the rate at which $S$ is changing over time. Roughly speaking, it’s the change in the total number of susceptible people over the course of a day. We know that $S$ is decreasing, since susceptible people get sick but we’re assuming that people don’t <em>become</em> susceptible, so $\frac{dS}{dt}$ is negative. And specifically, we worked out that $\frac{dS}{dt}$ is $-\beta \frac{IS}{N}$, since that’s the number of people who get sick each day.</p>
<p>Similarly, we saw that $\frac{dR}{dt}$ is $\gamma I$, the number of people who recover each day. And $\frac{dI}{dt}$ is the number of people who get sick minus the number who recover. All together this gives us:</p>
<p>\begin{align}
\frac{dS}{dt} & = - \beta \frac{IS}{N} \\\<br />
\frac{dI}{dt} &= \beta \frac{IS}{N} - \gamma I \\\<br />
\frac{dR}{dt} & = \gamma I
\end{align}</p>
<hr />
<h2 id="what-did-we-learn">What Did We Learn?</h2>
<p>Now that we have this model, what’s the point? We can actually do a few different things with a model like this. If we want, we can write down an <a href="https://arxiv.org/abs/1403.2160">exact formula</a> that tells us how many people will be sick on each day. Unfortunately, the exact formula isn’t actually all that helpful. The paper I linked includes lovely equations like</p>
<script type="math/tex; mode=display">z(\psi )= e^{-\mu\int_1^{\psi } \frac{ e^{\Psi (\xi )}}{\xi } \, d\xi } \left[\int_1^{\psi } e^{\Psi (\chi )+\mu\int_1^{\chi } \frac{ e^{\Psi (\xi )}}{\xi } \, d\xi } \, d\chi
-\int_1^{\gamma N_2} e^{\Psi (\chi )+\mu\int_1^{\chi } \frac{ e^{\Psi (\xi )}}{\xi } \, d\xi } \, d\chi +N_3 e^{\mu\int_1^{\gamma N_2} \frac{
e^{\Psi (\xi )}}{\xi } \, d\xi }\right].</script>
<p>And I don’t want to touch a formula that looks like that any more than you do.</p>
<p>Even if the formula were nicer, it wouldn’t be all that useful. Getting an exact solution to the equations doesn’t mean we know exactly how many people are going to get sick. Like all models, this one is a gross oversimplification of the real world. It’s not useful for making exact predictions; and if you want predictions that are <em>kinda</em> accurate, you should talk to the epidemiological experts, who have much more complicated models and much better data.</p>
<h3 id="qualitative-judgments">Qualitative Judgments</h3>
<p>But this model does give us a qualitative sense of how epidemics progress. For instance, in the very early stages of the epidemic, almost everyone will be susceptible. So we can make a further simplifying assumption that $S = N, I = R =0$, and get the equation
<script type="math/tex">\frac{dI}{dt} = \beta I.</script>
This is <a href="https://jaydaigle.net/blog/a-neat-argument-for-the-uniqueness-of-e-x/">famously</a> the equation for <a href="https://en.wikipedia.org/wiki/Exponential_growth">exponential growth</a>. And indeed, graphs of new coronavirus infections seem to start nearly perfectly exponential.</p>
<p><img src="https://cdn.i24news.tv/uploads/49/ba/a9/51/db/2f/9b/b6/08/0e/96/64/95/71/70/7f/49baa951db2f9bb6080e96649571707f.png" alt="Comparison of reported Chinese cases with exponential curve" class="center" style="width:320px;" /></p>
<p style="text-align: center;"><em>This graph <a href="https://www.i24news.tv/en/news/international/asia-pacific/1580327226-analysis-at-current-rate-china-virus-could-infect-over-25-000-by-february">from I24 news</a> of reported infections in China almost perfectly matches the exponential curve.</em></p>
<p><img src="https://static01.nyt.com/images/2020/03/20/science/virus-log-chart-1584728689795/virus-log-chart-1584728689795-facebookJumbo.jpg" alt="Linear and logarithmic scale plots of US and Italian coronavirus cases" style="width:600px;" class="center" /></p>
<p style="text-align: center;"><em>This <a href="https://www.nytimes.com/2020/03/20/health/coronavirus-data-logarithm-chart.html">New York Times graph</a> shows the exponential curves in both the US and Italy on the left. The right-hand logarithmic plots look nearly like straight lines, which which also reflects the exponential growth pattern.</em></p>
<p>As the epidemic progresses, the numbers of infected and recovered people climb. Each sick person will infect fewer additional people, since more of the people they meet are immune. We can see this in the model: the number of people who get infected each day is $\beta \frac{S}{N} I$. After many people have gotten sick, $\frac{S}{N}$ goes down and so fewer people get infected for a given value of $I$.</p>
<p>The epidemic will peak when people are recovering at least as fast as they get sick. This happens when $\beta \frac{IS}{N} \leq \gamma I$, and thus when $S = \frac{\gamma}{\beta} N$. Remember that $\frac{\beta}{\gamma}$ was our magic number $R_0$, so by the peak of the epidemic, only one person out of every $R_0$ people will have avoided getting sick.</p>
<p>If the estimates of $R_0 \approx 2.5$ are correct, this would mean that the epidemic would peak when something like 60% of the population had gotten sick. And remember, that’s not the end of the epidemic; that’s just the worst part. It would slowly get weaker from that time on, until it eventually fizzles.</p>
<p>(These are <em>not predictions</em>, for many reasons. I’m not an epidemiologist. Any real epidemiologist would be using a much more sophisticated model than this one to try to make real predictions. Don’t pay attention to the specific numbers I use here. But you can get a qualitative sense of what changing these numbers would do—and have more context for understanding what the real experts tell you.)</p>
<p><img src="/assets/blog/sir/imperial_projections_chart.png" alt="Chart" class="center" style="width:448px;" /></p>
<p style="text-align: center"><em>Predictions from actual experts use a ton of data and consider a huge range of possibilities, and generally look like <a href="https://spiral.imperial.ac.uk:8443/handle/10044/1/77482">this table</a> from a team at Imperial College London.</em></p>
<h3 id="numeric-simulations">Numeric Simulations</h3>
<p>There’s one more thing that toy models like this can do. We can use them to run numeric simulations (using <a href="https://en.wikipedia.org/wiki/Euler_method">Euler’s method</a> or something similar). We can see what would happen under our assumptions, and how the results change if we vary those assumptions.</p>
<p>Below is some code for the SIR model written in SageMath. (I borrowed the code from <a href="https://sage.math.clemson.edu:34567/home/pub/161/">this page</a> at Clemson; I believe the code was written by <a href="http://people.oregonstate.edu/~medlockj/">Jan Medlock</a>.) I’ve primed it with $\gamma = .07$, which means that people are sick for two weeks on average, and $\beta = .2$, which gives us an $R_0$ of about $2.8$.</p>
<p>If you just click “Evaluate”, you’ll see what happens if we run this model using those values of $\beta$ and $\gamma$ over the next 400 days. It’s pretty grim; the epidemic peaks two months out with a sixth of the country sick at once (the red curve), and in six months well over 80% of the country has fallen ill at some point (the blue curve).<strong title=" Reminder: I don't believe that this will happen, for many reasons. And you shouldn't listen to me if I did. Numbers are for illustrative purposes only and should not be construed as epidemiological advice."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>But with this widget you can play with those assumptions. What happens if we find a way to cure people faster, so $\gamma$ goes down? What if we lower $\beta$, by physical distancing or improved hygiene? The graph improves dramatically. And you can change up all the numbers if you want to. Play around, and see what you learn.</p>
<p>And stay safe out there.</p>
<div class="sage">
<script type="text/x-sage">
# Transmission rate
beta = 0.20
# Recovery rate
gamma = 0.07
# Population size
N = 300000000
# Initial infections
IInit = 100000
SInit = N - IInit
RInit = 0
R0 = beta / gamma
show(r'R_0 = %g' % R0)
# End time
tMax = 400
# Standard SIR model
def ODE_RHS(t, Y):
(S, I, R) = Y
dS = - beta * S * I / N
dI = beta * S * I / N - gamma * I
dR = gamma * I
return (dS, dI, dR)
# Set up numerical solution of ODE
solver = ode_solver(function = ODE_RHS,
y_0 = (SInit, IInit, RInit),
t_span = (0, tMax),
algorithm = 'rk8pd')
# Numerically solve
solver.ode_solve(num_points = 1000)
# Plot solution
show(
plot(solver.interpolate_solution(i = 0), 0, tMax, legend_label = 'S(t)', color = 'green')
+ plot(solver.interpolate_solution(i = 1), 0, tMax, legend_label = 'I(t)', color = 'red')
+ plot(solver.interpolate_solution(i = 2), 0, tMax, legend_label = 'R(t)', color = 'blue')
)
# code from https://sage.math.clemson.edu:34567/home/pub/161/
# Thanks to Jan Medlock
</script>
</div>
<p><em>Have a question about the SIR model? Have other good resources on this to point people at? Or did you catch a mistake? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<p><em>And take care of yourself.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Or people who are asymptomatic carriers. This model doesn’t worry about who actually gets a fever and starts coughing, just who carries the virus and can maybe infect others. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>If we’re being fancy, we say that the chance of getting sick is proportional to $\frac{I}{N}$ and that $\beta$ is the constant of proportionality. But if you’re not used to differential equations already I’m not sure that tells you very much. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Reminder: I don’t believe that this will happen, for many reasons. And you shouldn’t listen to me if I did. Numbers are for illustrative purposes only and should not be construed as epidemiological advice. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleFor some reason, a lot of people have gotten really interested in epidemiology lately. Myself included. Now, I'm not an epidemiologist. I don't study infectious diseases. But I do know a little about how mathematical models work, so I wanted to explain how one of the common, simple epidemiological models works. This model isn't anywhere near good enough to make concrete predictions about what's going to happen. But it _can_ give some basic intuition about how epidemics progress, and provide some context for what the experts are saying.Online Teaching in the Time of Coronavirus2020-03-14T00:00:00-07:002020-03-14T00:00:00-07:00https://jaydaigle.net/blog/online-teaching-in-the-time-of-coronavirus<p>I’ve been spending a lot of the past week looking at different options for transitioning my teaching online for the rest of the term. There are certainly people far more expert at online instruction than I am, but I wanted to share some of my thoughts and what I’ve found.</p>
<h2 id="handling-assignments">Handling Assignments</h2>
<h3 id="online-assignment-options">Online Assignment Options</h3>
<p>There are a lot of options for doing homework online. Many of these products (like WebAssign) have temporarily made everything freely available. I’m sure some of them are good, but I don’t know much about them.</p>
<p>This term I’ve been experimenting with using <a href="https://webwork.maa.org/">the MAA’s WeBWork system</a>, which has been going quite well. If you can administer your own server it’s completely free; if you can’t, the MAA will give you one trial class and then charge $200 per course you want to host. I don’t know how willing they are to start these up mid-semester, though. WeBWork is hardly a solution to everything, but it works very well for questions with numerical or algebraic answers.</p>
<p>(With WeBWork you can even give assignments that have to be completed inside a narrow window–say, an assignment that is only answerable between 2 and 3:30 on Thursday. So we could maybe use this to somewhat replace tests. Though again, not perfectly.)</p>
<h3 id="written-homework">Written Homework</h3>
<p>Of course, some assignments really need to include a written component. Written homework probably can just be photographed (or scanned) with a mobile phone; I expect most of our students have access to some sort of digital camera. I don’t know anything about the scanning apps but I know they exist. I have in fact graded photographed homework before, and my student graders have expressed a willingness to do this for the rest of the term.</p>
<p>We can also consider encouraging our students, especially in upper-division classes, to start using LaTeX for more assignments. That’s an unreasonable imposition on Calc 1 students but most of the people in the upper-level classes have probably been exposed to it, and it would make a lot of this much simpler. No scanning, no photographing, just emailing in PDFs.</p>
<h2 id="lectures-and-office-hours">Lectures and Office Hours</h2>
<p>I purchased a writing tablet for my computer. This is a peripheral that plugs into your computer and allows you to write/draw with a pen. I specifically ordered a Huion 1060 Plus, which gives a 10x6 writing area and <a href="https://amazon.com/gp/product/B01FTE9HS2/">goes for $70 on Amazon</a>. I haven’t gotten to test it yet, so don’t consider that quite a recommendation. The other thing that gets highly recommended is the <a href="https://amazon.com/Wacom-Drawing-Software-Included-CTL4100/dp/B079HL9YSF">Wacom Intuos</a>, which is supposed to be somewhat nicer but also gives a much smaller writing surface (something like 6x4), so if you write big this might not be comfortable.</p>
<p>I’ve been looking into options to stream lectures and other content. There are really two things I want to do here: the first is to have video conferences where I can stream lectures and share my screen to show written notes, LaTeX’d notes, Mathematica notebooks, etc. The second is to create a persistent space for student interactions. I’d like to create a space where even when I’m not “holding a lecture” or “having office hours”, my students can still ask questions—of each other and of me.</p>
<h3 id="discord">Discord</h3>
<p>I’ve been doing the second thing with Discord for my research group for the past year or so. It works pretty well. You create a room with a bunch of channels and all messages in a channel stay permanently (unless deleted by a moderator). You can scroll up to see what people have talked about in the past. Makes it great for students to have conversations with you and each other, and other students can see what happened in them. (There’s also a private messaging feature, of course.)</p>
<p>Discord is also good for voice calls, and has a screen sharing feature. Both of them worked very smoothly when I tried them, except the screensharing has some limitations that I believe are Linux-specific (in particular, in my multi-monitor setup I can share one window, or my entire desktop, but I can’t share exactly one monitor, which is something I would like to do). I’ve been in touch with <a href="http://www-personal.umich.edu/~speyer/">David Speyer</a>, who’s written up a bunch of thoughts about Discord <a href="https://academia.stackexchange.com/questions/145389/using-discord-to-support-online-teaching/145390#145390">here, with a basic tutorial for setting it up</a>.</p>
<p>One thing about discord that is both good and bad is that many of our students use it already. (It was designed for online videogame playing, and is now a widely used chat and voice program.) This is good because our students are already familiar with the program and how to use it. It may be bad because that means our students often already have screen names and identities on Discord that they may want to keep separated from their academic/professional personas. If we use some software they have not used before, they can create fresh accounts and keep their online personas appropriately segmented.</p>
<h3 id="oxys-suggestions-bluejeans-and-moodle">Oxy’s Suggestions: BlueJeans and Moodle</h3>
<p>My institution made some software recommendations. BlueJeans is the recommended videoconferencing software. I’ve played around with it a bit and it seems serviceable but not great. (Again, it has some specific issues with Linux that are more or less dealbreakers for me, as well.) One thing I miss from it is that it’s designed for video calls/conferences, but it doesn’t have the capacity to create a persistent chat room. So if I want that persistent interaction space, I’d need to use a second tool; I’d prefer to run everything on one platform if I can.</p>
<p>Moodle has a tool for creating chat rooms, but it’s <em>awful</em>. Do not want. It’s still a good place to post assignments and such if you don’t already have a place to post them and your institution uses Moodle. (If your institution uses some other learning management software, I can’t say much; Moodle is the only one I’ve ever used.)</p>
<h3 id="zoom-videoconferencing">Zoom Videoconferencing</h3>
<p>I’ve been leaning towards a videoconferencing solution called Zoom. The screensharing works great, and the recording feature works great. There’s an ability to create a shared whiteboard space, that I and students can both write on, which seems helpful for virtual office hours.</p>
<p>Zoom has the ability to create a persistent chatroom, and it worked very smoothly in some testing I did today with a couple of my undergraduates. (One of them reported that it “felt really slick”, which is a good sign; most of the experience was pretty seamless.) The videoconferencing can work without anyone making an account, I think, but the persistent chat room would require all our students to make (free) accounts. Anyone with a Gmail account can just log in with that, so that might not be a large barrier.</p>
<p>One major downside is that videoconferences are limited to 40 minutes. They’ve been relaxing this for schools and in affected areas, so I don’t know how much this would be in practice. But I also think we could just start again at the end of the 40 minute period if we needed to. (Or maybe just keep formal lectures below forty minutes; it’s hard to ask students to pay attention that long anyway. If you’re posting recorded video suggestions seem to be to keep them under ten minutes.)</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>There are a bunch of other resources floating around to help you; I’ve looked at several but unfortunately haven’t been keeping a list. But if you poke around on Twitter or elsewhere there are many people more informed than I am who will offer help!</p>
<p>I know the MAA has a <a href="https://twitter.com/mathcirque/status/1238119797747068929?s=09">recorded online chat on online teaching</a>, though I haven’t looked at it yet.</p>
<p>But the most important thing is not to get hung up on perfection. I didn’t plan to teach my courses remotely this term, and I’m sure they will suffer for lack of direct instructional contact. But that’s okay! And I’m going to be honest with my students about this.</p>
<p>This is a really unfortunate way to finish out the semester. It sucks. But I’m going to do what I can to make it only suck a medium amount. And I hope my students will bear with me and help to make this only medium suck.</p>
<p>We’ll get through this.</p>
<hr />
<p><em>I’d love to hear any ideas or feedback you have about moving to online instruction. And I’m happy to answer any questions I can—we’re in this together.</em> <em>Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a>, or leave a comment below.</em></p>Jay DaigleI’ve been spending a lot of the past week looking at different options for transitioning my teaching online for the rest of the term. There are certainly people far more expert at online instruction than I am, but I wanted to share some of my thoughts and what I’ve found.