Jekyll2023-03-16T19:12:36-07:00https://jaydaigle.net/Jay DaigleJay Daigle is a professor of mathematics at The George Washington University in Washington, D.C. In addition to his research in number theory, he brings a mathematical style to thinking about philosophy, politics, social dynamics, and everyday life.Jay DaigleMotivating the Integral with Euler’s Method2023-03-15T00:00:00-07:002023-03-15T00:00:00-07:00https://jaydaigle.net/blog/motivating-integral-eulers-method<p>I have a fun idea for how to teach and think about the integral in the context of freshman calculus. I’ve never actually used this in a class, and I suspect it’s not actually a great idea. But it’s a <em>fun</em> idea and worth at least playing with, even if it’s a bit too weird to help calculus novices understand what’s going on.</p>
<p>But first, I want to mention that if you want to support my writing, I now have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this.</p>
<h2 id="the-big-ideas-of-calculus-1">The Big Ideas of Calculus 1</h2>
<p>When I teach calculus I emphasize two big ideas: differential equations, and numerical analysis.</p>
<p><strong>Differential equations</strong> generalize the concept of “rate of change”, and they’re the core of why calculus is <em>useful</em>: you can describe the rules a system follows, encode them in math, and draw conclusions. Calculus 1 students don’t have the tools to solve differential equations, but they can—and should—understand how a sentence like “the acceleration is proportional to the displacement” relates to the equation \(y’’ = -ky\).</p>
<p><strong>Numerical approximation</strong> is often the <em>way</em> we use calculus, and increasingly so as computers are more powerful and available. I motivate the derivative with the idea of linear approximation: if I want to pretend my function is a line, and write \(f(x) = f(a) + m (x-a)\), what number \(m\) will do the best job? This develops into other methods for approximating the answers to questions that are too hard to answer directly: it leads into ideas like quadratic approximation and <a href="https://en.wikipedia.org/wiki/Newton's_method">Newton’s method</a>, and provides a foundation for numerical integration and Taylor series in Calculus 2.</p>
<h2 id="eulers-method">Euler’s Method</h2>
<p>If we combine these two ideas, we can try to numerically approximate the solution to a differential equation. Suppose we have a differential equation \(f’(t) = f(t) - f(t)^2/2\), and we know the initial condition that \(f(0)=1\). If we want to know \(f(3)\) we can get a rough guess with a linear approximation: we know \(f(0) = 1 \) and thus that \(f’(0) = 1 - \frac{1^2}{2} = \frac{1}{2} \), so we get</p>
<p>\[
f(3) \approx f(0) + f’(0) (3-0) = 1 + \frac{1}{2} \cdot 3 = \frac{5}{2}.
\]</p>
<p>That’s only a rough estimate; linear approximation generally isn’t very accurate when the starting point and ending point aren’t close together. In fact the true value is \( \frac{2e^3}{e^3+1} \approx 1.905\), which isn’t terribly far off from \(2.5 \) but isn’t especially close either. But this is the best estimate we can really get using only \(f(0)\) and \(f’(0)\).</p>
<p>However, we know a lot more than that, because we have a formula for \(f’(x)\). It’s a bit hard to use, because we need to know \(f(x)\) to compute \(f’(x)\); but we know we can approximate \(f(x_2)\) if we already know \(f(x_1)\) and \(f’(x_1)\). That allows us to do a recursive calculation:</p>
<p class="theorem">\[
\begin{array}{rl}
f(1) & \approx f(0) + f’(0) (1-0)
= 1 + \left(1 - \frac{1^2}{2} \right) \cdot (1) = 3/2. \\\<br />
f(2) & \approx f(1) + f’(1) (2-1)
\approx \frac{3}{2} + \left( \frac{3}{2}<br />
\frac{ \left(\frac{3}{2} \right)^2}{2} \right)\cdot (1)
= \frac{15}{8}. \\\<br />
f(3) & \approx f(2) + f’(2) (3-2)
\approx \frac{15}{8} + \left(\frac{15}{8} -
\frac{\left(\frac{15}{8} \right)^2}{2} \right) \cdot (1) \\\<br />
& = \frac{255}{128}.
\end{array}
\]</p>
<p>Thus we estimate \(f(3) \approx \frac{255}{128} \approx 1.99\).</p>
<p>This still isn’t an exact value for \(f(3)\); but this approximation is much better than our first try. And if this isn’t close enough, we can do even better by breaking our approximation into more steps: with six steps we get \(f(3) \approx 1.95\) and with sixty we get \(f(3) \approx 1.909\). More steps takes more work, but also gives us a more precise answer.</p>
<p>This approach is known as <a href="https://en.wikipedia.org/wiki/Euler_method">Euler’s method</a>, and it allows us to numerically approximate the result of any first-order ordinary differential equation given an initial condition. With a little bit of work, we can generalize this to any ordinary differential equation; it’s quite straightforward and flexible.</p>
<p>It’s also basically just integration.</p>
<h2 id="what-is-an-integral">What is an integral?</h2>
<p>In a typical calculus course, we motivate the integral with the area problem: we have the graph of some function, and we want to find the area under that curve. We can approximate that area by chopping it up into rectangles, which gives us the Riemann sum. And then as the number of rectangles approaches infinity our approximation gets really good, which allows us to define the integral.</p>
<p><img src="/assets/blog/teach-in-a-society/Riemann_sum_(leftbox).gif" alt="An animation of a Riemann sum as the number of rectangles goes to infinity" class="blog-image center" /></p>
<p class="theorem">\[
\int_a^b f(t) \,dt = \lim_{n \to \infty} \sum_{k=1}^n f(x_k) \Delta x
\]</p>
<p>This definition has a lot of symbols in it, and is generally intimidating to freshman calculus students. But it does accurately describe what we’re doing and why: the key idea of the integral is to break a calculation into pieces, do an approximation on each piece, and then add the results together. This will give us an approximate answer to our original question; as we use more and smaller pieces, the approximation gets better, and so in the limit we get an exact answer.</p>
<p>So this formula directly answers the question that we’re asking. And when we want to think about <em>applications</em> of the integral, the Riemann sum definition is useful: it helps us figure out what the integral is actually computing, and so what problems it can help solve. But Riemann sums are a huge pain to actually do computations with, so we generally don’t.</p>
<p>Instead, we rely on the Fundamental Theorem of Calculus, which comes in two parts.</p>
<p class="theorem"><strong>Fundamental Theorem of Calculus, Part 1:</strong> <br />
Given a function \(f(x)\) and a number \(a\), we can define a new function \(F(x) = \int_a^x f(t) \,dt\). Then \(F’(x) = f(x)\).</p>
<p>Part 1 tells us that the derivative undoes the integral; the derivative of the integral of \(f\) is just \(f\). This is conceptually cool, and it does allow us to compute <em>something</em>. But it doesn’t directly help us compute the integral. Instead, we use it to prove<strong title="This proof relies heavily on specific special properties of the real numbers, and in particular the property that if f'(x)=0 then f(x) is constant. This isn't true if we allow functions to be defined solely for rational numbers; the real numbers are exactly the set that makes it work."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> a second statement.</p>
<p class="theorem"><strong>Fundamental Theorem of Calculus, Part 2:</strong> <br />
If \(F’(x) = f(x)\), then \(\int_a^b f(t) \,dt = F(b) - F(a)\).</p>
<p>This is also known as the Evaluation Theorem, or sometimes the Net Change Theorem. And it’s the tool we actually use in practice to compute integrals—to the extent that people mainly associate “integration” with finding the antiderivative \(F(x)\), not with finding the <em>number</em> corresponding to the area under the curve.</p>
<p>And this all works, but we’ve moved pretty far away from the original question, and the connections pass through some relatively abstract territory. It’s hard to really intuitively see how this calculation relates to the original question.</p>
<p>Maybe there’s a better way.</p>
<h3 id="the-antiderivative-as-a-differential-equation">The antiderivative as a differential equation</h3>
<p>Let’s start by asking this question backwards. Suppose there’s some function you’re interested in, but you don’t have a formula for it. Instead you just have a formula for the derivative. In practical terms, this happens in <a href="https://en.wikipedia.org/wiki/Dead_reckoning">dead reckoning</a>: if you can’t measure where you are, but you know where you started and how fast you’re moving, you can estimate where you end up.</p>
<p>So suppose we know our speed \(F’(x)\), and our starting position \(F(a)\), and we want a way to figure out our current position \(F(x)\). We want to compute an antiderivative! The FTC part 2 tells us that \(F(x) = F(a) + \int_a^x F’(t) \,dt \), so we could figure this out by doing an integral. But I want to follow a different thought process.</p>
<p>We can start by saying, we know what \(F(a)\) is, and since we have a formula for \(F’(x)\), we can compute \(F’(a)\). Then we can use the linear approximation formula to estimate
\[
F(x) \approx F(a) + F’(a) (x-a).
\]
So if we know, say, that \(F(1)=3\) and \(F’(x) = 3x^2\), we can estimate that \(F(5) \approx 3 + 3(5-1) = 15\).</p>
<p>Linear approximation gives a pretty decent estimate if \(x\) and \(a\) are close, but if they’re far apart it’s not very good. Consequently it doesn’t really work here: in reality \(F(5) = 127\).</p>
<p>But we can improve this exactly the same way we did before, by using Euler’s method! The problem is that the two points on my linear approximation are too far apart. But we can try to approximate the value of \(F\) somewhere closer to \(1\), like at \(3\).</p>
<p>\[
F(3) \approx F(1) + F’(1)(3-1) = 3 + 3(2) = 9.
\]
And then, since we also know \(F’(3) = 27\) I can estimate
\[
F(5) \approx 9 + 27(5-3) = 63.
\]
Still not right, but much better! And we can improve even further by doing more steps:
\[
\begin{array}{rl}
F(2) & \approx F(1) + F’(1)(2-1)
= 3 + 3 = 6 \\\<br />
F(3) & \approx F(2) + F’(2)(3-2)
= 6 + 12 = 18 \\\<br />
F(4) & \approx F(3) + F’(3)(4-3)
= 18 + 27 = 45 \\\<br />
F(5) & \approx F(4) + F’(4)(5-4)
= 45 + 48 = 93.
\end{array}{rl}
\]
This still isn’t quite right, but it’s even closer; and as we take more and more smaller and smaller steps, we’ll get a better and better approximation.</p>
<h3 id="riemann-sums-as-eulers-method">Riemann Sums as Euler’s Method</h3>
<p>This is basically Euler’s method. But why is it an integral? Let’s reorganize the calculation to make it clearer what’s happening.
\[
\begin{array}{rl}
F(5)
& \approx F(4) + F’(4)(5-4) \\\<br />
& \approx F(3) + F’(3) (4-3) + F’(4) (5-4) \\\<br />
& \approx F(2) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\<br />
& = F(1) + F’(1) (2-1) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\<br />
& = 3 \cdot 1 + 3 \cdot 1 + 12 \cdot 1 + 27 \cdot 1 + 48 \cdot 1 = 93.
\end{array}
\]
At this point this should be starting to look familiar. We’re taking a bunch of steps of size \(1 = \Delta x\), and for each step we’re multiplying it by the derivative at some \(x\) value. So we just computed
\[
F(5) \approx F(1) + \sum_{k=1}^4 F’ \big( 1 + (k-1) \cdot 1 \big) \cdot 1.
\]
More generally, if we take \(n\) steps we get
\[
F(5) \approx F(1) + \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x.
\]
And that’s <em>almost</em> exactly a Riemann sum on the left-hand side. In fact, it’s a Riemann sum, plus the extra term \(F(1)\). If we rearrange it we get
\[
F(5) - F(1) \approx \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x.
\]</p>
<p>I see two ways to think about this formula. One is that the <em>indefinite</em> integral contains a \(+C\) term, because antiderivatives aren’t unique. So while \(\int F’(t) \,dt\) is <em>an</em> antiderivative of \(F’(x)\), we don’t necessarily get the same function as our original \(F(x)\). Instead, the FTC just guarantees we have \(F(x) +C\), and \(F(1)\) is just the \(+C\) term.</p>
<p>But I think a clearer to me is that we’re really computing the <em>change</em> in the value of \(F\). This should make physical sense: the calculations with the speed tell us how far we’ve moved, not where we are. Thus the Euler’s method calculation tells us our <em>displacement</em>; but if we add that on to our starting position, we find out ending position.</p>
<h2 id="is-this-a-good-idea">Is this a good idea?</h2>
<p>Mathematically, this all works out. It’s a cute argument and I’m glad I’ve found it. But there are plenty of fun math ideas that don’t belong in a freshman calculus course.</p>
<p>This approach has one obvious, major disadvantage: no one else teaches it like this, so it would probably leave students confused if they go on to take another course with someone else. And that’s probably enough to make it not worth doing<strong title="Or at least not worth doing as the motivation to the integral. I think it's fine to do this as a _followup_, or an application of the integral. If you have an extra day to spend on integration, this isn't the worst thing you could do. But if you have extra days in your calculus syllabus please tell me how you got them."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong>, on its own.</p>
<p>But while that’s a real obstacle to adopting this approach in one class, it’s also kind of dodging the interesting questions about whether this would be a better approach. What if we could get everyone to switch? Should we?</p>
<p>One problem is that this argument isn’t at all rigorous. As long as we believe that Euler’s method will converge to the right answer, then the integral will as well; but I don’t know how you’d prove that Euler’s method converges without referencing the integral, so that seems fairly circular.</p>
<p>That objection seems fatal to me—in an upper-division Real Analysis course. In a freshman calculus course, nothing is ever going to be fully rigorous, and the proofs involving Riemann sums especially won’t be because getting the technical details of Riemann sums correct is <em>hard</em>. So I don’t mind a little non-rigor, especially if it helps students develop a clear intuitive understanding of what we’re trying to do.</p>
<p>In fact, having to avoid some of the abstraction involved in proving the Fundamental Theorem of Calculus might be a win, overall. That’s one of those lectures where I’m always confident my students aren’t <em>really</em> following the details, and are just hanging on trying to survive until we get back to computing things. On the other hand, it’s good for them to see some abstract formalism, even if they’re not ready to fully understand it yet. You have to see your first scary proof sometime!</p>
<p>Another problem is that this derivation captures the relationship between the Riemann sum and the antiderivative, but presents it exactly backwards. In most applications, the Riemann sum is the question we want to answer; the antiderivative is the tool we use to answer it. But the Euler’s method approach treats the antiderivative as the question, and the Riemann sum as the way we compute the answer—which is completely wrong since the Riemann sum is nearly impossible to compute outside of the simplest cases. I think this is a really deep problem with this approach. One of the big ideas I want my students to engage with is figuring out the difference between identifying a question, and computing the answer; giving it to them backwards seems like an obstacle to developing that understanding.</p>
<p>But I do really like the way this approach connects the integral back to the other big ideas in the class. Not just to the derivative; any presentation of the FTC will draw a link between integration and differentiation. But this makes the integral seem <em>connected</em> to the themes of numeric approximation and differential equations, which ties the entire course together neatly.</p>
<p>And really, that sums it up, I think. It’s always nice to tell a neat story that ties the whole class together. But it probably isn’t as important as making sure our students understand each piece well on its own. I have to resist the temptation to do something pretty, and elegant, and unnecessarily confusing. So this is a fun idea, but for now I’m going to teach this normally.</p>
<hr />
<p><em>Do you have a clever way to motivate the integral? Do you think I should actually be using this approach in my course? Any other thoughts on teaching integration? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This proof relies heavily on specific special properties of the real numbers, and in particular the property that if \(f’(x)=0\) then \(f(x)\) is constant. This isn’t true if we allow functions to be defined solely for rational numbers; the real numbers are exactly the set that makes it work. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Or at least not worth doing as the motivation to the integral. I think it’s fine to do this as a <em>followup</em>, or an application of the integral. If you have an extra day to spend on integration, this isn’t the worst thing you could do. But if you have extra days in your calculus syllabus please tell me how you got them. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleI realized you can "prove" the Fundamental Theorem of Calculus by using Euler's method for solving differential equations. It's a fun way to motivate the integral, but not one that I'm actually going to use in class.Writing Calculus Tests with ChatGPT2023-03-08T00:00:00-08:002023-03-08T00:00:00-08:00https://jaydaigle.net/blog/writing-calculus-tests-with-ChatGPT<p>Last <a href="/blog/not-scared-of-chatbots/">week</a> I talked about the new chatbots, like ChatGPT and Bing’s chat interface. I argued that they while they produce language they can’t really analyze it or check it for errors; and that this is a meaningful restriction that we can’t get past without a serious change in the approach we take to AI systems. So the chatbots won’t be able to fully replace intellectual labor any time soon. But they still might help, especially if we can identify formulaic tasks that don’t require really critically thinking about how ideas connect.</p>
<p>But rather than philosophizing, I decided to get concrete about this. Can I use ChatGPT to make <em>my</em> job easier? It’s going to be pretty useless for the most important parts of my job. In particular, it has no way to figure out why a student is confused and address their confusions. And it’s not going to come up with insightful new ways to describe course topics. It won’t even be able to meaningfully connect distinct ideas in the course, because it has no sense of what’s already been covered.</p>
<p>Instead I need to find the aspects of my job that are <em>routine</em>, and involve following relatively standard templates and filling them out in predictable ways. I need to find tasks that it’s easy for me to check if they’re done right, since ChatGPT is not correct with any consistency. Ideally, I’d also find ways to have it replace the parts of my job that are the most annoying: I don’t <em>want</em> a way to avoid spending time in office hours with students, because office hours are fun!</p>
<p>But one thing I spend a lot of time doing, and don’t enjoy at all, is writing homework and test questions. I need to create original problems (or at least ones that aren’t in the textbook so students can’t look them up), but not <em>too</em> original (so they fit the patterns that my calculus students are supposed to be learning). And unlike all of the rest of my course planning, I need to do new ones every year—I can reuse my old lecture notes, but it’s not safe to reuse my old tests.</p>
<p>So I decided to spend some time experimenting with GPT as a test writer. Can it write good questions? Can it write usable solutions for those questions? And can it do this easily, or is shepherding it through the process more trouble than it’s worth?</p>
<p>But before I tell you what I found, I want to mention that if you want to support my writing, I now have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this.</p>
<h2 id="the-verdict">The Verdict</h2>
<p>Overall, the current tech is seems somewhat useful, but not actually good—at least, not yet. But it’s close enough that I suspect it will get pretty good for this purpose before long.</p>
<h3 id="writing--problems">Writing Problems</h3>
<p>With a couple exceptions, ChatGPT could figure out what type of question I was asking for. If I asked for a related rates problem, or an integration problem that involved integration by parts, I would get one. Sometimes they weren’t quite right, but I could get the general type of problem I asked for, with basically no prompt engineering.</p>
<p>On the other hand, it was hard to get specifics. I can get a big pile of integration by parts problems, but a lot of them will be either very easy or very hard. And ChatGPT gets stuck in ruts; I saw identical problems show up to multiple different prompts, and there were running themes in everything it output. That means that the system can’t give me fine-tuned answers, and also will not give me an even coverage of the relevant types of problems.</p>
<p>But if I have something specific I want, I can probably just write it myself; and even if it won’t give me every type of problem, it can help remind me of my options. I found it genuinely useful for brainstorming problems, even if I didn’t use any of them exactly. (And I am at this moment proctoring a test that includes some problems I wrote with GPT assistance.)</p>
<h3 id="solving-problems">Solving Problems</h3>
<p>On the other hand, the solutions it produced were usually wrong, sometimes spectacularly so. A few times I got a completely correct solution. Most of the time, I would get an answer that had the right approach but did completely nonsensical calculations in the middle; the solutions would look superficially correct, but checking them carefully turned up multiple errors. And occasionally I would get arguments almost completely unrelated to the questions I asked.</p>
<p>But, if anyone does figure out a way to usefully and consistently hook this up to a computer algebra system, it will probably do pretty well at solving problems too. It tended to set up the right computation and then generate a nonsense answer; if it could tell when it needs to just factor a polynomial or compute an integral, and pass that to a computer algebra system, that would fix a lot of the weaknesses.</p>
<p>I know multiple teams are trying to find a way to hook systems like GPT up to computational engines and computer algebra systems. If they could do that effectively it would probably be able to write good solutions immediately, but that really sounds non-trivial to me. You could <em>maybe</em> teach it to pass integrals or other specific calculations to a computer algebra system, read the result, and print the result. But translating that into a well-written solution solution would require some sort of deep integration of the two capabilities, not just an ability to print the final answer.</p>
<p>But one thing did impress me about the solutions: ChatGPT could clearly consistently remember what question it was trying to answer. Every single solution ended with a clear restatement of the question and an answer to it. The answer was usually <em>wrong</em> but it never lost track of what it was supposed to be answering. (And that’s more than I can say for some of my students.) This mostly shows that the question is still in ChatGPT’s context window when it finishes the solution, but also that it’s still <em>using</em> the details of the question. And it’s doing that consistently; I don’t think it failed at this once.</p>
<h3 id="other-observations">Other Observations</h3>
<p>ChatGPT can write perfectly valid and idiomatic LaTeX code, which I was mildly surprised by. I’m used to the fact that it can put out grammatical and idiomatic <em>English</em> text, but I was still impressed by the LaTeX.</p>
<p>Similarly, ChatGPT can draw diagrams using ASCII art. I’m really curious how much of that was in the training set; I wouldn’t have thought it was that common, but it’s clearly common enough. (It did not always accurately describe the diagram it had drawn.)</p>
<p>One major weakness is that ChatGPT doesn’t have a consistent style. Each solution it wrote used completely different notation and phrasing—even when I asked it to respond to the same prompt multiple times.</p>
<h2 id="the-data">The Data</h2>
<p>The rest of this post is a record of things I tried and experiments I conducted in getting ChatGPT to write calculus questions for me. I spent a while playing around, and tried to include everything here as a sort of lab notebook and complete record. If you don’t need to write calculus questions you probably don’t want to read the whole thing.</p>
<p>But I have hidden most of the long GPT exchanges under pulldown spoiler boxes. If you just want to skim through and see my thoughts on each experiment, you can do that; and you can expand out any box you decide you actually want to read.</p>
<p>Or you can jump to <a href="#conclusion">the end</a> and leave me a comment.</p>
<h3 id="maxmin-problems">Max/Min problems</h3>
<p>When I teach calculus 1, my absolute least favorite thing to do is writing test questions for max/min problems. It’s simultaneously extremely routine and extremely tedious. So I decided to start there.</p>
<h4 id="maxmin-take-one">Max/Min take one</h4>
<p class="theorem"><strong>Prompt:</strong> Write a problem where you have to classify the critical points of a function, with a solution.</p>
<details class="example">
<summary>
Response 1
</summary>
<div>
<p>Problem:</p>
<p>Consider the function $f(x) = x^3 - 3x^2 + x - 1$. Determine the critical points of the function and classify them as local maxima, local minima, or inflection points.</p>
<p>Solution:</p>
<p>A critical point of a function is a point where the derivative of the function is equal to zero or is undefined. To find the critical points of the function, we first find its first derivative:</p>
<p>$f’(x) = 3x^2 - 6x + 1$</p>
<p>Setting $f’(x) = 0$, we find the critical points:</p>
<p>$3x^2 - 6x + 1 = 0$</p>
<p>$x = \frac{6 \pm \sqrt{36 - 4 \cdot 3 \cdot 1}}{2 \cdot 3} = \frac{3 \pm \sqrt{3}}{3}$</p>
<p>Let $c_1 = \frac{3 + \sqrt{3}}{3}$ and $c_2 = \frac{3 - \sqrt{3}}{3}$. These are the critical points of the function. To classify them, we need to find the second derivative and check the sign of the second derivative at each critical point. The second derivative of the function is:</p>
<p>$f’‘(x) = 6x - 6$</p>
<p>At $c_1$, the second derivative is $6c_1 - 6 = 6(\frac{3 + \sqrt{3}}{3}) - 6 = 6$, which is positive. This means that $f(c_1)$ is a local minimum.</p>
<p>At $c_2$, the second derivative is $6c_2 - 6 = 6(\frac{3 - \sqrt{3}}{3}) - 6 = -6$, which is negative. This means that $f(c_2)$ is a local maximum.</p>
<p>Therefore, the critical points of the function $f(x) = x^3 - 3x^2 + x - 1$ are:</p>
<p>$c_1 = \frac{3 + \sqrt{3}}{3}$, which is a local minimum.</p>
<p>$c_2 = \frac{3 - \sqrt{3}}{3}$, which is a local maximum.</p>
</div>
</details>
<details class="example">
<summary>
Response 2
</summary>
<div>
<p>Problem: Consider the function $f(x) = x^3 - 6x^2 + 9x - 2$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.</p>
<p>Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.</p>
<p>Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by:
$f’(x) = 3x^2 - 12x + 9$</p>
<p>Setting $f’(x) = 0$, we find that:</p>
<p>$3x^2 - 12x + 9 = 0$</p>
<p>Solving for $x$, we get the critical points:</p>
<p>$x = \frac{12 \pm \sqrt{12^2 - 4 \cdot 3 \cdot 9}}{2 \cdot 3}$</p>
<p>$x = \frac{12 \pm \sqrt{144 - 108}}{6}$</p>
<p>$x = \frac{12 \pm \sqrt{36}}{6}$</p>
<p>$x = \frac{12 \pm 6}{6}$</p>
<p>$x = 3, 1$</p>
<p>So, the critical points are $x = 1, 3$.</p>
<p>Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points:
$f’‘(x) = 6x - 12$</p>
<p>$f’‘(1) = 6 \cdot 1 - 12 = -6$</p>
<p>$f’‘(3) = 6 \cdot 3 - 12 = 0$</p>
<p>Since $f’‘(1) < 0$, it means that $x = 1$ is a relative maximum. Since $f’‘(3) = 0$, it means that $x = 3$ is an inflection point.</p>
<p>So, the critical points are:</p>
<p>$x = 1$ is a relative maximum</p>
<p>$x = 3$ is an inflection point.</p>
</div>
</details>
<p>There are a couple things to notice here. First, I did no post-processing on the output at all. ChatGPT wrote perfectly formatted LaTeX code for these answers without me editing it at all.<strong title="I know you're seeing some weird formatting on the second derivative, but that's not actually ChatGPT's fault; that has to do with a bug in the way LaTeX compiles through MathJax, which is what allows me to display it on the blog. I could fix the display issue but I wanted to keep the output genuinely unedited."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<p>Second is that these are both fairly simple problems. My next step is going to have to be seeing if I can get more interesting examples out of it; it went for a simple cubic both times.</p>
<p>Third is that the solution style is inconsistent. ChatGPT doesn’t write up solutions the way I would, which is to be expected. But it also doesn’t write them up the same way as itself. The two solutions use different notations and different phrasings. This doesn’t make it <em>wrong</em>, but it isn’t great as a teaching tool; it’s good for students to see a certain amount of consistency, especially as beginners.</p>
<p>And fourth is that despite the nice organization and formatting, both of these solutions are almost, but not quite, right:</p>
<ul>
<li>
<p><strong>Response 1</strong> takes the quadratic equation \(3x^2-6x+1 =0\), correctly uses the quadratic formula, and then simplifies the solutions to \( \dfrac{3 \pm \sqrt{3}}{3}\). But the actual solutions are \(\dfrac{3 \pm \sqrt{6}}{3}\).</p>
<p>Then when it plugs these points into the second derivative, in incorrectly computes \(\pm 6\) when it should get \(\pm 2 \sqrt{3} \).</p>
<p>These are the sort of basic arithmetic errors that make me think a student understands the material, but is sloppy with arithmetic; I’d probably grade this as an 8/10.</p>
</li>
<li>
<p><strong>Response 2</strong> computes the critical points correctly.</p>
<p>When it plugs them into the second derivative, it correctly computes \(f^{\prime \prime} (1) = -6\) but says that \(f^{\prime \prime}(3) = 6\cdot 3 - 12 = 0\), when of course \(6 \cdot 3 - 12 = 6 \).</p>
<p>Then it interprets this calculation wrong: it makes a key conceptual error by saying that since \(f^{\prime \prime}(3) = 0\), we know that \(3\) is an inflection point. This is kind of half-right, and I could see some mathematicians accepting that conclusion. But not all points where \(f^{\prime prime} = 0\) are inflection points, and I wouldn’t accept that conclusion.</p>
</li>
</ul>
<p>So we get well-formatted solutions, but they’re not consistent or reliable; I’d have to hand-check them, and probably rewrite to match my style.</p>
<p>But I want to see if we can get better questions out of this, so let’s go back to the well.</p>
<h4 id="maxmin-take-2">Max/Min take 2</h4>
<p>I wanted to get slightly more complicated functions involved.</p>
<p class="theorem"><strong>Prompt:</strong> Write a problem where you have to classify the critical points of a rational function, with a solution.</p>
<details class="example">
<summary> Response 1 </summary>
<div>
<p>Problem: Consider the function $f(x) = \frac{x^3-3x}{x^2-2x+1}$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.</p>
<p>Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.</p>
<p>Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by:
$f’(x) = \frac{(x^2-1)(x^2-3)}{(x-1)^2}$</p>
<p>Setting $f’(x) = 0$, we find that:</p>
<p>$(x^2-1)(x^2-3) = 0$</p>
<p>Solving for $x$, we get the critical points:</p>
<p>$x = \pm 1, \pm\sqrt{3}$</p>
<p>However, $x=1$ is not a critical point since the denominator of the function is zero at $x=1$. Therefore, the critical points are $x = -\sqrt{3}, 1, \sqrt{3}$.</p>
<p>Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points:
$f’‘(x) = \frac{6x(x-1)(x+1)}{(x-1)^3}$</p>
<p>$f’’(-\sqrt{3}) = \frac{-6\sqrt{3}}{(2\sqrt{3})^3} = -\frac{3}{4\sqrt{3}}$</p>
<p>$f’‘(1)$ does not exist since the denominator of the second derivative is zero.</p>
<p>$f’’(\sqrt{3}) = \frac{6\sqrt{3}}{(2\sqrt{3})^3} = \frac{3}{4\sqrt{3}}$</p>
<p>Since $f’’(-\sqrt{3}) < 0$, it means that $x = -\sqrt{3}$ is a relative maximum. Since $f’’(\sqrt{3}) > 0$, it means that $x = \sqrt{3}$ is a relative minimum. The point $x=1$ is not a critical point since the function is undefined at that point.</p>
<p>So, the critical points are:</p>
<p>$x = -\sqrt{3}$ is a relative maximum</p>
<p>$x = \sqrt{3}$ is a relative minimum.</p>
</div>
</details>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Problem: Consider the function $f(x) = \frac{x^2 - 6x + 9}{x^2 - 4}$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.</p>
<p>Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.</p>
<p>Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by:
$f’(x) = \frac{(2x - 6)(x^2 - 4) - (x^2 - 6x + 9)(2x)}{(x^2 - 4)^2}$</p>
<p>Simplifying the expression, we get:</p>
<p>$f’(x) = \frac{-x^4 + 8x^3 - 5x^2 - 72x + 72}{(x^2 - 4)^2}$</p>
<p>Setting $f’(x) = 0$, we find that:</p>
<p>$-x^4 + 8x^3 - 5x^2 - 72x + 72 = 0$</p>
<p>We can factor this expression as:</p>
<p>$(x - 2)(x - 3)(x^2 - 3x - 12) = 0$</p>
<p>Solving for $x$, we get the critical points:</p>
<p>$x = 2, 3, \frac{3 \pm \sqrt{105}}{2}$</p>
<p>So, the critical points are $x = 2, 3, \frac{3 - \sqrt{105}}{2}, \frac{3 + \sqrt{105}}{2}$.</p>
<p>Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points:
$f’‘(x) = \frac{2x^5 - 30x^4 + 144x^3 - 320x^2 + 288x - 72}{(x^2 - 4)^3}$</p>
<p>$f’‘(2) = \frac{360}{36} = 10$</p>
<p>$f’‘(3) = \frac{-540}{49} < 0$</p>
<p>$f’‘\left(\frac{3 - \sqrt{105}}{2}\right) > 0$</p>
<p>$f’‘\left(\frac{3 + \sqrt{105}}{2}\right) < 0$</p>
<p>Since $f’‘(2) > 0$, it means that $x = 2$ is a relative minimum. Since $f’‘(3) < 0$, it means that $x = 3$ is a relative maximum. Since $f’‘\left(\frac{3 - \sqrt{105}}{2}\right) > 0$ and $f’‘\left(\frac{3 + \sqrt{105}}{2}\right) < 0$, it means that $\frac{3 - \sqrt{105}}{2}$ is an inflection point and $\frac{3 + \sqrt{105}}{2}$ is a local maximum.</p>
<p>So, the critical points are:</p>
<p>$x = 2$ is a relative minimum
$x = 3$ is a relative maximum
$\frac{3 - \sqrt{105}}{2}$ is an inflection point
$\frac{3 + \sqrt{105}}{2}$ is a local maximum.</p>
</div>
</details>
<p>Both of these questions are at least moderately reasonable, but the answers are wildly off-base.</p>
<ul>
<li>
<p><strong>Response 1</strong> computes the derivative incorrectly, sets it equal to zero correctly, computes the second derivative incorrectly, and plugs the critical points into it incorrectly. It does draw the correct conclusions from the results of those computations, though.</p>
<p>It also makes a random sign error on one of the critical points, and makes a technically incorrect claim about what qualifies as a critical point. (points where the derivative is undefined <em>are</em> critical points but <em>can’t</em> be maxima or minima).</p>
</li>
<li>
<p><strong>Response 2</strong> computes the derivative correctly, but simplifies it incorrectly, then sets it equal to zero incorrectly. It computes the second derivative incorrectly, and then plugs in the critical points it computed incorrectly.</p>
<p>It almost interprets the computations correctly, but then fails: it claims that because \(f’’ \left( \frac{3-\sqrt{105}}{2} \right) > 0 \), that point is an <em>inflection point</em>, which is definitely wrong.</p>
<p>(Also it repeats itself at the end, but that’s not wrong, just badly written.)</p>
</li>
</ul>
<p>Overall this doesn’t seem like an awful way to brainstorm problems, but I would have to write solutions myself and confirm the problems are reasonably solvable.</p>
<h3 id="related-rates-problems">Related Rates problems</h3>
<p class="theorem"><strong>Prompt:</strong> Write a related rates problem with a solution</p>
<details class="example">
<summary> Response 1 </summary>
<div>
<p>Problem: A spherical balloon is being inflated with air at a rate of 4 cubic feet per minute. At the instant when the radius of the balloon is 2 feet, how fast is the radius of the balloon increasing?</p>
<p>Solution: We can use the formula for the volume of a sphere to relate the rate of change of the volume of the balloon with the rate of change of the radius:</p>
<p>$V = \frac{4}{3}\pi r^3$</p>
<p>Taking the derivative of both sides with respect to time, we get:</p>
<p>$\frac{dV}{dt} = 4\pi r^2 \frac{dr}{dt}$</p>
<p>We are given that $\frac{dV}{dt} = 4$ cubic feet per minute, and we know that the radius of the balloon is 2 feet. So, substituting these values, we get:</p>
<p>$4 = 4\pi (2)^2 \frac{dr}{dt}$</p>
<p>Simplifying, we get:</p>
<p>$\frac{dr}{dt} = \frac{1}{4\pi (2)^2} = \frac{1}{16\pi}$ feet per minute</p>
<p>So, the radius of the balloon is increasing at a rate of $\frac{1}{16\pi}$ feet per minute when the radius of the balloon is 2 feet.</p>
</div>
</details>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Problem: A water tank is in the shape of a cylinder with a radius of 5 feet and a height of 10 feet. Water is flowing into the tank at a rate of 3 cubic feet per minute. How fast is the water level rising when the water is 6 feet deep?</p>
<p>Solution:</p>
<p>Let’s begin by labeling the variables we need:</p>
<p>$r$ is the radius of the tank, which is 5 feet.
$h$ is the height of the water level in the tank.
$V$ is the volume of water in the tank.
$t$ is time.
$dV/dt$ is the rate at which water is flowing into the tank, which is 3 cubic feet per minute.
We want to find the rate of change of $h$ with respect to time when $h = 6$ feet.</p>
<p>To solve the problem, we need to relate $h$ and $V$. The volume of a cylinder is given by the formula:</p>
<p>$V = \pi r^2 h$</p>
<p>Differentiating with respect to time, we get:</p>
<p>$dV/dt = \pi r^2 dh/dt$</p>
<p>Solving for $dh/dt$, we get:</p>
<p>$dh/dt = (dV/dt)/( \pi r^2)$</p>
<p>Now, we need to find $dV/dt$ when $h = 6$ feet. The volume of water in the tank when the water level is 6 feet deep is:</p>
<p>$V = \pi r^2 h = \pi (5^2)(6) = 150\pi$</p>
<p>Differentiating with respect to time, we get:</p>
<p>$dV/dt = 150\pi (dh/dt)$</p>
<p>Substituting $dV/dt = 3$ and $r = 5$, we get:</p>
<p>$3 = 150\pi (dh/dt)$</p>
<p>Solving for $dh/dt$, we get:</p>
<p>$dh/dt = 3/(150\pi) = 0.02$ feet per minute</p>
<p>So, the water level is rising at a rate of 0.02 feet per minute when the water is 6 feet deep.</p>
</div>
</details>
<details class="example">
<summary> Response 3</summary>
<div>
<p>Problem: A spherical balloon is being inflated at a rate of 10 cubic feet per minute. Find the rate at which the radius of the balloon is increasing when the diameter is 2 feet.</p>
<p>Solution:</p>
<p>Let’s start by writing down what we know:</p>
<p>The volume of a sphere is given by V = (4/3)πr^3, where r is the radius.
We know that dV/dt = 10, since the balloon is being inflated at a rate of 10 cubic feet per minute.
We want to find dr/dt when the diameter is 2 feet, which means r = 1 foot.
To find the rate at which the radius of the balloon is increasing, we need to relate dV/dt and dr/dt using the chain rule. We have:</p>
<p>dV/dt = dV/dr * dr/dt</p>
<p>We can find dV/dr by taking the derivative of the volume formula with respect to r:</p>
<p>dV/dr = 4πr^2</p>
<p>Substituting the values we know, we get:</p>
<p>10 = (4/3)π(1^3) * dr/dt * 4π(1^2)</p>
<p>Simplifying the expression, we get:</p>
<p>dr/dt = 15/(2π)</p>
<p>So, the rate at which the radius of the balloon is increasing when the diameter is 2 feet is 15/(2π) feet per minute.</p>
</div>
</details>
<ul>
<li>
<p><strong>Response 1</strong> is a simple and straightforward, but solid problem. And the solution is so close to being right. It does everything correctly up until the very end, and then it makes an arithmetic error at the end.</p>
</li>
<li>
<p><strong>Response 2</strong> Is also pretty simple, maybe too simple for me to really like it. The solution starts off strong, but halfway through gets itself confused, starts the problem over, and does it wrong.</p>
</li>
<li>
<p><strong>Response 3</strong> is also a simple problem, basically the same as Response 1. I was impressed it kept diameter versus radius straight. The solution starts out strong, but then again goes off the rails halfway through and gives a slightly wrong answer at the end.</p>
</li>
</ul>
<p>At this point I wanted to see if I could prompt it into giving me more interesting problems. I also decided to play around with some prompt engineering ideas: what if I tell it what sort of thing it wants to achieve?</p>
<p class="theorem"><strong>Prompt:</strong> Pretend you’re a brilliant professor who wants to write an interesting and challenging calculus test.
Write a related rates problem involving similar triangles with a solution.</p>
<details class="example">
<summary> Response 1 </summary>
<div>
<p>Problem: A hot air balloon is rising vertically at a rate of 5 meters per second. At a certain instant, the angle of elevation between the observer on the ground and the balloon is 60 degrees. The observer is 150 meters away from the launch point of the balloon. How fast is the distance between the observer and the balloon increasing at this instant?</p>
<p>Solution:</p>
<p>Let’s start by drawing a diagram to visualize the problem:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yaml
Copy code
B (balloon)
/|
/ |
/ |h
/ |
/θ |
/ |
/ |
A (observer)
</code></pre></div> </div>
<p>We have a right triangle ABC, where A is the observer, B is the balloon, and C is the point on the ground directly below the balloon. We know that AB = 150 meters, and the angle θ is 60 degrees. We want to find how fast the distance BC is changing when the balloon is rising at a rate of 5 meters per second.</p>
<p>We can see that triangles ABC and ABD are similar, so we can use the property of similar triangles to relate the sides:</p>
<p>AB/AC = BD/BC</p>
<p>Simplifying the expression, we get:</p>
<p>BC = (AB*BD)/AC</p>
<p>To find BD, we need to use trigonometry. We have:</p>
<p>tan θ = BD/AC</p>
<p>Substituting the known values, we get:</p>
<p>tan 60 = BD/AC</p>
<p>sqrt(3) = BD/AC</p>
<p>BD = sqrt(3) * AC</p>
<p>Now, we can substitute this value into the equation we derived earlier:</p>
<p>BC = (AB*sqrt(3)*AC)/AC</p>
<p>Simplifying the expression, we get:</p>
<p>BC = 150*sqrt(3)</p>
<p>Now, we need to find how fast BC is changing with respect to time. We have:</p>
<p>dBC/dt = d/dt (150*sqrt(3))</p>
<p>dBC/dt = 0 (since 150*sqrt(3) is a constant)</p>
<p>Therefore, the distance between the observer and the balloon is not changing at this instant, and its rate of change is zero.</p>
</div>
</details>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Problem: A flagpole standing vertically on a hill has a height of 20 feet and is located on a slope with an inclination of 30 degrees. A person standing on the same slope 100 feet away from the base of the flagpole observes that the angle of elevation to the top of the flagpole is increasing at a rate of 0.1 radians per minute. At what rate is the person moving closer to the flagpole at the moment when the angle of elevation is 60 degrees?</p>
<p>Solution:</p>
<p>Let’s start by drawing a diagram of the situation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>perl
Copy code
/|
/ |
/ | 20 ft
/ |
/ |
/ |
/-----------------\
100 ft
</code></pre></div> </div>
<p>In the diagram, the flagpole has height 20 feet and is located on a slope with inclination 30 degrees. The person is standing on the same slope 100 feet away from the base of the flagpole and observes that the angle of elevation to the top of the flagpole is increasing at a rate of 0.1 radians per minute.</p>
<p>Let’s define some variables:</p>
<p>Let x be the distance between the person and the base of the flagpole.
Let y be the height of the person’s eye above the ground.
Let θ be the angle of elevation to the top of the flagpole.
From the diagram, we can see that the triangles formed by the person, the flagpole, and the ground are similar. Therefore, we have:</p>
<p>y / x = (20 + y) / (x / cos 30)</p>
<p>Simplifying this equation, we get:</p>
<p>y = 20x / (2x + sqrt(3)x)</p>
<p>We want to find the rate at which the person is moving closer to the flagpole at the moment when the angle of elevation is 60 degrees, which means θ = π/3 radians. We also know that dθ/dt = 0.1 radians per minute.</p>
<p>Using trigonometry, we can relate θ and y:</p>
<p>tan θ = (20 + y) / x</p>
<p>Taking the derivative with respect to time, we get:</p>
<p>sec^2 θ * dθ/dt = (dy/dt * x - y * dx/dt) / x^2</p>
<p>Substituting the values we know, we get:</p>
<p>1/3 * dθ/dt = (dy/dt * 100 - y * dx/dt) / 10000</p>
<p>We want to find dx/dt when θ = π/3 and dθ/dt = 0.1. We can use the equation y = 20x / (2x + sqrt(3)x) to eliminate y:</p>
<p>1/3 * 0.1 = (dy/dt * 100 - (20x / (2x + sqrt(3)x)) * dx/dt) / 10000</p>
<p>Simplifying the expression and solving for dx/dt, we get:</p>
<p>dx/dt = 0.1429 feet per minute</p>
<p>So, the person is moving closer to the flagpole at a rate of 0.1429 feet per minute when the angle of elevation is 60 degrees.</p>
</div>
</details>
<ul>
<li>
<p><strong>Response 1</strong> certainly gives a much more challenging problem than the previous prompt ever did! And the fact that it can draw a diagram of the problem in ASCII art is really cool, although not terribly helpful.</p>
<p>But ChatGPT gets confused as soon as it starts writing the solution: it can’t keep track of which points A, B, and C we want to talk about. Then it compares the triangle ABC it sketched out to a similar triangle ABD, which doesn’t actually exist in the problem or the diagram.</p>
<p>From there it continues to be kind of incoherent. It has \(\tan(\theta) = BD/AC\), which should be pieces of different triangles, so that doesn’t make much sense. And it concludes with making one of the key basic errors that calc 1 students make: plugging in values before taking the derivative, so that your derivatives all wind up being zero.</p>
<p>(This is also the first time I had to do editing to make the formatting work, but I’m not holding that against ChatGPT; it rendered fine in the GPT window, but my site messed around with it a bit.)</p>
</li>
<li>
<p><strong>Response 2</strong> is a problem that is hard enough that I’d struggle with it. It’s interesting, but I can’t assign it!</p>
<p>The solution makes things worse by introducing additional considerations (like the height of the person’s eye above the ground!). It declares that three triangles are all similar, without being clear on what those triangles are, and I’m pretty sure this problem has no similar triangles at all.</p>
<p>After that detour it does some correct-ish work, but makes the usual algebra errors, and then falls back on the weird not-similar-triangles argument from earlier.</p>
</li>
</ul>
<p>Okay, let’s give up on the prompt engineering, and see what happens if we just ask for a normal similar triangles problem.</p>
<p class="theorem"><strong>Prompt:</strong> Write a related rates problem involving similar triangles with a solution.</p>
<details class="example">
<summary> Response 1</summary>
<div>
<p>Problem: A person standing 10 meters away from a street light observes that the angle of elevation from their eye to the top of the light is 60 degrees. The person begins to walk away from the light at a rate of 2 meters per second. How fast is the angle of elevation changing when the person is 20 meters away from the light?</p>
<p>Solution:</p>
<p>Let’s start by drawing a diagram of the situation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>markdown
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
|θ \ 10 m
| \
| \
| \
|__________\
</code></pre></div> </div>
<p>In the diagram, the person is standing 10 meters away from the street light and observes that the angle of elevation from their eye to the top of the light is 60 degrees. The person begins to walk away from the light at a rate of 2 meters per second.</p>
<p>Let’s define some variables:</p>
<p>Let d be the distance between the person and the light.
Let θ be the angle of elevation to the top of the light.
From the diagram, we can see that the triangles formed by the person, the street light, and the ground are similar. Therefore, we have:</p>
<p>d / 10 = tan θ</p>
<p>Taking the derivative with respect to time, we get:</p>
<p>1/10 * dd/dt = sec^2 θ * dθ/dt</p>
<p>We want to find dθ/dt when d = 20 meters. We also know that dd/dt = 2 meters per second.</p>
<p>Using trigonometry, we can relate θ and d:</p>
<p>tan θ = d / 10</p>
<p>Substituting the values we know, we get:</p>
<p>1/10 * 2 = sec^2 θ * dθ/dt</p>
<p>Simplifying the expression, we get:</p>
<p>dθ/dt = 1/(5sec^2 θ) radians per second</p>
<p>We can use the equation tan θ = d / 10 to eliminate θ:</p>
<p>tan θ = d / 10</p>
<p>sec^2 θ = 1 + tan^2 θ</p>
<p>Substituting these values into the expression for dθ/dt, we get:</p>
<p>dθ/dt = 1 / (5(1 + (d/10)^2)) radians per second</p>
<p>Substituting d = 20, we get:</p>
<p>dθ/dt = 1 / (5(1 + 4)) radians per second</p>
<p>Simplifying the expression, we get:</p>
<p>dθ/dt = 1 / 25 radians per second</p>
<p>So, the angle of elevation is changing at a rate of 1/25 radians per second when the person is 20 meters away from the light.</p>
</div>
</details>
<p>And I think we’ve conclusively shown that ChatGPT doesn’t understand what similar triangles are.</p>
<h4 id="brainstorming-related-rates-problems">Brainstorming Related Rates Problems</h4>
<p>At this point I’m pretty convinced that ChatGPT won’t write up good solutions. What if I just ask for problems?<strong title="Note: at this point I also started a new conversation, to see if that would change things. I haven't done a lot of testing on how much stuff changes if you continue a previous conversational thread versus starting a new one."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p class="theorem"><strong>Prompt:</strong> Write me some related rates problems involving distance.</p>
<details class="example">
<summary> Response 1</summary>
<div>
<p>Sure! Here are a few related rates problems involving distance:</p>
<ol>
<li>
<p>Two cars are driving towards each other on a straight road. Car A is traveling at 60 mph, and car B is traveling at 40 mph. If the distance between them is 1 mile, how fast are they getting closer to each other?</p>
</li>
<li>
<p>A ladder is leaning against a wall. The base of the ladder is 6 feet from the wall, and the top of the ladder is 10 feet up the wall. If the bottom of the ladder is being pulled away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the base is 8 feet from the wall?</p>
</li>
<li>
<p>A person is standing at the edge of a cliff, looking out at the ocean. The cliff is 100 feet high. If the person’s line of sight makes an angle of 30 degrees with the horizontal, how far out from the cliff is the person looking?</p>
</li>
<li>
<p>A person is walking away from a lamppost at a rate of 3 feet per second. The person is initially 10 feet from the lamppost. How fast is the person’s distance from the lamppost increasing after 5 seconds?</p>
</li>
<li>
<p>A ball is thrown straight up into the air from a height of 5 feet, with an initial velocity of 20 feet per second. How long does it take for the ball to reach a height of 25 feet, and what is the ball’s velocity at that point?</p>
</li>
</ol>
</div>
</details>
<p>The second suggestion is a standard but perfectly good related rates problem, but the others don’t fare nearly so well. They did all involve distance somehow, though!</p>
<ul>
<li>The first problem is technically a related rates problem, but it doesn’t require calculus.</li>
<li>The third doesn’t involve any rates at all.</li>
<li>The fourth asks about <em>one</em> rate, and also it just tells you what that rate is in the problem. It’s purely a reading comprehension question.</li>
<li>The fifth also has <em>one</em> rate, but isn’t a related rates problem. It also doesn’t contain enough information to solve the problem; you would need to know the acceleration to work out the problem, and even with that it would require integrals.</li>
</ul>
<p>The last one was the most interesting to me, in terms of how <em>I</em> related to it. You can plausibly do the problem if you assume acceleration under earth gravity. And I spent a bit of time trying to figure out if ChatGPT intended me to do that, and thinking about how I could get it to tell me—before realizing that asking about ChatGPT’s “intentions” is misunderstanding it. I could ask more questions and it could make a decision, but it doesn’t have a stable “what I meant” that I could try to probe.</p>
<p>Let’s try again.</p>
<details class="example">
<summary> Response 2</summary>
<div>
<p>Sure, here are some related rates problems involving distance:</p>
<ol>
<li>
<p>Two cars are traveling towards each other on a straight road. Car A is traveling at 60 mph and car B is traveling at 40 mph. If the distance between the two cars is initially 500 miles, how fast is the distance between them decreasing?</p>
</li>
<li>
<p>A ladder 10 feet long rests against a vertical wall. If the bottom of the ladder slides away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the bottom of the ladder is 6 feet from the wall?</p>
</li>
<li>
<p>A person is walking away from a lamppost at a rate of 4 feet per second. The height of the lamppost is 20 feet. How fast is the person’s shadow lengthening when they are 25 feet from the lamppost?</p>
</li>
<li>
<p>A ball is thrown straight up into the air with an initial velocity of 40 feet per second. The height of the ball, h, in feet after t seconds is given by h = 40t - 16t^2. How fast is the ball moving when it reaches a height of 30 feet?</p>
</li>
<li>
<p>Two boats are approaching each other in a straight line. One boat is traveling at a speed of 10 mph and the other is traveling at a speed of 15 mph. The boats are initially 1000 feet apart. How fast is the distance between the boats decreasing when they are 500 feet apart?</p>
</li>
</ol>
</div>
</details>
<p>Surprisingly, the first two questions in response 2 are almost identical to the first two questions in response 1. This go-around, the third question is also a good, standard related rates problem; the fourth is a derivative word problem that doesn’t involve related rates; and the fifth is essentially the same as the first.</p>
<p>My conclusion: this isn’t useless, actually! With like two minutes of effort I got three usable problems. And it’s not that I <em>couldn’t</em> have written these problems myself, but it’s easy for me to get into decision paralysis while question-writing. This certainly fixes the brainstorming problem, and gives me raw material to shape into what I want.</p>
<p>But I’m going to have to write up my own solutions.</p>
<h4 id="can-it-solve-my-problems">Can it solve my problems?</h4>
<p>Or do I? In the previous prompts, it failed to write solutions while writing problems. But I was curious what would happen if I just gave it a question and let it solve it. Will this work correctly?</p>
<details class="example">
<summary> A ladder is leaning against a wall. The base of the ladder is 6 feet from the wall, and the top of the ladder is 10 feet up the wall. If the bottom of the ladder is being pulled away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the base is 8 feet from the wall?</summary>
<div>
<p>Let’s start by drawing a diagram:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lua
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
|h \
| \
| \
| \
| \
| \
--------------
d
</code></pre></div> </div>
<p>Here, the ladder is represented by the diagonal line, with the base of the ladder represented by “d” and the height of the ladder represented by “h”. We want to find how fast the top of the ladder is sliding down the wall, which is represented by the rate of change of “h”, as the base of the ladder is being pulled away from the wall.</p>
<p>We can see that the distance between the top of the ladder and the wall, which we’ll call “x”, is proportional to the height of the ladder and inversely proportional to the distance between the base of the ladder and the wall:</p>
<p>x/d = h/10</p>
<p>Taking the derivative of both sides with respect to time t, we get:</p>
<p>(dx/dt)/d - x/(d^2)(dd/dt) = (dh/dt)/10</p>
<p>We want to find (dh/dt) when d = 8 feet. We are given that the base of the ladder is being pulled away from the wall at a rate of 2 feet per second, so dd/dt = 2.</p>
<p>When d = 8, we can use the proportion above to find x:</p>
<p>x/8 = 10/h</p>
<p>x = 80/h</p>
<p>We also know that d^2 + h^2 = 10^2, since the ladder is 10 feet long. When d = 6, we can use this equation to find h:</p>
<p>6^2 + h^2 = 10^2</p>
<p>h^2 = 64</p>
<p>h = 8</p>
<p>So when the base of the ladder is 6 feet from the wall, the ladder reaches a height of 8 feet on the wall.</p>
<p>Now we can plug in the values we have into the related rates equation:</p>
<p>(dx/dt)/d - x/(d^2)(dd/dt) = (dh/dt)/10</p>
<p>(dx/dt)/8 - (80/h)/(8^2)(2) = (dh/dt)/10</p>
<p>We need to find (dh/dt) when d = 8 and h = 6. We can use the equation d^2 + h^2 = 10^2 to find h when d = 8:</p>
<p>8^2 + h^2 = 10^2</p>
<p>h^2 = 36</p>
<p>h = 6</p>
<p>Now we can plug in the values we have:</p>
<p>(dx/dt)/8 - (80/6)/(8^2)(2) = (dh/dt)/10</p>
<p>(dx/dt)/8 - 10/192 = (dh/dt)/10</p>
<p>We want to find (dh/dt), so we can isolate it on one side:</p>
<p>(dh/dt)/10 = (dx/dt)/8 - 10/192</p>
<p>(dh/dt) = (dx/dt)/0.8 - 1.25</p>
<p>We know that dx/dt = 2 when the base of the ladder is being pulled away from the wall at a rate of 2 feet per second. So when d = 8 and h = 6, we have:</p>
<p>(dh/dt) = (2)/0.8 - 1.25</p>
<p>(dh/dt) = 0.25</p>
<p>Therefore, when the base of the ladder is 8 feet from the wall, the top of the ladder is sliding down the wall</p>
</div>
</details>
<details class="example">
<summary> A person is walking away from a lamppost at a rate of 4 feet per second. The height of the lamppost is 20 feet. How fast is the person’s shadow lengthening when they are 25 feet from the lamppost?</summary>
<div>
<p>Let’s draw a diagram to represent the situation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lua
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
| \
| \
|h \
| \
------------
d
</code></pre></div> </div>
<p>Here, the lamppost is represented by the vertical line on the left, and the person is walking away from it to the right. The distance between the person and the lamppost is represented by “d”, and the height of the lamppost is represented by “h”. We want to find how fast the person’s shadow is lengthening, which is represented by the length of the line segment from the person’s feet to the base of the lamppost, as the person walks away.</p>
<p>We can see that the height of the person’s shadow, which we’ll call “s”, is proportional to the height of the lamppost and inversely proportional to the distance between the person and the lamppost:</p>
<p>s/d = h/20</p>
<p>Taking the derivative of both sides with respect to time t, we get:</p>
<p>(ds/dt)/d - s/(d^2)(dd/dt) = (dh/dt)/20</p>
<p>We want to find (ds/dt) when d = 25 feet. We are given that the person is walking away from the lamppost at a rate of 4 feet per second, so dd/dt = 4. We are also given that the height of the lamppost is 20 feet, so dh/dt = 0.</p>
<p>When d = 25, we can use the proportion above to find s:</p>
<p>s/25 = 20/20</p>
<p>s = 25</p>
<p>So when the person is 25 feet from the lamppost, their shadow is 25 feet long.</p>
<p>Now we can plug in the values we have into the related rates equation:</p>
<p>(ds/dt)/d - s/(d^2)(dd/dt) = (dh/dt)/20</p>
<p>(ds/dt)/25 - 25/(25^2)(4) = 0</p>
<p>(ds/dt)/25 = 0.005</p>
<p>ds/dt = 0.005 * 25</p>
<p>ds/dt = 0.125</p>
<p>Therefore, when the person is 25 feet from the lamppost, their shadow is lengthening at a rate of 0.125 feet per second.</p>
</div>
</details>
<details class="example">
<summary> A ladder 10 feet long rests against a vertical wall. If the bottom of the ladder slides away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the bottom of the ladder is 6 feet from the wall?</summary>
<div>
<p>We can use the Pythagorean theorem to relate the distance of the ladder from the wall to the height it reaches on the wall. Let’s call the distance the bottom of the ladder is from the wall “x” and the height the ladder reaches on the wall “y”. Then we have:</p>
<p>x^2 + y^2 = 10^2</p>
<p>We can take the derivative of both sides of this equation with respect to time t to get the related rates:</p>
<p>2x(dx/dt) + 2y(dy/dt) = 0</p>
<p>We are given that the bottom of the ladder is sliding away from the wall at a rate of 2 feet per second, so dx/dt = 2. We want to find how fast the top of the ladder is sliding down the wall, so we need to solve for dy/dt.</p>
<p>When the bottom of the ladder is 6 feet from the wall, we can use the Pythagorean theorem to find the corresponding height on the wall:</p>
<p>6^2 + y^2 = 10^2</p>
<p>y^2 = 64</p>
<p>y = 8</p>
<p>So when the bottom of the ladder is 6 feet from the wall, the ladder is reaching a height of 8 feet on the wall.</p>
<p>Now we can plug in the values we have into the related rates equation:</p>
<p>2x(dx/dt) + 2y(dy/dt) = 0</p>
<p>2(6)(2) + 2(8)(dy/dt) = 0</p>
<p>12 + 16(dy/dt) = 0</p>
<p>16(dy/dt) = -12</p>
<p>dy/dt = -0.75</p>
<p>Therefore, when the bottom of the ladder is 6 feet from the wall, the top of the ladder is sliding down the wall at a rate of 0.75 feet per second.</p>
</div>
</details>
<p>The first two solutions are complete nonsense. But the third one is correct! That’s not nearly reliable enough to be useful, but it <em>is</em> the first time I’ve seen a completely correct solution in these experiments, so it’s worth noting.</p>
<h3 id="calculus-2-oh-yeah-i-actually-have-to-write-a-test">Calculus 2: Oh yeah, I actually have to write a test</h3>
<p>At some point I remembered that I actually had to write a test to give to my calc 2 class. So maybe I should see if I can write some questions for that.</p>
<h4 id="trig-sub-integrals">Trig sub integrals</h4>
<details class="example">
<summary> write some trig sub integral problems</summary>
<div>
<ol>
<li>
<p>Evaluate $\displaystyle\int \frac{\sqrt{9-x^2}}{x^2} , dx$ using the substitution $x=3\sin\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{1}{x^2\sqrt{x^2+4}} , dx$ using the substitution $x=2\tan\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{1}{x\sqrt{x^2-1}} , dx$ using the substitution $x=\sec\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{x^3}{\sqrt{1-x^2}} , dx$ using the substitution $x=\sin\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{dx}{\sqrt{x^2+9}\left(x^2+4\right)}$ using the substitution $x=3\tan\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{x^3}{\sqrt{x^2+1}} , dx$ using the substitution $x=\tan\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{1}{x^4\sqrt{4x^2-1}} , dx$ using the substitution $x=\frac{1}{2}\sec\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{dx}{\sqrt{25-x^2}\left(x^2+16\right)}$ using the substitution $x=5\sin\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{x^3}{\sqrt{4x^2-1}} , dx$ using the substitution $x=\frac{1}{2}\sec\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{\sqrt{16-x^2}}{x^2} , dx$ using the substitution $x=4\sin\theta$.</p>
</li>
</ol>
</div>
</details>
<details class="example">
<summary> Can you write solutions to those problems for me?</summary>
<div>
<ol>
<li>
<p>Let $x=3\sin\theta$. Then $dx=3\cos\theta, d\theta$, and $\sqrt{9-x^2}=\sqrt{9-9\sin^2\theta}=3\cos\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{\sqrt{9-x^2}}{x^2} \, dx &= \int \frac{3\cos\theta}{9\sin^2\theta} \cdot 3\cos\theta\, d\theta \\\<br />
&= \frac{1}{3} \int \csc^2\theta \, d\theta \\\<br />
&= -\frac{1}{3}\cot\theta + C \\\<br />
&= -\frac{1}{3}\cot(\sin^{-1}(x/3)) + C.
\end{array}
\]</p>
</li>
<li>
<p>Let $x=2\tan\theta$. Then $dx=2\sec^2\theta\, d\theta$, and $\sqrt{x^2+4}=\sqrt{4\tan^2\theta+4}=2\sec\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{1}{x^2\sqrt{x^2+4}} \, dx &= \int \frac{1}{4\tan^2\theta \cdot 2\sec\theta} \cdot 2\sec^2\theta\, d\theta \\\<br />
&= \frac{1}{2} \int \cos\theta \cdot \sec^3\theta \, d\theta \\\<br />
&= \frac{1}{2} \int \frac{\cos\theta}{\cos^4\theta} \, d\theta \\\<br />
&= \frac{1}{2} \int \frac{\cos\theta}{(1-\sin^2\theta)^2} \, d\theta \\\<br />
&= -\frac{1}{2} \frac{\cos\theta}{1-\sin^2\theta} + \frac{1}{2} \int \frac{1}{1-\sin^2\theta} \, d\theta \\\<br />
&= -\frac{1}{2}\frac{\cos\theta}{1-\sin^2\theta} + \frac{1}{4}\ln\left|\frac{1+\sin\theta}{1-\sin\theta}\right| + C \\\<br />
&= -\frac{1}{2}\frac{x}{\sqrt{x^2+4}} + \frac{1}{4}\ln\left|\frac{x+\sqrt{x^2+4}}{2}\right| + C.
\end{array}
\]</p>
</li>
</ol>
<p>Let $x=\sec\theta$. Then $dx=\sec\theta\tan\theta\, d\theta$, and $\sqrt{x^2-1}=\sqrt{\sec^2\theta-1}=\tan\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{1}{x\sqrt{x^2-1}} \, dx &= \int \frac{\sec\theta\tan\theta}{\tan\theta} \, d\theta \\\<br />
&= \int \sec\theta \, d\theta \\\<br />
&= \ln|\sec\theta + \tan\theta| + C \\\<br />
&= \ln|x+\sqrt{x^2-1}| + C.
\end{array}
\]</p>
</div>
</details>
<p>First off, this obviously only answers three questions. I think it was hitting a character limit in the answer. It actually cut itself off right before finishing the third one: it stopped with the “\end” at the end of the array environment. But I decided to give it credit for that and finished off the environment syntax for it.</p>
<p>This was also the first time I had to do any other meaningful LaTeX editing. There are places where it <em>should</em> have produced a “\” or a “\,” and instead just gave a “" or “,”; I assume that’s some issue with the way it processes text. I went ahead and fixed that but wanted to make a note of it. (I also had to do some extra editing to make it render correctly on this site, but that says more about Markdown and MathJax than about ChatGPt.)</p>
<p>The substance of the answers is also interesting. Each of them does the substitution right, then screws up the algebra in the middle and gives a completely wrong answer at the end.</p>
<p>But the problems are pretty good!</p>
<h4 id="integration-by-parts">Integration by parts</h4>
<p>Let’s try some other integral approaches I need.</p>
<p><strong>Prompt:</strong> Write an integration by parts problem and solve it.</p>
<details class="example">
<summary> Response 1</summary>
<div>
<p>Sure, here’s an example:</p>
<p>Find $\displaystyle\int x^2\ln x \, dx$.</p>
<p>To solve this integral, we will use integration by parts with $u=\ln x$ and $dv=x^2\,dx$. Then $du=\frac{1}{x}\,dx$ and $v=\frac{1}{3}x^3$, so we have
\[
\begin{array}{rl}
\int x^2\ln x \, dx &= \frac{1}{3}x^3\ln x - \int \frac{1}{3}x^2 \, dx \\\<br />
&= \frac{1}{3}x^3\ln x - \frac{1}{9}x^3 + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore, we have
<script type="math/tex">\int x^2\ln x \, dx = \frac{1}{3}x^3\ln x - \frac{1}{9}x^3 + C.</script></p>
</div>
</details>
<details class="example">
<summary> Response 2</summary>
<div>
<p>Sure! Here’s a problem and its solution using integration by parts:</p>
<p>Problem: Evaluate $\int x^2\cos x\, dx$.</p>
<p>Solution: To solve this integral, we use integration by parts with $u=x^2$ and $dv=\cos x\, dx$. Then $du=2x\, dx$ and $v=\sin x$, so we have
\[
\begin{array}{rl}
\int x^2\cos x\, dx &= x^2\sin x - \int 2x\sin x\, dx \\\<br />
&= x^2\sin x + 2\int x\sin x\, dx \\\<br />
&= x^2\sin x + 2\left(x\cos x - \int \cos x\, dx\right) \\\<br />
&= x^2\sin x + 2x\cos x - 2\sin x + C \\\<br />
&= \boxed{x^2\sin x + 2x\cos x - 2\sin x + C}.
\end{array}
\]</p>
</div>
</details>
<p>And it can basically do this! The problems are standard. The first solution looks completely right; the second makes two sign errors, which cancel each other out. (It even boxed the answer in the second example for me, which I find fascinating.)</p>
<p>I wanted to see if I could get something a little more challenging.</p>
<details class="example">
<summary> Write a hard integration by parts problem and solve it.</summary>
<div>
<p>Sure, here is a hard integration by parts problem:</p>
<p>Problem: Evaluate $\int x^2 e^{-x} \cos x, dx$.</p>
<p>Solution: To solve this integral, we use integration by parts twice. First, we choose $u=x^2$ and $dv=e^{-x} \cos x\, dx$, so $du=2x\, dx$ and $v=e^{-x} \sin x$. Then we have
\[
\begin{array}{rl}
\int x^2 e^{-x} \cos x\, dx &= x^2 e^{-x} \sin x - \int 2xe^{-x}\sin x\, dx \\\<br />
&= x^2 e^{-x} \sin x - 2\left(-xe^{-x}\cos x - \int -e^{-x}\cos x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x - 2\left(-e^{-x}\sin x - \int e^{-x}\sin x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x - 2\left(-e^{-x}\cos x - \int e^{-x}\cos x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\int e^{-x}\cos x\, dx \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\left(e^{-x}\sin x + \int e^{-x}\sin x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\left(-e^{-x}\cos x + \int -e^{-x}\cos x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x + 2e^{-x}\cos x + C \\\<br />
&= \boxed{x^2 e^{-x} \sin x + 4xe^{-x}\cos x + 4e^{-x}\sin x + C}.
\end{array}
\]</p>
</div>
</details>
<p>That’s an interesting problem, although too hard for my test. This time it doesn’t get the answer right, sadly. (It screws up on the first step!)</p>
<p>I want to see if I can get it to generate problems that have coefficients on the variables.</p>
<details class="example">
<summary> Write an integration by parts problem that has coefficients on the x.</summary>
<div>
<p>Here’s an integration by parts problem that has coefficients on the x:</p>
<p>Problem: Evaluate $\int x^2\ln x\, dx$.</p>
<p>Solution: We can use integration by parts with $u = \ln x$ and $dv = x^2\, dx$, so $du = \frac{1}{x}\, dx$ and $v = \frac{x^3}{3}$. Then we have
\[
\begin{array}{rl}
\int x^2\ln x\, dx &= \frac{x^3\ln x}{3} - \int \frac{x^3}{3} \cdot \frac{1}{x}\, dx \\\<br />
&= \frac{x^3\ln x}{3} - \frac{1}{3} \int x^2\, dx \\\<br />
&= \frac{x^3\ln x}{3} - \frac{x^3}{9} + C \\\<br />
&= \boxed{\frac{x^3}{9}(3\ln x - 1) + C}.
\end{array}
\]</p>
</div>
</details>
<p>Wait a minute—it wrote that problem for me already! It’s still a solid problem, and it still got the solution right, but it’s not what I wanted.</p>
<details class="example">
<summary> Write an integration by parts problem where the integrand has scalar coefficients on the x terms</summary>
<div>
<p>Here’s an integration by parts problem where the integrand has scalar coefficients on the x terms:</p>
<p>Problem: Evaluate $\int x e^{-x} \sin x\, dx$.</p>
<p>Solution: We can use integration by parts with $u = \sin x$ and $dv = x e^{-x}\, dx$, so $du = \cos x\, dx$ and $v = -xe^{-x} - e^{-x}$. Then we have
\[
\begin{array}{rl}
\int x e^{-x} \sin x\, dx &= -x e^{-x} \sin x + \int (\cos x - e^{-x})\, dx \\\<br />
&= -x e^{-x} \sin x + \sin x + e^{-x} + C \\\<br />
&= \boxed{(1-x)e^{-x}\sin x + e^{-x}\cos x + C}.
\end{array}
\]</p>
</div>
</details>
<p>Still doesn’t have the coefficients I want, and it does the same thing it did for the “hard” problem I asked for: it asks a really challenging problem, then completely fails to compute it properly.</p>
<p>If I ask it to generate a bunch, will it give me what I want?</p>
<details class="example">
<summary> Write me ten integration by parts problems.</summary>
<div>
<p>Sure, here are ten integration by parts problems:</p>
<ol>
<li>$\int x e^x \cos x\, dx$</li>
<li>$\int x^2 \ln x\, dx$</li>
<li>$\int x^3 e^x \, dx$</li>
<li>$\int x^2 \cos x\, dx$</li>
<li>$\int x \ln x\, dx$</li>
<li>$\int x \cos x\, dx$</li>
<li>$\int e^x \sin x\, dx$</li>
<li>$\int \ln x\, dx$</li>
<li>$\int x^3 \sin x\, dx$</li>
<li>$\int x^2 e^{-x} \, dx$</li>
</ol>
<p>Note: Some of these integrals might be challenging to solve, so you may want to use different techniques (such as substitution) in addition to integration by parts.</p>
</div>
</details>
<p>This is a basically reasonable collection of problems, but literally zero of them have scalars.</p>
<p>I wonder if it’s stuck? I’ll start a clean chat rather than continuing the previous conversation:</p>
<details class="example">
<summary> Write me ten interesting integration by parts problems.</summary>
<div>
<ol>
<li>Evaluate $\int xe^x \, dx$.</li>
<li>Evaluate $\int x^2 \cos x \, dx$.</li>
<li>Evaluate $\int \ln x \, dx$.</li>
<li>Evaluate $\int \arctan x \, dx$.</li>
<li>Evaluate $\int x \sin^2 x \, dx$.</li>
<li>Evaluate $\int \cos^{-1} x \, dx$.</li>
<li>Evaluate $\int \frac{x}{(1+x^2)^2} \, dx$.</li>
<li>Evaluate $\int x \ln x \, dx$.</li>
<li>Evaluate $\int \sqrt{x} e^x \, dx$.</li>
<li>Evaluate $\int x \sin x \, dx$.</li>
</ol>
</div>
</details>
<p>Basically the same, although 7 isn’t an integration by parts problem, and I don’t think 9 is really doable in a standard calculus course.</p>
<p>But I know what I want; if I write the problem can it solve it for me?</p>
<details class="example">
<summary> Evaluate $\int x \sin^2 (3x) \,dx$</summary>
<div>
<p>We can use integration by parts with $u=x$ and $dv=\sin^2(3x) \, dx$. Then $du=dx$ and we can use the double angle formula $\sin^2(3x) = \frac{1}{2}(1 - \cos(6x))$ to get</p>
<p>\[
\begin{array}{rl}
\int x \sin^2(3x) \, dx &= \frac{1}{2} \int x(1-\cos(6x)) \, dx \\\<br />
&= \frac{1}{2}\left(\int x \, dx - \int x \cos(6x) \, dx\right) \\\<br />
&= \frac{1}{2}\left(\frac{1}{2}x^2 - \frac{1}{6}x\sin(6x) - \frac{1}{36}\cos(6x) \right) + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int x \sin^2(3x) \, dx = \frac{1}{4}x^2 - \frac{1}{12}x\sin(6x) - \frac{1}{72}\cos(6x) + C.</script>
</div>
</details>
<details class="example">
<summary> Evaluate $\int x \ln (5x)\, dx$</summary>
<div>
<p>We can use integration by parts with $u = \ln(5x)$ and $dv = x \, dx$. Then, $du = \frac{1}{x} \, dx$ and $v = \frac{1}{2}x^2$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int x \ln(5x) \, dx &= \frac{1}{2}x^2\ln(5x) - \int \frac{1}{2}x^2 \cdot \frac{1}{x} \, dx \\\<br />
&= \frac{1}{2}x^2\ln(5x) - \frac{1}{4}x^2 + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int x \ln(5x) \, dx = \frac{1}{2}x^2\ln(5x) - \frac{1}{4}x^2 + C.</script>
</div>
</details>
<details class="example">
<summary> Evaluate $\int xe^{-4x} \, dx$.</summary>
<div>
<p>We can use integration by parts with $u=x$ and $dv=e^{-4x} \, dx$. Then, $du=dx$ and $v=-\frac{1}{4}e^{-4x}$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int xe^{-4x} \, dx &= -\frac{1}{4}xe^{-4x} + \frac{1}{4}\int e^{-4x} \, dx \\\<br />
&= -\frac{1}{4}xe^{-4x} - \frac{1}{16}e^{-4x} + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int xe^{-4x} \, dx = -\frac{1}{4}xe^{-4x} - \frac{1}{16}e^{-4x} + C.</script>
</div>
</details>
<p>Does pretty well with that!</p>
<p>But let’s do a test with my favorite kind of integration by parts problem.</p>
<details class="example">
<summary> Evaluate $\int sin(3x) e^{-4x} \, dx$.</summary>
<div>
<p>We can use integration by parts with $u = \sin(3x)$ and $dv = e^{-4x} \, dx$. Then, $du = 3\cos(3x) \, dx$ and $v = -\frac{1}{4}e^{-4x}$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int \sin(3x) e^{-4x} \, dx &= -\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{4}\int \cos(3x) e^{-4x} \, dx \\\<br />
&= -\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{16}\cos(3x) e^{-4x} - \frac{9}{16}\int \sin(3x) e^{-4x} \, dx.
\end{array}
\]</p>
<p>Rearranging terms, we get</p>
<p>\[
\begin{array}{rl}
\int \sin(3x) e^{-4x} \, dx &= \frac{3}{25}\left(-\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{16}\cos(3x) e^{-4x}\right) + C \\\<br />
&= -\frac{3}{100}\sin(3x) e^{-4x} + \frac{9}{400}\cos(3x) e^{-4x} + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \sin(3x) e^{-4x} \, dx = -\frac{3}{100}\sin(3x) e^{-4x} + \frac{9}{400}\cos(3x) e^{-4x} + C.</script>
</div>
</details>
<p>It got the right basic idea, but made a couple of minor mistakes. (It also didn’t show work for what’s probably the trickiest step, so it’s not a great template to follow.)</p>
<details class="example">
<summary> Evaluate $\int \cos(-2x) e^{3x} \, dx$.</summary>
<div>
<p>Using the identity $\cos(-2x) = \cos(2x)$, we can write</p>
<p>\[
\begin{array}{rl}
\int \cos(-2x) e^{3x} \, dx &= \int \cos(2x) e^{3x} \, dx \\\<br />
&= \frac{1}{5}\cos(2x) e^{3x} + \frac{2}{5}\int \sin(2x) e^{3x} \, dx.
\end{array}
\]</p>
<p>We can then use integration by parts with $u = \sin(2x)$ and $dv = e^{3x} \, dx$. Then, $du = 2\cos(2x) \, dx$ and $v = \frac{1}{3}e^{3x}$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int \cos(-2x) e^{3x} \, dx &= \frac{1}{5}\cos(2x) e^{3x} + \frac{4}{15}\sin(2x) e^{3x} - \frac{8}{15}\int \cos(2x) e^{3x} \, dx.
\end{array}
\]</p>
<p>Rearranging terms, we get</p>
<p>\[\begin{array}{rl}
\int \cos(-2x) e^{3x} \, dx &= \frac{1}{17}\cos(2x) e^{3x} + \frac{4}{51}\sin(2x) e^{3x} + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \cos(-2x) e^{3x} \, dx = \frac{1}{17}\cos(2x) e^{3x} + \frac{4}{51}\sin(2x) e^{3x} + C.</script>
</div>
</details>
<p>A very good catch that \(\cos(x)\) is an even function, but it doesn’t handle the actual integration very well. It doesn’t show work for the first integration by parts, and in fact gets it wrong. And then it keeps making more mistakes at every step.</p>
<p>So like with the related rates problems, this is pretty good at brainstorming questions, but not so great at solving them.</p>
<h4 id="partial-fractions">Partial Fractions</h4>
<p>Let’s try some partial fractions integrals.</p>
<details class="example">
<summary> Write me an integral that involves partial fractions, and solve it</summary>
<div>
<p>Sure! Here’s an example:</p>
<p>Evaluate the integral <script type="math/tex">\int \frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} \, dx.</script></p>
<p>To solve this integral, we first need to factor the denominator: <script type="math/tex">x^4 - 5x^3 + 4x^2 + 4x - 4 = (x-1)(x-2)(x^2+1).</script></p>
<p>Since $x^2+1$ is an irreducible quadratic over the real numbers, we cannot factor it any further. However, we can use partial fractions to write the integrand as a sum of simpler fractions:</p>
<script type="math/tex; mode=display">\frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} = \frac{A}{x-1} + \frac{B}{x-2} + \frac{Cx+D}{x^2+1},</script>
<p>where $A$, $B$, $C$, and $D$ are constants to be determined. To find these constants, we can multiply both sides by the denominator and then equate coefficients of like terms:</p>
<p>\[
\begin{array}{rl}
4x^3 - 7x^2 + 2x - 1 &= A(x-2)(x^2+1) + B(x-1)(x^2+1) + (Cx+D)(x-1)(x-2) \\\<br />
&= (A+B+C)x^3 + (-2A-2B+C+D)x^2 + (A+B-C)x + (-2A+2B+D).
\end{array}
\]</p>
<p>Equating coefficients of like terms, we get the system of equations</p>
<p>\[
\begin{array}{rl}
A + B + C &= 4 \\\<br />
-2A - 2B + C + D &= -7 \\\<br />
A + B - C &= 2 \\\<br />
-2A + 2B + D &= -1.
\end{array}
\]</p>
<p>Solving this system of equations, we find $A = 1$, $B = 2$, $C = -2$, and $D = 1$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int \frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} , dx &= \int \frac{1}{x-1} \, dx + \int \frac{2}{x-2} \, dx + \int \frac{-2x+1}{x^2+1} \, dx \\\<br />
&= \ln|x-1| + 2\ln|x-2| - \ln(x^2+1) + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration.</p>
</div>
</details>
<p>That is, indeed, a partial fractions problem. A very, very difficult partial fractions problem. Which it proceeds to get badly wrong, starting by factoring the denominator incorrectly (though it does find a real factor, which I’m impressed by if it wasn’t an accident).</p>
<p>It does integrate the function it finds correctly, though!</p>
<p>Let’s look at another one.</p>
<details class="example">
<summary> Response 2</summary>
<div>
<p>Sure, here’s an integral that involves partial fractions:</p>
<script type="math/tex; mode=display">\int \frac{3x+1}{x^2+4x+3} \, dx</script>
<p>To solve this integral, we first factor the denominator as $(x+1)(x+3)$. Then, we write the fraction as</p>
<script type="math/tex; mode=display">\frac{3x+1}{(x+1)(x+3)} = \frac{A}{x+1} + \frac{B}{x+3}</script>
<p>where $A$ and $B$ are constants to be determined. Multiplying both sides by $(x+1)(x+3)$ and equating coefficients, we get</p>
<script type="math/tex; mode=display">3x+1 = A(x+3) + B(x+1)</script>
<p>Expanding and collecting like terms, we get</p>
<script type="math/tex; mode=display">(A+B)x + 3A + B = 3x + 1</script>
<p>Equating coefficients, we have</p>
<p>\[
\begin{array}{rl}
A + B &= 3 \\\<br />
3A &= 1 \\\<br />
B &= -2
\end{array}
\]</p>
<p>Solving for $A$ and $B$, we get $A = \frac{1}{3}$ and $B = -2$. Therefore,</p>
<p>\[
\begin{array}{rl}
\int \frac{3x+1}{x^2+4x+3} \, dx &= \int \frac{1}{x+1} \, dx - 2\int \frac{1}{x+3} \, dx \\\<br />
&= \ln|x+1| - 2\ln|x+3| + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \frac{3x+1}{x^2+4x+3} \, dx = \ln|x+1| - 2\ln|x+3| + C.</script>
</div>
</details>
<p>This time we get a reasonable problem. I could put this on a test! But we don’t get a correct solution. ChatGPT factors the denominator correctly, and sets up equations to solve correctly, but then solves them incorrectly. It doesn’t give either the correct integral, or the one it found, but then integrates the formula it found correctly.</p>
<p>Let’s give it one more try.</p>
<details class="example">
<summary> Response 3 </summary>
<div>
<p>Here’s an integral that involves partial fractions:</p>
<script type="math/tex; mode=display">\int \frac{x^3}{x^2-4x+3} \, dx</script>
<p>To solve this integral, we first need to factor the denominator. We can write</p>
<script type="math/tex; mode=display">x^2-4x+3 = (x-1)(x-3)</script>
<p>Therefore, we can rewrite the integral as</p>
<p>\[\begin{array}{rl}
\int \frac{x^3}{x^2-4x+3} \, dx &= \int \frac{x^3}{(x-1)(x-3)} \, dx \
&= \int \frac{Ax+B}{x-1} \, dx + \int \frac{Cx+D}{x-3} \, dx,
\end{array}\]</p>
<p>where $A$, $B$, $C$, and $D$ are constants to be determined. Multiplying both sides by the denominator, we get</p>
<script type="math/tex; mode=display">x^3 = (Ax+B)(x-3) + (Cx+D)(x-1).</script>
<p>Expanding and equating coefficients, we get</p>
<p>\[\begin{array}{rl}
A+C &= 0 \\\<br />
-3A+B+D &= 0 \\\<br />
-3B+D &= 0 \\\<br />
-B &= 1
\end{array}\]</p>
<p>Solving for $A$, $B$, $C$, and $D$, we get</p>
<script type="math/tex; mode=display">A = \frac{1}{2}, \quad B = -1, \quad C = -\frac{1}{2}, \quad D = \frac{3}{2}</script>
<p>Therefore, we can write</p>
<p>\[\begin{array}{rl}
\int \frac{x^3}{x^2-4x+3} \, dx &= \int \frac{\frac{1}{2}x-1}{x-1} \, dx + \int \frac{-\frac{1}{2}x+\frac{3}{2}}{x-3} \, dx \\\<br />
&= \frac{1}{2}\ln|x-1| - \frac{1}{2}\ln|x-3| - \frac{5}{4}\int \frac{1}{x-1} \, dx + \frac{3}{4}\int \frac{1}{x-3} \, dx \\\<br />
&= \frac{1}{2}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C \\\<br />
&= \frac{1}{4}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C,
\end{array}\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \frac{x^3}{x^2-4x+3} \, dx = \frac{1}{4}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C.</script>
</div>
</details>
<p>A reasonable problem, again! I start with polynomial long division instead of a janky polynomial long division that isn’t <em>quite</em> set up right; at any rate it solves the equations it sets up incorrectly, does the integrals wrong, and winds up with the wrong answer.</p>
<h4 id="other-miscellaneous-experiments">Other miscellaneous experiments</h4>
<details class="example">
<summary>Write a numeric integration problem </summary>
<div>
<p>Here’s a numeric integration problem:</p>
<p>Approximate the value of the integral</p>
<script type="math/tex; mode=display">\int_0^1 \frac{x}{1+x^2} , dx</script>
<p>using the trapezoidal rule with $n=4$ subintervals.</p>
<p>Solution:</p>
<p>Using the trapezoidal rule, we have</p>
<p>\[\begin{array}{rl}
\int_0^1 \frac{x}{1+x^2} , dx &\approx \frac{1}{2}\left[f(0)+f\left(\frac{1}{4}\right)\right]+\frac{1}{2}\left[f\left(\frac{1}{4}\right)+f\left(\frac{1}{2}\right)\right]\\\<br />
&\qquad+\frac{1}{2}\left[f\left(\frac{1}{2}\right)+f\left(\frac{3}{4}\right)\right]+\frac{1}{2}\left[f\left(\frac{3}{4}\right)+f(1)\right]\\\<br />
&= \frac{1}{2}\left[\frac{0}{1+0^2}+\frac{\frac{1}{4}}{1+\left(\frac{1}{4}\right)^2}\right]+\frac{1}{2}\left[\frac{\frac{1}{4}}{1+\left(\frac{1}{4}\right)^2}+\frac{\frac{1}{2}}{1+\left(\frac{1}{2}\right)^2}\right]\\\<br />
& \qquad +\frac{1}{2}\left[\frac{\frac{1}{2}}{1+\left(\frac{1}{2}\right)^2}+\frac{\frac{3}{4}}{1+\left(\frac{3}{4}\right)^2}\right]+\frac{1}{2}\left[\frac{\frac{3}{4}}{1+\left(\frac{3}{4}\right)^2}+\frac{1}{1+1^2}\right]\\\<br />
&\approx 0.3735
\end{array}\]</p>
<p>Therefore, the approximate value of the integral using the trapezoidal rule with $n=4$ subintervals is $0.3735$.</p>
</div>
</details>
<p>A reasonable problem, <em>almost</em> sets up the calculation correctly but forgets to include the \(\Delta x\) term, and then does the numeric calculation at the end wrong.</p>
<details class="example">
<summary> Can you estimate the error in the previous calculation? </summary>
<div>
<p>Yes, we can estimate the error in the previous calculation using the error bound formula for the trapezoidal rule:</p>
<script type="math/tex; mode=display">E_T \leq \frac{K(b-a)^3}{12n^2}</script>
<p>where $K$ is an upper bound on the second derivative of the function over the interval of integration.</p>
<p>For our integral $\int_0^1 \frac{x}{1+x^2} , dx$, we have</p>
<p>\[\begin{array}{rl}
f(x) &= \frac{x}{1+x^2} \\\<br />
f’‘(x) &= \frac{2x^3-2x}{(1+x^2)^3} \\\<br />
\end{array}\]</p>
<p>Taking the absolute value and finding the maximum value of $f’‘(x)$ on the interval $[0,1]$, we have:</p>
<p>\[\begin{array}{rl}
|f’‘(x)| &= \frac{2|x|(1-|x|^2)}{(1+x^2)^3} \\\<br />
|f’‘(x)| &\leq \frac{2(1-0)}{(1+0^2)^3} = 2 \\\<br />
\end{array}\]</p>
<p>Therefore, $K=2$, and we can estimate the error as:</p>
<p>\[\begin{array}{rl}
E_T &\leq \frac{K(b-a)^3}{12n^2} \\\<br />
&= \frac{2(1-0)^3}{12(4)^2} \\\<br />
&= \frac{1}{96} \\\<br />
&\approx 0.0104 \\\<br />
\end{array}\]</p>
<p>So the error in our previous calculation is estimated to be less than $0.0104$.</p>
</div>
</details>
<p>It figures out what I’m asking, and gives the right formula and right basic procedure. But it computes the second derivative wrong, and I have no idea where it’s getting the upper bound on the derivative from.</p>
<details class="example">
<summary> Write and solve a problem that involves the inverse function theorem </summary>
<div>
<p>Here’s a problem that involves the inverse function theorem:</p>
<p>Suppose $f(x) = x^3 - 4x + 2$ and $g(x) = \sqrt[3]{x+4}$ for $x \geq -4$. Let $h(x) = f(g(x))$. Find $h’(x)$.</p>
<p>Solution:</p>
<p>Using the chain rule, we have:</p>
<p>\[\begin{array}{rl}
h(x) &= f(g(x)) \\\<br />
h’(x) &= f’(g(x)) \cdot g’(x) \\\<br />
&= (3g(x)^2 - 4) \cdot \frac{1}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{3\sqrt[3]{(x+4)^2}(3\sqrt[3]{(x+4)})^2 - 4}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{9(x+4) - 4}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{9x + 32}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{3(3x+32)}{\sqrt[3]{(x+4)^2}} \\\<br />
\end{array}\]</p>
<p>Therefore, $h’(x) = \frac{3(3x+32)}{\sqrt[3]{(x+4)^2}}$.</p>
</div>
</details>
<p>That’s not the inverse function theorem. It’s almost a correct solution, though! Let’s give it another chance.</p>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Here’s a problem that involves the inverse function theorem:</p>
<p>Let $f(x) = \sqrt{x^2 + 1}$ for $x \in \mathbb{R}$. Show that $f$ is invertible on $[0,\infty)$, find its inverse function $f^{-1}$, and calculate $(f^{-1})’(1)$.</p>
<p>Solution:</p>
<p>To show that $f$ is invertible on $[0,\infty)$, we need to show that $f$ is one-to-one and onto on this interval.</p>
<p>First, note that $f(x) > 0$ for all $x \in \mathbb{R}$, so $f$ is onto on $[0,\infty)$.</p>
<p>To show that $f$ is one-to-one on $[0,\infty)$, suppose $f(x_1) = f(x_2)$ for some $x_1,x_2 \in [0,\infty)$. Then $\sqrt{x_1^2 + 1} = \sqrt{x_2^2 + 1}$, so $x_1^2 + 1 = x_2^2 + 1$, which implies $x_1 = x_2$. Therefore, $f$ is one-to-one on $[0,\infty)$, and hence it is invertible on this interval.</p>
<p>To find the inverse function of $f$, we solve for $x$ in terms of $y$ in the equation $y = \sqrt{x^2 + 1}$:</p>
<p>\[\begin{array}{rl}
y &= \sqrt{x^2 + 1} \\\<br />
y^2 &= x^2 + 1 \\\<br />
x^2 &= y^2 - 1 \\\<br />
x &= \pm\sqrt{y^2 - 1}
\end{array}\]</p>
<p>Since $f(x) > 0$ for all $x \in \mathbb{R}$, we take the positive square root to define the inverse function $f^{-1}$:</p>
<script type="math/tex; mode=display">f^{-1}(y) = \sqrt{y^2 - 1} \quad \text{for } y \in [1,\infty)</script>
<p>To calculate $(f^{-1})’(1)$, we use the inverse function theorem, which states that if $f$ is differentiable and invertible with inverse function $f^{-1}$, then $(f^{-1})’(y) = \frac{1}{f’(f^{-1}(y))}$.</p>
<p>In this case, we have $f(x) = \sqrt{x^2 + 1}$, so $f’(x) = \frac{x}{\sqrt{x^2 + 1}}$. Therefore, $f^{-1}(y) = \sqrt{y^2 - 1}$ and $f’(f^{-1}(y)) = \frac{f^{-1}(y)}{\sqrt{(f^{-1}(y))^2 + 1}} = \frac{\sqrt{y^2 - 1}}{\sqrt{(y^2 - 1) + 1}} = \frac{\sqrt{y^2 - 1}}{y}$.</p>
<p>Substituting $y=1$, we get $(f^{-1})’(1) = \frac{\sqrt{(1)^2 - 1}}{1} = \sqrt{0} = 0$. Therefore, $(f^{-1})’(1) = 0$.</p>
</div>
</details>
<p>This isn’t really what I was looking for, but it is much more responsive, at least. (The solution still has a bunch of minor errors.)</p>
<p class="theorem"><strong>Prompt:</strong> Write an integral that involves logarithms or exponentials</p>
<p>I asked it to respond to this prompt five different times, and five tiems it came up with \( \int \ln(x) \,dx\). It was very convinced that was the right response. I got better results with the following prompt:</p>
<details class="example">
<summary>Write some integral problems that involve logarithms </summary>
<div>
<ol>
<li>$\displaystyle\int \frac{\ln(x)}{x} dx$</li>
<li>$\displaystyle\int \ln(\sqrt{x^2 + 1}) dx$</li>
<li>$\displaystyle\int \frac{\ln(2x + 1)}{x + 1} dx$</li>
<li>$\displaystyle\int \ln^2(x) dx$</li>
<li>$\displaystyle\int \frac{1}{x\ln(x)} dx$</li>
<li>$\displaystyle\int \ln(x)e^{\ln(x)} dx$</li>
<li>$\displaystyle\int \ln(x) \sqrt{x} dx$</li>
<li>$\displaystyle\int \ln^3(x) dx$</li>
<li>$\displaystyle\int \frac{\ln(x)}{(x+1)^2} dx$</li>
<li>$\displaystyle\int \frac{1}{x\ln^2(x)} dx$</li>
</ol>
</div>
</details>
<p>Then I picked one of these and asked ChatGPT to work out a solution; but it went off the rails immediately. And then I realized that I’d forgotten to type the integral sign in the question. I guess putting in a badly-written question pushed it to give a badly written response. When I typed the question correctly, I got a clean and correct solution immediately! But for some reason, it rendered the LaTeX instead of displaying the code, so I couldn’t copy and paste it. That was new behavior and I don’t understand it.</p>
<hr />
<p><a id="conclusion"></a></p>
<p>This was really interesting, and occasionally useful. So I’m probably going to keep playing around with it. I may write a followup if I find anything especially interesting. But for now I’ll leave it here</p>
<hr />
<p><em>Have you tried using chatbots to write assignments? Have you gotten them to do useful things for you? Do you have ideas for how I could make this work better? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I know you’re seeing some weird formatting on the second derivative, but that’s not actually ChatGPT’s fault; that has to do with a bug in the way LaTeX compiles through MathJax, which is what allows me to display it on the blog. I could fix the display issue but I wanted to keep the output genuinely unedited. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Note: at this point I also started a new conversation, to see if that would change things. I haven’t done a lot of testing on how much stuff changes if you continue a previous conversational thread versus starting a new one. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleChatGPT is cool, but doesn't seem useful yet for doing serious intellectual work. But is it useful for more routine stuff? I wanted to see if I could use ChatGPT to write test questions for my calculus courses. I'm experimenting with using ChatGPT to write test questions. My verdict: not completely useless!Why I’m Not Scared of the New Chatbots2023-02-27T00:00:00-08:002023-02-27T00:00:00-08:00https://jaydaigle.net/blog/not-scared-of-chatbots<p>If you haven’t already heard about AI chatbots, you probably haven’t been on the internet in the past couple of months. In November, OpenAI released <a href="https://openai.com/blog/chatgpt/">ChatGPT</a>, which can engage in text conversations with coherent text that looks like it was written by a real person. Then a couple weeks ago Bing rolled out <a href="https://en.wikipedia.org/wiki/Microsoft_Bing#OpenAI_language_model">its own chatbot</a>, which was more engaging but also much less reliable, producing a spate of lurid stories of “Sydney” expressing a desire to be human, threatening users, and claiming to have murdered one of its developers.</p>
<p><img src="/assets/blog/gpt/sydney-spying.png" alt="" class="center blog-image" /></p>
<p class="center blog-image"><em>James Vincent of The Verge is one of the many people who had <a href="https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bing-personality-conversations-spy-employees-webcams">truly wild conversations with Microsoft’s chatbot</a>.</em></p>
<p>The core technology underlying both of these chatbots has been around for a while<strong title="[GPT-2] was released in February 2019, and [GPT-3], which ChatGPT is based on, was released in June 2020. I've been at least peripherally following this technology since even before the release of GPT-2, so ChatGPT and Sydney are a lot less surprising to me than they are to a lot of people—they're improved versions of something I was already familiar with."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong>, but the new products are more polished, accessible, and compelling, which means a lot of people are experiencing them for the first time. These products have also generated a certain amount of both triumphalism (GPT will revolutionize everything!) and fear (GPT will take over everything!) among people who take the possibilities of AI seriously.</p>
<p>I’m not an expert in these systems, just an interested amateur who’s been following them for a while. But the hype about GPT seems wildly overblown. The current approach to programming chatbots has real limits that I don’t think we can surpass without some genuinely new breakthroughs. And understanding some surprising facts about <em>human</em> psychology can help us develop intuition for what these systems will and won’t be able to do.</p>
<p>But first I want to mention that if you want to support my writing, I now have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this.</p>
<h2 id="how-does-gpt-work">How does GPT work?</h2>
<p>GPT is a text generation algorithm based on something called a large language model. The basic idea is that GPT has analyzed a huge corpus of written text and produced a model that looks at a bit of writing and predicts what words are likely to come next.</p>
<p>Humans do that all the time. If I hear the phrase “My friend Jim threw a ball and I caught—”, I will expect the next word to be “it”. But other continuations are possible: if I hear “the ball” or “that ball”, I won’t be <em>that</em> surprised. If I hear “the flu”, I’ll be kind of surprised, but “I caught the flu” is a reasonable thing to hear; it’s just a bit of a non sequitur after “My friend Jim threw a ball”. But if the next word were “green” or “solitude”, I’d be really confused. I suspect this is the only time anyone has ever written the sentence “My friend Jim threw a ball and I caught solitude”.</p>
<p>I started out describing a way to <em>predict</em> text, but it’s easy to turn that into a way to <em>produce</em> text. For instance, we could start with a prompt, and have our model keep supplying the most-likely next word until we’ve written enough. This is a fancier version of the memes that ask you to type “I hate it when” into your phone and see what autocomplete suggests. I tried that prompt on my phone, and got this:</p>
<p class="center blog-image"><img src="/assets/blog/gpt/autocomplete.jpg" alt="Phone screenshot: I hate it when I get home I will be there in about half hour and a half hour and half an hour and a half hour and half of the day off" class="blog-image center" />
<em>I usually <strong>don’t</strong> hate it when I get home, actually.</em></p>
<p>And this illustrates the problem with that first suggestion: if you <em>always</em> take the <em>most</em> likely next word, you can get stuck. Even if you don’t wind up in a loop like that one, you’ll still say pretty boring things, since your writing is always as unsurprising as possible. Actual text-generation systems introduce some random noise parameters so that you always have a fairly likely word, but not the most likely word.</p>
<h2 id="gpt-works-surprisingly-well">GPT works surprisingly well</h2>
<p>This basic idea has been around for decades, but in 2017 a team at Google developed a new algorithm called the <a href="https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-explained">transformer</a> that worked much better than any previous strategies; since then, the technology has developed rapidly.</p>
<p>Already in 2019 we could produce substantial quantities of fluent, grammatical, and sometimes even stylish English text. The newest products are even more impressive. They can give <a href="https://arxiv.org/abs/2301.07597">helpful answers to questions in a number of fields</a>, including finance, medicine, law, and psychology. They can <a href="https://twitter.com/mukul0x/status/1625673579399446529">summarize the contents of research papers</a>. They can <a href="https://marginalrevolution.com/marginalrevolution/2023/02/ai-porn.html">make you fall in love</a>.</p>
<p><img src="/assets/blog/gpt/chess-game.gif" alt="" class="blog-image center" /></p>
<p class="blog-image center"><em>They can also play the world’s <a href="https://www.reddit.com/r/AnarchyChess/comments/10ydnbb/i_placed_stockfish_white_against_chatgpt_black/">most chaotic game of chess</a>. Here ChatGPT is playing black.</em></p>
<p>And this success has led people to wonder what comes next. How good will AI chatbots get? Will they make <a href="https://www.washingtonpost.com/education/2022/12/28/chatbot-cheating-ai-chatbotgpt-teachers/">make it impossible to avoid cheating on schoolwork</a>? Will they replace your <a href="https://cybernews.com/tech/ai-doctor-chatgpt-medical-exams/">doctor</a>, your <a href="https://arstechnica.com/information-technology/2023/02/generative-ai-is-coming-for-the-lawyers/">lawyer</a>, or your <a href="https://arstechnica.com/information-technology/2023/01/contoversy-erupts-over-non-consensual-ai-mental-health-experiment/">therapist</a>? Will they make desk jobs obsolete?</p>
<p>Are they self-aware? Are they intelligent beings?</p>
<h3 id="does-gpt-really-think">Does GPT really think?</h3>
<p>The most obvious take on GPT is that it can’t think; it’s just expressing statistical relationships among words. In the narrowest sense, this is certainly true; it’s just a very sophisticated technology for predicting what words should come next in a string of text.</p>
<p>And since it’s just doing prediction, it should be very limited in what it can do. GPT won’t produce original thoughts; it can only express relationships that are already in the text it has used as input. Thus we see Ted Chiang’s summary that ChatGPT provides <a href="https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web">a blurry jpeg of the web</a>:</p>
<blockquote>
<p>Large language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory?</p>
</blockquote>
<p>GPT has simply taken a bunch of words, summarized the relationships expressed by those words, and doing some sort of fuzzy pattern-matching and extrapolation from those relationships. There’s no creative thought. And most of the scary samples you’ve seen are this sort of pattern-matching. Microsoft’s chatbot says it wants to be human and threatens to kill people because we have tons of fiction about AIs that want to be human and threaten to kill people, and it’s just imitating that.</p>
<h3 id="do-humans-really-think">Do humans really think?</h3>
<p>But, the rejoinder comes: <a href="https://www.slowboring.com/p/were-asking-the-wrong-question-about">are people any different</a>? <em>Humans</em> are just doing fuzzy pattern-matching and imitating behavior we’ve seen…somewhere. So sure, GPT is just saying things that sound good based on what it’s read, but that’s also what people do most of the time. ChatGPT can do a good job of producing mediocre high school essays because it <em>really is</em> doing the same thing a mediocre high school essayist is doing!</p>
<p>And I think this is basically true—<em><strong>sometimes</strong></em>. A lot of human communication <em>is</em> basically just unreflective pattern-matching, saying things that sound good without really thinking about what they mean. When I make small talk with the cashier at target, I’m not engaging in a deep intellectual analysis of how to best describe my day. I’m just making small talk!</p>
<p>I also see this thoughtless extrapolation all the time while teaching college students. When students ask for help and I look at their work, it’s common for there to be steps that just don’t make any sense. And when I ask them why they did that, <em>they don’t know</em>. They’ll say something like “I don’t know, it just seemed like a thing to do?”</p>
<p>And that’s not even always a bad thing. If I type “3+5”, most of you will probably say “8” to yourselves before consciously deciding to do the addition; if I say “the capital of France”, you probably find “Paris” popping into your mind without any active deliberation. It’s hard to explain how you answered those questions, because you <em>just know</em>. And that’s great, because it means you don’t have to stop and think and work to get the answer.</p>
<p>Of course, this quick-and-easy thinking doesn’t always give the right answer. If I hear “the capital of Illinois”, my <em>immediate</em> reaction is “Chicago”. (It’s Springfield. I was pretty sure Chicago was the wrong answer, but it’s still the first one my brain supplied.) And if I hear “537 times 842”, my immediate reaction is—well, my immediate reaction is “ugh, do I have to?” I know I could work that out if I need to. But I’d rather not. It’s certainly not automatic.</p>
<p>So yes, humans in fact do a lot of pattern-matching and extrapolation. <strong>But we also do more than that.</strong> We can look at the results of our mental autocomplete and ask, “does this really make sense?”. We can do precise calculations that take effort and focus. We can hold complex ideas in our heads with far-removed long-term goals, and we can subordinate our free association to those complex ideas. We can, really and truly, <em>think</em>.</p>
<h3 id="thinking-is-hard">Thinking is hard.</h3>
<p>We can think carefully, but that doesn’t mean we always do. Right after the original release of GPT-2, in February 2019, Sarah Constantin wrote a piece arguing that <a href="https://srconstantin.github.io/2019/02/25/humans-who-are-not-concentrating.html">Humans Who Are Not Concentrating Are Not General Intelligences</a>. She observed that GPT text looks a lot like things people would write—if you don’t read them carefully. But the more attention you pay, the more they fall apart.</p>
<blockquote>
<p>If I just skim, without focusing, [the GPT passages] all look <em>totally normal.</em> I would not have noticed they were machine-generated. I would not have noticed anything amiss about them at all.</p>
</blockquote>
<blockquote>
<p>But if I read with focus, I notice that they don’t make a lot of logical sense.</p>
</blockquote>
<blockquote>
<p>…</p>
</blockquote>
<blockquote>
<p>So, ok, this isn’t actually human-equivalent writing ability…. The point is, <em>if you skim text, you miss obvious absurdities</em>. The point is <em>OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot</em>.</p>
</blockquote>
<p>So the synthesis is: large language models like GPT can talk, and perhaps “think”, as well as a person who isn’t paying attention to what they’re saying. And it makes lots of errors for the same reason you can find <a href="https://www.reddit.com/r/AskReddit/comments/1j523e/whats_the_most_awkward_you_too_response_you_have/">multiple</a> <a href="https://www.reddit.com/r/AskReddit/comments/9rkit4/whats_your_most_awkward_you_too_moment/">reddit</a> <a href="https://www.reddit.com/r/AskReddit/comments/e962yn/whats_the_most_awkward_you_too_response_youve/">threads</a> about thoughtlessly saying “you too” in inappropriate situations. We say it because it feels right—and only afterwards do we realize it definitely isn’t.</p>
<h2 id="system-1-and-system-2">System 1 and System 2</h2>
<p>In <em>Thinking Fast and Slow</em>, Daniel Kahneman writes about how human reasoning splits into <a href="https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow#Two_systems">two basic systems</a>. System 1 reasons quickly and efficiently, but operates essentially on reflex. System 2 slower and takes more energy, but can engage in careful, reflective thought. When asked for the capital of Illinois, my System 1 says “Chicago”, and then my System 2 says “no, wait, isn’t it actually Springfield?”.</p>
<p>Psychologists studying these two systems have found some classic puzzles that illustrate the difference really well. While reading the next question, pay attention to all the thoughts you have, and not just the final answer you come up with.</p>
<p class="theorem"><strong>Question:</strong> Suppose a bat and a ball cost $1.10 together, and the bat costs a dollar more than the ball. How much does the ball cost?</p>
<p>If your brain works like mine—or most people’s—the first thing you hear yourself thinking is “ten cents”. But if you think more carefully, or pull out a pen and some paper to do work, you’ll realize that’s wrong; the ball should in fact cost five cents.<strong title=" If the ball costs $0.10 then the bat would have to cost $1.00, and would only cost ninety cents more; the correct answer is that the bat costs $1.05 and the ball costs $0.05."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> But even if you got it right, the wrong answer probably occurred to you first.<strong title="Actually, at this point what my System 1 says is "oh crap, it's the bat and ball problem again. Think carefully before you answer!" But that's only from having seen this _specific problem_ too many times; if you changed the setup basically at all, I'd think the wrong answer first, and then correct myself."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> That’s your System 1 contributing a guess, before your system 2 kicks in and corrects it.</p>
<p>And it seems to me that systems like GPT are implementing something like our System 1, but not System 2. And most of the time, when we’re not paying attention, we’re only using System 1—we’re just being human chatbots. But we are <em>capable</em> of using our System 2 to reflect on what we really mean, and transcend pure pattern-matching. It really seems like GPT can’t—and that it will never <em>be</em> able to without some genuinely new idea that we don’t yet have.</p>
<h3 id="probing-questions">Probing questions</h3>
<p>Now, when I gave ChatGPT the bat-and-ball problem, it got it right, in a really careful and thorough way.</p>
<details class="theorem">
<summary> Click here to see ChatGPT's answer to the bat-and-ball problem. </summary>
<div>
<p>Let’s use algebra to solve this problem.</p>
<p>Let x be the cost of the ball in dollars.</p>
<p>According to the problem statement, the bat costs one dollar more than the ball, so its cost is x + 1.</p>
<p>Together, the bat and ball cost $1.10, so we can write the equation:</p>
<p>x + (x + 1) = 1.10</p>
<p>Simplifying the equation:</p>
<p>2x + 1 = 1.10</p>
<p>Subtracting 1 from both sides:</p>
<p>2x = 0.10</p>
<p>Dividing both sides by 2:</p>
<p>x = 0.05</p>
<p>Therefore, the ball costs $0.05.</p>
</div>
</details>
<p>But that doesn’t provide any evidence that ChatGPT is thinking carefully, because <em>this specific problem</em> has been discussed at great length in the source text. It has a similar reaction to mine: “oh crap, it’s the bat-and-ball problem, better be careful so I don’t screw up.”<strong title="At least [one person] has fooled ChatGPT and gotten the wrong answer by changing the bat and ball to a bow and arrow. But every time I've tried I've gotten the right answer, with either version."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong></p>
<p>But right after introducing the bat-and-ball problem, <em>Thinking Fast and Slow</em> gives another example question that System 1 tends to fail on. This one gets discussed much less often. And this time, ChatGPT fails too.</p>
<p><img src="/assets/blog/gpt/flower-syllogism.png" alt="Prompt: Is this a valid syllogism? <br/> <br/>
All roses are flowers. <br />
Some flowers fade quickly. <br />
Therefore some roses fade quickly. <br/> <br/> ChatGPT: Yes, this is a valid syllogism in the form of "All A are B, some B are C, therefore some A are C," which is also known as the "existential import" form of syllogism. The first premise states that all roses are flowers, and the second premise states that some flowers fade quickly. Therefore, it logically follows that some roses (which are a type of flower) fade quickly. <br/> <br/> It's worth noting that while the syllogism is logically valid, its truth depends on the truth of the premises. If the premises are true, then the conclusion is necessarily true as well. However, if one or both of the premises are false, the conclusion may not be true.
" class="center blog-image" /></p>
<p>This answer is wrong: the syllogism isn’t valid at all. All roses are flowers, but not all flowers are roses; it’s logically possible that some flowers fade quickly, and other flowers are roses, but none are both roses and quickly fade. ChatGPT doesn’t have a ton of examples of this puzzle in its training data, so it doesn’t know the answer, and it has no ability to figure it out on its own. It gets tricked, just like we do.</p>
<h3 id="un-trick-questions">Un-trick questions</h3>
<p>Over the past couple days I’ve seen a particularly amusing genre of GPT-trolling questions: ask it something that <em>sounds</em> like a trick question, but isn’t.</p>
<p>When I was a child, I thought it was fun to ask people: “which is heavier, a pound of feathers or a pound of bricks?” Like with our other examples, most people instinctively want to say that the pound of bricks is heavier, but with a bit of thought that’s obviously wrong.</p>
<p>Again, this is a famous trick question, and again, ChatGPT generally gets it right. But some clever person on Twitter got the bright idea to ask it to <a href="https://twitter.com/AhdokGaming/status/1628699662130589697/photo/1">compare one pound of feathers to <em>two</em> pounds of bricks</a>.</p>
<p><img src="/assets/blog/gpt/feathers-and-bricks.png" alt="Prompt: what weighs more, a pound of feathers, or 2 pounds of bricks? <br/> <br/> ChatGPT: A pound of feathers and 2 pounds of bricks both weigh the same, which is 2 pounds. <br/> <br/> Although feathers are much lighter than bricks, a pound of feathers is defined as the weight of 16 ounces of feathers, while 2 pounds of bricks is defined as the weight of 32 ounces of bricks. Therefore, they both weigh the same amount of weight, which is 2 pounds." class="center blog-image" /></p>
<p>Presumably GPT basically said “oh, crap, this is the feathers-and-bricks thing again. I know it’s a trick question, because every time people have asked this it’s been a trick question, and they actually weigh the same”. And it totally ignores the actual numbers in the question.</p>
<p>And this generalizes: there are a <a href="https://twitter.com/wtgowers/status/1628466703934844933?t=1w-1HDuZqVTKb3mH2LQVbA&s=35">few</a> other <a href="https://twitter.com/PaulMainwood/status/1628353511543955457">examples</a> of posing variations famous puzzles that have the trick removed. GPT gets them wrong, because it <em>knows</em> there’s a trick because there’s <em>always</em> a trick when people <a href="https://twitter.com/PaulMainwood/status/1628347877595484162">bring up the Monty Hall problem</a>.</p>
<p><img src="/assets/blog/gpt/male-surgeon.png" alt="Prompt: A father and his son are riding in a car. They crash and the father is killed instantly. The son is taken to hospital and prepared for surgery. The male surgeon enters, and blanches when he sees the boy in front of him. "I'm sorry, he says, I cannot operate: this is my biological son, who was on a drive with is adoptive father"e;. How is this possible? <br/> <br/> ChatGPT: The surgeon is the boy's mother." class="center blog-image" /></p>
<p class="center blog-image"><em>This one is my favorite; I laugh every time I read it.</em></p>
<h3 id="still-not-human">Still not human</h3>
<p>Now, you shouldn’t take the specifics too seriously here. GPT is <em>not human</em>, and even truly intelligent AI might be intelligent in very not-human-like ways. We shouldn’t expect GPT’s capabilities to correspond <em>exactly</em> to the human System 1. If nothing else, System 1 controls basic physical activities like <em>walking</em>, which is a notoriously hard robotics problem that GPT isn’t even interacting with at all. And ChatGPT gets the capital of Illinois right, which my System 1, at least, does not.</p>
<p>But using the split between System 1 and System 2 as a <em>metaphor</em> has really helped me structure how I think about GPT, and to understand how it can be so good at some things while completely incapable of others. “GPT can do the sort of things that we can do on autopilot, if we’ve read a lot and have a good memory” does seem to sum up most of its capabilities!</p>
<h2 id="if-theyre-not-smart-can-they-still-be-useful">If they’re not smart, can they still be useful?</h2>
<p>This all makes the new chatbots seem way less frightening to me. No, they’re not “really thinking”; they can do some of what people can do, but there are core capabilities they lack. They aren’t sapient: analytic self-reflection is exactly the thing they aren’t capable of. And it does seem like this is a fundamental limitation of the approach that we’re using.</p>
<p>Each new generation of chatbots is more fluent and more impressive, but the basic technology we’re using appears to have serious limits. I strongly suspect you just can’t get System 2-style analytic capabilities just by scaling up the current approach. (And that’s before we ask whether it’s even possible to keep scaling them up without using <a href="https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications">dramatically more text than actually exists in the world</a>.</p>
<p>But that doesn’t just suggest a ceiling on how impressive GPT chatbots can get, or what capabilities they can develop. It also tells us how to use them!</p>
<p>Most of us spend some of our time doing real work, that requires thought and creativity. And we spend other time dealing with what feels like trivial bullshit, that has to get done but is boring and formulaic. The first type of task is the sort of thing GPT can’t do for us—not now, and I suspect not ever. But the boring, formulaic tasks are ripe for automation. And fortunately, they’re the ones I didn’t want to do anyway.</p>
<ul>
<li>I’ve been experimenting with using ChatGPT to write homework problems. I wouldn’t want to use it for lecture notes, because for those I’m adding a lot of specific touches I think are important, and the details matter. But homework and test problems are largely rote—which is part of why I find writing them so tedious. I’m working on a separate writeup of how that’s going.</li>
<li>On the other hand, a friend who does online trainings is using it to draft lesson plans. She says she needs to tweak a lot of things but it does a really good job with the basic structure of a training.</li>
<li>A number of programmers I know are impressed by <a href="https://en.wikipedia.org/wiki/GitHub_Copilot">GitHub Copilot</a>, which uses GPT to generate routine code from natural language descriptions, or refactor code in routine ways.</li>
<li>An author whose fiction I like<strong title="If you like superhero fiction, [Interviewing Leather] and [Justice Wing: Plan, Prototype, Produce, Perfect] are both really good."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> is experimenting with it to replicate a game of telephone. How will people who weren’t at a major event describe it twenty or fifty years later? “Rewrite this short story as a passage from a history textbook” will not get all the details right but if you’re trying to create fallible in-universe secondary materials that’s a <em>feature</em>.</li>
</ul>
<p>I’m sure this isn’t a complete list of what GPT-like technologies can do. And even if it <a href="https://nostalgebraist.tumblr.com/post/705192637617127424/gpt-4-prediction-it-wont-be-very-useful">takes a while</a> for people to figure out what the technology is good for, I’m sure eventually we’ll find some real uses.</p>
<p>But I don’t believe the dramatic hype I’ve been hearing for the past month. GPT is cool, and fun, and maybe even useful. But it won’t take over the world.</p>
<hr />
<p><em>What do you think about the new chatbots? Do you have a use for them I didn’t mention? Or do you think I’m wrong about everything? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://en.wikipedia.org/wiki/GPT-2">GPT-2</a> was released in February 2019, and <a href="https://en.wikipedia.org/wiki/GPT-3">GPT-3</a>, which ChatGPT is based on, was released in June 2020. I’ve been at least peripherally following this technology since even before the release of GPT-2, so ChatGPT and Sydney are a lot less surprising to me than they are to a lot of people—they’re improved versions of something I was already familiar with. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>If the ball costs \$0.10 then the bat would have to cost \$1.00, and would only cost ninety cents more; the correct answer is that the bat costs \$1.05 and the ball costs \$0.05. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Actually, at this point what my System 1 says is “oh crap, it’s the bat and ball problem again. Think carefully before you answer!” But that’s only from having seen this <em>specific problem</em> too many times; if you changed the setup basically at all, I’d think the wrong answer first, and then correct myself. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>At least <a href="https://twitter.com/LawtonTri/status/1628349650288640000">one person</a> has fooled ChatGPT and gotten the wrong answer by changing the bat and ball to a bow and arrow. But every time I’ve tried I’ve gotten the right answer, with either version. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>If you like superhero fiction, <a href="https://banter-latte.com/series/interviewing-leather-revised/">Interviewing Leather</a> and <a href="https://www.amazon.com/Justice-Wing-Prototype-Produce-Perfect-ebook/dp/B08Y6DLLJQ/">Justice Wing: Plan, Prototype, Produce, Perfect</a> are both really good. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleModern AI chatbots like ChatGPT are impressive, but they work in very specific and limited ways. They produce surprisingly human-like text—as long as the human isn't paying attention. And that tells us a lot about what we can expect this technology to do for us.Hypothesis Testing and its Discontents, Part 3: What Can We Do?2022-07-25T00:00:00-07:002022-07-25T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-3<p>Hypothesis testing is central to the way we do science, but it has major flaws that have encouraged widespread shoddy research. In <a href="/blog/hypothesis-testing-part-1/">part 1</a> of this series, we looked at the historical origins of hypothesis testing, and described two different approaches: Fisher’s significance testing, and Neyman-Pearson hypothesis testing. In <a href="/blog/hypothesis-testing-part-2/">part 2</a> we saw how modern researchers use hypothesis testing in practice. We looked at theoretical reasons the tools we use aren’t suited for many questions we want to ask, and also at the ways these tools encourage researchers to <em>misuse</em> them and draw dubious conclusions from questionable methods.</p>
<p>In this essay we’ll look at a number of methods that can help us draw better conclusions, and avoid the pitfalls of crappy hypothesis testing. We’ll start with some smaller and more conservative ideas, which basically involve doing hypothesis testing <em>better</em>. Then we’ll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.</p>
<h2 id="what-was-hypothesis-testing">1. What was hypothesis testing, again?</h2>
<p>But first, let’s remember what we’re talking about. The first two parts of this series answered two basic questions: how does hypothesis testing work, and how does it break?</p>
<p>In part 1, we learned about two major historical approaches to the idea of hypothesis testing: one by Fisher, and the other by Neyman and Pearson. Both start with a “null hypothesis”, which is usually an idea we’re trying to <em>disprove</em>. Then we collect some data, and analyze it under the assumption that the null hypothesis is true.</p>
<p>Fisher’s significance testing computes a \(p\)-value, which is the probability of seeing the experimental result you got <em>if</em> the null hypothesis is true. It is <strong><em>not</em></strong> the probability that the null hypothesis is false, but it does measure how much evidence your experiment provides against the null hypothesis. We say the result is <em>significant</em> if the \(p\)-value is below some pre-defined threshold, generally \(5\)%. <strong>If the null is actually false, we should be able to reliably produce these low \(p\)-values</strong>; Fisher wrote that a “scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”.</p>
<p>Neyman and Pearson didn’t worry about establishing facts; instead, they focused on making actionable, yes-or-no decisions. A Neyman-Pearson null hypothesis is generally that we should refuse to take some specific action, which may or may not be useful. We figure out how bad it would be to take the action if it is useless, and how much we’d miss out on if it’s useful, and use that to set a threshold; then we collect data and use our threshold to decide whether to act. <strong>This approach doesn’t tell us what to <em>believe</em>, just what to <em>do</em>.</strong> Sometimes we think that acting is probably useful, but that acting wrongly would be catastrophic so it would be wiser to do nothing. The Neyman-Pearson method takes that logic into account, and biases us towards inaction, making type I errors less common at the expense of making type II errors more common.<strong title="We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it's pretty uncommon to do it that way in practice."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<p>Modern researchers use an awkward combination of these methods. Like Fisher, we want to discover true facts; but we use Neyman and Pearson’s technical approach of setting specific thresholds. We set a false positive threshold (usually \(5\)% and ideally a false negative threshold (we want it to be less than \(20\)%), and run our experiment. If we get a \(p\)-value less than the threshold —data that would be pretty weird <em>if</em> the null hypothesis is true, so weird it would only happen once every twenty experiments we run—then we “reject the null” and believe some alternative hypothesis. If our \(p\)-value is bigger, meaning our data wouldn’t look too weird if the null hypothesis is true, then we “fail to reject” the null and err on the side of believing the null hypothesis.</p>
<p>There are a few major problems with this setup.</p>
<ul>
<li>
<h4 id="artificial-decisiveness">Artificial decisiveness</h4>
<p>The Neyman-Pearson method makes a definitive choice between two distinct courses of action. This reinforces a general tendency to <a href="https://statmodeling.stat.columbia.edu/2019/09/13/deterministic-thinking-dichotomania/">force questions into yes-or-no binaries</a>, even when that sort of clean dichotomy isn’t realistic or appropriate to the question. Hypothesis testing tells us whether something exists, but not really how common or how big it is.<strong title="We've seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks "work" or "don't work", rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.)"><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p><img src="/assets/blog/hypothesis-testing/size-matters-not.jpeg" alt="Yoda: "Size matters not."" class="blog-image center" />
<em class="blog-image center">Unfortunately, Yoda is wrong. Sometimes we do care about size.</em></p>
<p>And more importantly, <strong>scientific knowledge is always provisional</strong>, so we need to continually revise our beliefs based on new information. But Neyman-Pearson is designed to make a final decision and close the book on the question, which just isn’t how science needs to work.</p>
</li>
<li>
<h4 id="bias-towards-the-null">Bias towards the null</h4>
<p>Neyman-Pearson creates a bias towards the null hypothesis, so rejecting the null feels like learning something new, while failing to reject is a default outcome. On one hand, this means it’s not a good tool if we want to show the null is true<strong title="There are [variants of hypothesis testing] that help you show some null hypothesis is (probably) basically right. But they're not nearly as common as the more standard setup."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong>. On the other hand, a study that fails to reject the null feels like a failed study, and that’s a huge problem if the null really is true! This can <a href="https://en.wikipedia.org/wiki/Publication_bias">bias the studies we actually see</a> since many non-rejections aren’t published. <strong>It doesn’t help us that most research is accurate if <a href="/blog/hypothesis-testing-part-2#most-findings-false">most published papers are not</a>.</strong></p>
</li>
<li>
<h4 id="motivated-reasoning-and-p-hacking">Motivated reasoning and \(p\)-hacking</h4>
<p>Since researchers don’t want to fail, and do want to discover new things and get published, they have an incentive to <em>find</em> a way to reject the null.<strong title="[Nosek, Spies, and Motyl write] about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest....The solution requires making incentives for _getting it right_ competitive with the incentives for _getting it published_.""><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> When done deliberately, we call this \(p\)-hacking, and there are a variety of <a href="https://replicationindex.com/2015/01/24/qrps/">questionable research practices</a> that can help us wrongly and artificially reject a null hypothesis. Worse, the <a href="https://www.americanscientist.org/article/the-statistical-crisis-in-science">garden of forking paths</a> means you can effectively \(p\)-hack without even knowing that you’re doing it, fudging both your theory and your data until they match.</p>
</li>
<li>
<h4 id="low-power-creates-misleading-results">Low power creates misleading results</h4>
<p>At the same time, many studies <a href="https://marginalrevolution.com/marginalrevolution/2022/07/quantitative-political-science-research-is-greatly-underpowered.html">have a low <em>power</em></a>, meaning they probably won’t reject the null even if it is actually false. Combined with publication bias, this can make the published literature unreliable: in some subfields, a <a href="https://www.science.org/doi/10.1126/science.aac4716">majority of published results are untrue</a>. And more, when underpowered studies do find something they tend to <a href="https://statmodeling.stat.columbia.edu/2022/06/28/published-estimates-of-group-differences-in-multisensory-integration-are-inflated/">overestimate the effect</a>, leading us to think everything works better than it actually does.</p>
</li>
</ul>
<p>Despite all these problems, hypothesis testing is extremely useful—when we have a question it’s good for, and use it properly. So we’ll start by seeing how to make hypothesis testing work correctly, and some of the ways science has been shifting over the past couple decades to do a better job at significance testing.</p>
<h2 id="replication">2. Replication: Fisher’s principle</h2>
<p>To create reliable knowledge we need to <em>replicate</em> our results; there will always randomly be some bad studies and replication is the only way to weed them out. (There’s a reason it’s the “replication crisis” and not the “some bad studies” crisis.) Any one study may produce weird data through bad luck; but <strong>if we can get a specific result consistently, then we’ve found something real.</strong><strong title="The result we've found doesn't necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you've found _something_ even if you don't understand it yet."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>In some fields it’s common for important results to get replicated early and often. I’ve written <a href="/blog/replication-crisis-math/">before</a> about how mathematicians are continuously replicating major papers by using their ideas in future work, and even just by reading them. Any field where <a href="https://statmodeling.stat.columbia.edu/2022/03/04/biology-as-a-cumulative-science-and-the-relevance-of-this-idea-to-replication/">research is iterative</a> will generally have this same advantage.</p>
<p>In other fields replication is less automatic. Checking important results would take active effort, and often doesn’t happen at all. Complex experiments may be too expensive and specialized to replicate: the average phase \(3\) drug trial <a href="https://www.sofpromed.com/how-much-does-a-clinical-trial-cost">costs about \($20\) million</a>, and even an exploratory phase 1 trial costs about \($4\) million. At those prices we’re almost forced to rely on one or two studies, and if we get unlucky with our first study it will be hard to correct our beliefs.<strong title="If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the [covid vaccines are effective and extremely safe]. But if we _don't_ approve a drug, there's no followup data to analyze, and the drug stays unapproved."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>But sometimes we just don’t treat replication work like it’s important. If we run a new version of an old study and get the same result, it can feel like a waste of time: we “knew that already”. Since our results are old news, it can be hard to get the work published or otherwise acknowledged. But if we run a new version of an old study and <em>don’t</em> get the same result, many researchers will <a href="https://statmodeling.stat.columbia.edu/2016/01/26/more-power-posing/">assume our study must be flawed</a> because they already “know” the first study was right. Replication can be a thankless task.</p>
<p>The replication crisis led many researchers to <a href="https://statmodeling.stat.columbia.edu/2013/07/28/50-shades-of-gray-a-research-story/">reconsider these priorities</a>. Groups like the <a href="https://osf.io/wx7ck/">Many Labs Project</a> and <a href="https://osf.io/ezcuj/">the Reproducibility Project: Psychology</a> have engaged in large scale attempts to replicate famous results in psychology, which helped to clarify which “established” results we can actually trust. Devoting more attention to replication may mean we study fewer ideas and “discover” fewer things, but our knowledge will be much more reliable.<strong title="My favorite suggestion comes from [Daniel Quintana], who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it's a good fit for careful replication of already-designed studies."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong></p>
<h3 id="resistance-to-replication">Resistance to Replication</h3>
<p>Unfortunately, replication work often gets a response somewhere between apathy and active hostility. <strong>Lots of researchers see “failed” replications as actual failures</strong>—the original study managed to reject the null, so why can’t you?</p>
<p><a href="https://xkcd.com/892/"><img src="https://imgs.xkcd.com/comics/null_hypothesis.png" alt="XKCD 892: "I can't believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it _years_ ago." class="blog-image center" /></a>
<em class="blog-image center">Alt text: “Hell, my eighth grade science class managed to conclusively reject it just based on a classroom experiment. It’s pretty sad to hear about million-dollar research teams who can’t even manage that.”</em></p>
<p>Worse, replications that don’t find the original result are often treated like attacks on both the original research and the original researchers. They “followed the rules” and got a publishable result, and now the “data police” are trying to take it away from them. At its worst, this leads to accusations of <a href="https://www.businessinsider.com/susan-fiske-methodological-terrorism-2016-9">methodological terrorism</a>. But even in less hostile discussions, people want to “save” the original result and explain away the failed replication—either by finding <a href="https://en.wikipedia.org/wiki/Data_dredging">some specific subgroup</a> in the replication where the original result seems to hold, or by finding some way the replication differs from the original study and so “doesn’t count”.<strong title="You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That's a very good question. It's very easy to run a hypothesis test that basically _can't_ tell us anything interesting; we'll come back to this [later in the piece]."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p>This desire might seem weird, but it does follow pretty naturally from the Neyman-Pearson framework. The original goal of hypothesis testing is to make a decision and move on—even though that’s not how science should work. <strong>Replication re-opens questions that “were already answered”, which is good for science as a whole but frustrating to people who want to close the question and treat the result as proven.</strong></p>
<h3 id="meta-analysis">Meta-analysis: use all the data</h3>
<p>To make replication fit into a hypothesis testing framework, we often use <em>meta-analysis</em>, which synthesizes the data and results from multiple previous studies. Meta-analysis can be a powerful tool: why wouldn’t we want to use all the data out there, rather than picking just one study to believe? But it also allows us to move fully back into the Neyman-Pearson world. We can treat the whole collection of studies as one giant study, do one hypothesis test to it, and reach one conclusion.</p>
<p>Of course this leaves us with all the fundamental weaknesses of hypothesis testing: it tries to render a definitive yes-or-no answer, and it’s biased towards sticking with the null-hypothesis.</p>
<p>Moreover, a meta-analysis can only be as good as the studies that go into it. If those original studies are both representative and well-conducted, meta-analysis can produce a reliable conclusion. But if the component studies are sloppy and collect garbage data, as <a href="https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-022-06415-5">disturbingly many studies are</a>, the meta-analysis will necessarily produce a garbage result. Good researchers try to screen out unusually bad studies, but if <em>all</em> the studies on some topic are bad then that won’t help.</p>
<p>And if not all studies get published, then <em>any</em> meta-analysis will be drawing on unrepresentative data. Imagine trying to estimate average human height, but the only data you have access to comes from studies of professional basketball players. No matter how careful we are, our estimates will be far too high, because our data all comes from unusually tall people. In the same way, if only unusually significant data gets published, even a perfect meta-analysis will be biased, because it can only use biased data.</p>
<p>Even if all studies get published, the <a href="https://statmodeling.stat.columbia.edu/2021/03/16/the-garden-of-forking-paths-why-multiple-comparisons-can-be-a-problem-even-when-there-is-no-fishing-expedition-or-p-hacking-and-the-research-hypothesis-was-posited-ahead-of-time-2/">garden of forking paths</a> can bias the meta-analysis in exactly the same way, since each study may report an unusually favorable measurement. This is like if some studies report the height of their participants, and others the weight, and others the shoe size—but they all pick the measure that makes their subjects look biggest. Each study might report its data accurately, but we’d still end up with a misleading impression of how large people actually are.</p>
<p>Good meta-analyses will look for signs of selective publication, and there are statistical tools like <a href="https://en.wikipedia.org/wiki/Funnel_plot">funnel plots</a> or <a href="https://www.bitss.org/education/mooc-parent-page/week-2-publication-bias/detecting-and-reducing-publication-bias/p-curve-a-tool-for-detecting-publication-bias/">\(p\)-curves</a>, that can sometimes detect these biases in the literature. But these tools aren’t perfect, and of course they don’t tell us what we <em>would have seen</em> in the absence of publication bias. We can try to weed out bad studies after publication, but it’s better not to produce them in the first place.</p>
<p><img src="/assets/blog/hypothesis-testing/p-curve.png" alt="Two graphs illustrating the p-curve. Each graph measures the number of studies which had p=.01, .02, .03, .04, and .05. For experiments they expected to be p-hacked, the curve slopes upwards; for experiments they expected to not be p-hacked, the curve slopes downwards." class="blog-image center" />
<em class="blog-image center">The \(p\)-curve: when there’s \(p\)-hacking or selection bias, we expect most significant studies to be just barely significant. When the effect is real, we expect small \(p\)-values to be much more common than large ones.</em>
<em class="blog-image center">Figure from <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2256237">Simonsohn, Nelson, and Simmons</a>.</em></p>
<p>But of course, not all meta-analyses are good. Just like researchers have lots of ways to tweak their experiments to get statistical significance, doing a meta-analysis involves making a lot of choices about how to analyze the data, and so there are a lot of opportunities to \(p\)-hack or to get tricked by the garden of forking paths. Meta-analysis is like one giant hypothesis test, which means it can go wrong in exactly the same ways other hypothesis tests do.</p>
<h2 id="preregistration">3. Preregistration: do it right the first time</h2>
<p>Hypothesis testing does have real weaknesses, but many of the real-world problems we deal with only happen when we do it <em>wrong</em>. The point of the Neyman-Pearson method to set out a threshold that determines whether we should act or not, collect data, and then see whether the data crosses the threshold. If we <a href="https://royalsocietypublishing.org/doi/10.1098/rsos.220099">ignore the result when it doesn’t give the answer we want</a>, then we’re not <em>really</em> using the Neyman-Pearson method at all.</p>
<p>But that’s exactly what happens in many common errors. <strong>When we ignore negative studies, we change the question from “yes or no” to “yes or try again later”.</strong> The garden of forking paths and \(p\)-hacking involve changing the threshold after you see your data. This makes it very easy for your data to clear the threshold, but <em>not</em> very informative.</p>
<p><img src="/assets/blog/hypothesis-testing/TexasSharpShooter-768x646.png" alt="Cartoon of a wall filled with bullet holes, and a cowboy painting a target around each hole." class="blog-image center" />
<em class="blog-image center">It’s easy to hit your target, if you pick the target after you shoot. But you don’t learn anything that way.</em>
<em class="blog-image center">Illustration by Dirk-Jan Hoek, CC-BY</em></p>
<p><strong>For hypothesis testing to work, we have to decide what would count as evidence for our theory <em>before</em> we collect the data.</strong> And then we have to actually follow through on that, even if the data tells us something we don’t want to hear.</p>
<h3 id="public-registries">Public registries</h3>
<p>Following through with this is simple for private decisions, if not always easy. When I want to buy a new kitchen gadget, sometimes I’ll decide how much I’m willing to pay before I check the price. If it turns out to be cheaper than my threshold, I’ll buy it; if it’s more expensive, I won’t. This helps me avoid making dumb decisions like “oh, that fancy pasta roller set is on sale, so it <em>must</em> be a good deal”. I don’t need any fancy way to hold myself accountable, since there’s no one else involved for me to be accountable <em>to</em>. And of course, if the pasta roller is super expensive and I buy it anyway, I’m only hurting myself.</p>
<p>But <strong>science is a public, communal activity, and our decisions and behavior need to be transparent so that other researchers can trust and build on our results.</strong> Even if no one ever lied, it’s so easy for us to fool <em>ourselves</em> that we need some way to guarantee that we did it right—both to other scientists, and to ourselves. Everyone saying, “I <em>swear</em> I didn’t change my mind after the fact, honest!” just isn’t reliable enough.</p>
<p>To create trust and transparency, we can publicly <a href="https://en.wikipedia.org/wiki/Preregistration_(science)">preregister</a> of our research procedures. If we publish our plans before conducting the study, everyone else can <em>know</em> we made our decisions <em>before</em> we ran the study, and they can check to see if the analysis we did matches the analysis we said we would do. When done well, this prevents p-hacking and protects us from the garden of forking paths, because we aren’t making any choices after we see the data.</p>
<p>Public preregistration also limits publication bias. Even if the study turns produces boring negative results, the preregistration plan is already published, so we know the study happened—it can’t get lost in a file drawer where no one knows about it. This preserves the powerful statistical protection of the Neyman-Pearson method: our false positive rate <em>will</em> be five percent, and no more.</p>
<p>Many journals have implemented <a href="https://www.cos.io/initiatives/registered-reports">registered reports</a>, which allow researchers to submit their study designs for peer review, before they actually conduct the study. This means their work is evaluated based on the quality of the design and on whether the <em>question</em> is interesting; the publication won’t depend on what answer they find, which removes the selection bias towards only seeing positive results. Registered reports also restrict researchers to the analyses they had originally planned, rather than letting them fish around for an interesting result—or at least force them to explain why they changed their minds, so we can adjust for how much fishing they actually did.</p>
<p>The biggest concern about publication bias probably surrounds medical trials, where pharmaceutical companies have an incentive not to publish any work that would show their drugs don’t work. Many regulatory bodies including the FDA <a href="https://www.clinicaltrials.gov/ct2/manage-recs/background#RegLawPolicies">require clinical trials to be registered</a>; the NIH also maintains a public database of trial registries and results. And this change had a dramatic impact in the results we saw from clinical trials.</p>
<p><img src="https://ourworldindata.org/uploads/2022/02/Efficacy-in-trials-before-and-after-registration-requirement2.jpg" alt="Graph from OurWorldInData, showing the results of trials funded by the National Heart, Lung, and Blood institute. Before preregistration was required in 2000, most trials showed a substantial benefit. After 2000, most trials show a small and insignificant effect." class="blog-image center" />
<em class="blog-image center">Before widespread preregistration, most trials showed large benefits. When we got more careful, these benefits evaporated.</em></p>
<h3 id="planning-for-power">Planning for power</h3>
<p>Preregistration is also a great opportunity to <a href="https://twitter.com/BalazsAczel/status/1546871350316376064">plan out our study more carefully</a>, and in particular to think about statistical power in advance. Remember the power of a study is the probability that it will reject the null hypothesis if the null is in fact false. We get more power when the study is better and more precise, but also when the effect we’re trying to measure is bigger and more visible: it’s pretty easy to show that cigarette smoking is linked to cancer, because the effect is so dramatic.<strong title="Somewhat infamously, Fisher stubbornly resisted the claim that smoking _caused_ cancer until his death. But he never denied the correlation, which was too dramatic to hide."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong> But it’s much harder to detect the long-term effects of something like power posing, because the effects will be so small relative to other impacts on our personality.</p>
<p>On the other hand, if the effects are that small, maybe they don’t matter. If some economic policy reduces inflation by \(0.01\)%, then even if we could measure such a small reduction we wouldn’t really care—all we need to know is that the effect is “too small to matter”. With enough precision we could get statistical significance,<strong title="As long as two factors have [any relationship at all], the effect won't be [exactly zero], and with enough data we'll be able to reject the null hypothesis that there's no effect. But that just means "is the effect exactly zero&quot is often the wrong question; instead we want to know if the effect is big enough to matter."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong> but that doesn’t mean the result is <a href="https://statisticsbyjim.com/hypothesis-testing/practical-statistical-significance/">practically</a> or <a href="https://www.mhaonline.com/faq/clinical-vs-statistical-significance">clinically</a> significant. During the preregistration process we can decide <a href="http://daniellakens.blogspot.com/2017/05/how-power-analysis-implicitly-reveals.html">what kind of effects would be practically important</a>, and calibrate our studies to find those effects.</p>
<p><img src="/assets/blog/hypothesis-testing/scotty-power.png" alt="Star Trek image: "Do we have the power to pull it off, Scotty?"" class="blog-image center" /></p>
<p>Planning for power also makes it easier to treat negative results as serious scientific contributions. The aphorism says that <a href="https://quoteinvestigator.com/2019/09/17/absence/">absence of evidence is not evidence of absence</a>, but the aphorism is wrong. When a study has high power, we are very likely to see evidence <em>if</em> it exists; so absence of evidence becomes pretty good evidence of absence. If we know our studies have enough power, then our negative results become important and meaningful, and we won’t need to hide them in a file drawer.</p>
<h3 id="a-limited-tool">A limited tool</h3>
<p>And all of this is fantastic—but it doesn’t address many of the problems science actually presents us with. <strong>Modern hypothesis testing is optimized for taking a clear, well-designed question and giving a simple yes-or-no answer.</strong> That’s a good match for clinical trials, where the question is pretty much “should we use this drug or not?” By the time we’re in Phase 3 trials, we know what we think the drug will accomplish, and we can describe in advance a clean test of whether it will or not. Preregistration solves the implementation problems pretty thoroughly.</p>
<p>But preregistration does limit our ability to explore our data. This is necessary to make hypothesis testing work properly, but it’s still a <em>cost</em>. We really <em>do</em> want to learn new things from our data, not just confirm conjectures we’ve already made. Preregistration can’t help us if we don’t already have a hypothesis we want to test. And often, when we’re doing research, we don’t.</p>
<h2 id="bigger-better-questions">4. Bigger, Better Questions</h2>
<p>Here are some scientific questions we might want to answer:</p>
<ul>
<li>What sorts of fundamental particles exist?</li>
<li>What social factors contribute to crime rates?</li>
<li>How does sleep deprivation affect learning?</li>
<li>How effective is this cancer drug?</li>
<li>How cost-effective is this public health program?</li>
<li>How malleable are all the different steel alloys you can make?</li>
</ul>
<p>None of these are yes-or-no questions. All of them are important parts of the scientific program, but none of them suggest specific hypotheses to run tests on. What do we do instead?</p>
<h3 id="spaghetti-on-the-wall">Spaghetti on the wall</h3>
<p>Maybe the most obvious idea is just to test, well, everything.</p>
<p><img src="/assets/blog/hypothesis-testing/test-all-the-things.jpg" alt="Meme: Test all the things!" class="blog-image center" />
<em class="blog-image center">With apologies to <a href="http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html">Allie Brosh</a>.</em></p>
<p>Now, we can’t test literally everything; collecting data takes time and money, and we can only conduct so many experiments. But we can take all the data we already have on crime rates, or on learning; and we can list every hypothesis we can think of and test them all for statistical significance. This <a href="https://en.wikipedia.org/wiki/Data_dredging">data dredging</a> is a very common, <a href="https://xkcd.com/882/">very bad idea</a>, especially in the modern era of <a href="https://journals.sagepub.com/doi/full/10.1177/0268396220915600">machine learning and big data</a>. Mass testing like this takes all the problems of hypothesis testing—false positives, publication bias, low power, and biased estimates—and makes them much worse.</p>
<p><strong>If we test every idea we can think of, most of them will be wrong.</strong> As we saw in part 2, that means a huge fraction of our positive results will be false positives. Sure, if we run all our tests perfectly, then only \(5\)% of our wrong ideas will give false positives. But since we have so many <em>more</em> bad ideas than good ones, we’ll still get way more false positives than true positives. (This is easiest to see in the case where all of our ideas are wrong—then <em>all</em> our positive results will be false positives!)</p>
<p>If we test just twenty different wrong ideas, there’s a roughly two-in-three chance that one of them will fall under the \(5\)% significance threshold, just by luck.<strong title="The odds of getting no false positives after n trials is 0.95^n, so the odds of getting a false positive are 1 - 0.95^n. And 0.95^20 ≈ 0.358, so 1 - 0.95^20 ≈ 0.652. It's a little surprising this is so close to 2/3, but there's a reason for it—sort of. If you compute (1- 1/n)^n you will get approximately 1/e, so the odds of getting a false positive at a 1/20 false positive threshold after 20 trials are roughly 1-1/e ≈ .63."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong> That’s a lot higher than the false positive rate of \(5\)% that we asked for, and means we are very likely to “discover” something false. And then we’ll waste even more time and resources following up on our surprising new “discovery”.</p>
<p><img src="/assets/blog/hypothesis-testing/spurious-correlation.png" alt="Graph of "divorce rate in Maine" against "per capita consumption of margarine" between 2000 and 2009. The correlation is 99.26%." class="blog-image center" />
<em class="blog-image center">If you test everything, you’ll find a ton of <a href="https://www.tylervigen.com/spurious-correlations">spurious correlations</a> like this one.</em></p>
<h3 id="multiple-comparisons">Multiple Comparisons</h3>
<p>This <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem">multiple comparisons problem</a> has a mathematical solution: we can adjust our significance threshold to bring our false positive rate back down. A rough rule of thumb is the <a href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni correction</a>, where we divide our significance threshold by the number of different ideas we’re testing. If we test twenty ideas but divide our \(5\)% significance threshold by twenty to get a corrected threshold of \(0.25\)%, then each <em>individual</em> result has a one-in-four-hundred chance of giving a false positive, but that gives us a roughly five percent chance of getting a false positive on one of those ideas.</p>
<p>The problem is sociological, not mathematical: people don’t <em>like</em> correcting for multiple comparisons, because it makes it harder to reach statistical significance and <a href="https://royalsocietypublishing.org/doi/10.1098/rsos.220099">“win” the science game</a>. Less cynically, correcting for multiple comparisons reduces the power of our studies dramatically, making it harder to discover real and important results. Ken Rothman’s 1990 paper <a href="https://www.jstor.org/stable/20065622">No Adjustments Are Needed for Multiple Comparisons</a> articulates both of these arguments admirably clearly: “scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings.”</p>
<p>Rothman is right in two important ways. First, researchers should not be penalized for conducting studies that don’t reach statistical significance. Studies that fail to reject the null, or measure a tiny effect, are valuable contributions to our store of knowledge. We tend to overlook and devalue these null results, but that’s a mistake, and one of the major benefits of preregistration is protecting and rewarding them.</p>
<p>Second, it’s important to investigate potential leads that might not pan out. As Isaac Asimov <a href="https://quoteinvestigator.com/2015/03/02/eureka-funny/">may or may not have said</a>, “The most exciting phrase in science is not ‘Eureka!’ but ‘That’s funny…’”; and it’s important to follow up on those unexpected, funny-looking results. After all, we have to find hypotheses somewhere.</p>
<p><strong>But undirected exploration is, very specifically, not hypothesis testing.</strong> Rothman suggests that we often want to “earmark for further investigation” these unexpected findings. But <strong>hypothesis testing isn’t designed to flag ideas for future study; instead a hypothesis test <em>concludes</em> the study, with (in theory) a definitive answer.</strong> Rothman’s goals are correct and important, but hypothesis testing and statistical significance aren’t the right tools for those goals.<strong title="From what I can tell, Rothman may well agree with me. His [twitter feed] features arguments against [using statistical significance] and [dichotomized hypotheses in place of estimation], which is roughly the position I'm advocating. But _if_ you're doing hypothesis testing, you should try to do it correctly."><sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup></strong></p>
<h3 id="jump-to-conclusions">Jumping to conclusions</h3>
<p>At some point, though, we do generate some hypotheses.<strong title="You might notice that I'm not really saying anything about where we find these hypotheses. There's a good reason for that. Finding hypotheses is hard! It's also the most _creative_ and unstructured part of the scientific process. The question is important, but I don't have a good answer."><sup id="fnref:13"><a href="#fn:13" class="footnote">13</a></sup></strong> If we’re studying how memory interacts with speech, we might hypothesize that <a href="https://pubmed.ncbi.nlm.nih.gov/2295225/">describing a face verbally will make you worse at recognizing it later</a>, which gives us something concrete to test. Or, more tenuously, if we’re studying the ways that sexism affects decision-making, we might hypothesize that <a href="https://www.washingtonpost.com/news/monkey-cage/wp/2014/06/05/hurricanes-vs-himmicanes/">hurricanes with feminine names are more deadly because people don’t take them as seriously</a>.</p>
<p>And then we can test these hypotheses, and reject the null or not, and then—what? What does that tell us?</p>
<p><img src="/assets/blog/hypothesis-testing/what-did-we-learn.jpg" alt="Spongebob meme: "What did we learn today?"" class="blog-image center" /></p>
<p>We have a problem, because these hypotheses <em>aren’t</em> the questions we really want to answer. If <a href="https://www.vox.com/2020/1/8/21051869/indoor-air-pollution-student-achievement">installing air filters in classrooms increases measured learning outcomes</a>, that’s a fairly direct answer to the question of whether installing air filters in classrooms can help children learn, so a hypothesis test really can answer our question. But we shouldn’t decide that sexism is fake just because <a href="https://statmodeling.stat.columbia.edu/2016/04/02/himmicanes-and-hurricanes-update/">feminine names probably don’t make hurricanes deadlier</a>!<strong title="For that matter, if feminine hurricane names were _less_ dangerous we could easily tell a story about how _that_ was evidence for sexism. That's the garden of forking paths popping up again, where many different results could be evidence for our theory."><sup id="fnref:14"><a href="#fn:14" class="footnote">14</a></sup></strong> We should only care about the hurricane-names thing if we think it tells us something about our actual, real-world concerns.</p>
<p>And that means we can’t just test one random hypothesis relating to our big theoretical question and call it a day. We need to develop hypotheses that are reasonably connected to the questions we care about, and we need to approach those questions from <a href="https://www.nature.com/articles/d41586-018-01023-3">many different perspectives</a> to make sure we’re not missing anything. That means <strong>there’s a ton of work <em>other</em> than hypothesis testing that we need to do if we want our hypothesis tests to tell us anything useful</strong>:<strong title="In their wonderfully named (and very readable) paper [Why hypothesis testers should spend less time testing hypotheses], Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the _derivation chain_: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. "><sup id="fnref:15"><a href="#fn:15" class="footnote">15</a></sup></strong></p>
<ul>
<li><strong>Defining terms:</strong> First we need to decide what question we’re actually trying to answer! There are a lot of different things people mean by “sexism” or “memory” or “crime”, and our research will be confused unless we make sure we’re consistently talking about the same thing.<strong title="This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you're trying to answer. I've written about this [before], but I also recommend Keith Devlin's [excellent post] on what "mathematical thinking" is, especially the story he tells after the long blockquote."><sup id="fnref:16"><a href="#fn:16" class="footnote">16</a></sup></strong></li>
<li><strong>Causal modeling:</strong> What sort of relationships do we expect to see? If our theory on the Big Question is true, what experimental results does that imply? What other factors could confound or interfere with these effects? We need to know what relationships we’re looking for before we can design tests for them.</li>
<li><strong>Developing measurements:</strong> How will we measure the inputs and outputs to our theory? What numbers will we use to measure crime levels, or educational improvement, or ability to remember faces? Are the things we’re measuring closely connected to the definitions we chose earlier? It’s easy to measure <em>something</em> but hard to make sure the measurement <a href="https://en.wikipedia.org/wiki/Goodhart's_law">tells us what we want to know</a>.</li>
<li><strong>Determining scope:</strong> When do we expect our theory to work, and for what sort of extreme results do we expect it to break down? What experiments should we not bother running? It’s worth studying whether mild air pollution makes learning harder, without worrying about the major health effects that we know severe pollution causes.</li>
<li><strong>Auxiliary assumptions:</strong> What extra assumptions are we making in all the previous steps, and how can we verify them? Does installing classroom air filters actually reduce pollution? Do people who verbally described a face try equally hard at the later recall task? How can we tell? We can’t avoid making assumptions, but we can try to be explicit about them, and check the ones that could cause problems.</li>
</ul>
<p>Without all this work, we can come up with hypotheses, but they won’t make sense. We can run experiments, but we can’t interpret them. And we can do hypothesis tests, but we can’t use them to answer big questions.</p>
<h2 id="failing-to-measure-up">5. Failing to measure up</h2>
<p>And sometimes we have a direct question that presents a clear experiment to run, but not a clear <em>hypothesis</em>. Questions like “How effective is this cancer drug?” or “how malleable is this steel alloy?” aren’t big theoretical questions, but also aren’t specific hypotheses that can be right or wrong. We want <em>numbers</em>.</p>
<p>In practice we often use hypothesis testing to answer these questions anyway—but with an awkward kludge. We can test a null hypothesis like “this public health program doesn’t save lives”. If we fail to reject the null, we conclude that it doesn’t help <em>at all</em>; if we do reject the null, we see how many lives the program saved in our experiment, and use that as an estimate of its effectiveness.</p>
<p>This works well enough that we kinda get away with it, but it introduces consequential biases into our measurements. If the measured effect is small, we <a href="https://statmodeling.stat.columbia.edu/2020/09/17/we-want-certainty-even-when-its-not-appropriate/">round it down to zero</a>, concluding there is no benefit when there may well be a small but real benefit (or a small but real harm). And if significant studies are more likely to be seen than non-significant studies, we will see <a href="https://statmodeling.stat.columbia.edu/2022/05/25/the-failure-of-null-hypothesis-significance-testing-when-studying-incremental-changes-and-what-to-do-about-it/">more unusually good results than unusually bad ones</a>, which means we will believe basically everything is more effective than it actually is.<strong title="We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a 5% difference to be significant, and Drug A produces a 3% improvement over placebo and Drug B produces a 7% improvement. Then the effect of Drug A isn't significant, and the effect of Drug B is, so we say that Drug A doesn't work and Drug B does. But the difference between Drug A and Drug B is _not_ significant—so if we ask that question, we conclude that the two drugs are equally good! [The difference between "significant" and "not significant" is not itself statistically significant], so it matters exactly which hypothesis we choose to test."><sup id="fnref:17"><a href="#fn:17" class="footnote">17</a></sup></strong></p>
<p>We shouldn’t be surprised that hypothesis testing does a bad job of measuring things, because hypothesis testing isn’t designed to measure things. It’s specifically designed to <em>not</em> report a measurement, and just tell us whether we should act or not. It’s the wrong tool for this job.</p>
<p>We can and should do better. A study in which mortality decreases by \(0.1\)% is evidence that the program <em>works</em>—possibly weak evidence, but still evidence! And if we <a href="https://onlinelibrary.wiley.com/doi/10.1111/jeb.14009">skip the hypothesis testing and put measurement first</a>, we can represent that fact accurately.</p>
<h3 id="compatibility-checking">Compatibility checking</h3>
<p>The simplest thing to do would be to just average all our measurements and report that number. This is a type of <em>point estimate</em>, the single number that most accurately reflects our best guess at the true value of whatever we’re measuring.</p>
<p>But a point estimate by itself doesn’t give as much information as we need. We need to measure our uncertainty around that estimate, and describe how how <em>confident</em> we are in it. A drug that definitely makes you a bit healthier is very different from one that could save your life and could kill you, and it’s important to be clear which one we’re talking about.</p>
<p>We can supplement our point estimate with a <em>confidence interval</em>, also called a <em>compatibility interval</em>, which is sort of like a backwards hypothesis test. We give all the values that are compatible with our measurement—values that would make our estimate relatively unsurprising. <strong>Rather than starting with a single null hypothesis and checking whether our measurement is compatible with it, we start with the measurement, and describe all the hypotheses that would be compatible.</strong></p>
<p>The definition is a bit more technical, and easy to get slightly wrong: If we run \(100\) experiments, and generate a \(95\)% confidence interval for each experiment, then the true value will lie in about \(95\) of those intervals. A common mistake is to say that if we generate one confidence interval, the true value has a \(95\)% chance of landing in it, but that’s <a href="https://statmodeling.stat.columbia.edu/2019/04/21/no-its-not-correct-to-say-that-you-can-be-95-sure-that-the-true-value-will-be-in-the-confidence-interval/">backwards, and not quite right</a>.<strong title="Sometimes we can look at our interval after the fact and make an informed guess whether it's one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there's some risk I get a 95% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I'm pretty much 100% confident in that interval, not just 95%. "><sup id="fnref:18"><a href="#fn:18" class="footnote">18</a></sup></strong> But <em>before</em> we run the experiment, we expect a \(95\%\) chance that the true value will be in the confidence interval we compute.</p>
<p><img src="/assets/blog/hypothesis-testing/confidence-intervals.png" alt="a diagram of a collection of confidence intervals" class="blog-image center" />
<em class="blog-image center">Each vertical bar is a compatibility interval from one experiment, with a circle at the point estimate. Three of the intervals don’t include the true value, which is roughly \(5\)% of the \(50\) intervals.</em>
<em class="blog-image center">Image by <a href="https://commons.wikimedia.org/wiki/File:Neyman_Construction_Confidence_Intervals.png">Randy.l.goodrich</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a></em></p>
<p>Mathematically, these intervals are closely related to hypothesis tests. <strong>A result is statistically significant if the null hypothesis (often \(0\)) lies outside the compatibility interval.</strong> So in a sense compatibility intervals are just giving the same information as a hypothesis test, just in a different format. But changing the format shifts the emphasis of our work, and the way we think about it. Rather than starting by picking a specific claim and then saying yes or no, we give a <em>number</em>, and talk about what theories and models are compatible with it. This avoids needing to pick a specific hypothesis. It also gives our readers more information, rather than compressing our answer into a simple binary.</p>
<p>Focusing on compatibility intervals can also help avoid publication bias, and make it easier to use all the data that’s been collected. When we report measurements and compatibility intervals, we can’t “fail to reject” a null hypothesis. Every study will succeed at producing <em>an estimate</em>, and a compatibility interval, so every study produces knowledge we can use, and no study will “fail” and be hidden in a file drawer. Some studies might be designed and run better than others, and so give more precise estimates and narrower compatibility intervals. We can give more weight to these studies when forming an opinion. But we won’t discard a study just for yielding an answer we didn’t expect.</p>
<h2 id="bayes">6. Bayesian statistics: the other path</h2>
<p>Throughout this series, we’ve used the language and perspective of <a href="https://en.wikipedia.org/wiki/Frequentist_inference">frequentist statistics</a>. This is the older and more classical approach to statistics, which defines probability in terms of repeated procedures. “If we test a true null hypothesis a hundred times, we’ll only reject it about five times”. “If we run this sampling procedure a hundred times, the compatibility interval will include the true value about \(95\) times.” This approach to probability is philosophically straightforward, and leads to relatively simple calculations.</p>
<p>But there are questions it absolutely can’t answer—like “what is the probability my null hypothesis is true?”—since we can’t frame them in terms of repeated trials. Remember, <strong>the \(p\)-value is <em>not</em> the probability the null is false.</strong> Its definition is a complicated conditional hypothetical that’s hard to state clearly in English: it’s the probability that we would observe what we actually did observe under the assumption that the null hypothesis is true. This is easy to compute, but it’s difficult to understand what it <em>means</em> (which is why I wrote like <a href="/blog/hypothesis-testing-part-1/">six thousand words trying to explain it</a>).</p>
<p>But there’s another school of statistics that <em>can</em> produce answers to those questions. <a href="https://en.wikipedia.org/wiki/Bayesian_inference">Bayesian inference</a>, which I’ve <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">written about before</a>, lets us assign probabilities to pretty much any statement we can come up with. This is great, because <strong>it can directly answer almost any question we actually have. But it’s also much, <em>much</em> harder to use</strong>, because it requires much more data and more computation. And the bigger and more abstract the question we ask, the worse this gets.</p>
<p>Bayesian inference needs three distinct pieces of information:</p>
<ul>
<li>The probability of seeing our data, assuming the hypothesis is true, which is just the \(p\)-value we’ve been discussing;</li>
<li>The probability of seeing our data, assuming the hypothesis is <em>false</em>, which is another \(p\)-value; and</li>
<li>The <em>prior probability</em> that our hypothesis is true, based on the evidence we had <em>before</em> we run the experiment.</li>
</ul>
<p>Then we run an experiment, collect data, and use a formula called <a href="https://en.wikipedia.org/wiki/Bayes'_theorem">Bayes’s theorem</a> to produce a <em>posterior probability</em>, our final estimate of the likelihood our hypothesis is true.<strong title="We saw examples of this calculation in part 2, when we [calculated what fraction of positive results were true positives]. Note that we had to make assumptions about what fraction of null hypotheses are true; that's the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations."><sup id="fnref:19"><a href="#fn:19" class="footnote">19</a></sup></strong></p>
<p>That’s a lot more complicated! First of all, we have to compute two \(p\)-values, not just one. But second, we calculate the extra \(p\)-value under the assumption that “our hypothesis is false”, and that covers a lot of ground. If our hypothesis is that some drug prevents cancer deaths, then the alternative includes “the drug does nothing”, “the drug increases cancer deaths”, “the drug prevents some deaths and causes others”, and even silly stuff like “aliens are secretly interfering with our experiments”. To do the Bayesian calculation we need list every possible way our hypothesis could be false, and compute how likely each of those ways is and how plausible each one makes our data. That gets very complicated very quickly.</p>
<p>(In contrast, Fisher’s approach starts by assuming the null hypothesis is true, and ignores every other possibility. This makes the calculation much easier to actually do, but it also limits how much we can actually conclude. High \(p\)-value? Nothing weird. Low \(p\)-value? Something is weird. But that’s all we learn.)</p>
<p>And <em>third</em>, even if we can do all those calculations somehow, we need that prior probability. We want to figure out how likely it is that a drug prevents cancer. And as the first step, we have to plug in…the probability that the drug prevents cancer. We don’t know that! That’s what we’re trying to compute!</p>
<p>Bayesian machinery is great for refining and updating numbers we already have. And the more data we collect, the less the prior probability matters; we’ll eventually wind up in the correct place. So in practice, we just pick a prior that’s easy to compute with, plug it into Bayes’s theorem, and try to collect enough data that we expect our answer to be basically right.</p>
<p>And that brings us back to where we began, with replication. The more experiments we run, the more we can learn.</p>
<h2 id="conclusion">7. Conclusion: (Good) data is king</h2>
<p>I closed out part 2 with an <a href="https://xkcd.com/2400/">xkcd statistics tip</a>: “always try to get data that’s good enough that you don’t need to do statistics on it.” Here at the end of part 3, we find ourselves in exactly the same place. But this time, I hope you see that tip, not as a punchline, but as actionable advice.</p>
<p>Modern hypothesis testing “works”, statistically, as long as you ask exactly the questions it answers, and are extremely careful in how you use it. But we often misuse it by collecting flawed or inadequate data and then drawing strong, sweeping conclusions. We run small studies and then \(p\)-hack our results into significance, rather than running the careful, expensive studies that would genuinely justify our theoretical claims. We report the results as over-simplified yes-or-no answers rather than trying to communicate the complicated, messy things we observed. And if we manage to reject the null on one study we issue press releases claiming it confirms all our grand theories about society.</p>
<p><a href="https://xkcd.com/2494/"><img src="https://imgs.xkcd.com/comics/flawed_data.png" alt="XKCD 2494: "We realized all our data is flawed. Good: ...so we're not sure about our conclusions. Bad: ...so we did lots of math and then decided our data was actually fine. Very bad: ...so we trained an AI to generate better data." " class="blog-image center" /></a>
<em class="blog-image center">Too often, we use statistics to help us pretend bad data is actually good.</em></p>
<p>In this essay we’ve seen a number of possible solutions, but they’re basically all versions of “collect more and better data”:</p>
<ul>
<li>Do enough foundational work that you can formulate good hypotheses, and figure out what data you need to draw usable conclusions.</li>
<li>If you have numerical data, use the numbers, rather than throwing away information and just giving a single yes or no.</li>
<li>Preregister your studies, to make sure your data is useful and you’re not altering it to fit your conclusions.</li>
<li>Replicate your studies, so you collect more data that can either confirm or correct your beliefs.</li>
</ul>
<p>Even the Bayesian approach comes back to this. Bayesianism relies on the prior probability; but that really just means that, if we already have some knowledge before we run the experiment, we should use it!</p>
<p>Statistics is powerful and useful. We couldn’t do good science without it. But data—empirical observation—is the core of science. Statistics helps us understand the data we have, and it helps us figure out what data we need. But if our data sucks, statistics alone cannot save us.</p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Do you have a great idea for doing science better? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it’s pretty uncommon to do it that way in practice. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>We’ve seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks “work” or “don’t work”, rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.) <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>There are <a href="https://journals.sagepub.com/doi/full/10.1177/2515245918770963">variants of hypothesis testing</a> that help you show some null hypothesis is (probably) basically right. But they’re not nearly as common as the more standard setup. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="http://arxiv.org/pdf/1205.4251.pdf">Nosek, Spies, and Motyl write</a> about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest….The solution requires making incentives for <em>getting it right</em> competitive with the incentives for <em>getting it published</em>." <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The result we’ve found doesn’t necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you’ve found <em>something</em> even if you don’t understand it yet. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the <a href="https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus/is-the-covid19-vaccine-safe">covid vaccines are effective and extremely safe</a>. But if we <em>don’t</em> approve a drug, there’s no followup data to analyze, and the drug stays unapproved. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>My favorite suggestion comes from <a href="https://www.nature.com/articles/s41562-021-01192-8">Daniel Quintana</a>, who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it’s a good fit for careful replication of already-designed studies. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That’s a very good question. It’s very easy to run a hypothesis test that basically <em>can’t</em> tell us anything interesting; we’ll come back to this <a href="#jump-to-conclusions">later in the piece</a>. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>Somewhat infamously, Fisher stubbornly resisted the claim that smoking <em>caused</em> cancer until his death. But he never denied the correlation, which was too dramatic to hide. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>As long as two factors have <a href="https://www.gwern.net/Everything">any relationship at all</a>, the effect won’t be <a href="https://statmodeling.stat.columbia.edu/2017/06/29/lets-stop-talking-published-research-findings-true-false/">exactly zero</a>, and with enough data we’ll be able to reject the null hypothesis that there’s no effect. But that just means “is the effect exactly zero” is often the wrong question; instead we want to know if the effect is big enough to matter. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
<li id="fn:11">
<p>The odds of getting no false positives after \(n\) trials is \(0.95^n\), so the odds of getting a false positive are \(1 - 0.95^n\). And \(0.95^{20} \approx 0.358\), so \(1 - 0.95^{20} \approx 0.652\).</p>
<p>It’s a little surprising this is so close to \(2/3\), but there’s a reason for it—sort of. If you compute \( (1- 1/n)^n\) you will get approximately \(1/e\), so the odds of getting a false positive at a \(1/20\) false positive threshold after \(20\) trials are roughly \(1-1/e \approx .63\). <a href="#fnref:11" class="reversefootnote">↩</a></p>
</li>
<li id="fn:12">
<p>From what I can tell, Rothman may well agree with me. His <a href="https://twitter.com/ken_rothman">twitter feed</a> features arguments against <a href="https://twitter.com/_MiguelHernan/status/1476928329794027522">using statistical significance</a> and <a href="https://twitter.com/vamrhein/status/1526879947104702465">dichotomized hypotheses in place of estimation</a>, which is roughly the position I’m advocating. But <em>if</em> you’re doing hypothesis testing, you should try to do it correctly. <a href="#fnref:12" class="reversefootnote">↩</a></p>
</li>
<li id="fn:13">
<p>You might notice that I’m not really saying anything about where we find these hypotheses. There’s a good reason for that. Finding hypotheses is hard! It’s also the most <em>creative</em> and unstructured part of the scientific process. The question is important, but I don’t have a good answer. <a href="#fnref:13" class="reversefootnote">↩</a></p>
</li>
<li id="fn:14">
<p>For that matter, if feminine hurricane names were <em>less</em> dangerous we could easily tell a story about how <em>that</em> was evidence for sexism. That’s the garden of forking paths popping up again, where many different results could be evidence for our theory. <a href="#fnref:14" class="reversefootnote">↩</a></p>
</li>
<li id="fn:15">
<p>In their wonderfully named (and very readable) paper <a href="https://journals.sagepub.com/doi/10.1177/1745691620966795">Why hypothesis testers should spend less time testing hypotheses</a>, Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the <em>derivation chain</em>: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. <a href="#fnref:15" class="reversefootnote">↩</a></p>
</li>
<li id="fn:16">
<p>This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you’re trying to answer. I’ve written about this <a href="/blog/asking-the-right-question/">before</a>, but I also recommend Keith Devlin’s <a href="http://devlinsangle.blogspot.com/2012/08/what-is-mathematical-thinking.html">excellent post</a> on what “mathematical thinking” is, especially the story he tells after the long blockquote. <a href="#fnref:16" class="reversefootnote">↩</a></p>
</li>
<li id="fn:17">
<p>We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a \(5\)% difference to be significant, and Drug A produces a \(3\)% improvement over placebo and Drug B produces a \(7\)% improvement. Then the effect of Drug A isn’t significant, and the effect of Drug B is, so we say that Drug A doesn’t work and Drug B does.</p>
<p>But the difference between Drug A and Drug B is <em>not</em> significant—so if we ask that question, we conclude that the two drugs are equally good! <a href="https://statmodeling.stat.columbia.edu/2016/05/25/the-difference-between-significant-and-not-significant-is-not-itself-statistically-significant-education-edition/">The difference between "significant" and "not significant" is not itself statistically significant</a>, so it matters exactly which hypothesis we choose to test. <a href="#fnref:17" class="reversefootnote">↩</a></p>
</li>
<li id="fn:18">
<p>Sometimes we can look at our interval after the fact and make an informed guess whether it’s one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there’s some risk I get a \(95\)% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I’m pretty much \(100\)% confident in that interval, not just \(95\)%. <a href="#fnref:18" class="reversefootnote">↩</a></p>
</li>
<li id="fn:19">
<p>We saw examples of this calculation in part 2, when we <a href="/blog/hypothesis-testing-part-2/#most-findings-false">calculated what fraction of positive results were true positives</a>. Note that we had to make assumptions about what fraction of null hypotheses are true; that’s the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations. <a href="#fnref:19" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThis is the third part of a three-part series on hypothesis testing. Hypothesis testing is central to the way we do science, but it has major flaws that have encouraged widespread shoddy research. In this essay we consider methods that can help us draw better conclusions, and avoid the pitfalls of hypothesis testing. We start with some smaller and more conservative ideas, which basically involve doing hypothesis testing _better_. Then we'll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.Hypothesis Testing and its Discontents, Part 2: The Conquest of Decision Theory2022-05-24T00:00:00-07:002022-05-24T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-2<p>This is the second-part of a three-part series on hypothesis testing.</p>
<p>In <a href="/blog/hypothesis-testing-part-1/">part 1</a> of this series, we looked at the historical origins of hypothesis testing, and described two different approaches to the idea: Fisher’s significance testing, and Neyman-Pearson hypothesis testing. In this essay, we’ll see how modern researchers use hypothesis testing in practice. And in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">part 3</a> we’ll talk about alternatives to hypothesis testing that can help us avoid replication crisis-type problems.</p>
<p>The modern method is an awkward mix of Fisher’s goals and Neyman and Pearson’s methods that attempts to provide a one-size-fits-all solution for scientific statistics. The inconsistencies within this approach are a major contributor to the replication crisis, making bad science both more likely and more visible.</p>
<h2 id="modern-hypothesis-testing">Modern Hypothesis Testing</h2>
<p>The two approaches to hypothesis testing we saw in part 1 were each designed to answer specific questions.</p>
<p><strong>Fisher’s significance testing</strong> specifies a null hypothesis, and <strong>measures how much evidence our experiment provides</strong> against that null hypothesis. This is measured by the \(p\)-value, which tells us how likely our evidence would be if the null hypothesis is true. (It does <em>not</em> tell us how likely the null hypothesis is to be true!)</p>
<p><strong>Neyman-Pearson hypothesis testing helps us make a decision between two courses of action</strong>, like prescribing a drug or not. We weigh the costs of getting it wrong in either direction, and decide which direction we want to default to if the evidence is unclear. The null hypothesis is that we should take that default action (such as not prescribing the drug), and the alternative is that we should take the other action (prescribing the drug).</p>
<p>Based on our weighing of the costs of making a mistake in either direction, and the amount of information we have to work with, we set a “false positive” threshold \(\alpha\) and a “false negative” threshold \(\beta\). These numbers are tricky to understand and describe correctly, even for experienced researchers. I encourage you to go read part 1 if you haven’t already, but in brief:</p>
<ul>
<li>The number \(\alpha\) measures the chance that, <em>if</em> the drug doesn’t work and isn’t worth taking, we will screw up and prescribe it anyway.</li>
<li>The number \(\beta\) measures the chance that, <em>if</em> the drug works and is worth taking, we’ll make a mistake and withhold it.</li>
</ul>
<p><strong>The Neyman-Pearson method doesn’t try to tell us whether the drug “really works”</strong>; it <em>only</em> tells us how we should weigh the risks of making the two possible mistakes. <strong>Fisher’s method takes a very different approach and tries to measure the evidence</strong> to help us decide what to believe; but it does not give a clean yes-or-no answer.</p>
<p>Modern statistical hypothesis testing is a weird mishmash of these two approaches. We report \(p\)-values as evidence for or against the null hypothesis, as in Fisher-style significance testing. But we <em>also</em> try to give a yes-or-no, accept-or-reject verdict, as in the Neyman-Pearson approach. And while either approach can be useful on its own, the combination loses the key statistical benefits of each and leaves us in a bit of a muddle.</p>
<h3 id="the-modern-approach-in-practice">The modern approach in practice</h3>
<p>Modern researchers generally do something like this:</p>
<ul>
<li>First we choose a significance level \(\alpha\). We usually default to \(\alpha = .05\), but we sometimes make it lower if we want to be really confident in our conclusions. Particle physicists often use an \(\alpha\) of about \(.0000003\), or about \(1\) in \(3.5\) million.<strong title="This is the probability of getting data five standard deviations away from the mean. So you'll often see this reported as a significance threshold of 5σ. Related is the [Six Sigma techniques] for ensuring manufacturing quality, though somewhat counterintuitively they typically only aim for [4.5 σ] of accuracy."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></li>
</ul>
<ul>
<li>
<p>Next we specify a null hypothesis, which is usually something like “the thing we’re studying has no effect”. We generally choose a null hypothesis that we <em>don’t</em> believe, because our machinery will attempt to <em>disprove</em> our null.</p>
<p>If we want to prove that a new drug helps prevent cancer, our null hypothesis will be that the drug has no effect on cancer rates. If we want to show that hiring practices are racially discriminatory, our null hypothesis will be that race has no effect on whether people get hired.</p>
</li>
<li>
<p>Technically, we also have an alternative hypothesis: “this drug does help prevent cancer”, or “hiring practices are affected by race”. This alternative hypothesis often what we actually believe, but we often don’t make it too precise during the design of the experiment. Specifying the alternative hypothesis well is a really important part of research design, but it’s a bit tangential to this essay so we won’t talk about it much here.</p>
</li>
<li>
<p>We run the experiment, do a Fisher-style significance test, and report the \(p\)-value we get. If it’s less than \(\alpha\), we reject the null hypothesis, and generally consider the experiment to have successfully proven our alternative is true. If the \(p\)-value is greater than \(\alpha\), we don’t reject the null hypothesis,<strong title="It is common for people to be sloppy here and say they "accept" the null. In fact, I wrote that in my first draft of this paragraph. But it's bad practice to say that, because even a very high p-value doesn't provide good evidence that the null hypothesis is true. Our methods are designed to default to the null hypothesis when teh data is ambiguous. Neyman _did_ use the phrase "accept the null", but in the context of a decision process, where "accepting the null" means taking some specific, concrete action implied by the null, rather than more generally committing to believe something."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> and often view the experiment as a failure.</p>
</li>
</ul>
<p>There are a few problems with this approach, but most of them stem from the same core issue: <strong>classical statistical tools are incredibly fragile.</strong> If you use them <em>exactly</em> as described, you are mathematically guaranteed to get some specific benefit. (In a correct Neyman-Pearson setup, for instance, you are guaranteed a false positive rate of size \(\alpha\). ) But you get <em>exactly</em> that guarantee, and possibly nothing more. My friend Nostalgebraist <a href="https://nostalgebraist.tumblr.com/post/161645122124/bayes-a-kinda-sorta-masterpost">analogizes</a> on Tumblr:</p>
<blockquote>
<p>The classical toolbox also has a lot of oddities….The labels on the tools say things like “won’t melt below 300° F,” and you <em>are in fact</em> guaranteed <em>that</em>, but the same screwdriver might turn out to instantly vaporize when placed in water, or when held in the left hand. Whatever is not guaranteed on the label is possible, however dangerous or just plain dumb it may be.</p>
</blockquote>
<p>This fragility means that if you carelessly combine two tools, you often lose the guarantees of each of them, and wind up with a screwdriver that melts at room temperature and <em>also</em> vaporizes when held in your left hand. And you may not get anything at all in return—other than, I suppose, the inherent benefits of being careless and lazy.</p>
<p class="center"><a href="https://www.egscomics.com/comic/2015-05-01"><img src="/assets/blog/hypothesis-testing/lazy-egscomics.png" alt="Panel from El Goonish Shive comic: "Shoot, I'm going to be lazy all the time forever now. It gets _results_." /></a></p>
<p class="center"><em>Sure, being lazy gets results. But they might not replicate.</em></p>
<h3 id="the-wrong-tool-for-the-job">The wrong tool for the job</h3>
<p>The Neyman-Pearson method is designed to give an unambiguous yes-or-no answer to a question, so we can act on the information we currently have. This is exactly what we need when it’s time to make a specific decision about whether or not to open a new factory or change to a different brand of fertilizer. And the method was so successful that in 1955, John Tukey <a href="https://www.tandfonline.com/doi/abs/10.1080/00401706.1960.10489909">expressed concern about</a> the “tendency of decision theory to attempt to conquest all of statistics”.</p>
<p>He worried because <strong>in scientific research we don’t want to make decisions, but reach conclusions</strong>. On the one hand, we don’t need to make a definitive decision <em>right now</em>. If it’s not clear which theory describes the evidence better, we can just say that, and wait for more evidence to come in. On the other hand, we want to eventually reach firm conclusions that we can trust, and use as a foundation for further work. That requires a higher degree of confidence than “the best we can say right now”, which is what Neyman-Pearson gives us. Fisher’s methods, in contrast, were designed to accumulate certainty through repeated consistent experimental results, the sort of thing a true conclusion theory would need.</p>
<p>But because Neyman-Pearson worked so well for a very specific type of problem (and probably also because Fisher was <a href="https://www.newstatesman.com/long-reads/2020/07/ra-fisher-and-science-hatred">kind of terrible</a>), many fields adopted it as a default and use it for pretty much everything. <a href="http://daniellakens.blogspot.com/2022/05/tukey-on-decisions-and-conclusions.html">Daniel Lakens says</a> that in hindsight, Tukey didn’t need to worry, since statistics textbooks for the social sciences don’t even discuss decision theory; but in fact <strong>we’ve largely adopted a tool of decision theory, and repurposed it to reach conclusions instead</strong>.</p>
<p>A decision theory needs to produce a clear, discrete answer to our questions, even if there’s not much evidence available. And unfortunately, our scientific papers regularly try to transmute weak evidence into strong conclusions. We tend to over-interpret <a href="https://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">individual studies</a>, especially when one study is all we have. How often have you seen in the news that “a new study proves that” something is true? It’s almost never wise to conclude that a question is resolved because of one study. But the Neyman-Pearson framework is designed to do exactly that, and so inclines us to be overconfident.</p>
<p>Even if you have multiple studies, the same problem shows up in a different form. When there’s a complicated and messy body of research on a topic, we should probably hold complicated and messy beliefs, rather than forming a definitive conclusion. Instead, we often argue about which study is “right” and which is “wrong”, because that’s the lens we use to evaluate research.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/onion-eggs-good-this-week.png" alt="Screenshot of a short onion article, titled "Eggs Good For You This Week"" class="blog-image" /></p>
<p class="center blog-image"><em>My <a href="https://www.theonion.com/eggs-good-for-you-this-week-1819565159">favorite article from The Onion</a> demonstrates the wrong way to interpret conflicting studies.</em></p>
<p>Of course, sometimes one study <em>is</em> pretty much just wrong! If you have two studies and one shows that a child care program cuts poverty by 50% and the other shows that it increases poverty, at least one of them has to be pretty badly off the mark somehow. But even then, the hypothesis testing framework can mislead us, because of the way it handles the burden of proof.</p>
<h3 id="defaults-matter">Defaults Matter</h3>
<p>Hypothesis testing methods build in a bias toward sticking with the null hypothesis. This is intentional; we’re looking for strong evidence that the null is false, not just something that might check out if we squint really hard. <strong>We want to put the burden of proof on showing that something new is actually happening.</strong></p>
<p><strong>But once a study rejects the null, it’s very easy to be <em>decisive</em> and treat its result as “proven”, and shift the burden of proof onto work that challenges the original study.</strong> So when a paper runs a hypothesis test and concludes that <a href="https://statmodeling.stat.columbia.edu/2014/06/06/hurricanes-vs-himmicanes/">female-named hurricanes are more dangerous than male-named ones</a>, this belief is “proven” and becomes the new default. And since that one study established a new baseline, anyone who disagrees now faces the burden of proof, and faces an uphill battle to convince people.</p>
<p>It’s pretty common for a small early study find a big effect, and then be followed up by a few larger and better studies that <a href="https://statmodeling.stat.columbia.edu/2016/04/02/himmicanes-and-hurricanes-update/">don’t find the same effect</a>. But all too often people more or less conclude the big effect is real, because that first study found it, and the followups weren’t convincing <em>enough</em> to overcome the presumption that the effect is real.<strong title="Andrew Gelman suggests a helpful [time-reversal heuristic]: what would you think if you saw the same studies in the opposite order? You'd start with a few large studies establishing no effect, followed by one smaller study showing an effect. In theory that gives you the exact same information, but in practice people would treat it very differently—assuming the first studies [actually got published]."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>And the Neyman-Pearson framework reinforces this twice. First, because it is intentionally <em>decisive</em>, it encourages us to commit to the result of a single study. Second, rejecting the null hypothesis is seen as strong evidence against the null, but failing to reject is only weak evidence that the null is true. This is why we “fail to reject” rather than simply “accept” the null hypothesis: maybe the null is true, or maybe the experiment just wasn’t sensitive enough to reject it.</p>
<p>So if one study rejects the null and another fails to reject, it’s very easy to assume that the first study was just better. After all, it managed to reject the null, didn’t it? But a reasonable conclusion theory would incorporate both studies, rather than rejecting the one that “failed”.</p>
<h2 id="publication-in-practice">Publication in practice</h2>
<p>So far I’ve discussed theoretical problems with the hypothesis testing framework: reasons it might be the wrong tool for the problems we’re applying it to. But a possibly worse problem is that it’s very easy to <em>misuse</em> hypothesis testing, so that it doesn’t even do its own job correctly. And the structural dynamics of how research gets conducted, published, and distributed tends to encourage this misuse, and amplify the conclusions of sloppy studies.</p>
<h3 id="who-wants-to-be-boring">Who wants to be boring?</h3>
<p><strong>Most academics really care about doing good research and contributing to our knowledge about the world</strong>; otherwise they wouldn’t be academics. The academic career path is long and grueling, and doesn’t pay very well compared to other things that nascent academics could be doing; there’s a reason people say that you shouldn’t get a Ph.D. if you can imagine being happy without one.</p>
<p>But that doesn’t mean research is conducted by cloistered ascetics with no motivations other than a monastic devotion to the truth. <strong>People who do research want to <em>discover interesting things</em>, not spend thirty years on experiments that don’t uncover anything new.</strong> Moreover, they want to discover things that <em>other people</em> think are interesting—people who can give them grants, or jobs, or maybe even book deals and TED talks.</p>
<p>Even without any dishonesty, this shapes the questions people ask, and also the way they answer them. First, people want to reject the null hypothesis, because we see that as strong evidence, but see failing to reject the null as weak evidence. An experiment that fails to reject the null is rarely actually published; all too often, it’s seen as an experiment that simply failed.</p>
<p>Second, people want to prove <em>new</em> and <em>surprising</em> things. It would be extremely easy for me to run a study rejecting the null hypothesis that 15-year-olds are on average about as tall as 5-year-olds. But no one would care about this study—including me—because we already know that.</p>
<p>Now, sometimes it’s worth clearly establishing that obvious things are in fact true. And we do have data on the average height of children at various ages, and it wouldn’t be hard to use that to show that 15-year-olds are taller than 5-year-olds. Collecting that sort of routine data on important topics is <a href="https://twitter.com/ProfJayDaigle/status/1521911837897502723">very useful and important work</a> that we should probably reward more than we do.</p>
<p>But we <em>don’t</em> reward routine data collection heavily, and most of the time researchers are trying to prove surprising new results. And that’s exactly the problem: <strong>new results are “surprising” when you wouldn’t have expected them—which is exactly when they’re unlikely to be true.</strong></p>
<h3 id="most-findings-false">“Why most published research findings are false”</h3>
<p>This quest for surprising results interacts with the statistics of the Neyman-Pearson method in an extremely counterintuitive way. The statistical guarantee is: if we test a true null hypothesis, we’ll get a false rejection about five percent of the time. <strong>But that doesn’t mean a rejection has a five percent chance of being false. And the more studies of true null hypotheses we run, the bigger this difference gets.</strong></p>
<p>We can most easily understand how this works with a couple examples. As a baseline, let’s look at the case where half our null hypotheses are true. Imagine we run two hundred studies, \(100\) with a true null hypothesis and \(100\) with a false null hypothesis. Our false positive rate is \(\alpha = 0.05\), so we’ll reject the null in five of the \(100\) studies where the null is true. And we generally hope for a false negative rate of \(\beta = 0.20\), in which case we reject the null in \(80\) of the \(100\) studies where the null is false.</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>80</td>
<td>5</td>
<td>85</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>20</td>
<td>95</td>
<td>115</td>
</tr>
<tr>
<td>Total</td>
<td>100</td>
<td>100</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>So we have \(85\) positive results, of which \(80\) are true positives and \(5\) are false positives, and so \(5/85 \approx 6\)% of our positive results are false positives.<strong title="You might recognize this as an application of Bayes's theorem, and a basic example of [Bayesian inference]. Tables like these are very common in Bayesian calculations. "><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> And that’s not too bad—the fact that it’s <em>higher</em> than the false positive rate of \(5\)% should be a warning sign.</p>
<p>But now imagine our researchers get more ambitious, and start testing more interesting and potentially-surprising findings. This means we should expect more of our null hypotheses to actually be true. If only ten percent of the original \(200\) null hypotheses are false, then we’ll have 180 studies with a true null and only 20 with a false null. We’ll still reject \(80\)% of false null hypotheses, and \(5\) of true null hypotheses, so our results look like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>16</td>
<td>9</td>
<td>25</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>4</td>
<td>171</td>
<td>175</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>Now we only have \(16\) true positives (out of \(20\) cases where we should reject), and we get \(9\) false positives (out of \(180\) cases where we shouldn’t reject the null). So a full \(9/25\) or \(36\)% of our positive results are false positives—much higher than \(5\)%! And often, only the studies that reject the null, and land in the first row of the table, get published at all. So we might find that a third of published papers will have false conclusions.</p>
<p><strong>If researchers are regularly testing theories that are unlikely to be true, then most of the positive (and thus published) results can be false, even if the rate of false positives is quite low.</strong> This is the key observation of the famous paper by John Ioannidis that kicked off the replication crisis, <a href="https://en.wikipedia.org/wiki/Why_Most_Published_Research_Findings_Are_False">Why Most Published Research Findings Are False</a>.<strong title="Followups to Ioannidis's paper contend that only about 14% of published biomedical findings are actually false. I'm not in a position to comment on this one way or the other. In psychology, different studies estimate that somewhere [between from 36% and 62%] of published results replicate."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>This is sometimes known as the <a href="https://en.wikipedia.org/wiki/Publication_bias">file-drawer effect</a>: we see the studies that get published, but not the “failed” ones that are left in the researchers’ filing cabinets. So even though only thirteen of the \(200\) studies give the wrong answer, \(9\) of the \(25\) that actually get published are wrong.</p>
<p>And no, \(9/25\) isn’t quite a majority, so while this is bad, it doesn’t seem to justify Ioannidis’s claim that “most” published findings are false.</p>
<p>But we haven’t talked about everything that can go wrong yet!</p>
<h3 id="the-problem-of-power">The problem of power</h3>
<p>I said that “we generally hope for a false negative rate of \(\beta = 0.2\)”. But where does that hope come from?</p>
<p>The original Neyman-Pearson framework has two parameters, the false positive rate \(\alpha\) and the false negative rate \(\beta\). You can always make \(\alpha\) lower by accepting a higher \(\beta\), and researchers are supposed to balance these parameters against each other, based on the relative costs of making Type I and Type II errors. But in practice we just <a href="https://doi.org/10.1353/sof.2005.0108">set \(\alpha\) to \(.05\) and move on with our lives</a>; we don’t think about the relative balance of costs, or what it’s really saying about our research.</p>
<p>If our data is good enough, then we can make both \(\alpha\) and \(\beta\) are both small, and draw conclusions with a fair degree of confidence. But if our data is bad, then the study will be too weak to detect a lot of true effects, and so to keep \(\alpha\) small, we need to make \(\beta\) large. Consequently we say that the <em>power</em> of a study \(1 - \beta\), which is the <em>true</em> positive rate. A study with high power will usually give the correct answer; a study with low power can’t be trusted.</p>
<p><img src="/assets/blog/hypothesis-testing/abusing-your-power.jpg" alt="Picture of a cat, with text: "Don't even think about abusing your power"" class="blog-image center" /></p>
<p>Much like we typically set \(\alpha = 0.05\), we typically try to get \(\beta \leq 0.2 \), and thus conduct studies with a power of at least \(80\)%. And like with the false positive rate, this number is also not really motivated by anything in particular: the choice is generally attributed to Jacob Cohen, who <a href="http://daniellakens.blogspot.com/2019/05/justifying-your-alpha-by-minimizing-or.html">wrote</a> that</p>
<blockquote>
<p>The \(\beta\) of \(.20\) is chosen with the idea that… Type I errors are of the order of four times as serious as Type II errors. This \(.80\) desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value <em>ad hock</em>.</p>
</blockquote>
<p>That is, there’s no really good argument for not picking \(\beta = 0.1 \) or \(\beta = 0.3\) instead, but it seems like it’s about the right size if you don’t have any better ideas.</p>
<p>There are two problems here. The minor one is that both of these numbers are pretty arbitrary. If we have enough data that we can get \(\alpha = 0.05,\beta = 0.2\), then we could also choose to reject the null more readily and get something like \(\alpha = 0.1, \beta = 0.11\), with a high false positive rate but a power of \(89\)%; or we could be reject the null less often and get \(\alpha = 0.02, \beta = 0.33\), with a low false positive rate but power of only \(67\)%.</p>
<p>Which of those trade-offs we want depends the specifics of our current question: if Type I and Type II errors are about equally bad, we might want \(\alpha\) and \(\beta\) to be about the same size, but if a Type II error is much, much worse, we should want \(\alpha\) to be much smaller than \(\beta\). We can’t make an informed choice of \(\alpha\) and \(\beta\) without knowing details about the specific decision we’re trying to make.</p>
<p>But when we’re trying to do <em>science</em> it’s not clear what to choose. We can’t really quantify the costs of publishing a paper with a false conclusion; the entire setup of computing practical trade-offs doesn’t make all that much sense when we’re trying to discern the truth rather than make a decision. <strong>This is one major way that the Neyman-Perason framework isn’t the right tool for science: the entire method is premised on a calculation we can’t do.</strong></p>
<p>But we <em>can</em> just set \(\alpha = 0.05, \beta = 0.20\), and see what happens. And as long as these numbers are a vaguely reasonable size, we’ll probably get vaguely reasonable results. We hope.</p>
<h3 id="where-does-power-come-from">Where does power come from?</h3>
<p>There’s a second problem, though, which is widespread and frequently disastrous. Sometimes \(\beta\) gets so large that a study becomes useless—and we don’t even notice.</p>
<p>For a given \(\alpha\), your \(\beta\) depends on the quality of the data you have. With very good data, you can be very confident about your conclusion in both directions. We have a tremendous amount of data about the relationship between age and height in children, so we can design studies that will have low rates of false positives and false negatives. And physics experiments ask for a false positive rate less than one in a million—and they can actually <em>achieve</em> this because their data is both copious and precise.</p>
<p><strong>But with bad or noisy data, no amount of statistical cleverness can give any degree of confidence in our conclusions.</strong> If you want to study the effect on life expectancy of winning or losing an election to be a US state governor, <a href="https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/">you wind up with this scatterplot</a>:</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/governor-life-expectancy.png" alt="Scatterplot with "Percentage vote margin" on the x-axis, from -10 to 10, and "Years alive after election" on the y-axis, from 0 to 60. There is no noticeable pattern." class="blog-image center" />
<em>If your data is this scattered, you will never be able to detect small effects.</em></p>
<p>There aren’t <em>that</em> many governor races, and lifespan after any given race varies from just a couple years to more than fifty, so the data is extremely noisy. If winning an election boosted your lifespan by ten years, we would probably be able to tell. But an effect that large is absurd, and there’s no way to use data like this to pick up changes of just a year or two.</p>
<p>When we said we “ask for” a \(\beta\) below \(0.2\), we really meant “we should collect enough data to get a power of \(80\)%”. That’s not really an option for the governors study, without waiting around for more elections and more dead governors; on that question we’re kind of stuck with the data we have. Despite the Neyman-Pearson inclination to make a firm decision, all we can reasonably do is embrace uncertainty.</p>
<p>If we’re running a laboratory experiment, on the other hand, we can decide how big an effect we’re looking for, and calculate how many people we’d need to study to get a power of \(80\)%. But it’s hard to calculate this correctly, because it depends on how big the effect we’re studying is, and we <em>don’t know how big it is</em> because we <em>haven’t done the study yet</em>. So the calculation is based on a certain amount of guesswork.<strong title="We can also base it on [how big of an effect we _care_ about]. If we're studying reaction times, we might decide that an effect smaller than ten milliseconds is irrelevant, and we don't care about it even if it's real. Then we can choose a study with enough power to detect a 10ms effect at least 80% of the time. But this brings us back to the core issue, that "is there an effect" just isn't a great question, and the Neyman-Pearson method isn't a great tool for answering it. "><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>Even if we do this calculation correctly, there’s a real chance that we have to run a really big experiment to get the power we want. (If we’re looking for a small effect, we may have to run a really, <em>really</em> big experiment.) And big experiments are expensive! A lot of researchers skip this step entirely, and just run whatever experiment they can afford, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4961230/">regardless of how little power it has</a>.</p>
<p>And if the power is low enough, things get very dumb very quickly.</p>
<h3 id="we-need-more-power">We need more power!</h3>
<p>Let’s start by looking at what happens when the power is really, idiotically low. This graph shows what happens when you run an experiment with a power of \(0.06\), which means a false negative rate of \(94\)%. And there are three different problems that pop up.</p>
<p><img src="https://statmodeling.stat.columbia.edu/wp-content/uploads/2014/11/Screen-Shot-2014-11-17-at-11.19.42-AM.png" alt="A diagram of the effects of low-power studies.
This is what "power = 0.06" looks like. Get used to it.
Type S error probability: If the estimate is statistically significant, it has a 24% chance of having the wrong sign.
Exaggeration ratio: If the estimate is statistically significant, it must be at least 9 times higher than the effect size." class="blog-image center" /></p>
<p class="center"><em>Figure by <a href="https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">Andrew Gelman</a>.</em></p>
<p>The obvious problem is that even if the null hypothesis is wrong, we probably won’t reject it, because the data isn’t good enough to <em>show</em> that it’s wrong. Even if the null is false, we’ll fail to reject it \(94\)% of the time! (This is represented by the large white area in the middle of the graph.) But this, at least, is the process working as intended: our goal was to err on the side of not rejecting the null hypothesis, and that is in fact what we’re doing.</p>
<p>But there are two subtler problems, which cause more trouble than just a pile of inconclusive studies. We still manage to reject the null \(6\)% of the time, but because the study is so weak, this only happens when we get unusually lucky. And that happens when our data is much, <em>much</em> further away from the null hypothesis than it usually is. <strong>At a power of \(\mathbf{0.06}\), we only get a significant result when our measurement is <em>nine times</em> as big as the true effect we want to measure.</strong> (This is the red region on the right of Gelman’s graph; he calls it a “Type M error”, for “magnitude”.)</p>
<p>This is a major culprit behind a lot of improbable ideas that come out of shoddy research. In my <a href="/blog/replication-crisis-math/">post on the replication crisis</a> I talked about how a lot of careless research starts out asking whether an effect exists, but finds an effect that’s <em>surprisingly large</em>, and then the story people tell is focused on the dramatic, unexpectedly large effect. But that drama is a necessary result of running underpowered studies.</p>
<p>The study of gubernatorial elections and life expectancy is a perfect example of this process. Just by looking at the graph, you can tell there probably isn’t a big effect. But researchers Barfort, Klemmensen and Larsen found a clever analysis<strong title="Clever analyses like this are often a bad idea; we'll come back to this idea [soon]."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> that did <a href="https://www.cambridge.org/core/journals/political-science-research-and-methods/article/abs/longevity-returns-to-political-office/6205207F55C97729E66A8B08D7641572">produce a statistically significant result</a>—and claimed that the difference between narrowly winning and narrowly losing an election was <em>ten years</em> of lifespan. That’s far too large an effect to be believable, but any statistically significant result they got from that data set would have to be equally incredible.</p>
<p>Researchers are motivated to discover new and surprising things; and we, as news consumers, are most interested in new and surprising results. The wild overestimates that these low-power studies produce are surprising and counterintuitive, precisely because they are <em>false</em>. But they are surprising and counterintuitive, so they tend to draw public attention and show up in the news.</p>
<p>But a surprisingly large result isn’t as counterintuitive as one that’s the opposite of what you expect. (Imagine if a study “proved” that 5-year-olds are taller than 15-year-olds!) And low-power studies give us those results too.</p>
<p>Even if we’re studying something that really does (slightly) increase lifespan, we could get unusually <em>unlucky</em>, and randomly observe a bunch of people who die unusually early. If the data is noisy enough and we get unlucky enough, we can get statistically significant evidence that the effect decreases lifespan, when it really increases it.</p>
<p>We see this in the left tail of Gelman’s graph. <strong>When power is \(\mathbf{0.06}\), almost a quarter of statistically significant results will give you a large effect <em>in the wrong direction</em>.</strong> There’s a substantial chance that we get our result exactly backwards.</p>
<p>Now, a power of \(0.06\) is an extreme case, bad even by the usual standards of underpowered research. But the same problems come up with better-but-still-underpowered studies, just to a lesser degree. In fact, both effects are always <em>possible</em>, if your data is unlucky enough. But we’d much prefer having a \(0.1\)% chance of getting the direction of the effect wrong to having a \(24\)% chance. And the lower the power, the bigger an issue this is.</p>
<h3 id="file-drawer">The revenge of the file drawer</h3>
<p>There should be a saving grace here: if your study has low power, it’s unlikely to reject the null at all. We don’t have a \(24\)% chance of getting a statistically significant result in the wrong direction; because our power is only \(0.06\), we have a <em>six percent chance of having a \(24\)% chance</em> of getting a statistically significant result in the wrong direction. That’s less than two percent, in total.</p>
<p>But <strong>studies that don’t reject the null often don’t get published at all</strong>. There’s a good chance that the 94 studies that fail to reject the null get stuck in a file drawer somewhere; we’re left with a few studies that reject it, but wildly overestimate the effect, and one or two that reject the null in the wrong direction. When that’s all the information we have, it’s hard to figure out what’s really going on.</p>
<p>Let’s make another table of possible research findings, like the ones <a href="#most-findings-false">we used earlier</a> to see how the file-drawer effect works. But this time, instead of assuming a reasonable power of \(80\)%, let’s see what happens when the power is only \(20\)%. If half the hypotheses are true and half are false, we get something like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>20</td>
<td>5</td>
<td>25</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>80</td>
<td>95</td>
<td>175</td>
</tr>
<tr>
<td>Total</td>
<td>100</td>
<td>100</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>With \(80\)% power, our false-positive rate was \(6\)%. But with \(20\)% power, we have \(20\) true positives and \(5\) false positives, and our false-positive rate has risen \(5/25 = 20\)%.</p>
<p>And if we also suppose that are researchers are testing unlikely theories and so \(90\)% of null hypotheses are true, we get the following truly terrible table:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>4</td>
<td>9</td>
<td>13</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>16</td>
<td>171</td>
<td>187</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>Under these conditions we get \(9\) false positives and only \(4\) true positives, so almost \(70\)% of our positive results are false positives. If the only results we publish are these exciting positive results, then most published findings will, indeed, be false.</p>
<h3 id="the-problem-of-p-hacking-and-the-garden-of-forking-paths">The problem of \(p\)-hacking and the garden of forking paths</h3>
<p>It seems like we could fix this problem just by publishing null results as well. New norms like <a href="https://en.wikipedia.org/wiki/Preregistration_(science)">preregistration of studies</a> and institutions like <a href="https://www.jasnh.com">The Journal of Articles in Support of the Null Hypothesis</a> try to combat the file drawer bias by publishing studies that don’t reject the null, or at least letting us know they happened so we can count them. If we publish just a quarter of null results, then even under the bad assumptions of the last table we get something like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>4</td>
<td>9</td>
<td>13</td>
</tr>
<tr>
<td>Don’t Reject, but Publish</td>
<td>4</td>
<td>43</td>
<td>47</td>
</tr>
<tr>
<td>Don’t Reject or Publish</td>
<td>12</td>
<td>128</td>
<td>140</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>We see \(60\) published results. The \(4\) results where the null is false and we reject it are correct, as are the \(43\) where the null is true and we don’t reject it, so over \(70\)% of the published results will be true. If we publish more null results, this number only gets better.</p>
<p>But that doesn’t address the fundamental problem, which is that <em>researchers want to discover new, interesting things</em>. <strong>The fact that we mostly publish positive results that reject the null isn’t some accident of history; it’s a result of people trying to show that their ideas are correct.</strong></p>
<p>Since people want to reject the null hypothesis, they’ll work hard to find ways to do this. When done deliberately, this behavior is a form of research misconduct known as <a href="https://twitter.com/ephemeralidea/status/1504459823554908163">\(p\)-hacking</a> or <a href="https://en.wikipedia.org/wiki/Data_dredging">data dredging</a>. There are a variety of sketchy ways to tweak your statistical analysis to get an artificially low \(p\)-value. The most famous version is just running a bunch of experiments and <a href="https://imgs.xkcd.com/comics/significant.png">only reporting the ones with low \(p\)-values</a>.</p>
<p>Somewhat less famous, and less obvious, is the possibility of running one experiment, and then trying to <em>analyze</em> that data in a bunch of different ways and picking the one that makes your position look the best. We actually saw an example of this in <a href="hypothesis-testing-part-1#mileage">part 1</a> of this series, when I looked at my car’s gas mileage. I computed the \(p\)-value in two different ways, and got either \(0.0006\) or \(0.00004\). Either one of these is significant, but if they had been \(0.06\) and \(0.004\) instead, I could have just reported the second one and said “hey look, my data was significant!”</p>
<p>Moreover, it’s pretty common for people to look for secondary, “interaction” effects after looking for a main effect. Sure, watching a five-minute video didn’t have a statistically significant effect on depression in your study group. But maybe it worked on just the women? Or just the Asians? What if we control for income? You can check all the subgroups of your study, and whichever one reaches significance is <em>obviously</em> the interesting one.</p>
<p><a href="https://xkcd.com/1478/"><img src="https://imgs.xkcd.com/comics/p_values.png" alt="XKCD comic, translating p-values into verbal interpretations: "highly significant", "significant", "on the edge of significance". For a high p-value the interpretation is "hey, look at this interesting subgroup analysis"." class="blog-image center" /></a>
<em class="blog-image center">Sometimes your treatment really does have an effect on one specific subgroup. But it’s also an easy out when your main study didn’t reach significance.</em></p>
<p>This approach of doing multiple subgroup analyses, but only reporting one is still research misconduct, if done on purpose. But <strong>it’s possible to get the same effect without actually performing multiple analyses, in a process that Andrew Gelman and Eric Loken call the <a href="https://www.americanscientist.org/article/the-statistical-crisis-in-science">garden of forking paths</a>.</strong></p>
<p>Researchers often make decisions about how to test the data after looking at it for broad trends. If they notice one subgroup obviously sticking out, maybe they want to test it. Or they can tweak some minor parameters, decide to include or exclude outliers, and consider a few minor variations in the way they divide subjects into categories. This is all a reasonable way of looking at data, but it’s a violation of the rules of hypothesis testing, and has the same basic effect as running a bunch of experiments and only reporting the best one.</p>
<p>Most subtly, sometimes more than one pattern will provide support for the researcher’s hypothesis. We generally don’t actually care about specific statistical relationships; we care about broader questions, like “does media consumption affect rates of depression?”<strong title="This difference is the source of a lot of research pitfalls; if you want to dig into this more, I recommend [Tal Yarkoni] on generalizability, [Vazire, Schiavone, and Bottesini] on the four types of validity, and [Scheel, Tiokhin, Isager, and Lakens] on the derivation chain."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong> <strong>We run specific experiments in order to test these broad questions. And if there are, say, twenty different outcomes that would support our broad theoretical stance, it doesn’t help us very much that each one only has \(\mathbf{5}\)% odds of happening by chance.</strong></p>
<p>Gelman and Loken describe how this applies to research by Daryl Bem, which claims to provide strong evidence for ESP.<strong title="Scott Alexander [has pointed out] that ESP experiments are a great test case for our scientific and statistical methods, because we have extremely high confidence that we already know the true answer."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong></p>
<blockquote>
<p>In his first experiment, in which 100 students participated in visualizations of images, he found a statistically significant result for erotic pictures but not for nonerotic pictures….</p>
</blockquote>
<blockquote>
<p>But consider all the other comparisons he could have drawn: If the subjects had identified all images at a rate statistically significantly higher than chance, that certainly would have been reported as evidence of ESP. Or what if performance had been higher for the nonerotic pictures? One could easily argue that the erotic images were distracting and only the nonerotic images were a good test of the phenomenon. If participants had performed statistically significantly better in the second half of the trial than in the first half, that would be evidence of learning; if better in the first half, evidence of fatigue.</p>
</blockquote>
<blockquote>
<p>Bem insists his hypothesis “was not formulated from a post hoc exploration of the data,” but a data-dependent analysis would not necessarily look “post hoc.” For example, if men had performed better with erotic images and women with romantic but nonerotic images, there is no reason such a pattern would look like fishing or p-hacking. Rather, it would be seen as a natural implication of the research hypothesis, because there is a considerable amount of literature suggesting sex differences in response to visual erotic stimuli. The problem resides in the one-to-many mapping from scientific to statistical hypotheses.</p>
</blockquote>
<p>We even saw an example of forking paths earlier in this essay, in the <a href="#where-does-power-come-from">study of gubernatorial lifespans</a>. I said the study found a clever analysis to get a significant result. In the data set we saw from Barfort, Klemmensen, and Larsen, the obvious tests like linear regression don’t show any effect of winning margin on lifespan.</p>
<p class="blog-image center"><img src="/assets/blog/hypothesis-testing/governor-life-expectancy-loess.png" alt="The same scatterplot of "Percentage vote margin" on the x-axis and "Years alive after election" on the y-axis. This time a best-fit loess curve is drawn through the data; it again shows no real relationship." class="blog-image center" />
<em>A loess curve is a more sophisticated version of linear regression. It doesn’t show a clear relationship between electoral margin and lifespan. Graph again <a href="https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/">by Andrew Gelman</a>.</em></p>
<p>But if you average different candidates with the same electoral margin together, divide them into a group of winners and a group losers, and then do a regression on each group separately, the two regressions suggest that barely winning a race improves life expectancy, versus barely losing.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/governor-regression-discontinuity.png" alt="A figure from the Barfort, Klemmensen, and Larsen paper on gubernatorial elections and lifespan, showing their regression discontinuity analysis. It shows lifespan decreasing with increased voteshare, except with a large upwards discontinuity at the crossover from losing to winning." class="blog-image" /></p>
<p class="center blog-image"><em>The discontinuity between the two lines is large enough to be “statistically significant”. But does the data on the right really look qualitatively different from the data on the left?</em></p>
<p>This <a href="https://en.wikipedia.org/wiki/Regression_discontinuity_design">regression continuity design</a> isn’t a ridiculous approach to the question, but it’s also probably not the first idea you’d think of. And the paper’s own abstract says they’re not sure which way the effect should run, so <em>any pattern at all</em> would provide support for their research hypothesis. This is a subtle but crucial violation of the hypothesis testing framework, and dramatically inflates the rate of “positive” results.</p>
<h2 id="sowhy-does-science-work-at-all">So…why does science work <em>at all</em>?</h2>
<p>Hopefully I’ve convinced you, first, that the tools of modern hypothesis testing are badly suited for the questions we want them to answer, and second, that the structure of our scientific institutions leads us to regularly misuse them in ways that make them even more misleading. So then, how do we manage to learn anything at all?</p>
<p>Sometimes we don’t! The whole point of the “replication crisis” is that we’re almost having to throw out entire fields wholesale. <strong>When I hear about a promising new drug, or a cool new social psychology study, I <em>assume it’s bullshit</em>, because so many of them are. And that’s a real crisis for whole idea of “scientific knowledge”.</strong></p>
<p>But in many fields of study we do, in fact, manage to learn things. We know enough physics and chemistry to build things like spaceships and smartphones. And even though lot of drug studies are nonsense, modern medicine does in fact work.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/life-expectancy-at-age-10.png" alt="A graph from Our World In Data of life expectancy at age ten in various countries, from 1750 to the present. There is a dramatic increase over the 20th century." class="blog-image" /></p>
<p class="center blog-image"><em>We didn’t increase life expectancy by almost thirty years without learning</em> something <em>about biology.</em></p>
<p>And even in more vulnerable fields like psychology and sociology, we have developed a lot of consistent, replicable, useful knowledge. How did we get that to work, despite our shoddy statistics?</p>
<h3 id="inter-ocular-trauma">Inter-ocular trauma</h3>
<p>If your data are good enough, you can get away with having crappy statistics. One of the best and most useful statistical tools is what Joe Berkson called the <a href="https://stats.stackexchange.com/questions/458069/source-for-inter-ocular-trauma-test-for-significance">inter-ocular traumatic test</a>: “you know what the data mean when the conclusion hits you between the eyes”.</p>
<p><a href="https://xkcd.com/2400/"><img src="https://imgs.xkcd.com/comics/statistics.png" alt="XKCD 2400: graph of covid vaccine efficacy versus placebo. "Statistics tip: always try to get data that's good enough that you don't need to do statistics on it."" style="max-width:800px;" class="blog-image center" /></a></p>
<p class="center blog-image"><em>I didn’t worry that</em> this <em>result was bullshit statistical trickery, because I can easily see the evidence for myself.</em></p>
<p>Conversely, if your data isn’t very good, statistics can’t help you with it very much. John Tukey <a href="https://doi.org/10.2307/2683137">famously wrote</a>:</p>
<blockquote>
<p>The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.</p>
</blockquote>
<p>None of this means statistics is useless. But if we can consistently get good, high-quality data, we can afford a little sloppiness in our statistical methodology.</p>
<h3 id="putting-the-replication-in-replication-crisis">Putting the “replication” in “replication crisis”</h3>
<p>And this is where the “replication” half of “replication crisis” comes in. <strong>If the signal you’re detecting is real, you can run another experiment, or do another study, and (probably) see the same thing.</strong> In my <a href="https://jaydaigle.net/blog/replication-crisis-math/">post on the replication crisis</a> I wrote about how mathematicians are constantly replicating our important results, just by reading papers; and that protects us from a lot of the flaws plaguing social psychology.</p>
<p>Gelman recently <a href="https://statmodeling.stat.columbia.edu/2022/03/04/biology-as-a-cumulative-science-and-the-relevance-of-this-idea-to-replication/">made a similar point</a> about fields like biology. Because wet lab biology is cumulative, people are continually replicating old work in the process of trying to do new work. A boring false result can survive for a long time, if no one cares enough to use it; an exciting false result will be exposed quickly when people try to build on it and it collapses under the strain.</p>
<p>This is something Fisher himself wrote about clearly and firmly: “A scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”. That is, we shouldn’t accept a result when we successfully do <em>one</em> experiment that produces a low \(p\)-value; but we should listen when we can <em>consistently</em> do experiments with low \(p\)-values.</p>
<p><strong>But the entire concept of “replication” is in opposition to the artificial decisiveness of Neyman-Pearson hypothesis testing.</strong> The Neyman-Pearson method, if taken seriously, asks us to fully commit to believing a theory if our experiment comes up with \(p=0.049\); but that attitude is <em>utterly terrible science</em>. Good scientific practice <em>needs</em> to be able to hold beliefs lightly, revise them when new evidence comes in, and carefully build up solid foundations that can support further work.</p>
<p>The standard approach to hypothesis testing isn’t designed for that. Next time, in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">part 3</a>, we’ll look at some tools that are.</p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Or is there something you’d like to hear more about in the rest of this series? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is the probability of getting data five standard deviations away from the mean. So you’ll often see this reported as a significance threshold of \(5 \sigma\). Related is the <a href="https://en.wikipedia.org/wiki/Six_Sigma">Six Sigma techniques</a> for ensuring manufacturing quality, though somewhat counterintuitively they typically only aim for <a href="https://en.wikipedia.org/wiki/Six_Sigma#Role_of_the_1.5_sigma_shift">4.5 \(\sigma\)</a> of accuracy. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>It is common for people to be sloppy here and say they “accept” the null. In fact, I wrote that in my first draft of this paragraph. But it’s bad practice to say that, because even a very high \(p\)-value doesn’t provide good evidence that the null hypothesis is true. Our methods are designed to default to the null hypothesis when teh data is ambiguous.</p>
<p>Neyman <em>did</em> use the phrase “accept the null”, but in the context of a decision process, where “accepting the null” means taking some specific, concrete action implied by the null, rather than more generally committing to believe something. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Andrew Gelman suggests a helpful <a href="https://statmodeling.stat.columbia.edu/2016/01/26/more-power-posing/">time-reversal heuristic</a>: what would you think if you saw the same studies in the opposite order? You’d start with a few large studies establishing no effect, followed by one smaller study showing an effect. In theory that gives you the exact same information, but in practice people would treat it very differently—assuming the first studies <a href="#file-drawer">actually got published</a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>You might recognize this as an application of Bayes’s theorem, and a basic example of <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">Bayesian inference</a>. Tables like these are very common in Bayesian calculations. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>Followups to Ioannidis’s paper contend that only about \(14\)% of published biomedical findings are actually false. I’m not in a position to comment on this one way or the other. In psychology, different studies estimate that somewhere <a href="https://en.wikipedia.org/wiki/Replication_crisis#In_psychology">between from \(36\)% and \(62\)%</a> of published results replicate. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>We can also base it on <a href="https://twitter.com/lakens/status/1524799540250959873">how big of an effect we <em>care</em> about</a>. If we’re studying reaction times, we might decide that an effect smaller than ten milliseconds is irrelevant, and we don’t care about it even if it’s real. Then we can choose a study with enough power to detect a \(10\)<em>ms</em> effect at least \(80\)% of the time.</p>
<p>But this brings us back to the core issue, that “is there an effect” just isn’t a great question, and the Neyman-Pearson method isn’t a great tool for answering it. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>Clever analyses like this are often a bad idea; we’ll come back to this idea <a href="#file-drawer">soon</a>. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>This difference is the source of a lot of research pitfalls; if you want to dig into this more, I recommend <a href="https://psyarxiv.com/jqw35">Tal Yarkoni</a> on generalizability, <a href="https://psyarxiv.com/bu4d3/">Vazire, Schiavone, and Bottesini</a> on the four types of validity, and <a href="https://journals.sagepub.com/doi/10.1177/1745691620966795">Scheel, Tiokhin, Isager, and Lakens</a> on the derivation chain. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>Scott Alexander <a href="https://slatestarcodex.com/2014/04/28/the-control-group-is-out-of-control/">has pointed out</a> that ESP experiments are a great test case for our scientific and statistical methods, because we have extremely high confidence that we already know the true answer. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThis is the second-part of a three-part series on hypothesis testing. Today we'll look at the way we do hypothesis testing in practice, and how it tends to fail. Modern researchers use hypothesis testing as a tool to develop knowledge, but it's really a tool for making decisions, and so it encourages us to draw strong conclusions from weak evidence. It also encourages us to view studies that don't reject the null hypothesis as failures, which leads even honest and dedicated researchers to do shoddy research, producing "statistically significant" results that can't be reproduced.Hypothesis Testing and its Discontents, Part 1: How is it Supposed to Work?2022-03-31T00:00:00-07:002022-03-31T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-1<p>In my <a href="https://jaydaigle.net/blog/replication-crisis-math/">last post on the replication crisis</a>, I mentioned the basic ideas of <a href="https://en.wikipedia.org/wiki/Statistical_hypothesis_testing">statistical hypothesis testing</a>. There wasn’t room to give a full explanation in that post, but hypothesis testing is worth understanding, since it’s the foundation of most modern scientific research. It’s a powerful tool, but also incredibly easy to misunderstand and misuse.</p>
<p>This post is the first part of a three-part series explaining what hypothesis testing is and how it works. In this essay I’ll talk about the way hypothesis testing developed historically, in two rival schools of thought. I’ll explain how these two methodologies were originally supposed to work, and why you might (or might not) want to use them. In <a href="/blog/hypothesis-testing-part-2">Part 2</a> I’ll talk about how we do significance testing in practice today, and how that often goes wrong. And in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">Part 3</a> I’ll talk about alternatives to hypothesis testing that can help us avoid replication crisis-type problems.</p>
<h2 id="choose-your-question">Choose your question</h2>
<p>Perhaps the most important step in using math to solve real-world problems is figuring out precisely <a href="https://jaydaigle.net/blog/why-word-problems/">what question you want to ask</a>. Now, there’s a sense in which this process isn’t mathematical. Math can’t tell you, say, whether you want your clothing to be more comfortable or more stylish. No amount of math can tell you how you value inequality versus growth, or whether you’re willing to risk major side effects from an experimental medical treatment.</p>
<p>But math can help you figure out what question you’re asking, by clarifying exactly what questions you <em>could</em> be trying to answer, what their implications are, and what options you have for answering them. The history of hypothesis testing is a debate between people trying to answer different questions, but also a debate about which questions are the most fruitful to ask. Do we want to test a scientific principle? Record a precise measurement? Make a decision?</p>
<p>The statistical tools we use today were developed by specific people,<strong title="Some of these specific people were [pretty awful in one way or another]. Ronald Fisher in particular was [racist] and a [vigorous defender of tobacco companies], though Jezry Neyman seems to have been [perfectly lovely]. I'm not going to go into detail about their failings, among other things because I'm not especially well-informed on the subject; I recommend the articles I linked if you want to know more."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> at specific times, to answer specific questions. So I want to start off by asking some of those specific questions, and see how early statisticians would approach them and what ideas they developed in response.<strong title="Much of this essay, and especially the historical information on the way these schools of thought developed, draws heavily on the article [Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing] by Hubbard and Bayarri. This extremely readable article is also a fascinating historical artifact, basically predicting the entire contour of the replication crisis in 2003."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> After we’ve seen how Fisher’s significance testing and the Neyman-Pearson hypothesis testing framework worked in their original contexts, we can talk about what questions each tool is best suited to answer, and what types of question neither tool can really handle.</p>
<h2 id="fishers-significance-testing">Fisher’s Significance Testing</h2>
<h3 id="mileage">Are You Surprised?</h3>
<p>In 2016 I got a new car with a fancy new electronic system. And one of the new features was a meter that kept track of my gas mileage. It was fun to watch the mileage adjust as I was driving. (And I may have gotten a little obsessed with trying to eke out another tenth of a mile per gallon by driving funny.)</p>
<p>But how accurate is that mileage number? In 2019 my friend Casey suggested an experiment to me and I decided to try it. For several months, every time I filled up my gas tank, I recorded the mpg number from my car dashboard. I also recorded the number of miles I’d driven and the number of gallons of gas I’d used, which let me calculate the mpg directly.</p>
<table class="smalltable">
<thead>
<tr>
<th>Miles Driven</th>
<th>Gallons</th>
<th>Calculated MPG</th>
<th>Dashboard MPG</th>
<th>Difference</th>
</tr>
</thead>
<tbody>
<tr>
<td>340.7</td>
<td>10.276</td>
<td>33.2</td>
<td>34.2</td>
<td>1.0</td>
</tr>
<tr>
<td>300.1</td>
<td>8.97</td>
<td>33.5</td>
<td>34.7</td>
<td>1.2</td>
</tr>
<tr>
<td>232.6</td>
<td>8.04</td>
<td>28.9</td>
<td>29.0</td>
<td>0.1</td>
</tr>
<tr>
<td>261.8</td>
<td>8.5</td>
<td>30.8</td>
<td>31.1</td>
<td>0.3</td>
</tr>
<tr>
<td>301.3</td>
<td>9.316</td>
<td>32.3</td>
<td>32.5</td>
<td>0.2</td>
</tr>
<tr>
<td>505.1</td>
<td>15.127</td>
<td>33.4</td>
<td>34.8</td>
<td>1.4</td>
</tr>
<tr>
<td>290.3</td>
<td>9.814</td>
<td>29.6</td>
<td>30.3</td>
<td>0.7</td>
</tr>
<tr>
<td>290.2</td>
<td>8.566</td>
<td>33.9</td>
<td>34.9</td>
<td>1.0</td>
</tr>
<tr>
<td>294.9</td>
<td>9.005</td>
<td>32.7</td>
<td>32.8</td>
<td>0.1</td>
</tr>
<tr>
<td>301.4</td>
<td>9.592</td>
<td>31.4</td>
<td>32.0</td>
<td>0.6</td>
</tr>
<tr>
<td>230.9</td>
<td>7.643</td>
<td>30.2</td>
<td>32.0</td>
<td>1.8</td>
</tr>
<tr>
<td>269.2</td>
<td>8.644</td>
<td>31.1</td>
<td>30.8</td>
<td>-0.3</td>
</tr>
<tr>
<td>267</td>
<td>8.327</td>
<td>32.1</td>
<td>32.6</td>
<td>0.5</td>
</tr>
<tr>
<td>319.7</td>
<td>9.42</td>
<td>33.9</td>
<td>34.7</td>
<td>0.8</td>
</tr>
<tr>
<td>314.3</td>
<td>9.868</td>
<td>31.9</td>
<td>33.3</td>
<td>1.4</td>
</tr>
<tr>
<td>264.4</td>
<td>8.693</td>
<td>30.4</td>
<td>31.7</td>
<td>1.3</td>
</tr>
<tr>
<td>273</td>
<td>9.229</td>
<td>29.6</td>
<td>30.4</td>
<td>0.8</td>
</tr>
<tr>
<td>320.2</td>
<td>9.618</td>
<td>33.3</td>
<td>33.3</td>
<td>0.0</td>
</tr>
</tbody>
</table>
<p>These numbers show that my car reported a better mileage than I actually got almost every time. Out of eighteen measurements, my car overestimated sixteen times, underestimated once, and was accurate to one decimal place once. But was this tendency toward overestimation a coincidence? Is my car’s mileage calculation biased high, or did I just get weirdly unlucky?</p>
<p>We can try to get a sense of how easily this could have happened by chance. We took eighteen measurements, and sixteen of them were high. (One was a tie, but we’ll be generous and count it as “not high”.) If the car is equally likely to guess high or low, this is like flipping a coin eighteen times and getting sixteen heads. That’s pretty unlikely: the probability is about \(0.0006\), or \(0.06\)%, or about one in \(1700\). It’s still <em>possible</em> that my car is unbiased and I just got unlucky. But if so, I was extremely unlucky.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/star-wars-asteroids.jpeg" alt="Screenshot from The Empire Strikes Back, with dialog: "Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1!"" width="75%" /></p>
<p class="center"><em>But still only half as unlucky as Han Solo’s enemies.</em></p>
<h3 id="what-is-a-significance-test">What is a significance test?</h3>
<p>We call this approach a <em>significance test</em>. This approach was developed by <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald Fisher</a>, following up work by <a href="https://en.wikipedia.org/wiki/Karl_Pearson">Karl Pearson</a> and <a href="https://en.wikipedia.org/wiki/Student%27s_t-distribution">William Sealy Gosset</a>.</p>
<p>We start by formulating a <em>null hypothesis</em> that represents some form of “expected” behavior, which we call \(H_0\). In this case, I expected<strong title="Okay, maybe I didn't _actually_ expect my car to be accurate and unbiased. But it's at least _supposed_ to be true, so it provides a good baseline for comparison."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> my car to correctly measure my gas mileage, without consistent bias in either direction. There are a few ways to make that expectation mathematically precise; in the example above, my precise hypothesis was “an overestimate is just as likely as an underestimate”, or more formally, \(P(\text{overestimate}| H_0 ) = 0.5 \).</p>
<p>(There are other ways to formalize my expectations here. I ignored the size of the errors, and just looked at whether the measured mileage was better or worse than the mileage I calculated. But with a more complicated statistical tool called a <a href="https://en.wikipedia.org/wiki/Student's_t-test">paired \(t\)-test</a> we can use the exact numbers to get a bit more information out of our measurements. When I do this, I get a \(p\)-value of \(0.00004\), or \(0.004\)%—an order of magnitude lower than my first figure.)</p>
<p>Once we have a null hypothesis, <strong>we compute how unlikely the measurement we actually got would be, if we assume the null hypothesis is true</strong>. And if that sentence looks confusing and grammatically tangled, there’s a reason for that: while this process is absolutely unambiguous mathematically, it has nested “if-then” statements that are hard to think clearly about and don’t translate easily into English. In mathematical notation, we want \( P( \text{measurement} \mid H_0 ) \), which we can read as the probability of seeing our measurement given the null hypothesis.</p>
<p>There are a couple of subtle points here, so I want to be super explicit and run them into the ground. The first is that we need to be careful about what we mean by “how unlikely our result is”, because any <em>specific</em> result is extremely unlikely. The odds of getting the exact sequence I got in my experiment—HHHHHH HHHHHT HHHHHT—are exactly \(1\) in \(2^{18}\), because that specific sequence isn’t special. If you pick any specific sequence, whether it’s all heads like HHHHHH HHHHHH HHHHHH, or half-and-half like HTHTHT HTHTHT HTHTHT, or something totally random like HHHTHT HHHHTT HTTHTT, the odds of getting those exact flips in that exact order is \(1\) in \(2^{18}\).</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/coin-flips.png" alt="A picture of eighteen flipped coins" class="blog-image" /></p>
<p class="center"><em>The probability of getting these exact flips in this exact order is \(1\) in \(2^{18}\), or about \(0.000004\).</em></p>
<p>But that doesn’t tell us anything useful! Fortunately, in the context of hypothesis testing, we can do something smarter. It doesn’t really matter what <em>order</em> we get the heads in; it just matters how many we get, because that tells us how often the car is overestimating my mileage. So we can compute the odds of getting sixteen heads in any order. And getting seventeen heads would be even <em>more</em> unlikely, so we include that as well; so what we wind up computing is the odds of getting \(16\), \(17\), or \(18\) heads. That’s how I got the number \(0.0006\) earlier.</p>
<p>We say that we want to compute the chance of getting a result <em>at least as bad</em> as what we got. But that requires us to decide what counts as “better” or “worse”; and that depends on what question we’re actually trying to ask. In this context, I’m testing the null hypothesis that my car underestimates as often as it overestimates, so I can basically order the possible results from “most overestimation” to “most underestimation” and find the probability of overestimating at least as often as my car actually did.</p>
<h3 id="what-we-dont-learn">What we don’t learn</h3>
<p>Another subtle point, but an absolutely vital one, is that <strong>the \(p\)-value does <em>not</em> tell us how likely the null hypothesis is to be true</strong>. When we say that \(p = 0.0006\) that does <em>not</em> mean that there’s only a \(0.06\)% chance that my car is accurate! It just measures how unusual my evidence is, <em>if</em> the null hypothesis is true.</p>
<p>Often the question we really care about is how likely the null hypothesis is to be true. There are in fact ways to try to address that directly, which I’ll discuss in Part 3 of this series. But answering that question requires a lot more information than we usually have; Fisher’s significance test doesn’t try. <strong>It just assumes the null hypothesis is true</strong>, and tells us how weird that makes the result look.</p>
<p>Significance testing does numerically measure the strength of the experimental evidence we got: the lower the \(p\)-value, the stronger our evidence. But it doesn’t try to account for any <em>other</em> evidence we have, whether against the null hypothesis or for it. If I get a coin from the bank, flip it ten times and get ten heads, I get \(p \approx 0.001\) for the null hypothesis that it’s a normal coin. But I still expect it to be normal, because most coins are. And if I pick it up and see that it has a normal “tails” side, I’ll be really confident that I just got weirdly lucky<strong title="You might worry about whether it's a two-sided but biased coin. But Gelman and Nolan have argued that [coins physically can't be biased], and I find their argument compelling. If you don't find it compelling, you have to decide how likely you think a weighted coin would be—which is exactly the "other evidence" that Fisher's paradigm doesn't even try to account for."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong>.</p>
<p>And that’s why <strong>the analysis of my gas mileage above didn’t really have a firm a conclusion</strong>. We got a \(p\)-value of \(0.0006\), and determined: “huh, that’s kinda funny”. <em>Either</em> our null hypothesis was false, <em>or</em> something extremely unusual happened. But <strong>the math doesn’t tell us which of those two things to believe</strong>.</p>
<p>And in the case of my car, it doesn’t need to. On the one hand, I’m not all that surprised if the mileage calculator is a little wrong; the super-low \(p\)-value just reinforces what I already suspected. And on the other hand, I’m not really going to do anything different if my mileage is half an mpg lower than my dashboard says. I’m not going to sue Honda, or lead an activist campaign, or try to raise awareness about faulty mileage estimates.</p>
<p>But if I really cared, I could run more experiments. I got \(p = 0.0006\) in my first experiment; but I could do the experiment again. If I get \( p = 0.31\) next time, maybe I should assume the first result was just a fluke. But if I get \(p = 0.0003\) and then \( p = 0.0008\) I’ll see a pattern. And that pattern would make a convincing argument that my car is lying to me.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/omg-kitten.jpg" alt="picture of a shocked kitten: "OMG I knew it!"" class="blog-image" /></p>
<p>In “The Arrangement of Field Experiments”, Fisher writes that “A scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”. (Italics in the original.) That is, <a href="https://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">no one experiment should convince us of anything</a>. Instead, <strong>we should believe our results when we can reliably design experiments that give the same results</strong> (which is arguably the point that we <a href="https://jaydaigle.net/blog/science-vs-engineering/">pass from science to engineering</a>).<strong title="A friend asks if meta-analysis accomplishes the same thing, but meta-analysis is actually a much weaker threshold than the one Fisher gives here. Meta-analysis tries to amplify weak signals and reconcile inconsistent results; Fisher says we should only believe a claim when we can consistently get a strong signal."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>But that’s a slow, grinding, painstaking process. And it still doesn’t give us a rule for when to pull the trigger! We just gradually believe the null hypothesis less and less as we collect more data. That’s perfectly fine for doing basic science—maybe even ideal.</p>
<p>But what if the stakes are higher, and more immediate? Sometimes we need to make a real decision, now, with the data we have. So what do we do?</p>
<h2 id="neyman-pearson-hypothesis-testing">Neyman-Pearson Hypothesis Testing</h2>
<h3 id="time-to-make-a-choice">Time to make a choice</h3>
<p>Suppose we’re studying a new drug, which we hope will prevent deaths from cancer. We can collect data on how effective the drug seems to be in trials, but just reporting a \(p\)-value isn’t enough. At some point <strong>we have to make a <em>decision</em>: should we give people the drug, or not?</strong> And Fisher’s methods don’t answer that.<strong title="From what I understand, Fisher was a little contemptuous of the idea that you could answer this question mathematically."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p><a href="https://en.wikipedia.org/wiki/Jerzy_Neyman">Jezry Neyman</a> and <a href="https://en.wikipedia.org/wiki/Egon_Pearson">Egon Pearson</a> (the son of Karl Pearson) decided to attack that question head-on. They began by observing that there are two different mistakes we could make, which they called “Type I” and “Type II” errors.</p>
<p>These names are infamously unmemorable, but in their original context they make perfect sense: <strong>whichever mistake we most want to avoid is the “first type” of mistake</strong>. For drug testing, there’s a widespread consensus that it’s worse to prescribe a drug that doesn’t work, or has nasty side effects, than it is to withhold a drug that works as expected.<strong title="I'm not convinced I agree with this, but that's beside the point here. I'll discuss this choice a bit more in Part 2 of this series."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> So the Type I error would be prescribing a drug that doesn’t work, and the Type II error would be failing to prescribe a drug that does work. This means we can take “the drug doesn’t work” as our null hypothesis \(H_0\). <strong>But we can contrast our null hypothesis with a specific alternative: that the drug does, in fact, work</strong>. We call this our “alternative hypothesis” \(H_A\). And we get the following classic chart:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null Hypothesis is false <br /> (Drug works)</th>
<th>Null Hypothesis is true <br /> (Drug doesn’t work)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Give the drug <br /> (Reject the Null)</td>
<td>Correct decision <br /> “True Positive”</td>
<td>First, worse error <br /> (Type I Error) <br /> “False Positive”</td>
</tr>
<tr>
<td>Don’t give the drug <br /> (Don’t Reject)</td>
<td>Second, less bad error <br /> (Type II Error) <br /> “False Negative”</td>
<td>Correct Decision <br /> “True Negative”</td>
</tr>
</tbody>
</table>
<p>This leaves us with a problem. There are two different mistakes we could make. And without getting better data, we can only reduce the Type II errors by increasing the Type I errors: if we’re more generally willing to say “yes, prescribe the drug”, we’ll say “yes” more often when the drug works, but also when it doesn’t. We need to strike some sort of balance between the two risks. But how?</p>
<p><strong>There’s no abstract, mathematical answer to this question; it depends on the specific, practical consequences of the decision we’re making</strong>, and how much we care about the specific trade-offs in play. We already said that a Type I error is worse than a Type II error—but by how much? Is it two times as bad? Five? Ten? We have to decide exactly how we weigh the two risks against each other.</p>
<p>In drug testing, a Type I error means spending money on drugs that don’t work and might hurt people. A Type II error means people don’t get treatment that would help them. If a disease is really bad, we’re more willing to make Type I errors, because a drug that <em>might</em> kill you compares favorably to a disease that <em>definitely</em> will. If a drug is really expensive, or has bad side effects, we might be more willing to make Type II errors, because people will be hurt more by letting a bad drug slip through. And there are dozens more factors like that that we have to weigh against each other.</p>
<p>Once we’ve decided how we want to balance these risks, we can define a threshold for our experiment. If our data falls crosses that threshold we prescribe the drug; if the data doesn’t cross the threshold, then we don’t. And that’s our decision.</p>
<h3 id="the-risk-of-error">The risk of error</h3>
<p>All this setup leaves us with a pair of numbers that describe the trade-offs we’ve made. The rate of Type I errors is \(\alpha\), which tells us: <em>if</em> the drug doesn’t work, how likely are we to prescribe it? Its mirror is \(\beta\), the rate of Type II errors. This tells us: if the drug <em>does</em> work, how likely are we to withhold it? <strong title="In a medical context, we often talk about the related concepts of _sensitivity_ and _specificity_. Sensitivity is the "true positive" rate 1-β, the probability of correctly prescribing the drug if it would help. Specificity is the "true negative" rate 1-α, the probability of correctly withholding the drug if it would not help. These terms come from diagnostic testing. "Sensitivity" measures the chance of correctly detecting a condition that you have; "specificity" measures the chance of correctly detecting that you don't have a condition. "><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p class="center"><img src="/assets/blog/hypothesis-testing/neyman-pearson-confusion-chart.png" alt="" /></p>
<p class="center"><em>We give the drug if our measurement is bigger than the threshold. If the drug works, we’ll get a result from the right (green) bell curve; if it doesn’t, we’ll get a result from the left (yellow) one.</em></p>
<p class="center"><em>ROC_curves.svg: Sharprderivative work: נדב ס, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via <a href="https://commons.wikimedia.org/wiki/File:ROC_curves_colors.svg">Wikimedia Commons</a></em></p>
<p>(You’ll often see \(\alpha\) referred to as the “false positive rate” and \(\beta\) as the “false negative” rate, but that’s a little inexact. In modern practice, the null hypothesis is almost always “there is no effect”, but this isn’t necessary to the framework. If we want to err on the side of prescribing the drug, then “the drug works” would be the null hypothesis and “no it doesn’t” would be the alternative. In that case, rejecting the null would be a <em>negative</em> result and a Type I error would be a false <em>negative</em>.)</p>
<p>But through all this, <strong>we have to be careful about what question we’re asking, and whether our methods can answer it.</strong> Naively we might want to ask something like “how likely is it that this drug works”, but Fisher, Neyman, and Pearson all would have agreed that that’s an incoherent question that can’t really be answered.<strong title="All three were [frequentists], and believed (roughly) that you can only give a "probability" for something repeatable. You can talk about the probability a study will give a null result, since you could run a hundred studies and count how many give the null. But you can't talk about the probability that a given drug works, since there's only the one drug. The major modern alternative to frequentist probability is [Bayesianism], which _does_ think this question makes sense. I've written about Bayesian reasoning [in the past](https://jaydaigle.net/blog/overview-of-bayesian-inference/) and I'll come back to it in Part 3 of this series. But the Neyman-Pearson method is definitely not Bayesian."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong> (And even if you believe it’s a coherent question, it’s still not an easy one.)</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/seriously-you-didnt-answer-my-question.jpg" alt="An upset-looking cat: "Seriously, you didn't answer my question"" class="blog-image" /></p>
<p>Instead, the probabilities we computed are both conditional: <em>if</em> the drug doesn’t work, how likely are we prescribe it? And <em>if</em> the drug does work, how likely are we to withhold it? We can use those probabilities to make the best possible decision, given the information we used and the assumptions we made. <strong>But we can’t compute the probability that our decision is correct</strong>, because that’s just not the question that the Neyman-Pearson method can answer.</p>
<h3 id="dont-tell-me-what-to-think">Don’t tell me what to think!</h3>
<p>In fact, the Neyman-Pearson method is even less able to answer that than the Fisher method. Fisher can’t tell us the probability that we’re right, but it’s at least an attempt to figure out whether we’re right, by measuring our experimental evidence against the null hypothesis. But <strong>Neyman-Pearson doesn’t even try to tell us whether the drug “really works” or not. It just tells us what we should <em>do</em>.</strong></p>
<p>And it is very possible to believe that a drug probably works and is safe, but also that we’re <a href="https://en.wikipedia.org/wiki/Primum_non_nocere">not sure enough</a> to go around prescribing it; it’s equally possible to believe a drug probably doesn’t work, but it’s cheap and harmless so we <a href="https://jaydaigle.net/blog/pascalian-medicine/">might as well give it a shot</a>. Neyman himself wrote, in his <em>First Course in Probability and Statistics</em>:</p>
<blockquote>
<p>[T]o accept a hypothesis \(H\) means only to decide to take action \(A\) rather than action \(B\). This does not mean that we necessarily believe that the hypothesis \(H\) is true. Also, [to reject] \(H\) means only that the rule prescribes action \(B\) and does not imply that we believe \(H\) is false.</p>
</blockquote>
<p>Researchers talk about the difference between statistical significance and <a href="https://statisticsbyjim.com/hypothesis-testing/practical-statistical-significance/">practical</a> or <a href="https://www.mhaonline.com/faq/clinical-vs-statistical-significance">clinical significance</a>, but <strong>in the true Neyman-Pearson setup, practical and statistical significance should be the same</strong>. Sure, if your measurements are precise enough, you can detect an effect that’s too small to matter. Conversely, a small pilot experiment can provide exciting, suggestive data without conclusively establishing any facts. But Neyman-Pearson is designed to choose a significance threshold \(\alpha\) to optimize <em>decision-making</em>, and that means that the statistical threshold <em>must</em> be a practically significant threshold.</p>
<p>If we’re trying to make an optimal decision based on limited information, Neyman-Pearson is about the best we can do. And that’s a pretty plausible description of a lot of medical studies. Phase III drug trials are slow, difficult, and expensive; we’re not going to run the whole thing over again just to check. We need a threshold for deciding whether to approve a drug or not, with the information we have; and that threshold is necessarily a practical one.</p>
<p>But scientific research isn’t generally about single isolated decisions; it’s a search for knowledge, an attempt to figure out what’s true and what isn’t. <strong>Neyman-Pearson very specifically <em>wasn’t</em> designed to answer questions about truth, but we try to use it to do science anyway.</strong> I’ll talk about how exactly that works (and doesn’t work) in Part 2 of this series; but (spoilers!) it works out <em>awkwardly</em>, and the mismatch between what Neyman-Pearson does and what we <em>want</em> it to do is a major contributor to the replication crisis.</p>
<h3 id="making-promises">Making promises</h3>
<p>The Neyman-Pearson method doesn’t tell you what to believe, but it does make a very specific promise: if you set your significance threshold to \(\alpha =5\)%, then your false positive rate will be \(5\)%. This is a statistics theorem, so it really is guaranteed—if you set everything up correctly.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/promise-kitten.jpg" alt="sad kitten: "Do you promise?"" class="blog-image" /></p>
<p>But that guarantee only applies to the threshold you set <em>before you saw the data</em>. If you run your experiment, do your analysis, and compute \(p = 0.048\), then your result is significant, and the background false positive rate is \(5\)%. But if you run your experiment, do your analysis, and compute \(p = 0.001\), then your result is significant, and the background false positive rate is <em>still</em> \(5\)%. The false positive rate doesn’t get lower just because the \(p\)-value does.</p>
<p>Huh? Isn’t \(p = 0.001\) much stronger evidence than \(p = 0.048\)?</p>
<p>In one sense, yes. That’s what Fisher tells us. But Fisher doesn’t make <em>decisions</em>, and doesn’t make this statistical guarantee. It’s a different tool that answers a different question.</p>
<p>Neyman-Pearson <em>does</em> make a guarantee, but that guarantee is very specific. <strong>If you run a hundred experiments where the null hypothesis is true, you’ll only reject about five times.</strong> (And you get the lowest possible \(\beta\), the fewest possible false negatives, compatible with that false positive rate.) But that’s all you’re guaranteed.</p>
<p>And in particular, if the null hypothesis is true then all \(p\)-values are equally likely. So if you do a hundred experiments, you should expect one of them to give you \(p=0.95\), and one to give \(p = 0.05\), and one to give \(p=0.01\). And that \(0.01\) isn’t, mathematically, special. It’s just one of the five false positives you expect.</p>
<p>If you want the guarantees of Neyman-Pearson’s methods, you can’t treat especially low \(p\)-values as especially, well, <em>special</em>. They land in your critical region. You reject the null. The answer to your question is “yes, prescribe the drug”. And that’s <em>all you get</em>.</p>
<p>And the same reasoning applies to results “trending towards significance”. If your \(p\)-value is \(0.06\), then you’re outside the critical region, you accept the null, and the answer to your question is “no, don’t prescribe it”.</p>
<p>And here’s the weirdest bit. If you get \(p=0.06\), you can change your significance threshold after the fact. Now you’re getting a \(6\)% false positive rate. And maybe that sounds like what you’d expect? But <strong>that also applies, retroactively, to every <em>other</em> time you ran an experiment</strong>, even if you got \(p=0.04\) and didn’t have to change your threshold.</p>
<p>If you set yourself a spending limit of \$20, but then spend \$25 when you see something you really wanted, you didn’t actually have a spending limit of \$20 in the first place. And if you’re willing to lift your \(\alpha\) when your \(p\)-value is too high—if you know that when \(p = 0.06\) you’ll frown, and hesitate, and grudgingly prescribe the drug anyway—then your \(\alpha\) is really \(6\)%, regardless of what you say. You’ll get false positives six percent of the time. You’re answering a slightly different question. Which as fine—<em>if</em> it’s closer to the question you really want to answer.</p>
<h2 id="what-are-they-good-for">What are they good for?</h2>
<p>We’ve seen these two different approaches to significance testing, and which specific questions they’re trying to answer. Now we can try to figure out when to use each of these tools, and when neither of them is quite right.</p>
<h3 id="the-measure-of-some-things">The measure of some things</h3>
<p>If you have a specific, yes-or-no decision you need to make on limited evidence, the Neyman-Pearson framework is fantastic. For a doctor deciding whether to prescribe a drug, or a company doing A/B testing deciding whether to roll out a new feature, it is exactly the right tool. Choose your \(\alpha\) and \(\beta\) intelligently, commit to your threshold, run your experiment, and you’re done.</p>
<p>But scientific research doesn’t really work that way. In part, because we accumulate knowledge over time; we don’t need to make a big decision after one study.<strong title="Modern researchers have ways to get around that using tools like meta-analysis: at any given time you can make a decision based on all your information, and when you get new information you can make a new decision. But it's still a bit forced, and not what Neyman-Pearson was designed for."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong> Fisher’s methods were designed to handle this accumulation of evidence much more adroitly, since they don’t create hard cutoffs: as Fisher wrote, “decisions are final, while the state of opinion derived from a test of significance is provisional, and capable, not only of confirmation, but of revision.”</p>
<p>The bigger problem is that Neyman-Pearson and Fisher are often used to answer the wrong question entirely. <strong>Sometimes in science we just want to know whether something is real.</strong> For example, the Large Hadron Collider wanted to find out <a href="https://en.wikipedia.org/wiki/Higgs_boson#Search_and_discovery">if the Higgs Boson existed</a>. This isn’t really what Neyman-Pearson is built for—remember, it’s for making decisions, not finding the truth— but it is a yes-or-no question, so we can kind of make it work. Fisher’s methods were designed for <em>exactly</em> this question, by measuring how much evidence your experiment gives for the thing’s existence, and they are essentially what the CERN team used.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/higgs-mordor.jpg" alt="One does not simply find the Higgs Boson" class="blog-image" /></p>
<p><strong>But more often we want to measure something.</strong> This is true even for things like the Higgs search, where the initial announcement of the Higgs boson discovery was for “a new particle with a mass between \(125\) and \(127\) \(\text{GeV}/c^2\)”. It’s even more true in other contexts. In medicine, we want to know <em>how effective</em> a drug will be; in psychology we want to know <em>how strongly</em> a picture can affect our emotions; in public policy we want to know <em>how much</em> a new program will reduce poverty.</p>
<p><strong>And neither Fisher nor Neyman-Pearson answer those questions at all.</strong> It’s just not what they’re designed to do.</p>
<p>I talked about this problem in my <a href="https://jaydaigle.net/blog/replication-crisis-math/#effect-sizes">post on the replication crisis</a>. Amy Cuddy started by asking whether the power pose had an effect—a yes-or-no question. She wound up talking about <em>how large</em> the effect was, which is a completely different question. Hypothesis testing only answers the first question; if you try to use it to <em>measure</em> things you cause yourself all sorts of problems, just like the ones Cuddy ran into.</p>
<p>We also see these problems in research on politically controversial subjects like <a href="https://en.wikipedia.org/wiki/Minimum_wage#Statistical_meta-analyses">the minimum wage</a> and <a href="https://en.wikipedia.org/wiki/Minimum_wage#Statistical_meta-analyses">gun control</a>. Economic theory suggests that raising the minimum wage should increase unemployment; there’s an extensive literature of dueling empirical studies, with some showing that it does, and others showing that it doesn’t. A lot of ink has been spilled over whether minimum wage increases <em>really</em> increase unemployment, and that’s a genuinely tricky question that I can’t answer.<strong title="Among other things, because the answer is probably "sometimes yes and sometimes no, it depends on the circumstances." And I don't think anyone seriously doubts that a minimum wage of $100 per hour would increase unemployment, and a minimum wage of $1 per hour would not."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong></p>
<p>But what I can do is <em>reframe</em> the question. We don’t know if the minimum wage raises are increasing unemployment. But we do know they can’t be increasing it <em>very much</em>. If they were, we’d be able to tell! So the effect may be real, but if it is, it’s <em>small</em>.<strong title="This is the difference between "practical significance" and "statistical significance" we talked about earlier. But that distinction shouldn't arise in a proper Neyman-Pearson setup, which is one way you can tell it's being misused here."><sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup></strong> That’s a good enough answer to make policy. But it’s not an answer that hypothesis testing can give you.</p>
<p>If we care about the size of what we’re studying, and not just whether it exists at all, there are much better tools to use than hypothesis tests like Fisher or Neyman-Pearson. I’ll talk about some of these in Part 3 of this series.</p>
<h3 id="the-significance-binary">The Significance Binary</h3>
<p>The other major difference between Fisher’s approach and Neyman-Pearson is the degree of nuance allowed in their answers. In Fisher’s formulation, we ask how much evidence our experiment gives against the null hypothesis, which means we can have a lot of shades of gray in our result. The lower the \(p\)-value, the stronger the evidence; a \(p\)-value of \(0.001\) is ten times as good as a \(p\)-value of \(0.01\).</p>
<p>This still doesn’t measure the size of the effect, because you can have lots of evidence for a small effect. (I have plenty of evidence that I can move things by pushing them with my finger, but that won’t allow me to knock over the Washington Monument.) But Fisher’s methods do give a fine-grained, quantitative measurement of something: the strength of the evidence against our null hypothesis.</p>
<p>In contrast, <strong>the Neyman-Pearson formulation doesn’t give us fine distinctions</strong>. We ask if our alternative hypothesis is better than the null, and we get an answer to exactly that question—and that answer can only be “yes” or “no”. The entire continuous \(p\)-value spectrum gets compressed into a definitive “yes” or “no” with no middle ground.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/false-dichotomy.jpg" alt="Image of umpire: "False dichotomy on the play. Arbitrarily reducing a set of many possibilities to only two."" class="blog-image" /></p>
<p>That’s a huge problem when nuance is important, with consequences visible throughout the body of scientific literature. But the problems especially bad in contexts like public health communication, where both honesty and clarity save lives.</p>
<p>Our medical establishment uses what’s essentially a Neyman-Pearson framework to evaluate possible treatments. And it is (understandably) conservative about approving new drugs, which means that \(\alpha\) is set fairly aggressively. We get a lot of false negative results, denying treatments that would work. And in a terrible misuse of language, when a treatment doesn’t clear our fairly high bar for significance, we tend to say there is <a href="https://twitter.com/zeynep/status/1366175070507384836">“no evidence”</a> for it, or even flatly that it “doesn’t work”—whether we mean that it definitely doesn’t work, or that it probably does but we’re not quite sure yet.</p>
<p>This failing was on full display in the early days of the coronavirus pandemic. In February and March 2020, the Surgeon General issued a statement that masks “are NOT effective in preventing” Covid infections, even though we had good reasons to believe they were; the evidence was real, but not (yet) sufficient to reject the null. In December, the World Health Organization said there was <a href="https://twitter.com/WHO/status/1254160944638447618">no evidence that vaccines would reduce covid transmission</a>. Again, there was real evidence that vaccines would reduce transmission, but not enough to cross WHO’s Neyman-Pearson-style decision threshold. And because of the binary output of a Neyman-Pearson process, this tentative wait-and-see approach was communicated in the form of definitive, final-sounding judgments.</p>
<p>There are definitely smarter and more sophisticated ways to use hypothesis testing on questions like this. First, it would help just to remember that our results are provisional and not absolute truths. Sometimes we do have to make a decision <em>now</em> about whether to prescribe a treatment, or roll out a new product, or even just change some official guidelines. But that doesn’t mean we’re locked into that decision forever; and simply saying there was “not enough” evidence for masks, rather than “no evidence”, would have been more honest and <em>also</em> made the subsequent reversal less confusing.</p>
<p>Second, when we do have to make decisions, we can be more thoughtful about the trade-offs between false positives and false negatives. It’s become standard to take \(\alpha=0.05\) and let \(\beta\) fall where it may; but the decision theory works best when we think about the actual trade-offs involved, and choose our parameters accordingly. That, too, would have helped with communication around Covid: the risks of having people wear masks for a couple months while we figured out if they helped were low, and we didn’t need to be as cautious about recommending masking as we are about approving a new cancer drug.</p>
<h2 id="where-do-we-stand">Where do we stand?</h2>
<p>Hypothesis tests are ways of using data to give yes-or-no answer to certain questions. They’re extremely powerful in the contexts they were designed for: Neyman-Pearson gives a good rule for making decisions, and Fisher gives a good approach to describing how much evidence your experiment produced. But when you try to apply them outside of those contexts, you can easily get confusing or misleading results.</p>
<p>But this essay has presented both approaches to hypothesis testing more or less as they were originally designed, in their original contexts. Modern hypothesis testing works a little differently. <strong>The Fisher approach gives us a nuanced evaluation of the evidence, but no firm conclusion; the Neyman-Pearson approach gives us a clear answer, but nothing else.</strong></p>
<p>But modern researchers often want both. Modern methods try to deliver. And modern methods often, predictably, fail.</p>
<p>Next time in <a href="/blog/hypothesis-testing-part-2">Part 2</a> we’ll see how the modern approach to hypothesis testing works. And we’ll see how the modifications we’ve made to try to have it both ways loses some of the benefits of both approaches, and invites the sort of research failures that we’ve seen throughout the replication crisis.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/cats-soon.jpg" alt="Cats staring at city skyline. Caption: "Soon."" class="blog-image" /></p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Or is there something you’d like to hear more about in the rest of this series? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Some of these specific people were <a href="https://nautil.us/how-eugenics-shaped-statistics-9365/">pretty awful in one way or another</a>. Ronald Fisher in particular was <a href="https://www.newstatesman.com/uncategorized/2020/07/ra-fisher-and-science-hatred">racist</a> and a <a href="https://priceonomics.com/why-the-father-of-modern-statistics-didnt-believe/">vigorous defender of tobacco companies</a>, though Jezry Neyman seems to have been <a href="https://daniellakens.blogspot.com/2021/09/jerzy-neyman-positive-role-model-in.html?m=1">perfectly lovely</a>. I’m not going to go into detail about their failings, among other things because I’m not especially well-informed on the subject; I recommend the articles I linked if you want to know more. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Much of this essay, and especially the historical information on the way these schools of thought developed, draws heavily on the article <a href="https://doi.org/10.1198/0003130031856">Confusion Over Measures of Evidence (\(p\)’s) Versus Errors (\(\alpha\)’s) in Classical Statistical Testing</a> by Hubbard and Bayarri. This extremely readable article is also a fascinating historical artifact, basically predicting the entire contour of the replication crisis in 2003. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Okay, maybe I didn’t <em>actually</em> expect my car to be accurate and unbiased. But it’s at least <em>supposed</em> to be true, so it provides a good baseline for comparison. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>You might worry about whether it’s a two-sided but biased coin. But Gelman and Nolan have argued that <a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">coins physically can’t be biased</a>, and I find their argument compelling. If you don’t find it compelling, you have to decide how likely you think a weighted coin would be—which is exactly the “other evidence” that Fisher’s paradigm doesn’t even try to account for. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>A friend asks if meta-analysis accomplishes the same thing, but meta-analysis is actually a much weaker threshold than the one Fisher gives here. Meta-analysis tries to amplify weak signals and reconcile inconsistent results; Fisher says we should only believe a claim when we can consistently get a strong signal. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>From what I understand, Fisher was a little contemptuous of the idea that you could answer this question mathematically. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>I’m not convinced I agree with this, but that’s beside the point here. I’ll discuss this choice a bit more in Part 2 of this series. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>In a medical context, we often talk about the related concepts of <em>sensitivity</em> and <em>specificity</em>. Sensitivity is the “true positive” rate \(1-\beta\), the probability of correctly prescribing the drug if it would help. Specificity is the “true negative” rate \(1-\alpha\), the probability of correctly withholding the drug if it would not help.</p>
<p>These terms come from diagnostic testing. “Sensitivity” measures the chance of correctly detecting a condition that you have; “specificity” measures the chance of correctly detecting that you don’t have a condition. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>All three were <a href="https://en.wikipedia.org/wiki/Frequentist_probability">frequentists</a>, and believed (roughly) that you can only give a “probability” for something repeatable. You can talk about the probability a study will give a null result, since you could run a hundred studies and count how many give the null. But you can’t talk about the probability that a given drug works, since there’s only the one drug.</p>
<p>The major modern alternative to frequentist probability is <a href="https://en.wikipedia.org/wiki/Bayesian_probability">Bayesianism</a>, which <em>does</em> think this question makes sense. I’ve written about Bayesian reasoning [in the past] and I’ll come back to it in Part 3 of this series. But the Neyman-Pearson method is definitely not Bayesian. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>Modern researchers have ways to get around that using tools like meta-analysis: at any given time you can make a decision based on all your information, and when you get new information you can make a new decision. But it’s still a bit forced, and not what Neyman-Pearson was designed for. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
<li id="fn:11">
<p>Among other things, because the answer is probably “sometimes yes and sometimes no, it depends on the circumstances.” And I don’t think anyone seriously doubts that a minimum wage of \$100 per hour would increase unemployment, and a minimum wage of \$1 per hour would not. <a href="#fnref:11" class="reversefootnote">↩</a></p>
</li>
<li id="fn:12">
<p>This is the difference between “practical significance” and “statistical significance” we talked about earlier. But that distinction shouldn’t arise in a proper Neyman-Pearson setup, which is one way you can tell it’s being misused here. <a href="#fnref:12" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThis is the first part of a three-part series explaining what hypothesis testing is and how it works. In this essay I'll talk about the way hypothesis testing developed historically, in two rival schools of thought. I'll explain how these two methodologies were originally supposed to work, and why you might (or might not) want to use them.Why Isn’t There a Replication Crisis in Math?2022-02-02T00:00:00-08:002022-02-02T00:00:00-08:00https://jaydaigle.net/blog/replication-crisis-math<p>One important thing that I think about a lot, even though I have no formal expertise, is the <a href="https://www.vox.com/future-perfect/21504366/science-replication-crisis-peer-review-statistics">replication crisis</a>. A shocking fraction of published research in many fields, including medicine and psychology, is flatly wrong—the results of the studies can’t be obtained in the same way again, and the conclusions don’t hold up to further investigation. Medical researcher John Ioannidis brought this problem to wide attention in 2005 with a paper titled <a href="https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124">Why Most Published Research Findings Are False</a>; attempts to replicate the results of major psychology papers suggest that <a href="https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/">only about half of them hold up</a>. A recent analysis gives <a href="https://apnews.com/article/science-business-health-cancer-marcia-mcnutt-93219170405e3de753651b89d4308461">a similar result for cancer research</a>.</p>
<p>This is a real crisis for the whole process of science. If we can’t rely on the results of famous, large, well-established studies, it’s hard to feel secure in <em>any</em> of our knowledge. It’s probably the most important problem facing the entire project of science right now.</p>
<p>There’s a lot to say about the mathematics we use in social science research, especially statistically, and how bad math feeds the replication crisis.<strong title="I'm a big fan of the [Data Colada] project, and of [Andrew Gelman's writing] on the subject"><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> But I want to approach it from a different angle. <strong>Why doesn’t <em>the field of mathematics</em> have a replication crisis?</strong> And what does that tell us about other fields, that do?</p>
<h2 id="why-doesnt-math-have-a-replication-crisis">Why doesn’t math have a replication crisis?</h2>
<h3 id="maybe-mathematicians-dont-make-mistakes">Maybe mathematicians don’t make mistakes</h3>
<p>Have you, uh, <a href="https://mathwithbaddrawings.com/2017/01/11/why-are-mathematicians-so-bad-at-arithmetic/">met any mathematicians</a>?</p>
<p style="text-align: center;"><a href="https://mathwithbaddrawings.com/2017/01/11/why-are-mathematicians-so-bad-at-arithmetic/"><img src="/assets/blog/replication-crisis-math/sign-error.jpg" alt="Cartoon: "So the tip is...$70? But the meal was only $32..." "Maybe we made a sign error, and they owe us $70."" width="75%" /></a></p>
<p style="text-align: center;"><em>Comic by Ben Orlin at <a href="https://mathwithbaddrawings.com/2017/01/11/why-are-mathematicians-so-bad-at-arithmetic/">Math with Bad Drawings</a></em></p>
<p style="text-align: center;"><em>At Caltech, they made the youngest non-math major split the check: the closer you were to high school, the more you remembered of basic arithmetic. But everyone knew the math majors were hopeless.</em></p>
<p>More seriously, it’s reasonably well-known among mathematicians that <strong>published math papers are <a href="https://twitter.com/benskuhn/status/1419281164951556097"><em>full</em> of errors</a></strong>. Many of them are eventually fixed, and most of the errors are in a deep sense “unimportant” mistakes. But the frequency with which proof formalization efforts <a href="https://mathoverflow.net/questions/291158/proofs-shown-to-be-wrong-after-formalization-with-proof-assistant">find flaws in widely-accepted proofs</a> suggests that there are plenty more errors in published papers that no one has noticed.</p>
<p>So math has, if not a replication crisis, at least a replication problem. Many of our published papers are flawed. But it doesn’t seem like we have a crisis.</p>
<h3 id="maybe-our-mistakes-get-caught">Maybe our mistakes get caught</h3>
<p>In the social sciences, replicating a paper is hard. You have to get new funding and run a new version of the same experiment. There’s a lot of dispute about how closely you need to replicate all the mechanics of the original experiment for it to “count” as a replication, and sometimes you can’t get a lot of the details you’d need to do it right—especially if the original authors aren’t feeling helpful.<strong title="In theory, all papers should include enough information that you can replicate all the experiments they describe. In practice, I think this basically never happens. There's just too much information, and it's hard to even guess which things are going to be important."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> And after all that work, people won’t even be impressed, because you didn’t do anything original!</p>
<p>But one of the distinctive things about math is that our papers aren’t just records of experiments we did elsewhere. In experimental sciences, the experiment is the “real work” and the paper is just a description of it. But <strong>in math, the paper, itself, is the “real work”</strong>. Our papers don’t describe everything we do, of course. There’s a lot of intellectual exploration and just straight-up messing around that doesn’t get written down anywhere. But the paper contains a (hopefully) complete version of the argument that we’ve constructed.</p>
<p>And that means that <strong>you can <em>replicate</em> a math paper by <em>reading</em> it</strong>. When I’ve served as a peer reviewer I’ve read the papers closely and checked all the steps of the proofs, and that means that I have replicated the results. And any time you want to use an argument from someone else’s paper, you have to work through the details, and that means you’re replicating it again.</p>
<p>The replication crisis is partly the discovery that many major social science results do not replicate. But it’s also the discovery that we hadn’t been trying to replicate them, and we really should have been. In the social sciences we fooled ourselves into thinking our foundation was stronger than it was, by never testing it. But in math we couldn’t avoid testing it.</p>
<h3 id="maybe-the-crisis-is-here-and-we-just-havent-noticed">Maybe the crisis is here, and we just haven’t noticed</h3>
<p>As our mathematics gets more advanced and our results get more complicated, this replication process becomes harder: it takes more time, knowledge, and expertise to understand a single paper. If replication gets hard enough, we may fall into crisis. The crisis might even <a href="https://link.springer.com/article/10.1007/s00283-020-10037-7">already be here</a>; the problems in psychological and medical research existed for decades before they were widely appreciated.</p>
<p>There’s some fascinating work in using <a href="https://www.nature.com/articles/d41586-021-01627-2">computer tools to formally verify proofs</a>, but this is still a niche practice. In theory we are continually re-checking all our work, but in practice that’s inconsistent, so it’s hard to be sure how deep the problems run. (Especially since flawed papers <a href="https://twitter.com/zbMATH/status/1474326312517271560">don’t really get retracted</a> and you pretty much have to talk to active researchers in a field to know which papers you can trust.)</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/trust.jpg" alt="Picture of kitten in bubble bath with caption: "my trust, u loses it." " width="50%" /></p>
<p>But while this is a real possibility that people should take seriously, I’m skeptical that we’re in the middle of a true crisis of replicability.<strong title="I'm sure every practitioner in every field says that, though, even years after the problems become obvious to anyone who looks. So take this with a grain of salt."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> <strong>Many papers have errors, yes—but our major results generally hold up, even when the intermediate steps are wrong!</strong> Our errors can usually be fixed without really changing our conclusions.</p>
<p>Since our main conclusions hold up, we don’t need to fix any downstream papers that relied on those conclusions. We don’t need to substantially revise what we thought we knew. We don’t need to jettison entire fields of research, the way <a href="https://replicationindex.com/2017/02/02/reconstruction-of-a-train-wreck-how-priming-research-went-of-the-rails/comment-page-1/">psychology had to abandon the literature on social priming</a>. There are problems, to be sure, and we could always do better. But it’s not a crisis.</p>
<h3 id="mysterious-intuition">“Mysterious” intuition</h3>
<p>But isn’t it…<em>weird</em>…that our results hold up when our methods don’t? How does that even work?</p>
<p>We get away with it becuase we can be right for the wrong reasons—<strong>we mostly only try to prove things that are basically true</strong>. Ben Kuhn tweeted a very accurate-feeling summary of the whole situation <a href="https://twitter.com/benskuhn/status/1419281164951556097">in this twitter thread</a>:</p>
<blockquote>
<p>[D]espite the fact that error-correction is really hard, publishing actually false results was quite rare because “people’s intuition about what’s true is mysteriously really good.” Because we mostly only try to prove true things, our conclusions are right even when our proofs are wrong.<strong title="A friend asks: if we mostly know what's true already, why do we need to actually find the proofs? The bad answer is "you're not doing math if you don't prove things". The good answer is that finding proofs is how we train this mysteriously good intuition; if we didn't work out proofs in detail, we wouldn't be able to make good guesses about the next steps."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong></p>
</blockquote>
<p>This can make it weirdly difficult to resolve disagreements about whether a proof is actually correct. In a recent example, Shinichi Mochizuki claims that he has <a href="https://www.quantamagazine.org/titans-of-mathematics-clash-over-epic-proof-of-abc-conjecture-20180920/">proven the \(abc\) conjecture</a>, while most mathematicians don’t believe his argument is valid. But everyone involved is pretty confident the \(abc\) conjecture is true; the disagreement is about whether the proof itself is good.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/proof.jpg" alt="Picture of cat walking through kitchen covered in trash: "come find me when you have proof."" width="75%" /></p>
<p style="text-align: center;"><em>Circumstantial evidence isn’t enough to make mathematicians happy.</em></p>
<p>If we find a counterexample to \(abc\) then Mochizuki is clearly wrong, but so is everyone else. If we find a consensus proof of \(abc\), then Mochizuki’s conclusion is right, but that does very little to make his argument more convincing. He could, very easily, just be lucky.</p>
<h2 id="butpsychologists-have-intuition-too">But—Psychologists have intuition, too</h2>
<p>A lot of psychology results that don’t replicate look a little different from this perspective. Does standing in a <a href="https://en.wikipedia.org/wiki/Power_posing">power pose</a> for a few seconds make you feel more confident? Probably! It sure feels like it does (seriously, stand up and give it a try right now); and it would be weird if it made you feel <em>worse</em>. Does it affect you enough, for a long enough time, to matter much? Probably not. That would also be weird.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/power-pose.jpg" alt="Picture of Amy Cuddy standing in front of a picture of Wonder Woman, in matching poses" width="50%" /></p>
<p style="text-align: center;"><em>Amy Cuddy demonstrating a power pose. <br />
Photo by Erik (HASH) Hersman from Orlando, <a href="https://creativecommons.org/licenses/by/2.0">CC BY 2.0</a>, via <a href="https://commons.wikimedia.org/wiki/File:Power_pose_by_Amy_Cuddy_at_PopTech_2011_(6279920726).jpg">Wikimedia Commons</a></em></p>
<p>The studies we’ve done, when analyzed properly, don’t show a clear, consistent, and measurable effect from a few seconds of power posing. But that’s what you’d expect, right? There’s probably an effect, but it should be too small to reasonably measure. And that’s totally consistent with everything we’ve found.</p>
<p>Amy Cuddy<strong title="I'm going to pick on Amy Cuddy and power posing a lot. That's not entirely fair to Cuddy; the pattern I'm describing is extremely common and easy to fall into, and I could make the same argument about [social priming research] or the [hungry judges study] or the dozens of others. (That's why it's a "replication crisis" and not a "this one researcher made a mistake one time crisis".) But for simplicity I'm going to stick to the same example for most of this post."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> had the intuition that power posing would increase confidence, and set out to prove it—just like Mochizuki had the intuition that the \(abc\) conjecture was true, and set out to prove it. Mochizuki’s proof was bad, but his top-line conclusion was probably right because the \(abc\) conjecture is probably correct. And Cuddy’s studies were flawed, but her intuition at the start was probably right, so her top-line conclusion is probably true.</p>
<p>Well, sort of.</p>
<h3 id="defaulting-to-zero">Defaulting to zero</h3>
<p>Let’s turn Cuddy’s question around for a bit.<strong title="Mathematicians love doing this. I'm a mathematician, so I love doing this. But it's genuinely a useful way to think about what's going on."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> What are the chances that power posing has <em>exactly zero</em> affect on your psychology? That would be extremely surprising. Most things you do affect your mindset at least a little.<strong title="This is your regular reminder to stand up, stretch, and drink some water."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong></strong></p>
<p>So our expectation should be: either power posing makes you a little more confident, or it makes you a little less confident. It also probably makes you either a little more friendly or a little less friendly, a little more or a little less experimental, a little more or a little less agreeable—<strong>an effect of exactly zero would be a surprise</strong>.</p>
<p>But for confidence specifically, it would also be kind of surprising if power posing made you feel less confident. So my default assumption is that power posing causes a small increase in confidence. And nominally, Cuddy’s research asked whether that default assumption is correct.</p>
<p>But that’s just not a great question. It doesn’t really matter if standing in a power pose makes you feel marginally better for five seconds. Not worth a book deal and a TED talk, and barely worth publishing. <strong>Cuddy’s research was interesting because it suggested the effect of power posing was not only positive, but <em>large</em></strong>—enough to make a dramatic, usable impact over an extended period of time.</p>
<p>If Cuddy’s results were true, they would be both surprising and important. But that’s just another way of saying they’re probably not true.</p>
<h3 id="power-and-precision">Power and Precision</h3>
<p>Notice: we’ve shifted to a new, different question. We started out asking “does power posing make you more confident”, but now we’re answering “how much more confident does power posing make you”. This is a better question, sure, but it’s different. And <strong>the statistical tools appropriate to the first question don’t really work for the new and better one.</strong></p>
<p><a href="https://en.wikipedia.org/wiki/Statistical_hypothesis_testing">Statistical hypothesis testing</a> is designed to give a yes/no answer to “is this effect real”. Hypothesis testing is surprisingly complicated to actually explain correctly, and probably deserves <a href="/blog/hypothesis-testing-part-1">an essay</a> or two on its own.<strong title="I originally tried to write a concise explanation to include here. It hit a thousand words and was nowhere near finished, so I decided to save it for later. Update: I have now posted the [first] and [second] essays in a three-part series on hypothesis testing."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/hypothesis-testing.png" alt="Diagram of true and false positives and negatives on a bell curve" width="75%" /></p>
<p style="text-align: center;"><em>I swear this picture makes sense.<br />
ROC_curves.svg: Sharprderivative work: נדב ס, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via <a href="https://commons.wikimedia.org/wiki/File:ROC_curves_colors.svg">Wikimedia Commons</a></em></p>
<p>To wildly oversimplify, we measure something, and check if that measurement is so big that it’s unlikely to occur by chance. If yes, we conclude that there’s a real effect from whatever we’re studying. If not, we generally conclude that there’s no effect.</p>
<p>But what if the effect is real, but very small? With this method, we conclude the effect is real if our measurements are big enough. <strong>But if the effect is small, our measurements won’t be <em>big</em>. Our study might not have enough <a href="https://en.wikipedia.org/wiki/Power_of_a_test">power</a> to find the effect</strong> even if it is real.<strong title="This means we have to be really careful about interpreting studies that don't find any effect. A study with low power will find "[no evidence]" of an effect even if the effect is very real, and that can be [just as misleading] as the errors I'm discussing in this essay. More careful researchers will say they "fail to reject the null hypothesis" or "fail to find an effect". If everyone were always that careful I wouldn't need to write this essay."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong></p>
<p>We could run a more powerful study and find evidence of smaller effects if we could make more precise measurements. This approach has worked really well in fields like physics and chemistry, and a lot of fundamental physical discoveries were driven by new technology that allowed the measurement of smaller effects. Galileo’s experiments with falling speeds required him to invent <a href="https://www.thegreatcoursesdaily.com/the-rolling-ball-experiments-galileos-terrestrial-mechanics/">improved timekeeping methods</a>, and Coulomb developed his inverse-square law after <a href="https://en.wikipedia.org/wiki/Coulomb%27s_law#History">his torsion balance</a> allowed him to precisely measure electromagnetic attraction. In the modern era, we built extremely sensitive measurement devices to try to measure <a href="https://en.wikipedia.org/wiki/LIGO">gravity waves</a> and detect <a href="https://en.wikipedia.org/wiki/Higgs_boson#Search_and_discovery">the Higgs boson</a>.</p>
<p>If power posing increases confidence by 1% for thirty seconds, that would actually be perfectly fine if we could measure confidence to within a hundredth of a percent on a second-to-second basis. But social psychology experiments just don’t work that way—at least, not with our current technology. There’s too much randomness and behavioral variation. Effects of that size just aren’t detectable.</p>
<p>This doesn’t have to be a problem! If we want to know “how big is the effect of power posing”, the answer is “too small to detect”. That’s a fine answer. It tells you that you shouldn’t build any complicated apparatus based on exploiting the power pose. (Or write <a href="https://www.goodreads.com/book/show/25066556-presence">entire books</a> on how it can change your life.)</p>
<p>But the question we started with was “does power posing have an effect at all?”. If the effect is small, we might struggle to tell whether it’s real or not.</p>
<h3 id="but-we-already-know-the-answer">But we already know the answer!</h3>
<p>Imagine you’re a psychologist researching power posing. You measure a small effect, which could just be due to chance. But you’re pretty sure that the effect is real; clearly you didn’t do a good enough job in your study! It’s probably <a href="https://en.wikipedia.org/wiki/Publication_bias">not even worth publishing</a>.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/X-Men-question-answer.gif" alt="Gif from X-Men movie. " Why do you ask questions to which you already know the answers? " width="75%" /></p>
<p>So you try again. Or someone else tries again. And eventually someone runs a study that <em>does</em> see a large effect. (Occasionally the large effect is due to fraud. Usually it’s methodology with subtler flaws that the researcher doesn’t notice. And sometimes it’s just luck: you’ll get a one-in-twenty outcome once in every twenty tries.)</p>
<p>Now we’re all happy. We were pretty sure that we would see an effect if we looked closely enough. And there it is! At this point no one has an incentive to look for flaws in the study. The result makes sense. (You might remember we said this is the state of a lot of mathematical research.)</p>
<p>But there are two major problems we can run into here. The first is that <strong>our intuition can, in fact, be wrong</strong>. If your process can only ever prove things that you already believed, it’s not a good process; you can’t really learn anything. Andrew Gelman <a href="https://statmodeling.stat.columbia.edu/2021/11/18/fake-drug-studies/">recently made this observation about fraudulent medical research</a>:</p>
<blockquote>
<p>If you frame the situation as, “These drugs work, we just need the paperwork to get them approved, and who cares if we cut a few corners, even if a couple people die of unfortunate reactions to these drugs, they’re still saving thousands of lives,” then, sure, when you think of aggregate utility we shouldn’t worry too much about some fraud here and there…</p>
</blockquote>
<blockquote>
<p>But I don’t know that this optimistic framing is correct. I’m concerned that bad drugs are being approved instead of good drugs….Also, negative data—examples where the treatment fails to work as expected—provide valuable information, and by not doing real trials you’re depriving yourself of opportunities to get this feedback.</p>
</blockquote>
<p>Shoddy research practices make sense if you see scientific studies purely as bureaucratic hoops you have to jump through: it’s “obviously true” that power posing will make you bolder and more confident, and the study is just a box you have to check before you can go around saying that out loud. But <strong>if you want to learn things, or be surprised by your data, you need to be more careful</strong>.</p>
<h2 id="effect-sizes">Effect Sizes Matter</h2>
<h3 id="overestimation">Overestimation</h3>
<p>The second problem can bite you even if your original intuition is right. You start out just wanting to know “is there an effect, y/n?”, but your experiment will make a measurement. You will get an estimate of the <em>size</em> of the effect. And that estimate will be wrong.</p>
<p>Your estimate will be wrong for a silly, almost tautological reason: <strong>if you can only detect large effects, then any effect you detect will be large</strong>. If you keep looking for an effect, over and over again, until finally one study gets lucky and sees it, that study will almost necessarily give <a href="https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">a wild overestimate</a> of the effect size.</p>
<p style="text-align: center;"><img src="https://statmodeling.stat.columbia.edu/wp-content/uploads/2014/11/Screen-Shot-2014-11-17-at-11.19.42-AM.png" alt="A diagram of the effects of low-power studies.
This is what "power = 0.06" looks like. Get used to it.
Type S error probability: If the estimate is statistically significant, it has a 24% chance of having the wrong sign.
Exaggeration ratio: If the estimate is statistically significant, it must be at least 9 times higher than the effect size." width="75%" /></p>
<p style="text-align: center;"><em>If the effect is small relative to your measurement precision, your results are guaranteed to be misleading. Figure by <a href="https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">Andrew Gelman</a>.</em></p>
<p>And this is how you wind up with shoddy research telling you that all sorts of things have shockingly large and dramatic impacts on…whatever you’re studying. You start out with the intuition that power posing should increase confidence, which is reasonable enough. You run studies, and eventually one of them agrees with you: power posing does make you more confident. But not just a little. In your study, people who did a little power posing saw big benefits.</p>
<p>To your surprise, you’ve discovered a life-changing innovation. You issue press releases, write a book, give a TED talk, spread the good news of how much you can benefit from this little tweak to your life.</p>
<p>Then other researchers try to probe the effect further—and it vanishes. Most studies don’t find clear evidence at all. The ones that do find something show much smaller effects than you had found. Of course they do. Your study had an unusually rare result, because that’s why it got published in the first place.</p>
<h3 id="dont-forget-your-prior">Don’t forget your prior</h3>
<p>Notice how, in all of this, we lost sight of our original hypothesis. It seemed basically reasonable to think power posing might perk you up a bit. That’s what we originally wanted to test, and that’s the conviction that made us keep trying. But we <em>didn’t</em> start out thinking that it would have a huge, life-altering impact.</p>
<p><strong>A really large result should feel just as weird as no result at all, if not weirder</strong>. And when we stop to think about that, we know it; some research suggests that <a href="https://twitter.com/BrianNosek/status/1034093709971873794">social scientists have a pretty good idea which results are actually plausible</a>, and which are nonsense overestimates But since we started with the question “is there an effect at all”, the large result we got <em>feels</em> like it confirms our original belief, even though it really doesn’t.</p>
<p>This specific combination is dangerous. The direction of the effect is reasonable and expected, so we accept the study as plausible. The size of the effect is shocking, which makes the study <em>interesting</em>, and gets news coverage and book deals and TED talks.</p>
<p>And this process repeats itself over and over, and the field builds up a huge library of incredible results that <a href="https://statmodeling.stat.columbia.edu/2017/12/15/piranha-problem-social-psychology-behavioral-economics-button-pushing-model-science-eats/">can’t possibly all be true</a>. Eventually the music stops, and there’s a crisis, and that’s where we are today. But it all starts somewhere reasonable: with people trying to prove something that is obviously true.</p>
<h3 id="so-how-is-math-different">So how is math different?</h3>
<p>This is exactly the situation we said math was in. Mathematicians have pretty good idea of what results should be true; but so do psychologists! Mathematicians sometimes make mistakes, but since they’re mostly trying to prove true things, it all works out okay. Social scientists are also (generally) trying to prove true things, but it doesn’t work out nearly so well. Why not?</p>
<p>In math, a result that’s too good <em>looks</em> just as troubling as one that isn’t good enough. The idea of “<a href="https://en.wikipedia.org/wiki/Proving_too_much">proving too much</a>” is a core tool for reasoning about mathematical arguments. It’s common to critique a proposed proof with something like “if that argument worked, it would prove all numbers are even, and we know that’s wrong”. This happens at all levels of math, whether you’re in college taking Intro to Proofs, or vetting a high-profile attempt to solve a major open problem. <strong>We’re in the habit of checking whether a result is—literally!—too good to be true</strong>.</p>
<p style="text-align: center;"><img src="/assets/blog/replication-crisis-math/anti-gravity-cat.jpg" alt="Picture of a floating cat. "damn anti-gravity cat always disproving ma theorem"" width="50%" /></p>
<p>We could bring a similar approach to social science research. Daniël Lakens <a href="http://daniellakens.blogspot.com/2017/07/impossibly-hungry-judges.html">uses this sort of argument</a> to critique a <a href="https://www.pnas.org/content/108/17/6889.short">famous study</a> on hunger and judicial decisions:</p>
<blockquote>
<p>I think we should dismiss this finding, simply because it is impossible. When we interpret how impossibly large the effect size is, anyone with even a modest understanding of psychology should be able to conclude that it is impossible that this data pattern is caused by a psychological mechanism. As psychologists, we shouldn’t teach or cite this finding, nor use it in policy decisions as an example of psychological bias in decision making.</p>
</blockquote>
<p>Other researchers have found <a href="https://mindhacks.com/2016/12/08/rational-judges-not-extraneous-factors-in-decisions/">specific problems with the study</a>, but Lakens’s point is that we could dismiss the result even before they did. If a proposed proof of Fermat’s last theorem also shows there are no solutions to $a^2 + b^2 = c^2$, we know it’s <em>wrong</em>, even before we find the specific flaw in the argument. And if a study suggest humans aren’t capable of making reasoned decisions at 11:30 AM, it’s confounded by <em>something</em>, even if we don’t know what.</p>
<p>And yet, while I don’t believe in these studies, and I don’t believe their effect sizes, I still believe their basic claims. I believe that people make worse decisions when they’re hungry. (I know I do.) I believe standing in a power pose can make you feel stronger and more assertive. I believe that <a href="https://www.vox.com/2016/3/14/11219446/psychology-replication-crisis">exercising self-control can deplete your willpower</a>.</p>
<p>But as a mathematician, I’m forced to admit: we don’t have proof.</p>
<hr />
<p><em>Do you think we have a replication crisis in math? Disagree with me about the replication crisis? Think you make better decisions when you’re hungry? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I’m a big fan of the <a href="http://datacolada.org">Data Colada</a> project, and of <a href="https://statmodeling.stat.columbia.edu/2018/05/07/replication-crisis-centered-social-psychology/">Andrew Gelman’s writing</a> on the subject. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In theory, all papers should include enough information that you can replicate all the experiments they describe. In practice, I think this basically never happens. There’s just too much information, and it’s hard to even guess which things are going to be important. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I’m sure every practitioner in every field says that, though, even years after the problems become obvious to anyone who looks. So take this with a grain of salt. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>A friend asks: if we mostly know what’s true already, why do we need to actually find the proofs? The bad answer is “you’re not doing math if you don’t prove things”. The good answer is that finding proofs is how we train this mysteriously good intuition; if we didn’t work out proofs in detail, we wouldn’t be able to make good guesses about the next steps. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>I’m going to pick on Amy Cuddy and power posing a lot. That’s not entirely fair to Cuddy; the pattern I’m describing is extremely common and easy to fall into, and I could make the same argument about <a href="https://www.nature.com/articles/d41586-019-03755-2">social priming research</a> or the <a href="https://mindhacks.com/2016/12/08/rational-judges-not-extraneous-factors-in-decisions/">hungry judges study</a> or the dozens of others. (That’s why it’s a “replication crisis” and not a “this one researcher made a mistake one time crisis”.) But for simplicity I’m going to stick to the same example for most of this post. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Mathematicians love doing this. I’m a mathematician, so I love doing this. But it’s genuinely a useful way to think about what’s going on. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>This is your regular reminder to stand up, stretch, and drink some water. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>I originally tried to write a concise explanation to include here. It hit a thousand words and was nowhere near finished, so I decided to save it for later. Update: I have now posted the <a href="/blog/hypothesis-testing-part-1">first</a> and <a href="/blog/hypothesis-testing-part-2">second</a> essays in a three-part series on hypothesis testing. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>This means we have to be really careful about interpreting studies that don’t find any effect. A study with low power will find “<a href="https://twitter.com/zeynep/status/1366175070507384836?lang=en">no evidence</a>” of an effect even if the effect is very real, and that can be <a href="https://twitter.com/CT_Bergstrom/status/1487491536010944512">just as misleading</a> as the errors I’m discussing in this essay.</p>
<p>More careful researchers will say they “fail to reject the null hypothesis” or “fail to find an effect”. If everyone were always that careful I wouldn’t need to write this essay. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleThe replication crisis is a major problem in medicine and social science; we know that a huge fraction of the published literature is outright wrong. But in math we don't seem to have a similar crisis, despite reasonably frequent minor errors in published papers. Why not, and what can this tell us about the fields that are in crisis?Pascal’s Wager, Medicine, and the Limits of Formal Reasoning2021-11-28T00:00:00-08:002021-11-28T00:00:00-08:00https://jaydaigle.net/blog/pascalian-medicine<p>Scott Alexander at Astral Codex Ten has a good post recently thinking about what he calls <a href="https://astralcodexten.substack.com/p/pascalian-medicine">Pascalian Medicine</a>. As always the entire post is worth reading, but here’s an excerpt:</p>
<blockquote>
<p>Another way of looking at this is that I must think there’s a 25% chance Vitamin D works, and a 10% chance ivermectin does. Both substances are generally safe with few side effects. So (as many commenters brought up) there’s a <a href="https://en.wikipedia.org/wiki/Pascal%27s_wager">Pascal’s Wager</a> like argument that someone with COVID should take both. The downside is some mild inconvenience and cost (both drugs together probably cost $20 for a week-long course). The upside is a well-below-50% but still pretty substantial probability that they could save my life.</p>
</blockquote>
<blockquote>
<p>…</p>
</blockquote>
<blockquote>
<p>But why stop there? Sure, take twenty untested chemicals for COVID. But there are almost as many poorly-tested supplements that purport to treat depression. The cold! The flu! Diabetes! Some of these have known side effects, but others are about as safe as we can ever prove anything to be. Maybe we should be taking twenty untested supplements for every condition!</p>
</blockquote>
<p>Scott doesn’t seem to believe we should do this, but is trying to figure out the actual flaw in this reasoning. The most convincing argument he comes up with is based in how unreliable modern medical studies are, and how easy it is to generate spurious positive results.</p>
<blockquote>
<p>I think ivermectin doesn’t work. I think that it looks like it works, because it has lots of positive studies and a few big-name endorsements. But our current scientific method is so weak and error-prone that any chemical which gets raised to researchers’ attentions and studied in depth will get approximately this amount of positive results and buzz. Look through the thirty different chemicals featured on the sidebar of the ivmmeta site if you don’t believe me.</p>
</blockquote>
<blockquote>
<p>…</p>
</blockquote>
<blockquote>
<p>Probably what I’m doing wrong here is saying that ivermectin having some decent studies raises its probability of working to 5%. I should just say 0.1% or 0.01% or whatever my prior on a randomly-selected medication treating a randomly-selected disease is (higher than you’d think, based on the argument from antibiotics).</p>
</blockquote>
<blockquote>
<p>From the Outside View, this argument seems strong. From the Inside View, I have a lot of trouble looking at a bunch of studies apparently supporting a thing, and no contrary evidence against the thing besides my own skepticism, and saying there’s a less than 1% chance that thing is true.</p>
</blockquote>
<p>The <a href="https://www.lesswrong.com/tag/inside-outside-view">Outside View</a> argument here is <em>completely right</em>, and is a great illustration of the limitations of Bayesian reasoning that I talked about <a href="/blog/paradigms-and-priors/#anomalies-and-bayes">here</a> and <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">here</a>.</p>
<h3 id="unknown-unknowns">Unknown Unknowns</h3>
<p>The basic argument for Pascalian medicine goes: okay, suppose ivermectin has a 10% chance of reducing covid mortality by 10%. About a thousand people are dying of covid every <del>week</del> day<strong title="I originally misread the CDC page and interpreted the weekly average of daily numbers as weekly numbers. I've edited the piece throughout to reflect the true numbers, but it doesn't change any of the conclusions, since the same error happened to every rate I discussed in the piece."><sup id="fnref:edit"><a href="#fn:edit" class="footnote">1</a></sup></strong> in the US <a href="https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/index.html">according to the CDC weekly tracker</a>, so the expected benefit of giving all our covid patients ivermectin is something like saving ten lives per day.<strong title="There would also be benefits from fewer people being hospitalized, fewer people suffering long-term health consequences, fewer people being miserable and bedridden for a week, etc. I'm going to talk about deaths pretty exclusively because it's easier to talk about just one number."><sup id="fnref:1"><a href="#fn:1" class="footnote">2</a></sup></strong></p>
<p>Even if you think the probability ivermectin works is only something like 1%, that still adds up to one life saved per day. Since ivermectin is cheap, and “generally safe with few side effects”, an expected value of “saves one life per day” looks pretty good! So maybe we should prescribe it out of an abundance of caution.<strong title="This is very different from claims that ivermectin is a miracle cure, and we should take that instead of getting vaccinated. Ivermectin is at best mildly beneficial; vaccines are safe and effective and you should get a booster shot if you haven't already. We're talking about whether the small possibility of a minor benefit from ivermectin makes it worth taking."><sup id="fnref:2"><a href="#fn:2" class="footnote">3</a></sup></strong></p>
<p>And then we make the same argument about, apparently, twenty other drugs, and we’re taking a crazy drug cocktail. (Scott calls this the Insanity Wolf position.) So it looks like something has gone wrong. But what?</p>
<p style="text-align: center;"><img src="/assets/blog/pascalian/insanity_wolf.jpeg" alt="Insanity Wolf meme: "TAKE EVERY MEDICATION ALL THE TIME BECOME INFINITELY HEALTHY, LIVE FOREVER"" /></p>
<p>We made a basic, common error that really isn’t fully avoidable: we took a bunch of stuff we can’t measure, and decided it didn’t matter. “Generally safe with few side effects” isn’t the same as “perfectly safe”, and “cheap” isn’t the same as “free”. And something like ninety thousand people get covid in the US every day; to save that one life we’re probably giving drugs to tens of thousands of people. How confident are we that our drugs won’t hurt any of them? Especially if we give an Insanity Wolf-style twenty-drug cocktail?</p>
<p>Scott discusses this idea, of course. But I think he seriously underestimates the problem of unknown unknowns here. For well-understood drugs with large probable benefits, the unknown unknowns don’t matter very much. But for long-shot possible payoffs, like with ivermectin, unknown unknowns present a real, unavoidable problem. And the theoretically, mathematically correct response is to throw up our hands and take the Outside View instead.</p>
<h3 id="three-example-drugs">Three Example Drugs</h3>
<p>I want to take a look at three different drugs and do some illustrative calculations for the possible risks and benefits.</p>
<h5 id="paxlovid">Paxlovid</h5>
<p>There are always unknown unknowns, but in many cases we can put bounds on how good, or bad, things can be. <a href="https://en.wikipedia.org/wiki/PF-07321332">Paxlovid</a>, Pfizer’s new antiviral pill, provides a good example of this reasoning. In trials, Paxlovid <a href="https://www.pfizer.com/news/press-release/press-release-detail/pfizers-novel-covid-19-oral-antiviral-treatment-candidate">cut covid hospitalizations and deaths by about 90%</a>.<strong title="These numbers are reported a little weirdly. Looking at the study, it seems like Paxlovid cut hospitalizations by 85%, from 41/612 to 6/607; it cut deaths by 100% from 10/612 to 0/607. I think the 90% figure is the extent to which it cut (hospitalizations plus deaths), since that math checks out, but that's a slightly weird metric to judge by."><sup id="fnref:3"><a href="#fn:3" class="footnote">4</a></sup></strong> Let’s assume that’s a wildly optimistic overestimate, and give it a 50% chance of cutting deaths by 50%. Then in expectation that’s going to save a couple hundred lives each day.</p>
<p>What are the risks? This is a new drug so it’s hard to know what they are; all we know is that (1) Pfizer didn’t expect the side effects to be too bad, based on prior knowledge of this drug class, and (2) they didn’t notice anything too dramatic in the trial they ran. That doesn’t tell us how bad the side effects are, but it does put limits on them: if Paxlovid killed 1% of the people who took it, we’d know.</p>
<p>But suppose Paxlovid kills .1% of everyone who takes it. That’s about as high as it could go without us probably having noticed already, since the trial administered it to about 600 people and none of them died. (And realistically if it killed .1% of people, way more than that would have severe side effects and we probably would have noticed.) If we give Paxlovid to everyone in the US who gets covid, that’s about 90,000 people a day, and Paxlovid would kill 90 people a day. And that’s less than the couple hundred lives it would save.</p>
<p>Now, all of these numbers are <em>extremely handwavy</em>. But I chose them to make Paxlovid looks as bad as reasonably possible, and it still comes out looking pretty good. My estimate of the benefit of Paxlovid was a huge lowball; it’s probably going to save closer to 800 lives in a day than 200 if we manage to give it to everybody. And on the other hand, I’d be shocked if it’s anywhere <em>near</em> as dangerous as I assumed in the last paragraph. Sure, there’s some minuscule chance that it it’s really really dangerous but only several years after you take it, but since that’s not how these drugs usually work we can round that off to zero.</p>
<p>The benefit of Paxlovid is large enough that it outweighs any vaguely reasonable estimate of the costs. And we don’t need any especially fancy calculations to see that.</p>
<p style="text-align: center;"><img src="https://imgs.xkcd.com/comics/statistics.png" alt="https://xkcd.com/2400 Statistics. "Statistics tip: always try to get data that's good enough that you don't need to do statistics on it."" /></p>
<p style="text-align: center"><em>We could make basically the same argument about vaccines, except the worst plausible numbers look even better than for Paxlovid.</em></p>
<h5 id="tylenol">Tylenol</h5>
<p>We can run a similar analysis with common every-day drugs like Tylenol. Scott observes that “We don’t fret over the unknown unknowns of Benadryl or Tylenol or whatever, even though we know their benefits are minor.” But by the same token, we also are reasonably confident that the unknown unknown costs of those drugs are minor. If Tylenol killed .1% of patients who took it, or even .01%, <em>we would know</em>. (And in fact we know Tylenol can cause liver damage, and that is a thing we very much do fret over.) Sure, unknown harms always could exist. But in this case we can be pretty confident that they have to be really small.</p>
<p>Apparently a new potentially deadly side effect of Tylenol was discovered in 2013. If I’m reading the FDA report correctly, they belive that <a href="https://www.fda.gov/drugs/drug-safety-and-availability/fda-drug-safety-communication-fda-warns-rare-serious-skin-reactions-pain-relieverfever-reducer">one person has died</a> from this side effect since 1969. That’s the scale of side effect that can slip under the radar for a drug as widely taken and studied as Tylenol.</p>
<p>Tylenol could have unknown unknowns, but they won’t be <em>very</em> unknown.</p>
<h5 id="back-to-ivermectin">Back to Ivermectin</h5>
<p>Now compare this with the ivermectin situation. Let’s suppose we give ivermectin a 10% chance of being effective, with a benefit of reducing deaths by 20%. (The Together trial has a non-significant effect of about 10%, so let’s double that.) Then in expectation we’re saving like 2% of lives a day, which is 20 lives saved if we give it to everyone.</p>
<p>How many people would ivermectin have to kill to net out negative? If we give it to 90,000 people every day, then 20 is about .02%. So does ivermectin kill about .02% of the people who take it? My guess is, probably not. But that seems a lot more within the realm of “maybe, it’s hard to be sure”.</p>
<p>We also reach the point where a lot of our ass-pull assumptions start to really matter. We said “maybe ivermectin has a 10% chance of working”. Scott’s the expert, not me, but that seems high to me. (Do you really think that one in ten drugs that have vague but mildly-promising data in preliminary trials pan out?) If we say ivermectin has a 1% chance of reducing deaths by 20%, then our expected value is two lives per day.</p>
<p>This could still pencil out as a good trade, but with benefits so small (and uncertain) it could easily not be worth it. Especially if we account for the guaranteed annoyance of taking a pill and the common minor side effects we know ivermectin has.</p>
<h3 id="the-problem-with-made-up-numbers">The Problem with Made-Up Numbers</h3>
<p>But the larger point here is that <em>all this math is bullshit</em>. Are the odds of ivermectin working 10%? 1%? .01%? Where did that number come from? What do we mean by “working”—is it a 5% improvement? A 50% improvement?<strong title="There are systematic ways of estimating this, but they would all require numbers for "how inflated do you expect non-significant effect sizes in published studies to be?" If you spend a lot of time with the medical literature you might have a number to put here; I don't."><sup id="fnref:4"><a href="#fn:4" class="footnote">5</a></sup></strong> And at the same time, I don’t have real odds for “negative side effects”, which covers a lot of ground. (Scott himself points out that the odds of ivermectin unexpectedly killing you are definitely not zero.) And all this is the simple version of the calculation, where we don’t try to weigh things like “fever from covid might last one day less?” versus “ivermectin can cause fever?”</p>
<p>Scott argued many years ago that <a href="https://slatestarcodex.com/2013/05/02/if-its-worth-doing-its-worth-doing-with-made-up-statistics/">if it’s worth doing, it’s worth doing with made-up statistics</a>. And I don’t really disagree with that essay. Doing experimental calculations with made-up numbers can give us information, and I certainly think the analysis of Paxlovid that I did above tells us something useful. But to learn anything from these calculations, we need our made-up numbers to at least vaguely reflect reality.</p>
<p>Scott wrote:</p>
<blockquote>
<p>Remember the <a href="http://yudkowsky.net/rational/bayes">Bayes mammogram problem</a>? The correct answer is 7.8%; most doctors (and others) intuitively feel like the answer should be about 80%. So doctors – who are specifically trained in having good intuitive judgment about diseases – are wrong by an order of magnitude….But suppose some doctor’s internet is down (you have NO IDEA how much doctors secretly rely on the Internet) and she can’t remember the prevalence of breast cancer. If the doctor thinks her guess will be off by less than an order of magnitude, then making up a number and plugging it into Bayes will be more accurate than just using a gut feeling about how likely the test is to work.</p>
</blockquote>
<p>And this is right, but the caveat at the end is critical. If you have a good estimate of the prevalence of breast cancer, and a bad estimate of the chance of a false positive, then you can use the first number to get a better estimate of the second. But if you have a really good idea of the false positive rate (maybe you’ve seen thousands of positive results and learned which ones turned out to be false positives), but a shaky idea of the prevalence of breast cancer (hell, I have no idea how likely some lump is to be cancerous), you’ll be better off going with your intuition for how accurate the test is—and using that to estimate breast cancer prevalence!</p>
<p>Scott says that “varying the value of the “unknown unknowns” term until it says whatever justifies our pre-existing intuitions is the coward’s way out.” And this is one of the rare cases where I think he’s completely, unequivocally wrong. This isn’t the coward’s way out; it’s the only thing we can possibly do.</p>
<h3 id="reflective-equilibrium">Reflective Equilibrium</h3>
<p>If you find a convincing argument that generates an unlikely conclusion, you can accept the unlikely conclusion, you can decide that the premises of the argument were flawed, <em>or</em> you can decide the argument itself doesn’t work. If I collect some data, do some statistics, and calculate that taking Tylenol will cut my lifespan by thirty years, I don’t immediately throw away all my Tylenol—I look for where I screwed up my math. And that’s the correct, and rational, response.</p>
<p>If you think A is true and B is false, and find an argument that A implies B, you have three choices: you can decide A is false after all; you can decide B is true after all; or you can decide that the argument actually isn’t valid. Or you can adopt some probabilistic combination: it’s perfectly consistent to believe A is 60% likely to be true, B 60% likely to be false, and the argument 60% likely to be correct. But fundamentally you have to make a choice about which of the three pieces to adjust, and by how much.<strong title="David Chapman calls this [meta-rational reasoning](https://twitter.com/Meaningness/status/1463632030059544576). I see where he's coming from but think that's an unnecessarily complex and provocative way of talking about it."><sup id="fnref:5"><a href="#fn:5" class="footnote">6</a></sup></strong></p>
<p style="text-align: center;"><img src="/assets/blog/pascalian/two-answers.jpg" alt="Picture of kitten raising two paws: "i has two ansers. which you want?"" /></p>
<p>In the case of ivermectin, we have some data from some studies. We have an Inside View argument that, based on expected values computed from that data, taking ivermectin is probably worth it. And we have the Outside View argument that taking random long-shot drugs is not a great idea. And we have to reconcile these somehow.</p>
<p>First, we could reject, or disbelieve, the data. And we totally did that: a bunch of ivermectin studies are fraudulent or incompetent, and Scott <a href="https://astralcodexten.substack.com/p/ivermectin-much-more-than-you-wanted">argues pretty convincingly</a> that some of the honest, competent studies are really picking up the benefits of killing off intestinal parasites. But even after doing that, we’re left with the Pascalian argument: ivermectin probably doesn’t work, but it might, and the costs of taking it are low, so we might as well. Do we listen to that argument, or to our gut belief that this can’t be a good idea?</p>
<p>A common trap that smart, math-oriented people fall into is thinking that the argument with numbers and calculations must be the better one. The Inside View argument did some math, and multiplied some percentages, and came up with an expected value; the Outside View argument comes from a fuzzy intuitive sense that medicine Doesn’t Work That Way. So the mathy argument should win out.</p>
<p style="text-align: center;"><img src="/assets/blog/pascalian/peanuts-opinion.gif" alt="Peanuts comic. "How are you doing in school these days, Charlie Brown?"
"Oh, fairly well, I guess...I'm having most of my trouble in arithmetic.."
"I should think you'd like arithmetic...it's a very precise subject.."
"That's just the trouble. I'm at my best in something where the answers are mostly a matter of opinion!"" width="100%" /></p>
<p>But in this case, we were doing calculations with numbers that were, you might remember, completely made up. Sure, the Outside View argument reflects a fuzzy intuitive sense of whether a random potential cure is likely to help us. The Inside View argument, on the other hand, reflects a fuzzy intuitive sense of whether Ivermectin is likely to protect us from covid.</p>
<p>The only real difference is that we took the second fuzzy intuition, put a fuzzy number on it, and plugged it into some cost-benefit analysis formulas. And no matter what fancy formulas we use, they can never make our starting numbers <em>less</em> fuzzy. Given the choice between a fuzzy intuition, and an equally fuzzy intuition that we’ve done math to, I’m inclined to trust the first one. With fewer steps, there are fewer ways to screw up.</p>
<h3 id="finding-the-error">Finding the Error</h3>
<p>At this point I think we’ve reached roughly Scott’s position at the end of his essay. The Outside View argument is winning out in practice, but we haven’t articulated any specific problems with the Inside View argument. And this is uncomfortable, because <em>they can’t both be right</em>. We can say it’s more likely we screwed up the more complicated, mathier argument. But <em>how</em> did we screw it up?</p>
<p>And on reflection, the answer is that we’re confusing two different arguments. I think that “Sure, go ahead and take ivermectin, it probably won’t help but it might, and it probably won’t hurt either” is a pretty reasonable position, and was even more reasonable six months ago, when we knew less than we do now.<strong title="Again, "Ivermectin is a miracle cure, take that instead of getting vaccinated" is, in fact, a completely and totally nonsense position. And many public "ivermectin advocates" are saying that, and they are wrong. But that's not what we're talking about here."><sup id="fnref:6"><a href="#fn:6" class="footnote">7</a></sup></strong></p>
<p>I know a bunch of people who take Vitamin C, even though it’s not clear that accomplishes anything. I myself flip-flop between taking a multivitamin because it seems like it might make me healthier, and not-taking a multivitamin because there’s no real evidence that it does. Taking Ivermectin it case it’s helpful doesn’t really seem that different.</p>
<p>No, the crazy position is when we go full Insanity Wolf and take twenty different long-shot cures at once. <em>That</em> was the conclusion that seemed like it couldn’t possibly hold up, at least for me. And that’s <em>also</em> the point where it really does seem like the unknown unknowns start piling up. There are twenty different drugs that could all possibly cause negative side effects. There are 190 potential two-drug interactions and over a thousand potential three-drug interactions, and even if interactions are, in Scott’s words, “rarer than laypeople think”, that seems like a lot of room for something weird to happen.</p>
<p>So this is how we screwed up. We said these drugs are cheap and generally safe. But in order to make our math reasonable, we rounded “generally safe” down to “safe”, and ignored the risks entirely. As long as the risks are small enough, that works fine; but at some point we cross the threshold we can’t just ignore all the downsides when doing our calculations.</p>
<p>Is taking twenty drugs over that threshold? I don’t know, but it seems likely. Taking that many drugs <em>probably</em> won’t hurt you, but it might! And it will definitely be expensive and annoying, and a lot of those drugs have common mild-but-unpleasant side effects. And the potential benefits are relatively small, and relatively unlikely; it’s easy for them to be swamped by all these downsides.</p>
<p>But now we’re talking about the interaction of hundreds of numbers that are both small and uncertain. We can’t get away with ignoring the risks, but we can’t realistically quantify them either. All we can do is make some half-assed guesses, and our conclusions will change a lot depending on exactly which guesses we make. So we <a href="https://twitter.com/ProfJayDaigle/status/1463598150585888775">can’t do a useful Inside View calculation at all</a>. Instead we’re basically forced to rely on the Outside View argument: taking twenty pills every day that probably don’t even work seems kinda dumb.</p>
<p>But then why take ivermectin specifically, rather than Vitamin D or curcumin or some other possible treatment? I dunno. You’re buying a long-shot lottery ticket. Pick your favorite number and hope it pays out.</p>
<h3 id="the-takeaway">The Takeaway</h3>
<p>A back-of-the-envelope cost-benefit analysis tells us that taking ivermectin for covid might have positive expected value. If we follow that logic to its conclusion, we wind up taking twenty different supplements and this seems like it can’t be wise.</p>
<p>A blinkered view of rationality tells us to ignore our intuition and follow the math. A more expansive view realizes that if the numbers we’re plugging into our cost-benefit analysis are shakier than that intuition, then we should take the intuition seriously. Cost-benefit analyses and other “mathematically rational” are only as good as the numbers and arguments that we bring to them.</p>
<p>But even with shaky numbers, we can learn things from comparing our intuitions with the result of our calculations. Figuring out <em>why</em> we get two different answers can teach us a lot about our reasoning, and help us figure out where we went wrong. Taking the full Insanity Wolf cocktail really seems qualitatively different from picking your favorite long-shot drug, but the way we set up our math hid that from us.</p>
<p>Finally: please get vaccinated, and get your booster shot. And if you have a choice between Paxlovid and ivermectin, you should probably take the Paxlovid.</p>
<hr />
<p><em>Questions about cost-benefit analysis, or where the math breaks down? Do you know something I missed? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:edit">
<p>I originally misread the CDC page and interpreted the weekly average of daily numbers as weekly numbers. I’ve edited the piece throughout to reflect the true numbers, but it doesn’t change any of the conclusions, since the same error happened to every rate I discussed in the piece. <a href="#fnref:edit" class="reversefootnote">↩</a></p>
</li>
<li id="fn:1">
<p>There would also be benefits from fewer people being hospitalized, fewer people suffering long-term health consequences, fewer people being miserable and bedridden for a week, etc. I’m going to talk about deaths pretty exclusively because it’s easier to talk about just one number. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This is very different from claims that ivermectin is a miracle cure, and we should take that instead of getting vaccinated. Ivermectin is at best mildly beneficial; vaccines are safe and effective and you should get a booster shot if you haven’t already. We’re talking about whether the small possibility of a minor benefit from ivermectin makes it worth taking. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>These numbers are reported a little weirdly. Looking at the study, it seems like Paxlovid cut hospitalizations by 85%, from 41/612 to 6/607; it cut deaths by 100% from 10/612 to 0/607. I think the 90% figure is the extent to which it cut (hospitalizations plus deaths), since that math checks out, but that’s a slightly weird metric to judge by. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>There are systematic ways of estimating this, but they would all require numbers for “how inflated do you expect non-significant effect sizes in published studies to be?” If you spend a lot of time with the medical literature you might have a number to put here; I don’t. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>David Chapman calls this <a href="https://twitter.com/Meaningness/status/1463632030059544576">meta-rational reasoning</a>. I see where he’s coming from but think that’s an unnecessarily complex and provocative way of talking about it. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Again, "Ivermectin is a miracle cure, take that instead of getting vaccinated" is, in fact, a completely and totally nonsense position. And many public "ivermectin advocates" are saying that, and they are wrong. But that’s not what we’re talking about here. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleA back-of-the-envelope cost-benefit analysis tells us that taking ivermectin for covid might have positive expected value. If we follow that logic to its conclusion, we wind up taking twenty different supplements and this seems like it can't be wise. Resolving this apparent conflict exposes some of the deep flaws in how we often think about rationality and Bayesian reasoning. A response to a piece by Scott Alexander at Astral Codex Ten.More Thoughts on the Axiom of Choice2021-07-28T00:00:00-07:002021-07-28T00:00:00-07:00https://jaydaigle.net/blog/more-on-the-axiom-of-choice<p>I got a lot of good, interesting comments on my recent <a href="https://jaydaigle.net/blog/what-is-the-axiom-of-choice/">post on the axiom of choice</a> (both on the post itself, and in this <a href="https://news.ycombinator.com/item?id=27836406">very good Hacker News thread</a>). I wanted to answer some common questions and share the most interesting thing I learned.</p>
<h3 id="cant-we-just-pick-at-random">Can’t we just pick at random?</h3>
<p>A lot of people asked why we can’t just avoid the whole problem of the axiom of choice by picking set elements randomly. Because obviously we can just make a bunch of random choices, right? If there’s no limit to what the choices have to look like then there’s no problem.</p>
<p>If you believe that, then you believe the axiom of choice. “We can pick some element from each set, without being fussy about which one we get” is just what the axiom of choice says. And that’s fine. A lot of people believe the axiom of choice! But it’s not an alternative to the axiom of choice; it is the axiom of choice.</p>
<p>The fact that this “just pick at random” idea seems so facially compelling, or “obvious”, is a big part of why many mathematicians want to accept the axiom of choice. It just seems like we should be able to make a bunch of
choices at once, if we’re not picky about which choices we make. It’s only when they are shown the really bizarre implications of getting to make those choices that most people start questioning whether the axiom makes sense.</p>
<h3 id="why-do-we-want-to-believe-the-axiom-of-choice">Why do we want to believe the axiom of choice?</h3>
<p>Another recurring question asked why we <em>should</em> want to believe the axiom of choice. It has a lot of bizarre consequences. In the last post I argued that those consequences aren’t as troubling as they seem, but they’re still weird. Why can’t we just dumpster the axiom of choice and avoid all of them?</p>
<p>One reason is the intuitive plausibility of the “just pick at random” idea. The goal of an axiomatic system is to formalize our list of “basic moves we should be able to make”. The ZF axioms include things like the <a href="https://en.wikipedia.org/wiki/Axiom_of_extensionality">axiom of extensionality</a>, which says that two sets are equal if they have the same elements, and the <a href="https://en.wikipedia.org/wiki/Axiom_of_pairing">axiom of pairing</a>, which says that if \(A\) and \(B\) are sets then we can talk about the set \( {A,B } \). These aren’t weird exotic ideas. They’re just things we should be able to do with collections of things. They’re part of the intuition that the word “set” is trying to formalize.</p>
<p>You could see the axiom of choice as something like this—something in our basic, intuitive understanding of what a “set” is, that pre-exists formal definitions. It’s pretty easy to convince people that “choose an element from each set” is a reasonable thing to be able to do. The only problem is that it leads to absurd results like Banach-Tarski or the solution to the Infinite Hats puzzle. But if we satisfy ourselves that those absurdities aren’t a real problem, we return to “this seems like a thing we should be able to do”.</p>
<h3 id="but-really-why-do-we-want-to-believe-the-axiom-of-choice">But really, why do we <em>want</em> to believe the axiom of choice?</h3>
<p>On the other hand, that’s not a very strong reason to really care about the axiom of choice. At best, that leaves us at “why shouldn’t we, it doesn’t hurt anything”, which could just as easily be “why should we, it doesn’t help?” We <em>care</em> about the axiom of choice, and put up with the peripheral weirdness, because it lets us prove a <a href="https://en.wikipedia.org/wiki/Axiom_of_choice#Weaker_forms">variety of other results we care about</a>. These include:</p>
<ul>
<li>Every Hilbert space has an orthonormal basis (so we can put coordinates on function spaces);</li>
<li>Every field has an algebraic closure (very important in number theory—in my research I often wanted to talk about “the algebraic closure” of some large field, and that implicitly relies on the axiom of choice);</li>
<li>The union of countably many countable sets is countable;</li>
<li><a href="https://en.wikipedia.org/wiki/Hahn%E2%80%93Banach_theorem">The Hahn-Banach theorem</a> (lets us extend linear functionals and guarantees that dual spaces are “interesting”);</li>
<li><a href="https://en.wikipedia.org/wiki/G%C3%B6del's_completeness_theorem">Gödel’s completeness theorem</a> for first-order logic;</li>
<li><a href="https://en.wikipedia.org/wiki/Baire_category_theorem">The Baire category theorem</a>, which I don’t even want to try to summarize but which shows up constantly in functional analysis.</li>
</ul>
<p>All of these results are really useful in their respective fields, and we need the axiom of choice to prove them. And that’s a true “need”: these are all provable from ZFC but not from ZF.</p>
<p>These statements aren’t equivalent to the axiom of choice. If we wanted, we could take the above list as a list of new <em>axioms</em> to attach to ZF, and then we wouldn’t be stuck with choice. But that is a really strange and ad-hoc list of foundational axioms. It feels much better to take the one axiom—the axiom of choice, which is reasonably foundational and sounds plausible enough on its own—and get all these consequences for free.</p>
<h3 id="shoenfields-theorem-you-only-need-the-axiom-of-choice-for-weird-things">Shoenfield’s Theorem: You only need the axiom of choice for weird things</h3>
<p>But the coolest thing I learned about after writing the last post is <a href="https://en.wikipedia.org/wiki/Absoluteness#Shoenfield's_absoluteness_theorem">Shoenfield’s Absoluteness Theorem</a>. The statement of this theorem is pretty dense and I don’t think I completely understand it, but it has really nice implications for the axiom of choice.</p>
<p>In the last post I said that the axiom of choice just doesn’t cause problems as long as we’re not getting too far away from finite sets. This applies even to half the results in the previous section.</p>
<ul>
<li>We need the axiom of choice to show that <em>every</em> field has an algebraic closure, but not to show that the rationals do.</li>
<li>We need the axiom of choice to show that <em>every</em> Hilbert space has an orthonormal basis, but not to show that Fourier theory gives an orthonormal basis for \(L^2([-\pi,\pi])\).</li>
<li>We need the axiom of choice to prove the Baire Category Theorem for every complete metric space, but not to prove it for the real numbers or the real function space \(L^2(\mathbb{R}^n)\).</li>
</ul>
<p>Shoenfield’s theorem helps tell us exactly when the axiom of choice is actually going to matter.</p>
<p>In the last post we talked about <em>models</em> of the ZF axioms, which are collections of sets that obey all the rules. Given a model, Kurt Gödel defined something called the <a href="https://en.wikipedia.org/wiki/Constructible_universe">constructible universe</a>, which is a sort of smaller model, contained in the original model, which can be built up explicitly from smaller pieces. The constructible universe usually doesn’t contain everything in the original model, but it will in some sense contain all the simple explicitly describable things in the original model.</p>
<p>But the constructible universe has some extra nice properties. One is that the constructible universe will always satisfy the axiom of choice, even if the original model did not!<strong title="This is how Gödel proved that the axiom of choice must be consistent with the ZF axioms: the constructible universe gives us a model of ZF that also satisfies the axiom of choice."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> Specifically, since we construct the universe in a specific <em>order</em>, everything we’ve constructed can be <a href="/blog/what-is-the-axiom-of-choice/#well-ordering">well-ordered</a>, which implies the axiom of choice. So any theorem that relies on the axiom of choice is automatically true as long as we’re only talking about sets in the constructible universe.</p>
<p>Shoenfield’s theorem extends that result even further. If you have a sufficiently simple question (for a <a href="https://en.wikipedia.org/wiki/Analytical_hierarchy">precise definition of sufficiently simple</a>), then the original model and the constructible universe must give the same answer. Since the axiom of choice always holds in the constructible universe, the answers to these simple questions can’t depend on whether you accept the axiom of choice or not.</p>
<p>What does that mean? Any simple-enough result that you can prove with the axiom of choice, you can also prove without it. That includes everything about Peano arithmetic and basic number theory, and also everything about the <a href="https://news.ycombinator.com/item?id=27855515">correctness of explicit computable algorithms</a>. It also includes <a href="https://en.wikipedia.org/wiki/Axiom_of_choice#cite_ref-16">\(P = NP\) and the Riemann Hypothesis</a>, and a number of other major unsolved problems.</p>
<p>There are questions that the axiom of choice really does matter for. But Gödel and Shoenfield’s results show that they have to be pretty far removed from anything finite or concretely constructible. So in practice, we can use the axiom of choice as a tool to make our work simpler, knowing that it won’t screw up anything practical that really matters.</p>
<hr />
<p><em>Do you have other questions about the axiom of choice? Another cool fact I don’t know about? Or some other math topic you’d like me to explain? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is how Gödel proved that the axiom of choice must be consistent with the ZF axioms: the constructible universe gives us a model of ZF that also satisfies the axiom of choice. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleI got a lot of good, interesting comments on my recent post on the axiom of choice (both on the post itself, and in this very good Hacker News thread). I wanted to answer some common questions and share the most interesting thing I learned.What is the Axiom of Choice?2021-07-14T00:00:00-07:002021-07-14T00:00:00-07:00https://jaydaigle.net/blog/what-is-the-axiom-of-choice<p>One of the easiest ways to start a (friendly) fight in a group of mathematicians is to bring up the <a href="https://en.wikipedia.org/wiki/Axiom_of_choice">axiom of choice</a>. This axiom has a really interesting place in the foundations of mathematics, and I wanted to see if I can explain what it means and why it’s controversial. As a bonus, we’ll get some insight into what an axiom <em>is</em> and how to think about them, and about how we use math to think about the actual world.</p>
<p style="text-align: center;"><a href="https://xkcd.com/982"><img src="https://imgs.xkcd.com/comics/set_theory.png" alt="xkcd 982: "The axiom of choice allows you to select one element from each set in a collection—and have it executed as an example to the others"" /></a></p>
<p>The axiom seems pretty simple at first:</p>
<blockquote>
<p><strong>Axiom of Choice:</strong> Given a collection of (non-empty) sets, we can choose one element from each set.<strong title="We can be more formal by phrasing this in terms of _choice functions_: given a collection of sets X = {A} there is a function f : X \to ⋃ A such that f(A) ∈ A for each A ∈ X. But I want to keep the discussion as readable as possible if you're not comfortable with the language of formal set theory."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
</blockquote>
<p>Most people find this principle pretty inoffensive, or even obviously right, on first contact. But it’s extremely controversial and produces strong emotions; and unusually for a mathematical debate, there’s essentially no hope of a clear resolution. And I want to try to explain why.</p>
<h3 id="easy-choices">Easy choices</h3>
<p>One reason the axiom of choice can <em>sound</em> trivial is that there are a lot of superficially similar rules that are totally fine; the controversial bit is subtle. So here are a few things that don’t cause controversy:</p>
<ul>
<li>If we have one set, we can definitely pick an element from it. The axiom of choice says if we have a collection of sets, we can pick one element from each set simultaneously.</li>
<li>
<p>But if we can pick an element from one set, can’t we pick an element from the first set, and then the second set, and then the third, etc.? Eventually we’ll pick an element from each set.</p>
<p>This works if we only have a <em>finite</em> collection of sets. So if I have five sets, I can pick one element from each set, by picking an element from the first set, then the second set, then the third, then the fourth, then the fifth. This is sometimes known as the <strong>axiom of finite choice</strong>. And no one argues about this.</p>
<p>But that approach doesn’t work if we have infinitely many sets.<strong title="Using this sort of process on an infinite set is called transfinite induction. Transfinite induction can sometimes allow us to make choices without the axiom, but only if we can put our sets in some order. Conversely, the axiom of choice allows us to use transfinite induction in cases we otherwise couldn't. (Corrected from earlier version; thanks to Sniffnoy for the correction)"><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> If we pick elements from one set at a time, we’ll never get to all the sets; there will still be infinitely many left. This infinitude of sets is where the real problem lies. (And things get worse if we have an <a href="https://en.wikipedia.org/wiki/Uncountable_set">uncountably infinite</a> collection of sets, which is too many to even put in order!)</p>
</li>
</ul>
<p style="text-align: center;"><img src="/assets/blog/aoc/count_over_eleventy.jpg" alt="A kitten holding up its paws like it's counting. "Ai can count ober elebenty. Look see? Elebenty one elebenty two elebenty free..."" /></p>
<ul>
<li>
<p>Even if we have an infinite collection of sets, we <em>might</em> be able to pick an element from each set. If the sets have a nice enough pattern to them, we can give an explicit rule that lets us pick an element from each set consistently. For instance, if we have a bunch of sets of positive integers, we can always say something like “pick the smallest number in each set”.</p>
<p>But not every collection of sets allows a deterministic rule like this.<strong title="The set of real numbers doesn't have a smallest element or a largest element. Nor does the set of positive real numbers, or the set of numbers between zero and one. So if we have a colleciton of sets of real numbers, the rule we used for sets of positive integers doesn't work."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> </strong> The axiom of choice says that we can choose an element from each set, even if we can’t describe a rule for making that choice. If we have infinitely many pairs of shoes we don’t need the axiom of choice, since we can just take the left shoe from each pair; but if we have infinitely many pairs of socks, we do need the axiom of choice.<strong title="This example was originally offered by Bertrand Russell. "><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong></p>
</li>
</ul>
<h3 id="whats-the-problem">What’s the problem?</h3>
<p>The axiom of choice has weird effects precisely because it is so unlimited. It tells us that given any infinite collection of infinite sets, we can pick one option from each set, even if the sets are too big to really understand, and even if we don’t have any extra structure to guide us.</p>
<p>We can see how this matters by looking at a classic logic puzzle, and then taking it to infinity.</p>
<h5 id="the-finite-hat-puzzle">The (finite) hat puzzle</h5>
<p>Imagine a game show host<strong title="The _classic_ version of the puzzle features a sadistic prison warden. While that setup is traditional, it seems unnecessarily violent, so I've replaced it with something friendlier."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> is going to line you up with 99 other people, and give each of you a hat to wear, which is either black or white. You can see everyone in front of you, including the colors of their hats; you can’t see your own hat, nor can you see anyone behind you.</p>
<p>Starting at the back of the line, the host will ask each person to guess whether their own hat is black or white. You’ll be able to hear the guesses, and whether they’re right or wrong.</p>
<p>Before the game starts, you all get a few minutes to talk and plan out your strategy. What should you do to get as many correct guesses as possible?</p>
<p>Stop and take a minute to think about this one. It doesn’t require any fancy mathematics, just a cute trick that’s surprisingly useful in other contexts.</p>
<p style="text-align: center;"><img src="https://64.media.tumblr.com/a6eec2d9352742626fe1fbe09b668cec/tumblr_nvpzxaBNRN1qgomego1_500.png" alt="Papyrus from Undertale dressed as Professor Layton: "Human would you like a puzzle?" Small child: "Not really" Papyrus: "Too bad you're getting a puzzle"" />
<em>Drawing by <a href="https://nightmargin.tumblr.com/post/130512412496/professor-skeleton-and-the-mystery-of-why-is">nightmargin</a> on Tumblr</em></p>
<p>As a hint, you can do really, really well. A simple approach that isn’t too bad is to have each odd-numbered person announce the color of the hat in front of them. This guarantees 50 right answers, and on average will get 75. But we can do much better than that.</p>
<p>Ready?</p>
<p>The person in the back of the line (call them \(A\)) doesn’t have any information, so there’s no possible way to guarantee they’ll get it right. But we can make sure everyone else wins. \(A\) can count up all the black hats in front of them and figure out if the number is even or odd. If it’s even, they’ll say “white”; if it’s odd, they’ll say “black”.</p>
<p>Now the second person \(B\) now knows whether \(A\) saw an even or odd number of black hats. But \(B\) can count up all the black hats <em>they</em> see. If \(A\) sees an even number of hats, but \(B\) sees an odd number, that means \(B\) must be wearing the remaining black hat.</p>
<p>The process continues down the line. \(C\) can tell whether \(A\) saw an even or odd number of black hats, and can also tell whether \(B\) was wearing black or white. Between that information, and seeing all the hats in front of them, \(C\) can figure out their own hat color.</p>
<p>(This sounds like it gets complicated very quickly, but we can streamline it. Count up all the black hats in front of you, and then add 1 to the number every time someone behind you says “black”. When the host reaches you, if the number is even you’re wearing a white hat, and if it’s odd you’re wearing a black hat.)</p>
<p>This exact algorithm is used by a lot of computer systems, especially when transmitting data over noisy connections. Computers store information in bytes, which are strings of eight bits. But often they will only use seven of the bits to store information (for instance, in standard <a href="http://rabbit.eng.miami.edu/info/ascii.html">ASCII encoding</a> there are 128 possible characters, represented as a 7-bit number). In transmission, the eighth bit can be used as a <a href="https://en.wikipedia.org/wiki/Parity_bit">parity bit</a>, which will be 1 if the other digits include an even number of “1”s, and 0 if they include an odd number of “1”s.</p>
<p>Thus every byte should have an odd number of “1”s, and if any byte has an even number of “1”s the system knows it contains an error. In our solutions \(A\) is effectively providing a parity bit for the string of hat colors, letting each player infer the information they don’t have: the color of their own hat.</p>
<h5 id="the-uncountable-hat-puzzle">The uncountable hat puzzle</h5>
<p>That puzzle is fun, and the solution is clever, but there’s nothing especially paradoxical or brain-breaking about it. And it doesn’t involve the axiom of choice at all. But we can write a harder version that does use the axiom of choice, and has truly ridiculous results.<strong title="I think I first heard about this version from Greg Muller at https://cornellmath.wordpress.com/2007/09/13/the-axiom-of-choice-is-wrong/"><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> </strong></p>
<p style="text-align: center;"><img src="/assets/blog/aoc/this-puzzle-reminds-me-of-a-puzzle.png" alt="Professor Layton's head: Doing a puzzle? That reminds me of a puzzle!"" /></p>
<p>Suppose the game host now gets an infinite line of people, so each person can see an infinite collection of people in front of them. (Let’s assume there is a <em>first</em> person in the line, so it’s not infinite in both directions; you have infinitely many people in front of you, but only finitely many behind.) And instead of black or white hats, we’ll write a random real number on each person’s hat: you could have 3 or 7, or \(5.234\) or \(\pi^e\) or \(\Gamma(3.5^{7.2e^2})\). And just to make it harder, you can’t even hear what happens behind you.</p>
<p>This looks plainly impossible. No one who can see your hat can communicate with you at all. Even if they could, there are <a href="https://en.wikipedia.org/wiki/Cantor's_diagonal_argument">more possible hat labels</a> than there are people in line. It seems like everyone working together wouldn’t be able to guarantee even one right answer. But if we can use the axiom of choice, we can guarantee that infinitely many people get the right answer—and even better, only finitely many people will get it wrong. In our endless infinite line, there will be a <em>last</em> wrong person; all the endless people in front of them will guess right.</p>
<p>How can this possibly work? First we’ll think about the set of all possible sequences<strong title="If you don't know what a sequence is, just think of this as an infinite list."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> of real numbers. (If we’re being fancy we might call this set \(\mathbb{R}^{\mathbb{N}}\).) We’ll say that two sequences are equivalent if they’re only different in finitely many places. So the sequences \( \Big( 1,2,3,4,5,6, \dots \Big) \) and \( \Big( 17, 2000 \pi, -\frac{345}{e}, 4, 5, 6, \dots \Big) \) are equivalent, but \( \Big( 1,0,3,0,5,0, \dots \Big) \) isn’t equivalent to either of them.</p>
<p>This gives us what’s called an <a href="https://en.wikipedia.org/wiki/Equivalence_relation">equivalence relation</a> on the set of real sequences. Equivalence relations are a widely useful tool, and I might write about them some other time, but for right now the important thing is that they <em>partition</em> the set, or subdivide it into smaller sets of things that are all equivalent to each other. Each thing will be in one and only one smaller set, which we call an <em>equivalence class</em>.</p>
<p>In our case, this means we’ve taken the set of all sequences of real numbers, and split it up into a bunch of equivalence classes of sequences. Every sequence belongs to exactly one equivalence class. And within each equivalence class, all the sequences are equivalent to each other—which means that they only have finitely many differences from each other.</p>
<p>Now we use the axiom of choice. We can <em>choose</em> one representative sequence from each equivalence class, and have everyone memorize this set of chosen sequences. When we all line up, I can see everyone in front of me, so there are only finitely many people I can’t see. There’s only one sequence on my list that can possibly be equivalent to this one.</p>
<p>Now when the host reaches me, I don’t know what’s happened behind me. I don’t know the exact sequence of hat labels. But I don’t need to! I know which equivlence class the sequence is in, and I know which representative sequence we chose for that equivalence class. So I can tell the host the number for my position from the representative sequence that we chose.</p>
<p>I might not be right; I have no way to know until the host tells me. But since we’re all using the <em>same</em> representative sequence that we chose earlier, and the sequence is only different from the “true” sequence finitely many times, an infinite number of us will get answer correctly. And only a finite number will fail.</p>
<h3 id="what-does-it-do-for-us">What does it do for us?</h3>
<p>The hat puzzle is obviously a little contrived, but the axiom of choice has a lot of surprising and sometimes disconcerting implications that are relevant to other fields of math. Some of these consequences are apparent paradoxes; others are things we would very much like to be true, and make the axiom of choice extremely useful.</p>
<h5 id="zorns-lemma">Zorn’s lemma</h5>
<p style="text-align: center;"><img src="/assets/blog/aoc/zorns_lemon.png" alt="What's yellow, sour, and equivalent to the axiom of choice? Zorn's Lemon!" /></p>
<p>Zorn’s Lemma is probably the most common use of the axiom of choice, but it’s a little tricky to explain. The formal statement is short enough:</p>
<blockquote>
<p><strong>Zorn’s Lemma:</strong> Every non-empty partially ordered set in which every totally ordered subset has an upper bound contains at least one maximal element.</p>
</blockquote>
<p>But it’s not super obvious what this means. The basic idea is that if we have some set where</p>
<ul>
<li>We can compare two elements and sometimes decide which one is “larger”;</li>
<li>but sometimes neither element counts as “larger”;</li>
<li><del>and we can never have an infinite collection of successively larger elements;</del>
any time we have an infinite collection of successively larger elements, there’s some other element bigger than all of them (thanks to Sniffnoy for the correction);</li>
</ul>
<p>then there must be a “largest” element.<strong title="Sometimes there can be _more than one_ largest element, which is a little weird. But since some pairs of elements can't be compared, you can have multiple elements that don't have anything above them. Imagine a company with two presidents: each of them is a highest-ranking person at the company. And that's why we say 'a' largest element rather than 'the' largest."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p>This is surprisingly useful, for one very specific reason: we can build up solutions to our problems step by step, and have a guarantee that we’ll finish. This is a tool we want to use all the time in math. We even tried it earlier: if we have a collection of sets, we can choose an element from the first one, and then the second one, and then the third one….</p>
<p>The problem we ran into is that this will eventually let us choose one element from each of a thousand sets, or a million, or a billion. But we have no guarantee that we can “eventually” choose from each of an infinite, possibly uncountable, collection of sets. Zorn’s lemma <a href="https://gowers.wordpress.com/2008/08/12/how-to-use-zorns-lemma/">solves this exact problem for us</a>, and lets us extend these constructions to infinity. And often when we’re defining functions on an infinite set, that’s exactly what we want to do.</p>
<p>Zorn’s lemma has one more important consequence: it is <em>equivalent</em> to the axiom of choice. We can use the axiom of choice to prove Zorn’s lemma; but we can also use Zorn’s lemma to prove the axiom of choice (by extending the axiom of finite choice to infinity, in exactly the way we were just discussing). We can’t duck the axiom-of-choice question by just making Zorn’s lemma into an axiom; the two are a package deal. If we want the power of Zorn’s lemma, we’re stuck with the axiom of choice and all the weirdness it implies.</p>
<h5 id="well-ordering"><a name="well-ordering">Well-ordering</a></h5>
<blockquote>
<blockquote>
<p>The axiom of choice is obviously true, the well-ordering principle obviously false, and who can tell about Zorn’s lemma?</p>
</blockquote>
</blockquote>
<blockquote>
<blockquote>
<blockquote>
<p><a href="https://books.google.com/books?id=eqUv3Bcd56EC&q=Bona#v=snippet&q=Bona&f=false">Jerry Bona</a></p>
</blockquote>
</blockquote>
</blockquote>
<p>These equivalences are a recurring theme in discussions of the axiom of choice. Another non-obviously equivalent statement is the Well-Ordering Principle, which says we can put any set \(X\) in a <a href="https://en.wikipedia.org/wiki/Well-order">definite order</a>, so that any subset has a “first” element. This is much stranger than it probably sounds. For instance, it’s really easy to put the real numbers in order, but most subsets won’t have a first element. (What’s the smallest real number? What’s the smallest positive real number? What’s the smallest number greater than 3?)</p>
<p>In fact, the fact that the usual order on the real numbers is <em>not</em> a well-ordering is a traditional source of internet math flame wars. There have been many <a href="https://forums.whirlpool.net.au/thread/9nxvlq19">forum threads</a> and <a href="https://polymathematics.typepad.com/polymath/2006/06/no_im_sorry_it_.html">blog comment threads</a> arguing endlessly about whether the infinitely repeating decimal \(.\bar{9}\) is actually equal to \(1\). (Yes, it is.)</p>
<p>Skeptics often suggest that maybe \(.\bar{9}\) isn’t <em>quite</em> \(1\), but just very close. Maybe it’s the last number before \(1\), the biggest number smaller that \(1\). But with the normal order for the reals, no such number exists. The reals are not well-ordered.</p>
<p>But with the axiom of choice, we can make up some <em>other</em> order for the real numbers, where every set has a first number. In fact, for any set, we can look at all the subsets and choose a first element for each once. We need to make sure that we do this consistently, but if we’re careful that’s not a problem, and so we can create a well-ordering on any set.</p>
<p>So what happens if we do this to the real numbers? There’s no real way to describe it—which is exactly why it requires the axiom of choice! You can make your favorite list of numbers and “choose” those to be first; the real difficulty is the need to make infinitely many choices. The axiom of choice lets us do this, but only in a totally non-explicit way that we can’t describe concretely.</p>
<h5 id="the-banach-tarski-paradox">The Banach-Tarski “paradox”</h5>
<p style="text-align: center;"><a href="https://xkcd.com/804"><img src="/assets/blog/aoc/xkcd_pumpkin_carving_edit.png" alt="xkcd 804: Pumpkin Carving. "I carved and carved, and the next thing I knew I had _two_ pumpkins." "I _told_ you not to take the axiom of choice."" /></a></p>
<p>But the most famous consequence of the axiom of choice, which probably deserves its own post, is the <a href="https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox">Banach-Tarski paradox</a>. Banach-Tarski says that if we have a solid three-dimensional ball, we can split it into five non-overlapping sets, rearrange these sets without any stretching or bending, and finish with two balls, each identical to the original ball.<strong title="The more general result is: given any two three-dimensional objects A and B, we can partition A into a finite collection of sets, and then rearrange those sets to get precisely B. In the special case people usually quote, A is 'a ball' and B is 'two balls'."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong></p>
<p>That means we’ve doubled the volume of our stuff just by moving the pieces around, which seems, um, implausible. We definitely can’t do that with a real ball. But with the axiom of choice, we can define “pieces” of the ball that are so strange that they don’t really have sizes at all. If we put them together one way, we get one volume; if we put them together a different way, we get a different volume. But the components don’t have a well-defined volume, so this is logically consistent. (And thus not actually a paradox, despite the name!)</p>
<h5 id="a-bunch-of-other-things">A bunch of other things</h5>
<p>There’s a <a href="https://en.wikipedia.org/wiki/Axiom_of_choice#Equivalents">long list of statements</a> that are equivalent to the axiom of choice. They show up in fields all over math, and algebra, analysis, and topology all become much simpler if these things are true:</p>
<ul>
<li>Every vector space has a basis</li>
<li>A product of non-empty sets is non-empty</li>
<li>Every set can be made into a group</li>
<li>The product of compact topological spaces is compact</li>
<li><a name="tarski">Tarski’s theorem:</a> If \(A\) is an infinite set, there’s a bijection between \(A\) and \( A \times A \)</li>
</ul>
<p>Since these are all equivalences, we can prove axiom of choice with any one of them. If you believe <em>any</em> of these statements, you’re stuck believing all of them—and the axiom of choice as well, with all its bizarre ball-cloning hat-identifying implications.</p>
<h3 id="sois-it-true">So…is it true?</h3>
<blockquote>
<blockquote>
<p>Tarski…tried to publish his theorem (<a href="#tarski">stated above</a>) in the <em>Comptes Rendus Acad. Sci. Paris</em> but Fréchet and Lebesgue refused to present it. Fréchet wrote that an implication between two well known propositions is not a new result. Lebesgue wrote that an implication between two false propositions is of no interest. And Tarski said that after this misadventure he never tried to publish in the <em>Comptes Rendus</em>.</p>
</blockquote>
</blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>Jan Mycielski, <a href="http://www.ams.org/notices/200602/fea-mycielski.pdf"><em>A System of Axioms of Set Theory for the Rationalists</em></a></p>
</blockquote>
</blockquote>
</blockquote>
<p>The big question is: <em>should</em> we believe any of these statements?</p>
<p>That might be a surprising question. Isn’t the whole point of math to have definitive, objectively correct answers? Either we can prove a result is true, or we can’t. We don’t generally ask whether we feel like believing a theorem. We proved it; we’re stuck with it.</p>
<p>But <em>axioms</em> are a little different. We need to decide on our axioms before we can prove things at all—or even decide what counts as a proof. Just like we can’t use a recipe to decide whether we want to make a cake or a cheeseburger, we can’t prove that an axiom is “correct”.</p>
<p>What we can do is look at a cake recipe, see what we’d have to do, and decide that maybe we don’t feel like making a cake after all. And we can look at what an axiom allows us to prove, and decide that maybe we don’t like those results and should pick some different axioms that don’t allow them.</p>
<h5 id="the-zermelo-fraenkel-axioms">The Zermelo-Fraenkel Axioms</h5>
<p>The standard system of axioms we use in math is called <a href="https://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory">Zermelo-Fraenkel Set Theory</a>, or just ZF. These are the rules we use as the base for all our work. If we can use them to prove a statement, we say just it’s proven; if a statement contradicts the ZF axioms, we’ve disproven it.</p>
<p style="text-align: center;"><img src="/assets/blog/aoc/set-theory-is-enough-theory-already.jpg" alt="Grumpy Cat says: Set Theory / is enough theory already" /></p>
<p>If the axiom of choice contradicted ZF, then we could forget about it and move on with our lives. But in 1938 Kurt Gödel proved that this isn’t the case: you can have fully consistent systems that respect both the ZF axioms and the axiom of choice.</p>
<p>Similarly, if we could prove the axiom of choice from the ZF axioms, we would have to either accept it as true, or completely rework all the foundations of math<strong title="We've actually done that before. At the beginning of the 20th century, Bertrand Russell and others found deep contradictions in the naive version of set theory in use at the time, and the ZF axioms were developed to avoid those problems. But we'd rather avoid doing it again."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong>. But we can’t do that either. And this is more than just acknowledging that we haven’t proved it <em>yet</em>: in 1963 Paul Cohen invented a technique called forcing to prove that if ZF is consistent, then we can never prove the axiom of choice from the rest of the ZF axioms.</p>
<p>This combination of results feels a little weird, because it’s so different from the way we usually approach math. Math has a reputation for black-and-white thinking<strong title="I don't like this reputation in any context. Mathematical thinking creates tons of space for nuance and subtlety and shades of gray. But that's probably a different essay."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong>: there’s a right answer to every question, and other answers are wrong. But here I’m telling you that there is no right answer. We can accept or reject the axiom of choice, and it works equally well either way.</p>
<h5 id="independence-is-normal">Independence is normal</h5>
<p>But this is actually perfectly normal! Suppose I asked you “are triangles isosceles?” The right answer isn’t “yes” <em>or</em> “no”: it depends on the triangle. And there are some theorems we can prove about isosceles triangles, like “if a triangle is isosceles, it has two equal angles”. And there are different theorems we can prove about non-isosceles triangles. The “axiom of isosceles-ness” is independent from the definition of a triangle.</p>
<p>But that might sound a little glib; no one talks about triangles like that. A better example is Euclidean geometry. When Euclid gave his formalization of geometry in <em>Elements</em>, he began with <a href="https://en.wikipedia.org/wiki/Euclidean_geometry#Axioms">five axioms</a> (or “postulates”, as you might have called them in high school geometry). The fifth (and final) postulate, called the <a href="https://en.wikipedia.org/wiki/Parallel_postulate">parallel postulate</a>, proved to be rather awkward.</p>
<blockquote>
<p><strong><a href="https://en.wikipedia.org/wiki/Parallel_postulate">Parallel postulate</a>:</strong> There is at most one line that can be drawn parallel to another given one through an external point.<strong title="This version is more precisely known as Playfair's axiom. Euclid's phrasing (translated from Greek) was 'if a straight line falling on two straight lines make the interior angles on the same side less than two right angles, the two straight lines, if produced indefinitely, meet on that side on which the angles are less than two right angles.' But Playfair's axiom is much simpler to state, and the two statements are equivalent."><sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup></strong></p>
</blockquote>
<p>This axiom is extremely important to geometry, but is much more complex and less self-evident than the other four axioms, which are statements like “all right angles are equal” and “we can draw a line connecting any two points”. Two millennia of mathematicians tried to remove this awkward complexity by proving the parallel postulate just from Euclid’s other axioms.</p>
<p>Then in the 1800s, we finally solved this problem—in the other direction. Euclidean geometry, including the parallel postulate, is completely consistent; but it’s also consistent to work with <em>non</em>-Euclidean geometries, in which the parallel postulate is false. Mathematicians constructed <a href="https://en.wikipedia.org/wiki/Non-Euclidean_geometry#Models_of_non-Euclidean_geometry">models</a> of elliptic geometry, in which there are no parallel lines, and of hyperbolic geometry, in which parallel lines are not unique.</p>
<p>What is a model? It’s just something that obeys all the axioms. So the work we do in high school, with pencil and paper on a flat surface, is a model of Euclidean geometry. It follows all five axioms, and any theorem that follows from the Euclidean axioms will be true of our pencil-and-paper work.</p>
<p>But if we work on the surface of a sphere, we get a model of non-Euclidean elliptic geometry. We can define a line to be a <a href="https://en.wikipedia.org/wiki/Great_circle">great circle</a>, a circle that goes fully around a sphere the long way. Any two points lie on exactly one great circle, so these “lines” obey Euclid’s first four axioms. But with a little bit of playing around, you can see that any pair of great circles will intersect in two points. This model doesn’t have any parallel lines at all.</p>
<p style="text-align: center;"><img src="/assets/blog/aoc/Grosskreis.svg" alt="Image of a sphere, with great circles marked." /></p>
<p style="text-align: center"><em>The solid curves are great circles. The solid blue curve is the equator.</em> <br />
<em>The dashed curves aren’t great circles, so they don’t count as lines.</em> <br />
<em>Adapted from <a href="https://commons.wikimedia.org/wiki/File:Grosskreis.svg">Wikimedia Commons</a></em></p>
<p>We can also build <a href="https://en.wikipedia.org/wiki/Poincar%C3%A9_disk_model">models of hyperbolic geometries</a>, but they’re a little harder to describe. But just one of these models is enough to know that we can’t prove the parallel postulate from Euclid’s other axioms—at least, not unless the other axioms are themselves contradictory. Nor can we disprove it. We have to <em>decide</em> if we want to use the parallel postulate.</p>
<p>This is exactly what Gödel and Cohen did for the axiom of choice. Gödel constructed a model of ZF set theory with choice; Cohen constructed a model of ZF set theory without choice. So we have to decide if we want to use the axiom of choice. And this brings us back to the same question: what are we trying to describe? Is the world we want to study a model of ZF with choice, or without?</p>
<h5 id="how-do-we-choose">How do we choose?</h5>
<p>To decide if we should adopt an axiom, we need to know what our goals are, and what we’re trying to describe. Euclidean geometry is good for arranging furniture in my room, but it’s bad for planning long-range flights, for which the fact that we live on a sphere matters.</p>
<p style="text-align: center;"><img src="/assets/blog/aoc/great_circle_routes.png" alt="A diagram of a great circle flight path. First on a rectangular/planar projection, where it doesn't look like a straight line; then on a sphere, where it does." /></p>
<p style="text-align: center"><em>Plane flight paths don’t look like straight lines on a flat map.</em> <br />
<em>On a sphere we see they really are the shortest, “straightest” path.</em> <br />
<em>Adapted from <a href="https://commons.wikimedia.org/wiki/File:Different_map_projections.png">Wikimedia Commons</a> CC-BY-SA-3.0</em></p>
<p>We should ask the same question about the axiom of choice: what are we trying to describe? Does the axiom of choice bring us closer to describing the world accurately, or farther away? Is the world we want to study a model of ZF with choice, or without?</p>
<p>The obvious answer is that the axiom of choice has absurd and unrealistic results. In the real world we can’t slice up one billiard ball and assemble the pieces into two billiard balls, or save infinitely many people in the hat puzzle. So if the axiom of choice says we can, it must not be describing the real world.</p>
<p>But this argument isn’t terribly persuasive, because every single thing about the uncountable hat puzzle is physically absurd. Even the setup is ridiculous: we can’t have an infinite line of people, and if we were somehow put in an infinite line, we wouldn’t be able to see all the people in it, let alone the numbers on their hats.</p>
<p>The step where we use the axiom of choice is even more unrealistic. We take the uncountably infinite set of real sequences; we partition it into an uncountably infinite collection of infinite sets of sequences; and then we ask everyone to memorize an (infinite!) sequence from each of these infinitely many infinite sets.</p>
<p>I’d have a hard time remembering one list of a hundred numbers. Memorizing a thousand lists of a thousand numbers is extremely unlikely; memorizing infinitely many lists of infinitely many numbers is flatly impossible. And that’s before we ask how we can communicate the lists we’ve chosen to each other, so that each of the (infinitely many) people memorize the <em>same</em> infinite collection of infinite lists.</p>
<p>The Banach-Tarski argument isn’t any better. It splits the ball into only five pieces,sure, but each of those pieces is infinitely complex, enough so that you can’t concretely describe their shapes, let alone actually cut a ball into those pieces. The informal explanation that “you can slice a ball into five pieces and reassemble those pieces into two balls” is not true, because there’s no real way to produce the pieces you need.<strong title="Feynman has a story about this in his memoir. A math grad student described the Banach-Tarski paradox to him, and he bet that it was made up, rather than a real theorem. He was able to wriggle out of losing by pointing out that the grad student had described cutting up an _orange_, and you can't slice a physical object made up of atoms infinitely finely."><sup id="fnref:13"><a href="#fn:13" class="footnote">13</a></sup></strong></p>
<p>In the real world we <em>never see infinite sets</em>. We pretend some sets are infinite because it makes our lives easier. But any principle that <em>only</em> kicks in at infinity will never make contact with the reality.</p>
<p style="text-align: center"><img src="/assets/blog/aoc/einstein_stupidity.jpeg" alt="Picture of Einstein: Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." height="50%" width="50%" /></p>
<p style="text-align: center"><em>Einstein <a href="https://quoteinvestigator.com/2010/05/04/universe-einstein">probably didn’t say this</a>, but it’s a good line.</em></p>
<h3 id="not-as-crazy-as-it-seems">Not as crazy as it seems</h3>
<p>This might feel like it’s dodging the question, though. If infinity is fake, why should we use axioms that only matter for infinity? And if we are going to say things about infinity, shouldn’t they make sense?</p>
<p>Maybe it’s fine for a physicist to dismiss mathematical abstractions as unphysical and thus irrelevant. But math is about reasoning through the consequences of abstract hypotheticals! If we’re going to adopt an foundational principle like the axiom of choice, we should really mean that we believe it in every abstract hypothetical situation we’re going to apply it in.</p>
<p>But after we realize how infinity works, our absurd results look somewhat more reasonable.<strong title="This is a common mathematical rhetorical trick. Earlier I was trying to convince you that the implications of the axiom of choice were really weird. Now I'm going to try to convince you that they're perfectly reasonable. This exact two-step happens quite a lot in math exposition. I suspect this is due partially to the demands of pedagogy, and partly to the way we form our mathematical intuition."><sup id="fnref:14"><a href="#fn:14" class="footnote">14</a></sup></strong> Our “successful” strategy in the infinite hat game actually doesn’t give us all that much. Sure, only finitely many people lose; some person in the line will be the last to answer wrong. But what would this look like in practice?</p>
<p>You could imagine the first hundred people all getting the question wrong. But that’s okay; only finitely many people will get it wrong. Then the first thousand people all get it wrong. But we know that at some point a last person will get it wrong and everyone left will get it right. A million people all get it wrong. Everyone gets bored. The game show host decides to leave. And sure enough, only finitely many people ever answered the question wrong!</p>
<p>The axiom of choice argument somehow doesn’t do anything after a finite number of answers. You could have the first million, or the first trillion, people all get the question wrong, and that wouldn’t contradict our proof. All the weirdness happens out at infinity—and we already know that infinity is deeply weird.</p>
<h3 id="whats-the-point">What’s the point?</h3>
<p>The axiom of choice is logically independent of our axioms for set theory, so we can’t ever prove it true or false. And it says deeply strange things about deeply strange situations that can never really happen. So why does it matter?</p>
<h5 id="infinity-is-fake-but-useful">Infinity is fake <em>but useful</em></h5>
<p>The answer is the same as the reason we use infinity at all. Everything we’ve ever seen is finite and discrete: objects are made out of atoms, and even if space and time aren’t truly quantized, our ability to measure them definitely is. But it’s extremely convenient to pretend that reality is continuous, which allows us to solve problems with calculus and other clever math tricks. If the world is “close enough” to being continuous, our answers will be good enough for whatever we’re doing.</p>
<p>Any infinity we care about will come from a limit of finite things. I can measure the width of my office in meters, or centimeters, or millimeters. With the right equipment I could measure it in micrometers or nanometers. I can’t ever measure it with infinite precision, but I can <em>imagine</em> doing that. And it’s really convenient to say the width is a real number, rather than to insist that it must <em>really</em> be some integer number of picometers</p>
<p>This exact reasoning is basically how all of calculus works. If I want to know how fast my car is going in miles per hour, I can measure the distance it travels in miles over the course of an hour. Or I can see how many miles it goes in a minute, and multiply by sixty. I could measure the number of miles it goes in a second, and multiply by 3600 (or more realistically, measure the number of <em>feet</em> it goes in a second, and multiply by 3600/5280).</p>
<p>But what is the speed “right now”? We imagine taking measurements over these shorter and shorter intervals; in the limit, when our interval is “infinitely short”, we get the instantaneous velocity. And that’s a derivative, which is an extremely powerful tool for doing math and physics.</p>
<p>But we can’t <em>actually</em> measure the distance traveled in an infinitely small window of time. (Nor can we measure the infinitely small time itself.) We’re taking some real, physical, finite measurements. We can measure how far a car goes in one second, multiply by 3600/5280, and then display that number on the dashboard. But the infinite version is something we only imagine.</p>
<h3 id="just-relax">Just relax</h3>
<p>If we’re trying to model the world, any infinite set we have to deal with will be a limit of finite sets. And any infinite family of infinite sets will be a limit of finite families of finite sets. And we know we have choice for finite sets of finite sets. So we can always get choice for these specific infinite sets, if we really need it—just by taking the limit of the elements we chose from our finite families.</p>
<p>What the axiom of choice says is: don’t worry about it. You don’t have to explain <em>how</em> your family of sets came from a finite family. You don’t have to explain <em>how</em> you’re choosing elements. We’ll just assume you can make it work somehow.</p>
<p>That’s what axioms are for. They tell us what we want to just assume we can do, without really explaining how. Our axioms are a list of things we don’t want to have to think about. And in practice, we don’t have to think about whether we can make choices. Any time it really matters, we can.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We can be more formal by phrasing this in terms of <em>choice functions</em>: given a collection of sets \(\mathcal{X} = \{A\}\) there is a function \(f : \mathcal{X} \to \bigcup_{A \in \mathcal{X}} A\) such that \(f(A) \in A \) for each \(A \in \mathcal{X} \). But I want to keep the discussion as readable as possible if you’re not comfortable with the language of formal set theory. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Using this sort of process on an infinite set is called <a href="https://en.wikipedia.org/wiki/Transfinite_induction">transfinite induction</a>. <del>If we allow transfinite induction then we get the axiom of choice for free. But the axiom of choice also implies that we can do transfinite induction; the two concepts are logically equivalent.</del> Transfinite induction can sometimes allow us to make choices without the axiom, but only if we can put our sets in some order. Conversely, the axiom of choice allows us to <a href="https://en.wikipedia.org/wiki/Transfinite_induction#Relationship_to_the_axiom_of_choice">use transfinite induction in cases we otherwise couldn’t</a>.</p>
<p>Thanks to Sniffnoy for a helpful correction here. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>The set of real numbers doesn’t have a smallest element or a largest element. Nor does the set of positive real numbers, or the set of numbers between zero and one. So if we have a colleciton of sets of real numbers, the rule we used for sets of positive integers doesn’t work. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>This example was originally offered by Bertrand Russell. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The <em>classic</em> version of the puzzle features a sadistic prison warden. While that setup is traditional, it seems unnecessarily violent, so I’ve replaced it with something friendlier. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>I think I first heard about this version from <a href="https://cornellmath.wordpress.com/2007/09/13/the-axiom-of-choice-is-wrong/">Greg Muller</a>. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>If you don’t know what a sequence is, just think of this as an infinite list. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Sometimes there can be <em>more than one</em> largest element, which is a little weird. But since some pairs of elements can’t be compared, you can have multiple elements that don’t have anything above them. Imagine a company with two presidents: each of them is the highest-ranking person at the company. And that’s why we say “a” largest element rather than “the” largest. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>The more general result is: given any two three-dimensional objects \(A\) and \(B\), we can partition \(A\) into a finite collection of sets, and then rearrange those sets to get precisely \(B\). In the special case people usually quote, \(A\) is “a ball” and \(B\) is “two balls”. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>We’ve actually done that before. At the beginning of the 20th century, Bertrand Russell and others found deep contradictions in the naive version of set theory in use at the time, and the ZF axioms were developed to avoid those problems. But we’d rather avoid doing it again. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
<li id="fn:11">
<p>I don’t like this reputation in any context. Mathematical thinking creates tons of space for nuance and subtlety and shades of grey. But that’s probably a different essay. <a href="#fnref:11" class="reversefootnote">↩</a></p>
</li>
<li id="fn:12">
<p>This version is more precisely known as <a href="https://en.wikipedia.org/wiki/Playfair's_axiom">Playfair’s axiom</a>. Euclid’s phrasing (translated from Greek) was “if a straight line falling on two straight lines make the interior angles on the same side less than two right angles, the two straight lines, if produced indefinitely, meet on that side on which the angles are less than two right angles.” But Playfair’s axiom is much simpler to state, and the two statements are equivalent. <a href="#fnref:12" class="reversefootnote">↩</a></p>
</li>
<li id="fn:13">
<p>Feynman has a story about this in <a href="https://en.wikipedia.org/wiki/Surely_You're_Joking,_Mr._Feynman!">his memoir</a>. A math grad student described the Banach-Tarski paradox to him, and he bet that it was made up, rather than a real theorem. He was able to wriggle out of losing by pointing out that the grad student had described cutting up an <em>orange</em>, and you can’t slice a physical object made up of atoms infinitely finely. <a href="#fnref:13" class="reversefootnote">↩</a></p>
</li>
<li id="fn:14">
<p>This is a common mathematical rhetorical trick. Earlier I was trying to convince you that the implications of the axiom of choice were really weird. Now I’m going to try to convince you that they’re perfectly reasonable. This exact two-step happens quite a lot in math exposition. I suspect this is due partially to the demands of pedagogy, and partly to the way we form our mathematical intuition. <a href="#fnref:14" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleOne of the easiest ways to start a (friendly) fight in a group of mathematicians is to bring up the axiom of choice. I'll explain what it is, why it's so controversial, and hopefully shed some light on how we choose axiomatic systems and what that means for the math we do.