Jekyll2023-09-30T17:12:29-07:00https://jaydaigle.net/Jay DaigleJay Daigle is a professor of mathematics at The George Washington University in Washington, D.C. In addition to his research in number theory, he brings a mathematical style to thinking about philosophy, politics, social dynamics, and everyday life.Jay DaigleA Fictional History of Numbers, Part 4: Imagination, Complexity, and the Fundamental Theorem of Algebra2023-07-25T00:00:00-07:002023-07-25T00:00:00-07:00https://jaydaigle.net/blog/fictional-history-of-numbers-part-4<p>Welcome back to our survey of the different types of “numbers” that mathematicians work with, and what kind of questions lead us to study those numbers. This week we’re going to put a bow on the first collection of questions we asked and tie them all together.</p>
<p>In the first few essays in this series, we saw two different approaches to finding new types of numbers. But they gave us different—and overlapping, but distinct—sets of numbers. Today we’ll see what happens when we combine both techniques, and develop the <em>complex numbers</em>. This won’t finish our quest to find weird numbers that mathematicians care about; far from it. But it will finish one line of questions, and cover pretty much everything we normally see in high school algebra and calculus.</p>
<p>But before I start, I want to take a moment to thank everyone who has donated to my <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Tips are never necessary, but always appreciated, and they really do make a difference and help me to keep writing essays like this one.</p>
<h2 id="building-the-complex-numbers">Building the complex numbers</h2>
<h3 id="the-two-approaches">The two approaches</h3>
<p>In <a href="https://jaydaigle.net/blog/fictional-history-of-numbers-part-1/">part 1</a>, we started with the natural numbers, which are the basic numbers we use to count. Using basic arithmetic operations, we introduced negative numbers to get the integers, then fractions to get the rational numbers. We ended by asking all polynomial equations to have solutions, which gave us the algebraic numbers. These include square roots and cube roots of all the rational numbers, and also some stranger things like the solutions to \(x^5+x+3=0\). This gave us a set that was <em>algebraically closed</em>: any polynomial equation defined with algebraic numbers will have a solution that is an algebraic number. So algebraic tools couldn’t push us any farther.</p>
<p>In <a href="https://jaydaigle.net/blog/fictional-history-of-numbers-part-2/">part 2</a> we asked a different question, about measurement and approximation. We wanted areas and lengths to all correspond to numbers, and this led to the idea of <em>completeness</em>, where any number we can approximate with rational numbers should actually exist. Completing the rational numbers gave us the real numbers. We might call this the <em>analytic</em> approach to extending the rationals, in contrast to the algebraic approach of part 1.</p>
<p>In <a href="https://jaydaigle.net/blog/fictional-history-of-numbers-part-3/">part 3</a> we showed that not every real number is algebraic; in particular \(\pi\) is a <em>transcendental</em> number, which isn’t the solution to any polynomial equation. But more generally, we showed that the algebraic numbers are <em>countable</em>, which means we can describe any one of them with a finite amount of information, but the real numbers are <em>uncountable</em>, which means it takes an infinite amount of information to describe most of them. There aren’t just more real numbers than algebraic numbers; there are infinitely more.</p>
<p>But that doesn’t mean the real numbers cover everything! There are algebraic numbers that aren’t real numbers. And there are real polynomials that don’t have real solutions. So what happens if we start with the real numbers and do part 1 again? Can we get a field with the completeness of the reals, but also the nice algebraic closure of the algebraic numbers?</p>
<h3 id="keeping-it-unreal">Keeping it unreal</h3>
<p>How do we know there are algebraic numbers that aren’t real?</p>
<p>We can start with the quadratic polynomial equation \(x^2+1=0\). This is defined entirely with real numbers. But when we graph the function \(x^2+1\), we see it doesn’t cross the \(x\)-axis, which means that \(x^2+1=0\) doesn’t have a real solution.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/no-root-quadratic.png" alt="Graph of x^2+1" class="blog-image center" /></p>
<p>We maybe should have expected this. We know that \(\sqrt{2}\) is real, because we can find a rational numbers whose squares are between \(1\) and \(2\), or \(1.9\) and \(2\), or between \(1.99999\) and \(2.\) That gives us a sequences of numbers that approximates \(\sqrt{2}\), and thus \(\sqrt{2}\) must be real. But we can’t do the same trick for \(−1\): <em>no</em> rational number has a square less than zero, so we can’t find anything that’s <em>close to</em> the square root of \(-1\).</p>
<p>But we can see this more directly by using the core principles of the real numbers: they’re a complete ordered field. Since they’re ordered, that every (non-zero) number must be either positive or negative. Since they’re an ordered <em>field</em>, the product of two positive numbers must be positive, and the product of two negative numbers must also be positive.</p>
<p>So suppose we have a number \(i\) that solves this equation. Then \(i^2 = -1\), which means \(i\) can’t be positive, and also can’t be negative. It’s clearly not zero. So it can’t be a real number at all. But it’s definitely algebraic: it’s the solution to \(x^2+1=0\).</p>
<p>Can we find other non-real algebraic numbers? Sure! There’s \(2i\) and \(3i\) and \(1+i\) and…. We can use \(i\) to build lots more non-real numbers.</p>
<p>But that’s it. It turns out that if we take the real numbers, and then add in everything we can build with the number \(i\), we have all the algebraic numbers. And in fact we have the solution to any polynomial we can write down with real numbers. This gives us everything we could ever want.<strong title="At least, until we come up with a new question to ask."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> But to see <em>why</em> this gets us everything, we’ll need to take a bit of a detour</p>
<h3 id="imaginary-and-complex-numbers">Imaginary and complex numbers</h3>
<p>We want to look at all the numbers we can build by combining the real numbers and \(i.\) These numbers will all look like \(a + bi\) where \(a\) and \(b\) are real numbers.<strong title="We don't have to worry about terms with i^2 or anything, because i^2 = -1 is a real number again."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> And we call the set of all these things the <em>complex numbers</em>, abbreviated \(\mathbb{C}.\) If we have a complex number \(z = a + bi\) then we say the real number \(a\) is the <em>real part</em> and the real number \(b\) is the <em>imaginary part</em>.</p>
<p>Remember our goal was to extend the real numbers to something algebraically nice. So we should start my making sure that we can still do arithmetic operations—that complex numbers are a field. Now, addition and subtraction are fine, since can use the rules
\[
\begin{aligned}
(a+bi) + (c+di) & = (a+c) + (b+d) i \\
(a+bi) - (c+di) & = (a-c) + (b-d) i .
\end{aligned}
\]
Multiplication is also pretty straightforward. By <a href="https://en.wikipedia.org/wiki/FOIL_method">FOILing</a> we get
\[
\begin{aligned}
(a+bi)(c+di)
& = ac + adi + bci + bdi^2 \\
& = ac + adi + bci + bd(-1) \\
& = (ac - bd) + (ad +bc)i
\end{aligned}
\]
so if we multiply two complex numbers, we get another.</p>
<p>Division is a little trickier; we don’t have a good way to distribute something like \( \frac{a+bi}{c+di}. \) Here we need to be clever, and maybe start by asking a new question that introduces a second big idea.</p>
<p>We defined \(i\) to be the square root of \(-1\). That is, \(i^2=-1\) is the definition of the number \(i.\) But what happens if we square the number \(-i\)? We have
\[
(-i)^2 = (-1)^2 (i)^2 = 1 \cdot (-1) = -1.
\]
So we have two different numbers that both satisfy our equation \(x^2 = -1.\) How do we know which is the “positive” \(i\), and which is the “negative” \(-i\)?</p>
<p>And the answer is that there’s no real difference! A positive number like \(4\) has two square roots, \(2\) and \(-2\), and since they’re both real numbers one is positive and the other is negative. A negative number like \(-1\) will also have two square roots, but since they <em>aren’t</em> real numbers, neither one of them is actually positive. We just pick one to call \(i\), and call the other one \(-i\)—but it doesn’t matter which one is which. And <em>that</em> means that if we swap \(i\) and \(-i\), nothing else should change. Thus we can define an operation called <em>complex conjugation</em> by the rule
\[
\overline{a + bi} = a - bi.
\]
This operation swaps \(i\) with \(-i,\) without changing anything else about our number.<strong title="This is the simplest example of a really interesting field called [Galois theory]. The complex conjugation operation we constructed is an element of the _Galois group_ of the complex numbers over the reals."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>But the complex conjugate has another useful property. What happens if we multiply a number by its own conjugate? We get
\[
\begin{aligned}
(a+bi) \overline{(a+bi)}
&= (a+bi)(a-bi) \\
&= a^2 +abi - abi - b^2 i^2 \\
&= a^2 - b^2 (-1) \\
&= a^2+b^2.
\end{aligned}
\]
If we multiply any complex number by its conjugate, we get a real number—and in fact, a positive real number, as long as we didn’t start with 0.</p>
<p>And this gives us a way to complex-number divisions, by turning them into real-number division:
\[
\begin{aligned}
\frac{a+bi}{c+di}
& = \frac{a+bi}{c+di} \frac{c-di}{c-di} \\
& = \frac{ (ac +bd) + (bc - ad)i}{c^2 + d^2} \\
& = \frac{ac+bd}{c^2+d^2} + \frac{bc-ad}{c^2+d^2} i.
\end{aligned}
\]
So we can in fact divide by any non-zero complex number. This means we can do basic arithmetic, and thus the complex numbers are a field.</p>
<p>And like the real numbers, they’re complete. The simplest way to think about this: we can think of a complex number \(z = a +bi\) as a pair of real numbers \(a\) and \(b\). So a sequence of complex numbers is basically just two sequences of real numbers, and we know that sequences of real numbers behave well. So any complex number that we can approximate has to actually exist; there aren’t any holes.</p>
<p>So while the reals are the unique complete ordered field, the complex numbers are a complete <em>unordered</em> field, which contains all the reals. And by giving up the order, we hope to get something else: every complex polynomial has a complex number solution. Once we take the real numbers and add in \(i\) there’s nothing left to algebraically add.</p>
<p>But it’s not obvious <em>why</em> that’s true. How do we know there’s not some polynomial equation we haven’t thought of, that doesn’t have a solution even in the complex numbers? To answer this, we need to turn to geometry.</p>
<h2 id="complex-geometry">Complex Geometry</h2>
<h3 id="the-complex-plane">The complex plane</h3>
<p>If we have a pair of real numbers, we can graph it on a plane, using the first number for the horizontal coordinate and the second number for the vertical coordinate. But a complex number \(z = a +bi\) is a pair of real numbers. And that means that, just like we can think of the real numbers as forming a line:
<img src="/assets/blog/fictional-history-of-numbers/number-line.svg" alt="The real number line" class="blog-image center" /></p>
<p>we can think of the complex numbers as forming a plane:</p>
<p><img src="/assets/blog/fictional-history-of-numbers/complex-plane.svg" alt="The complex plane" class="blog-image center" /></p>
<p>There are a lot of geometric ideas we can poke at here; for instance, complex numbers give us a useful way to talk about angles that I’m not going to talk about here, since it doesn’t help answer our current question.</p>
<p>But distances and sizes will be extremely useful. So let’s think about those: if we have a number \(z = a+bi\), let’s figure out how far away from the origin at \(0\) it is. The \(x\)- and \(y\)-coordinates are \(a\) and \(b\), so we have a triangle with side lengths \(a\) and \(b\). By the Pythagorean theorem, the length of the hypotenuse, and thus the distance from the origin, is \(\sqrt{a^2+b^2}\).</p>
<p><img src="/assets/blog/fictional-history-of-numbers/complex-modulus.png" alt="A graph of a triangle in the complex plane, with hypotenuse running from the origin to a+bi." class="blog-image center" /></p>
<p>So far, we haven’t used the fact that we have complex numbers running around. But if we remember the calculations we did with the complex conjugate, we might notice that
\[
a^2+b^2 = (a+bi)(a-bi) = (a+bi)\overline{(a+bi)}.
\]
So we can rewrite our distance formula: if we have a complex number \(z\), the distance from the origin is \(\sqrt{z \cdot \overline{z}} \). We call this the <em>modulus</em> or <em>absolute value</em> of the number \(z\), and write it \(|z|\). It’s one of the most important operations we can do with complex numbers.</p>
<p>Specifically, it allows us to talk about sizes. Because the complex numbers aren’t ordered, we can’t directly compare numbers like \(3-4i\) and \(1 + 3i\); neither one is greater than the other. But once we graph them it’s visually clear that \(3-4i\) is much further from \(0\) than \(1+3i\) is, and in that sense it’s definitely “bigger”.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/complex-distances.png" alt="A graph of two points in the complex plane, showing that 3-4i is much longer than 1+3i." class="blog-image center" /></p>
<p>The modulus lets us compute this numerically:
\[
\begin{aligned}
| 3 - 4i | & = \sqrt{3^2 + 4^2} = \sqrt{25} = 5 \\
| 1+3i | & = \sqrt{1^2 + 3^2} = \sqrt{10} \approx 3.16 \\
\end{aligned}
\]
and so the first number is in this sense “bigger” than the second.</p>
<p>This size computation allows us to do a few things. First, we need it to do geometry, since it allows us to compute distances: the distance between \(z\) and \(w\) \( |z-w|\), the modulus of the difference. And then <em>that</em> lets us talk about “completeness” more precisely. Completeness tells us that when all the points in a sequence get close together, they must have some limit; for that to make sense, we need to know what “close” means!</p>
<p>And importantly for us, the modulus lets us talk about maximum values for functions. In the real numbers this is simple to talk about: we’re looking for the greatest possible output. But a function that outputs complex numbers can’t really have a maximum, because the outputs aren’t ordered! But instead we can look for the “biggest” output, where the modulus is greatest. Since the modulus is always a (positive) real number, this is a question that makes sense.</p>
<p>And once we investigate the maxima of complex functions, we get one of the most surprising results in all of complex analysis.</p>
<h3 id="the-maximum-modulus-principle">The Maximum Modulus Principle</h3>
<p>In the real numbers we had three key theorems in our “value pack”. One was the Extreme Value Theorem, which says that a continuous function on a closed interval has a maximum and minimum value. This doesn’t quite work in the complex numbers, because the lack of order means we lack both maximum outputs, and also “intervals”.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/real-interval.png" alt="" class="blog-image center" />
<em class="blog-image center">A real interval is one-dimensional and doesn’t make sense in the complex plane.</em></p>
<p>But it’s <em>basically</em> true, after we tweak it. Instead of a closed interval, we want to have a closed and bounded region, which you can think of as a loop and everything inside of it, very much including all the points on the loop. And we need to look for the greatest modulus, instead of the “greatest complex number”. But after we make those tweaks, we can restate the Extreme Value Theorem: a continuous function on a closed and bounded region has a maximum (and minimum) modulus.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/closed-region.png" alt="" class="blog-image center" />
<em class="blog-image center">A closed region in the complex plane. The outer blue boundary is included.</em></p>
<p>In fact, we can get even more than that. A real function on the plane has to have a maximum, but that can happen basically anywhere, without restrictions.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/multivar-many-maxes.png" alt="" class="blog-image center" /></p>
<p class="blog-image center"><em>Some real-valued functions have lots of local maxima all over the place.</em></p>
<p>But a <em>complex</em> function, if it has a derivative, is much more restricted. The <a href="https://en.wikipedia.org/wiki/Maximum_modulus_principle">maximum modulus principle</a> says that \(|f|\) doesn’t just have a maximum somewhere in the region; the maximum has to occur on the boundary of the loop. In fact, unless the function isn’t totally constant, the maximum value can <em>only</em> occur on the boundary. If we have a point on the inside of the loop, we can always get a bigger modulus by moving in some way towards the boundary, so there aren’t even local maxima on the inside of the region.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/complex-max-modulus.png" alt="" class="blog-image center" />
<em class="blog-image center">The height in this graph gives the modulus of the output, and color tells us the angle. If you ignore color this graph looks extremely boring—which is the point.</em></p>
<p>This has widespread and surprising implications. One of the most famous is that if a complex function is differentiable and bounded—meaning there is some maximum modulus the function can output, no matter the input—then it has to be constant.</p>
<p>And that’s really restrictive! A differentiable real function can easily be bounded without being constant:</p>
<p class="blog-image center"><img src="/assets/blog/fictional-history-of-numbers/bounded-real-functions.png" alt="" class="blog-image center" />
<em>The functions \(\sin(x)\) and \(e^{-x^2}\) are differentiable, bounded, non-constant real functions.</em></p>
<p>but a differentiable complex function cannot. Either it has only one possible output, or the outputs go to infinity. And this sort of behavior leads to what some mathematicians have jokingly called the <a href="https://twitter.com/marvinanashahn/status/1637922952254484481">only theorem of complex analysis</a>:</p>
<p class="theorem">Let \(f\) be a complex differentiable function with any interesting properties whatsoever. Then \(f\) is constant.</p>
<p>In truth, there’s a lot more to the calculus of complex numbers than that; and I could hang out all day talking about cool weird tricks. Like, we can use complex numbers to compute the integrals of purely real-valued functions that are too tricky to solve over just the real numbers, and that’s really cool and also kind of obnoxious.</p>
<p>But that’s not what we’re here for. We just wanted to take the real numbers, and add in everything we needed to make all our polynomial equations have solutions. And now we’re ready to prove that \(i\) is the only thing we had to add.</p>
<h3 id="the-fundamental-theorem-of-algebra">The fundamental theorem of algebra</h3>
<p class="theorem"><strong>Theorem:</strong> Any non-constant polynomial equation with complex coefficients has a complex number solution.</p>
<p><strong>Proof:</strong> Suppose we have some complex polynomial \(f(z)\) that doesn’t have any roots. We start by drawing big loop in the complex plane—big enough that \(|f(z)| > |f(0)|\) for every \(z\) on the boundary of the loop. We know we can do this because a polynomial will always get very big when the input gets very big.<strong title="This is the step where we actually use the fact that we're talking about a polynomial. This proof doesn't work for functions like e^z, and this is why."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong></p>
<p>Then the maximum value of \(|f(z)|\) happens on the boundary of the loop, but the <em>minimum</em> has to happen on the inside of the loop, since \(0\) is on the inside, and \(|f(0)|\) is smaller than any value we get on the boundary. (It’s not necessarily the minimum itself; there could be points that give even smaller values. But we know the minimum can’t be on the boundary because all the boundary points give big values.)</p>
<p>So we know that \(f\) is a differentiable function, with a maximum on the boundary of the loop, and a minimum on the inside. We can also define the function \( \frac{1}{f} \), which will flip this. When \(|f|\) is big, then \(\frac{1}{|f|}\) will be small, and vice versa; so \(\frac{1}{|f|}\) has its minimum on the boundary of the loop, and its maximum on the inside.</p>
<p>But we also know something else. Since \(f\) has a derivative, we know that \( \frac{1}{f} \) also has a derivative, so the maximum modulus principle applies: the maximum value of \( \frac{1}{|f(z)|} \) must occur on the boundary of the loop. But we just said that the maximum has to occur on the inside of the loop; something has gone wrong.</p>
<p>The culprit is our assumption that we could actually compute the function \(\frac{1}{f}\) everywhere inside the loop. That’s only true if \(f(z)\) is never zero, since we can’t divide by zero. Because that assumption led to a contradiction, we know \(f(z) = 0\) for some value of \(z\)—so there is a solution to the equation we started with. ∎</p>
<h2 id="the-end-of-one-road">The end of one road</h2>
<p>And this means that the complex numbers are sort of the end of this series of questions. In part 1 we started with the natural numbers, wanted to do algebra to them without worrying, and wound up with the algebraic numbers. In part 2 we started with the natural or rational numbers, wanted to do geometry and make approximations, and found the real numbers.</p>
<p>The algebraic numbers weren’t complete, meaning they’re inadequate for doing geometry and calculus. The real numbers are <em>perfect</em> for doing calculus, and are great for approximations, but they’re not algebraically closed—there are those pesky polynomial equations like \(x^2+1=0\) that don’t have solutions.</p>
<p>Now we can combine the two ideas, and get the complex numbers. They’re complete, so we can do geometry and calculus. They’re algebraically closed, so we can do whatever algebra we want. And they’re in many ways the best tool for doing both algebra and geometry.</p>
<p>But we did lose something when we moved to the complex numbers: we lost the ordering, and with it we lost some of our key calculus theorems from the reals.</p>
<ul>
<li>The <strong>Intermediate Value Theorem</strong> says that if a continuous real function can output two distinct numbers, it can also output anything in between them. In the complex numbers this isn’t true, because we have two dimensions and so we can go <em>around</em>. In the real numbers, to get from \(-1\) to \(1\) we have to go through \(0\); in the complex numbers we can go through \(i\) instead.</li>
</ul>
<p class="blog-image center"><img src="/assets/blog/fictional-history-of-numbers/complex-ivt.png" alt="" class="blog-image center" />
<em>The function outputs zero at the Xs. This path takes the output from \(1\) to \(-3\) without ever passing through zero.</em></p>
<ul>
<li>
<p>The <strong>Mean Value Theorem</strong> says that if we have a differentiable real function on a closed interval, the average speed is equal to the derivative at some point. This fails in the complex numbers for the same reason the intermediate value theorem does; we can get from a speed of \(30\) mph to a speed of \(60\) mph without ever going \(45\) mph, because we can travel at \(45+i\) mph instead. (Physically this may or may not be meaningful, but mathematically it works.)</p>
<p>But this time we can recover an important chunk of the result. The Mean Value Theorem tells us speed limits work: if our speed is never higher than sixty miles per hour, we can’t possibly travel more than sixty miles in one hour. And we can still get that principle in the complex numbers, because the <em>modulus</em> of the distance we travel has to be less than the modulus of the time we spend, times the modulus of the speed. So we can save the tool we really care about—but only by shifting things back to the real numbers.</p>
</li>
<li>
<p>We already talked about the <strong>Extreme Value Theorem</strong>. In this case the complex numbers have an even stronger version than the reals did, in the Maximum Modulus Principle; it’s just <em>so</em> strong that it makes things really weird.</p>
</li>
</ul>
<p>So of our three key calculus theorems, one is basically true but very strange, one is salvageable in a much weaker form, and one is just gone. And that makes the complex numbers awkward for doing calculus, in the sense we normally mean calculus. They’re not good for talking about speeds, or rates of change, or anything like that—at least not directly.</p>
<p>On the other hand, they’re great for doing algebra and geometry (and <a href="https://en.wikipedia.org/wiki/Algebraic_geometry">algebraic geometry</a>). And there are all sorts of problems that don’t start out in the complex numbers, but can be transformed into complex-number questions, where we can throw our extremely powerful tools at them. (And then hopefully <a href="/blog/why-word-problems/">translate those answers back</a> into real-world information!)</p>
<p>But we’re not going to talk about that here. My promise in this series was I would pose reasonable questions, and show you how answering them gives us new numbers; and that’s what we’ve done. We wanted to expand the natural numbers using basic operations, and now we can’t expand any further. We wanted a field that is complete and algebraically closed, and we got it. Until we find a new question, we can rest content.</p>
<hr />
<p><em>I’m done with this line of questions; but I’m not at all done with this project! I hope to talk about quaternions and octonions, finite fields and modular arithmetic, \(p\)-adic numbers, transfinite numbers, infinitesimals, and function fields. Let me know what you’d like to hear about—tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>At least, until we come up with a new question to ask. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>We don’t have to worry about terms with \(i^2\) or anything, because \(i^2 = -1\) is a real number again. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>This is the simplest example of a really interesting field called <a href="https://en.wikipedia.org/wiki/Galois_theory">Galois theory</a>. The complex conjugation operation we constructed is an element of the <em>Galois group</em> of the complex numbers over the reals. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p>This is the step where we actually use the fact that we’re talking about a polynomial. This proof doesn’t work for functions like \(e^z\), and this is why. <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleWe continue our exploration of what numbers are, and where mathematicians keep finding weird ones. In the first three parts we extended the natural numbers in two ways: algebraically and analytically. Those approaches gave overlapping but distinct sets of numbers. This week we combine them to get the complex numbers, and see some hints of why the complex numbers are so useful—and so frustrating.A Fictional History of Numbers, Part 3: Computability, Reality, and Leaving Well Enough Alone.2023-06-14T00:00:00-07:002023-06-14T00:00:00-07:00https://jaydaigle.net/blog/fictional-history-of-numbers-part-3<p>Today we’re going to finish our discussion of the real numbers. We’ll see that they really are quite strange, in ways that are uncomfortable to think about, and then see why we keep using them anyway. And in passing we’ll define the <em>computable numbers</em>, which are an interesting type of number that doesn’t get nearly enough attention.</p>
<p>In <a href="/blog/fictional-history-of-numbers-part-1">part 1</a> we saw the most straightforward types of numbers, from the natural numbers that we count with, through the rationals that allow us to do basic arithmetic, to the algebraic numbers that let us solve polynomial equations. In <a href="/blog/fictional-history-of-numbers-part-2">part 2</a> we started asking questions about <em>geometry</em>, where we wanted to measure shapes. We found that the area of a circle isn’t given by an algebraic number, but can be approximated as closely as we want.</p>
<p>This led to the idea of <em>completeness</em>, which basically means that anything we can approximate has to be real. Every sequence that looks like it should converge does converge, and thus every length gets an actual number attached to it. And if we want completeness we get the real numbers, which can be thought of as the set of infinite decimals.</p>
<p>But the real numbers were hard to define. They seemed like a lot of work just to be able to talk about the area of a circle without making any estimates; ten decimal places should be enough for anybody, but the reals require infinitely many. In this essay we’ll see that it gets worse—but also that all that work really has a payoff, and that the real numbers are the right sort of numbers to use.</p>
<p>But first, if you want me to feel like my work has a payoff, please consider donating to my <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Tips are never necessary, but always appreciated, and they help make it possible for me to keep writing essays like this one.</p>
<h2 id="reality-is-weird">Reality is weird</h2>
<p>We keep saying that the real numbers were really weird. How weird, exactly, are they?</p>
<p>We saw one hint with the observation that \(0.99\dots~ = 1 = 1.00\dots.\) All real numbers are infinite decimals, but sometimes more than one infinite decimal corresponds to the same real number. (And the idea that we can have “infinitely many nines”, and that somehow they add up to exactly one, is something that makes a lot of people viscerally uncomfortable). But that doubling-up is pretty easy to avoid if we’re careful; if we disallow decimal expansions that end in an infinite string of nines, the problem goes away, and we can sleep easy.</p>
<p>But the real numbers are strange in other ways. For instance: how many of them are there? There are two answers that both seem intuitively compelling. On the one hand, there are infinitely many real numbers, and maybe that’s all we can say. Infinity is infinity.</p>
<p>On the other hand, there are infinitely many natural numbers, and infinitely many rational numbers, and infinitely many real numbers. But it sure seems like there are more rational numbers than natural numbers, and more real numbers than natural numbers; so maybe all infinities aren’t the same.</p>
<p>If we look at this more carefully, things get complicated.</p>
<h3 id="even-counting">Even counting</h3>
<p>How can we tell if two sets of things are the same size? We could try counting them and comparing the numbers: I have two hands, and two feet, so I have the same amount of hands and feet. But that doesn’t work if we have infinities. And anyway, counting is pretty abstract. Can we make things simpler?</p>
<p>There are a few approaches you could take here, but one very basic idea is just to pair things off. I don’t actually know how many pairs of shoes I have; but I know that I have the same number of left shoes and right shoes, because each left shoe is paired to a right shoe, and each right shoe is paired to a left shoe. There are none left over, so I have the same number of each. In technical terms, we’d say this table gives a <em>bijection</em> or a <em>one-to-one correspondence</em> between my left shoes and my right shoes.</p>
<p>On the other hand, if I try to pair up my socks and my shoes, I’ll have socks left over. I can give each shoe its own sock, and I’ll still have a big pile of socks left over. So I know I have more socks than shoes.</p>
<p>Let’s apply that idea now. Are there more natural numbers, or more even numbers? The obvious answer is that there are more natural numbers. If we look at the first ten numbers, only five of them are even.</p>
<table class="table">
<tbody>
<tr>
<td>Natural numbers:</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>\(\dots\)</td>
</tr>
<tr>
<td>Even numbers:</td>
<td> </td>
<td>2</td>
<td> </td>
<td>4</td>
<td> </td>
<td>6</td>
<td> </td>
<td>8</td>
<td> </td>
<td>10</td>
<td>\(\dots\)</td>
</tr>
</tbody>
</table>
<p>When we look at the first ten numbers, we have a lot of leftover (odd) natural numbers after we’ve paired off all the evens. And this pattern continues: if we look at the first hundred numbers, fifty of them are even. If we look at the first \(n\) numbers, about half of them will be even. So it seems like there must be more natural numbers than even numbers.</p>
<p>On the other hand, we can make a table like this, instead:</p>
<table class="table">
<tbody>
<tr>
<td>Natural numbers:</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>\(\dots\)</td>
</tr>
<tr>
<td>Even numbers:</td>
<td>2</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>10</td>
<td>12</td>
<td>14</td>
<td>16</td>
<td>18</td>
<td>20</td>
<td>\( \dots \)</td>
</tr>
</tbody>
</table>
<p>In this table, every even number corresponds to a natural number, and every natural number corresponds to an even number. They’re perfectly paired up. So by this argument, there must be the <em>same</em> number of natural numbers and even numbers.</p>
<p>This is one of the weird things that immediately happens when we start dealing with infinities: an infinite set can be in bijection with one of its own subsets. We see this in the observation that “infinity plus one” <a href="https://www.smbc-comics.com/comic/2015-02-18">is just infinity</a>, since adding an element to an infinite set doesn’t change the size. And these bijections are surprisingly common; sets in bijection with the natural numbes include the perfect squares:</p>
<table class="table">
<tbody>
<tr>
<td>Natural numbers:</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>\(\dots\)</td>
</tr>
<tr>
<td>Squares:</td>
<td>1</td>
<td>4</td>
<td>9</td>
<td>16</td>
<td>25</td>
<td>36</td>
<td>49</td>
<td>64</td>
<td>81</td>
<td>100</td>
<td>\( \dots \)</td>
</tr>
</tbody>
</table>
<p>the primes:</p>
<table class="table">
<tbody>
<tr>
<td>Natural numbers:</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>\(\dots\)</td>
</tr>
<tr>
<td>Primes:</td>
<td>2</td>
<td>3</td>
<td>5</td>
<td>7</td>
<td>11</td>
<td>13</td>
<td>17</td>
<td>19</td>
<td>23</td>
<td>29</td>
<td>\( \dots \)</td>
</tr>
</tbody>
</table>
<p>and even the integers:</p>
<table class="table">
<tbody>
<tr>
<td>Natural numbers:</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>\(\dots\)</td>
</tr>
<tr>
<td>Integers:</td>
<td>0</td>
<td>1</td>
<td>-1</td>
<td>2</td>
<td>-2</td>
<td>3</td>
<td>-3</td>
<td>4</td>
<td>-4</td>
<td>5</td>
<td>\( \dots \)</td>
</tr>
</tbody>
</table>
<p>We call these sets <em>countable</em> or <em>countably infinite</em>, because we can put all the elements in order and count them. It makes sense to ask for the \(37\)th prime number \((157),\) or the \(53\)rd square \((2809).\) And conversely, we can look at \(193\) and determine it’s the \(44\)th prime number, or at \(289\) and see it’s the \(17\)th square.</p>
<h3 id="counting-the-rationals">Counting the rationals</h3>
<p>Let’s make things a little more interesting. We saw that the sets of natural numbers, integers, even numbers, perfect squares, and prime numbers were all the same size. What about the rational numbers? It seems like there are a <em>lot</em> more rational numbers than there are natural numbers. But it seemed like there were a lot more natural numbers than even numbers, and that didn’t work out, so we should look closer. We can try making a table like this:</p>
<table class="table">
<tbody>
<tr>
<td>Natural numbers:</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>\(\dots\)</td>
</tr>
<tr>
<td>Rational numbers:</td>
<td>1/1</td>
<td>1/2</td>
<td>1/3</td>
<td>1/4</td>
<td>1/5</td>
<td>1/6</td>
<td>1/7</td>
<td>1/8</td>
<td>1/9</td>
<td>1/10</td>
<td>\( \dots \)</td>
</tr>
</tbody>
</table>
<p>But that won’t get us very far. Or rather, it would get us really far—we could keep <em>going</em> forever—but we’d leave most of the rational numbers out. We’ll never get to \(2\) that way.</p>
<p><a href="https://en.wikipedia.org/wiki/Georg_Cantor">Georg Cantor</a>’s clever idea was to put the rational numbers on a grid, instead.</p>
<table class="table">
<tbody>
<tr>
<td><strong>1/1</strong></td>
<td><strong>1/2</strong></td>
<td><strong>1/3</strong></td>
<td><strong>1/4</strong></td>
<td><strong>1/5</strong></td>
<td><strong>1/6</strong></td>
<td><strong>1/7</strong></td>
<td><strong>1/8</strong></td>
<td><strong>1/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>2/1</strong></td>
<td><strong>2/2</strong></td>
<td><strong>2/3</strong></td>
<td><strong>2/4</strong></td>
<td><strong>2/5</strong></td>
<td><strong>2/6</strong></td>
<td><strong>2/7</strong></td>
<td><strong>2/8</strong></td>
<td><strong>2/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>3/1</strong></td>
<td><strong>3/2</strong></td>
<td><strong>3/3</strong></td>
<td><strong>3/4</strong></td>
<td><strong>3/5</strong></td>
<td><strong>3/6</strong></td>
<td><strong>3/7</strong></td>
<td><strong>3/8</strong></td>
<td><strong>3/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>4/1</strong></td>
<td><strong>4/2</strong></td>
<td><strong>4/3</strong></td>
<td><strong>4/4</strong></td>
<td><strong>4/5</strong></td>
<td><strong>4/6</strong></td>
<td><strong>4/7</strong></td>
<td><strong>4/8</strong></td>
<td><strong>4/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>5/1</strong></td>
<td><strong>5/2</strong></td>
<td><strong>5/3</strong></td>
<td><strong>5/4</strong></td>
<td><strong>5/5</strong></td>
<td><strong>5/6</strong></td>
<td><strong>5/7</strong></td>
<td><strong>5/8</strong></td>
<td><strong>5/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>6/1</strong></td>
<td><strong>6/2</strong></td>
<td><strong>6/3</strong></td>
<td><strong>6/4</strong></td>
<td><strong>6/5</strong></td>
<td><strong>6/6</strong></td>
<td><strong>6/7</strong></td>
<td><strong>6/8</strong></td>
<td><strong>6/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>7/1</strong></td>
<td><strong>7/2</strong></td>
<td><strong>7/3</strong></td>
<td><strong>7/4</strong></td>
<td><strong>7/5</strong></td>
<td><strong>7/6</strong></td>
<td><strong>7/7</strong></td>
<td><strong>7/8</strong></td>
<td><strong>7/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>8/1</strong></td>
<td><strong>8/2</strong></td>
<td><strong>8/3</strong></td>
<td><strong>8/4</strong></td>
<td><strong>8/5</strong></td>
<td><strong>8/6</strong></td>
<td><strong>8/7</strong></td>
<td><strong>8/8</strong></td>
<td><strong>8/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>9/1</strong></td>
<td><strong>9/2</strong></td>
<td><strong>9/3</strong></td>
<td><strong>9/4</strong></td>
<td><strong>9/5</strong></td>
<td><strong>9/6</strong></td>
<td><strong>9/7</strong></td>
<td><strong>9/8</strong></td>
<td><strong>9/9</strong></td>
<td>…</td>
</tr>
<tr>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋱</td>
</tr>
</tbody>
</table>
<p>A grid like this should contain every positive<strong title="We're going to ignore the negative numbers here because they make everything more complicated in a boring and annoying way. I promise I could include them if I wanted to make this section even longer."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> rational number somewhere. In fact, the big problem is that some of them show up more than once! \(1/1 = 2/2 = 3/3\) and \(1/2 = 2/4 = 4/8\); we get a lot of repetitions. If we throw out the duplicates, and only include fractions that are in lowest terms, we get this:</p>
<table class="table">
<tbody>
<tr>
<td><strong>1/1</strong></td>
<td><strong>1/2</strong></td>
<td><strong>1/3</strong></td>
<td><strong>1/4</strong></td>
<td><strong>1/5</strong></td>
<td><strong>1/6</strong></td>
<td><strong>1/7</strong></td>
<td><strong>1/8</strong></td>
<td><strong>1/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>2/1</strong></td>
<td><del style="color:red;">2/2</del></td>
<td><strong>2/3</strong></td>
<td><del style="color:red;">2/4</del></td>
<td><strong>2/5</strong></td>
<td><del style="color:red;">2/6</del></td>
<td><strong>2/7</strong></td>
<td><del style="color:red;">2/8</del></td>
<td><strong>2/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>3/1</strong></td>
<td><strong>3/2</strong></td>
<td><del style="color:red;">3/3</del></td>
<td><strong>3/4</strong></td>
<td><strong>3/5</strong></td>
<td><del style="color:red;">3/6</del></td>
<td><strong>3/7</strong></td>
<td><strong>3/8</strong></td>
<td><del style="color:red;">3/9</del></td>
<td>…</td>
</tr>
<tr>
<td><strong>4/1</strong></td>
<td><del style="color:red;">4/2</del></td>
<td><strong>4/3</strong></td>
<td><del style="color:red;">4/4</del></td>
<td><strong>4/5</strong></td>
<td><del style="color:red;">4/6</del></td>
<td><strong>4/7</strong></td>
<td><del style="color:red;">4/8</del></td>
<td><strong>4/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>5/1</strong></td>
<td><strong>5/2</strong></td>
<td><strong>5/3</strong></td>
<td><strong>5/4</strong></td>
<td><del style="color:red;">5/5</del></td>
<td><strong>5/6</strong></td>
<td><strong>5/7</strong></td>
<td><strong>5/8</strong></td>
<td><strong>5/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>6/1</strong></td>
<td><del style="color:red;">6/2</del></td>
<td><del style="color:red;">6/3</del></td>
<td><del style="color:red;">6/4</del></td>
<td><strong>6/5</strong></td>
<td><del style="color:red;">6/6</del></td>
<td><strong>6/7</strong></td>
<td><del style="color:red;">6/8</del></td>
<td><del style="color:red;">6/9</del></td>
<td>…</td>
</tr>
<tr>
<td><strong>7/1</strong></td>
<td><strong>7/2</strong></td>
<td><strong>7/3</strong></td>
<td><strong>7/4</strong></td>
<td><strong>7/5</strong></td>
<td><strong>7/6</strong></td>
<td><del style="color:red;">7/7</del></td>
<td><strong>7/8</strong></td>
<td><strong>7/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>8/1</strong></td>
<td><del style="color:red;">8/2</del></td>
<td><strong>8/3</strong></td>
<td><del style="color:red;">8/4</del></td>
<td><strong>8/5</strong></td>
<td><del style="color:red;">8/6</del></td>
<td><strong>8/7</strong></td>
<td><del style="color:red;">8/8</del></td>
<td><strong>8/9</strong></td>
<td>…</td>
</tr>
<tr>
<td><strong>9/1</strong></td>
<td><strong>9/2</strong></td>
<td><del style="color:red;">9/3</del></td>
<td><strong>9/4</strong></td>
<td><strong>9/5</strong></td>
<td><del style="color:red;">9/6</del></td>
<td><strong>9/7</strong></td>
<td><strong>9/8</strong></td>
<td><del style="color:red;">9/9</del></td>
<td>…</td>
</tr>
<tr>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋮</td>
<td>⋱</td>
</tr>
</tbody>
</table>
<p>And once we have all the rational numbers in a grid like this, we can put them in order: we just have to take a snaking diagonal path through our grid.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/rationals-order.png" alt="" class="center blog-short-image" /></p>
<p>You can think of this as listing all the numbers where the top plus the bottom is two, then all the numbers where it’s three, then all the numbers where it’s four; there’s only a finite collection at each level.<strong title="Fancy number theorists call this sort of thing a [height]. It's a convenient way of putting as "size" on rational numbers so that there are only finitely many small ones, which is useful when we want to put things in order, or compute probabilities."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> And that means that any rational number gets a specific, finite place in our list:
\[
1/1, \quad 2/1, \quad 1/2, \quad 1/3, \quad 3/1, \quad 4/1, \quad 3/2, \quad 2/3, \quad 1/4, \quad 1/5, \quad \dots
\]</p>
<p>But all this is still a little weird, right? There are “obviously” way more rational numbers than there are natural numbers, but we just put them in order and paired them up. The fifth rational number is \(3\), and \(2/3\) is the eighth rational number; we can go in either direction.</p>
<h3 id="counting-the-algebraic-numbers">Counting the algebraic numbers</h3>
<p>We can take this logic one step further. In part 1 we defined the <em>algebraic numbers</em>, the numbers that are solutions to polynomial equations with integer<strong title="In part 1 we said rational coefficients, but this is the same thing. If you have an equation with rational coefficients, you can multiply through by the least common denominator and get an equivalent equation with integer coefficients. And assuming all the coefficients are integers is way more convenient for what we're going to do here."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> coefficients. These include all the rational numbers, and all the square roots, and the imaginary number \(i\), and the solutions to \(x^5+x+3=0\) which we can’t describe any better than that. Can we pair them up with the rational numbers?</p>
<p>It seems obvious that there are way more algebraic numbers than rational numbers. But it was also “obvious” that there were more rational numbers than integers, and that didn’t quite pan out. In fact we <em>can</em> count the algebraic numbers. Take a minute and see if you can figure out how to do it!</p>
<p>There are a few approaches, but I think the easiest is this. First think about all the degree-one polynomials whose coefficients are \(1\) or smaller. There aren’t very many of these, and we can list them all off:</p>
<p>\[
\begin{aligned}
0 && 1 && -1 \\
x && x+1 && x-1 \\
-x && -x+1 && -x-1
\end{aligned}
\]
There are nine of them, so we can put them in whatever order we want. And each one has at most one root, so we’ve counted up to nine algebraic numbers. (In fact we’ve only counted three, since there’s a lot of duplication here, but that’s fine; we’ll just cross out the duplicate numbers like we did for the rationals.)</p>
<p>Now think about all the degree-two polynomials whose coefficients are \(2\) or smaller. There are a lot more of these!</p>
<details>
<summary>Click to see all 125 degree-two polynomials whose coefficients are \\(2\\) or smaller.</summary>
<div>
<p>\[
\begin{aligned}
-2 && -1 && 0 && 1 && 2 \\
x-2 && x-1 && x && x+1 && x+2 \\
2x-2 && 2x-1 && 2x && 2x+1 && 2x+2 \\
-x-2 && -x-1 && -x && -x+1 && -x+2 \\
-2x-2 && -2x-1 && -2x && -2x+1 && -2x+2 \\
x^2-2 && x^2-1 && x^2 && x^2+1 && x^2+2 \\
x^2+x-2 && x^2+x-1 && x^2+x && x^2+x+1 && x^2+x+2 \\
x^2+2x-2 && x^2+2x-1 && x^2+2x && x^2+2x+1 && x^2+2x+2 \\
x^2-x-2 && x^2-x-1 && x^2-x && x^2-x+1 && x^2-x+2 \\
x^2-2x-2 && x^2-2x-1 && x^2-2x && x^2-2x+1 && x^2-2x+2 \\
2x^2-2 && 2x^2-1 && 2x^2 && 2x^2+1 && 2x^2+2 \\
2x^2+x-2 && 2x^2+x-1 && 2x^2+x && 2x^2+x+1 && 2x^2+x+2 \\
2x^2+2x-2 && 2x^2+2x-1 && 2x^2+2x && 2x^2+2x+1 && 2x^2+2x+2 \\
2x^2-x-2 && 2x^2-x-1 && 2x^2-x && 2x^2-x+1 && 2x^2-x+2 \\
2x^2-2x-2 && 2x^2-2x-1 && 2x^2-2x && 2x^2-2x+1 && 2x^2-2x+2 \\
-x^2-2 && -x^2-1 && -x^2 && -x^2+1 && -x^2+2 \\
-x^2+x-2 && -x^2+x-1 && -x^2+x && -x^2+x+1 && -x^2+x+2 \\
-x^2+2x-2 && -x^2+2x-1 && -x^2+2x && -x^2+2x+1 && -x^2+2x+2 \\
-x^2-x-2 && -x^2-x-1 && -x^2-x && -x^2-x+1 && -x^2-x+2 \\
-x^2-2x-2 && -x^2-2x-1 && -x^2-2x && -x^2-2x+1 && -x^2-2x+2 \\
-2x^2-2 && -2x^2-1 && -2x^2 && -2x^2+1 && -2x^2+2 \\
-2x^2+x-2 && -2x^2+x-1 && -2x^2+x && -2x^2+x+1 && -2x^2+x+2 \\
-2x^2+2x-2 && -2x^2+2x-1 && -2x^2+2x && -2x^2+2x+1 && -2x^2+2x+2 \\
-2x^2-x-2 && -2x^2-x-1 && -2x^2-x && -2x^2-x+1 && -2x^2-x+2 \\
-2x^2-2x-2 && -2x^2-2x-1 && -2x^2-2x && -2x^2-2x+1 && -2x^2-2x+2 \\
\end{aligned}
\]</p>
</div>
</details>
<p>But it’s still a finite list, and each one has at most two solutions. So this has less than \(250\) algebraic numbers on it, and we can count them. In fact, let’s put the ones from the first list first, and then all the rest.</p>
<p>And now what we’ve done is defined a “height” for our polynomials: it’s the maximum of the degree and all the coefficients. So next week can look at the “height three” polynomials, the degree-three polynomials with coefficients three or less; and then the height four polynomials, which are degree-four polynomials with coefficients four or less; and so on.</p>
<p>At each height we’re adding finitely many algebraic numbers, so we can put them all in order. But we’ll get to every algebraic number eventually. For instance, the polynomial \(x^5+x+3\) has height five, so all the solutions will show up at the fifth step of this process. And that means we can pair the algebraic numbers up with the natural numbers. The algebraic numbers are countable.</p>
<h2 id="reality-is-too-big">Reality is too big</h2>
<h3 id="counting-the-reals">Counting the Reals</h3>
<p>At this point you might be wondering if we can just always do this. The naturals and the rationals and the algebraics are all the same size; maybe infinities are all the same, after all. But let’s look at one more example: the real numbers.</p>
<p>Imagine we can put all the real numbers in a list, like we did for the rational numbers. Every real number can be written as an infinite decimal, so the list might look something like this:</p>
<p>\[
\begin{aligned}
7.77000643096\dots \\
1.05898980495\dots \\
6.35097622647\dots \\
1.79660844929\dots \\
4.45063253213\dots \\
7.48984022493\dots \\
2.23729615260\dots \\
0.09015630234\dots \\
1.30480398871\dots \\
7.76421175135\dots \\
\end{aligned}
\]</p>
<p>But now we can make a new infinite decimal, that definitely isn’t on the list. To keep things simple, we’ll make a number whose whole number part, to the left of the decimal point, is just \(0\). That means our number will be between \(0\) and \(1\), and that’s fine.</p>
<p>Now let’s look back at the big list. The first number on the list has a \(7\) in the first decimal place. So to make sure our number is different, the first decimal place can’t have a \(7\), so we’ll put a \(0\) there instead.</p>
<p>Now the second number on the list has a \(5\) in the second decimal place. To make sure our number is different, the second decimal place can’t have a \(5\), so we’ll put a \(0\) there instead.</p>
<p>The third number has a \(0\) in the third decimal place, so we don’t want to have a \(0\) in the third decimal place of our number; let’s use a \(1\) instead. At this point, we know the first three places of our infinite decimal: \(0.001\dots.\) And we also know that our infinite decimal isn’t any of the first three numbers on the big list.</p>
<p>So we can continue this pattern. The fourth digit of the fourth number is \(6\), so we can pick \(0\) for our fourth digit.<strong title="We could also pick 1, or 2, or 3, or 4, or 5, or 7, or 8, or 9; this isn't a deterministic process. We just can't pick 6. "><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> The fifth digit of the fifth number is \(3\), so we can pick \(0\). The sixth digit of the sixth number is \(0\), so we have to pick something else like \(1\). The seventh digit of the seventh number is \(1\), and the eighth digit of the eighth number is \(0\), so we should pick \(0\) for our seventh digit and \(1\) for our eighth digit. As we keep going, we get the number</p>
<p>\[
0.0010010100\dots
\]
which can’t<strong title="There's some slight weirdness here around the fact that 0.99... = 1. But it's not a real problem; don't worry about it. "><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> be the same as any of the numbers on our list:
\[
\begin{aligned}
7.\color{red}{7}7000643096\dots \\
1.0\color{red}{5}898980495\dots \\
6.35\color{red}{0}97622647\dots \\
1.796\color{red}{6}0844929\dots \\
4.4506\color{red}{3}253213\dots \\
7.48984\color{red}{0}22493\dots \\
2.237296\color{red}{1}5260\dots \\
0.0901563\color{red}{0}234\dots \\
1.30480398\color{red}{8}71\dots \\
7.764211751\color{red}{3}5\dots \\
\end{aligned}
\]</p>
<p>And that means that we have a number that <em>isn’t</em> on the infinite list we started with.</p>
<p>Now obviously we could make a list that contains this number. We can just tack it on to the front:<strong title="We could also stick it in the middle, or replace some element of the list we started with. But we can't tack it on to the end, because this is an infinite list—it doesn't have an end! "><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>\[
\begin{aligned}
&0.0010010100\dots \\
&7.77000643096\dots \\
&1.05898980495\dots \\
&6.35097622647\dots \\
&1.79660844929\dots \\
&4.45063253213\dots \\
&7.48984022493\dots \\
&2.23729615260\dots \\
&0.09015630234\dots \\
&1.30480398871\dots \\
&7.76421175135\dots \\
\end{aligned}
\]</p>
<p>And this list does have the number we just made. But it still can’t have every number on it; we can do the same thing we just did to get a new number, \(0.1000100000\dots,\) that isn’t on this list. Whatever list we come up with, there has to be a number that isn’t on it.</p>
<p>And in fact there are infinitely many numbers that aren’t on the list! We can see this pretty directly by listing a bunch of them. We built a number that wasn’t on the list, using just the digits \(0\) and \(1.\) But we could also use \(0\) and \(2,\) or \(4\) and \(7,\) or whichever pair of numbers we want. We could even choose a different pair for each place; we wind up having nine choices for each decimal place. So we can see all the following numbers aren’t on our original list:</p>
<p>\[
\begin{aligned}
0.0020020200 \dots \\
0.4777777777 \dots \\
0.6669666666 \dots \\
0.1234567890 \dots
\end{aligned}
\]</p>
<p>And every number on that super-infinite is between \(0\) and \(1.\) We can find another infinite list between \(1\) and \(2,\) and another between \(2\) and \(3,\) and another between \(37\) and \(38.\) So not only are there more real numbers than natural, rational, or algebraic numbers; there are way, way more of them.</p>
<h3 id="leaving-things-up-to-chance">Leaving things up to chance</h3>
<p>Another way of thinking about how many real numbers there are is to imagine choosing one at random. In fact, let’s just choose a number between \(0\) and \(1.\) Some of the numbers between \(0\) and \(1\) are rational, and others are irrational. So what are the odds that our randomly chosen number will be rational?</p>
<p>It turns out this probability of getting a rational number has to be zero. Not just small, but actually zero—even though it’s obviously possible, it has to be infinitely unlikely.</p>
<p>To see why, imagine that there was some positive probability of getting a rational number, like one in three.<strong title="You might feel like this probability is obviously too big, and of course you're right. But none of the argument depends on the specific number. I just want to keep the number small so it's easier to think about what's going on."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> That would mean that one third of all the real numbers were rational—and that means we could divide the real numbers up into three sets \(A, B, C\), each of which are the size of the rational numbers.</p>
<p>But we know the rationals are countable, meaning we can put them in a numbered list. If the other two sets are the same size, we must be able to put them in lists, too, so we could divide the real numbers into three countable sets. And then we can make a complete list of all the real numbers: we can take the first element of \(A\), then the first element of \(B\), then the first element of \(C;\) then the second elements of \(A\) and \(B\) and \(C\); then the third elements; and so on. But this can’t possibly work, because we <em>know</em> the real numbers are uncountable. The rationals can’t be one third of the reals.</p>
<p>And the specific number was irrelevant here, right? If the rationals were \(1/1,000,000\) of the reals, we could still count the reals. We’d take the first element from each of our million sets, and then the second element, and then the third… If the rationals were any finite percentage of the real numbers, we’d be able to make a list of all the reals. So instead the rationals have to be zero percent of the real numbers.</p>
<p>And this sort of weirdness can only happen with infinities. Obviously there are rational numbers. “One” is a rational number, and it definitely exists. So if you pick a real number, it is <em>possible</em> to pick a rational number. But it will happen zero percent of the time. Not a small percentage; not a tiny percentage; it will happen zero percent of the time. There are <em>infinitely</em> many more real numbers than rational numbers. And that’s weird!</p>
<p>It gets worse when you realize the same argument applies to the algebraic numbers. It’s hard to come up with a real number that <em>isn’t</em> algebraic. Sure there are a couple of weird ones like \(\pi\) and \(e,\) but for the most part the irrational numbers we think about are all algebraic.</p>
<p>And yet if you pick a real number at random, there’s a zero percent chance you’ll get anything algebraic. There are infinitely more real numbers that <em>aren’t</em> solutions to polynomial equations than ones that are. Most real numbers are a little hard to even describe.</p>
<h3 id="what-are-we-even-talking-about">What are we even talking about?</h3>
<p>And this brings us to weird you might have noticed when we showed that the reals were uncountable. We started with a list of real numbers, and we constructed a decimal that wasn’t on that list, and said we came up with \(0.0010010100 \dots. \) But did we really find a specific numbers? I wrote an ellipsis there; do we know what comes next?</p>
<p>And the answer is, sort of. We have a rule that tells us how to choose the next digit, so in theory we should know what comes next. But the rule depends on the next number on the original list, and I didn’t tell you that, so we can’t actually figure out the next digit.</p>
<p>In general we’ve been pretty sloppy about this! I said a real number is an infinite decimal, and I’ve been writing strings of digits with ellipses at the end to say the decimal keeps going. But consider these three real numbers:</p>
<p>\[
\begin{aligned}
A & = 0.1428571428 \dots \\
B & = 3.1415926535 \dots \\
C & = 0.9193470019 \dots
\end{aligned}
\]</p>
<p>Do we know what digit comes next in \(A\)? It looks like it’s repeating, so we can guess the next digits are \(5\) and \(7\). (We might even notice this is the decimal expansion of \(1/7\), which can make us more confident in our guess.) For \(B\), we might recognize that this is \(\pi,\) so we can look up the next digits, which are \(8\) and \(9\). But what about for \(C\)? Can you figure out what comes next?</p>
<p>And in this case, you can’t. You absolutely can’t. I just generated ten random numbers, without any pattern, and wrote them down. I wrote an ellipsis like there’s something that comes next, but I never decided what comes next, so the ellipsis is basically a lie. I have no idea what that number is—except that it’s between \(\dfrac{9193470019}{10000000000} \) and \(\dfrac{9193470020}{10000000000}.\)</p>
<p>And honestly, that’s enough information to do pretty much any calculation we would actually want to do in the real world. But it’s <em>not</em> enough to tell you which specific real number this is. There are infinitely many—uncountably many!—ways to continue on from those ten digits and get a real number. And even <em>I</em> can’t really tell you which one I want.</p>
<h3 id="working-with-recipes">Working with recipes</h3>
<p>But I <em>did</em> tell you exactly what numbers \(A\) and \(B\) were—just not by writing down an infinite decimal. \(A\) is a repeating decimal, so I can give you a few digits and then tell you it repeats. \(B\) isn’t repeating, but it is a special number, \(\pi,\) and you can go look up the next digits if you want to. So rather than listing off all of the infinitely many digits, I can give you an <em>algorithm</em>, or recipe, for finding the digits. And you can keep computing the next digit and the next, for as long as you want. If a number has a recipe like this, that lets you compute all of the digits, we say it’s <a href="https://en.wikipedia.org/wiki/Computable_number">computable</a>.</p>
<p>Can we do something like that for \(C?\) Can we take <em>any</em> infinite decimal and give an algorithm that will compute it?</p>
<p>Unfortunately, the answer is no, and for the same basic reason we know most real numbers aren’t algebraic. Every computable number has to have a recipe, so let’s think about what a recipe should look like. First, the recipe should be finite. We might not be able to finish reading the infinite decimal, but we should be able to finish reading the recipe. (Otherwise the recipe could just be a list of all the digits, which is missing the point!)</p>
<p>And we have to pick some language to write the recipe in. What language we pick doesn’t really matter; we could use English, or Mandarin, or some sort of <a href="https://en.wikipedia.org/wiki/Lambda_calculus">weird mathematical symbology</a>. But the language probably has finitely many symbols in it. Even if it has infinitely many symbols, we can limit it to a <em>countable</em> infinity—no more symbols than there are natural numbers.<strong title="If this seems like a lot of symbols, you're right! English has, like, thirty, and thirty is way smaller than infinity. But the lambda calculus, which is the mathematical formalism I linked, does in fact want an infinitely long list of symbols. Since this argument works either way, I'm letting the list be infinite."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p>And with just those two restrictions, we find that we can count all the recipes. We can count the symbols, meaning we can label them so that there’s a first symbol, then a second, then a third, and so on. There’s at most one recipe that uses only the first symbol, and is only one symbol long. There are at most six recipes that use the first two symbols and are only two symbols long. There are at most \(39\) recipes that use the first three symbols and are only three symbols long.</p>
<p>We’ve found a height for our recipes again: the maximum of the length of the recipe, and the number of symbols off our list we have to use. There are finitely many recipes of each height, so we can label all the height-one recipes, then all the height-two recipes, then the height-three, then the height-four, and so on. Eventually we will reach every possible recipe, which means we can make a numbered list of every possible recipe.</p>
<p>This means the recipes are countable, and so the numbers they produce are also countable. So there are countably many computable numbers, but uncountably many real numbers. And just like with the rational numbers and the algebraic numbers, almost every real number is uncomputable; if we pick a random real number, it is essentially guaranteed not to be computable.</p>
<p>And to be clear about what this means: one hundred percent of the real numbers are things that we <strong>can’t even describe</strong>. They’re so strange and gratuitously infinite that we can’t even really talk about most of them.</p>
<h2 id="so-what-are-they-good-for">So what are they good for?</h2>
<p>The real numbers were hard to define. (It took basically all of part 2!) There are way too many of them, and almost all of them can’t even be described in a useful way. And if you think about them too hard, they just start seeming really weird and uncomfortable. So why do we keep using them, rather than doing something more sensible?</p>
<p>The point of this series is that <strong>this is the right question to ask</strong>. We shouldn’t start by asking how some new weird math thing is defined; we shouldn’t start with a definition and just try to prove theorems from it. If we want to understand a math idea, we need to understand what problem it was designed to solve. So <strong>we don’t want to think about the definition of the reals. We should think about what they do.</strong></p>
<p>The reals are characterized by three key properties: they are a <em>complete ordered field</em>. (In fact, they are the <em>only</em> complete ordered field.) And each of these three words represents a major idea from this series. We’ll take them in reverse order.</p>
<ul>
<li>
<p>A <em>field</em> is a set that allows the four fundamental arithmetic operations: addition, subtraction, multiplication, and division (by non-zero numbers). Looking back at <a href="/blog/fictional-history-of-numbers-part-1">part 1</a>, the natural numbers aren’t a field, because they don’t let you compute \(1-3\); the integers aren’t a field, because they don’t let you compute \(1/3\). When we wanted to do addition, subtraction, multiplication, and division, we came up with the rational numbers, which are a field. And so are the algebraic numbers, the computable numbers, and the reals.</p>
</li>
<li>
<p>The reals are <em>ordered</em>: if we have two distinct real numbers, one will always be greater than the other. We talked about this some in part 2. The order also “plays nicely” with the algebraic operations, in the sense that, for instance, adding \(1\) to a number will always make it bigger. The rational numbers and the computable numbers are also ordered, but the algebraic numbers are not, because \(i\) is neither positive or negative.</p>
</li>
<li>
<p>Finally, the reals are <em>complete</em>. Completeness was the main topic of part 2: we built the reals by saying that every sequence that looks like it should converge, does converge. Combining this with the order gave us the Monotone Convergence Theorem, which says an increasing sequence that doesn’t go to infinity has to converge, and that allowed us to show that every infinite decimal was a real number.</p>
</li>
</ul>
<p>These three properties, taken together, give us <em>exactly</em> the real numbers. Any ordered field has to contain all the integers, because we can add \(1\) to itself repeatedly.<strong title="If we give up on having an order, we can get "looping around" behavior, where repeatedly adding 1 gets us back to where we started. This shows up in modular arithmetic and finite fields, which I hope to discuss later in this series. But in an _ordered_ field, this sort of looping isn't possible, because adding 1 will always make a number larger."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong> Then because we can do division, we have to have all the rationals, which means we get all the finite decimals. And completeness means we get all the infinite decimals, so a complete ordered field has to include all the reals.</p>
<p>Conversely, every element of the complete ordered field has to be a real number. The order means we can trap our element in between two integers, and then in between two one-place decimals, and then in between two two-place decimals, and so on. Thus our element can be written as an infinite decimal, so it must be a real number.</p>
<p>If we want those three things to be true, the real numbers are what we have to use.</p>
<h3 id="value-theorems-calculus-and-physics">Value theorems, calculus, and physics</h3>
<p>We want a field because it lets us actually do arithmetic; we want an order because some numbers are bigger than other numbers. The least obvious property here—and the one that creates all of that uncountable weirdness—is the completeness. Why do we need that?</p>
<p>The obvious answer is the one we saw in part 2: completeness lets us handle geometry and distances, because we can define any number we can approximate. But this is maybe not the most compelling reason, since <em>we can approximate</em> those distances without having to use real numbers. In fact this is essentially what the Greeks did: Archimedes computed that the circumference of a circle with diameter one was between \(\dfrac{221}{71}\) and \(\dfrac{22}{7}.\) The number \(\pi\) isn’t rational, but \(22/7\) and \(355/113\) and \(3.1415926535\) all are, and that’s enough for any actual calculation we want to do.</p>
<p>No, <strong>the real reason we need real numbers is they let us do calculus</strong>.</p>
<p>In calculus, we learn about the derivative, which tells us how quickly something is changing. (So if \(f(x)\) represents the position of an object, then the derivative \(f’(x)\) represents the speed.) A freshman calculus course will then spend a lot of time learning formulas to compute the derivative, which is very important for using calculus, but not important to the story we’re interested in here. So we won’t worry about that.</p>
<p>Instead I want to talk about a few important theorems about the derivative. A college calculus course will generally mention these theorems, but not really focus on them, because they aren’t necessary for any particular calculations. But they are critical to explain why we do calculus the way we do, and if I were writing a Fictional History of Calculus they would take center stage.</p>
<p>I like to call these key theorems the value pack, because they’re thematically related, and also all have the word “value” in their names:</p>
<ul>
<li>
<p>The <strong>Intermediate Value Theorem</strong> says that if a continuous function can output two distinct numbers, it can also output anything in between them. This is the “no teleporting” theorem: if an object falls from ten feet above ground to five feet above ground, at some point in the middle it was nine and eight and seven and six feet off the ground.</p>
<p>It’s also the rule that says continuous functions have continuous graphs. If you’ve heard that a continuous function is one you can draw without lifting your pencil off of the paper, you’ve heard a version of the Intermediate Value Theorem.</p>
</li>
<li>
<p>The <strong>Mean Value Theorem</strong> says that if you have a differentiable function on a closed interval, the average speed is equal to the derivative at some point. This is the “speed limit” theorem: it says that if your speed is never higher than sixty miles per hour, you can’t possibly travel more than sixty miles in one hour.</p>
<p>It also tells us that if a function’s derivative is zero, the function has to be constant. If your speed is always zero, then you should never move at all.</p>
</li>
<li>
<p>The <strong>Extreme Value Theorem</strong> says that a continuous function on a closed interval has a maximum and minimum value. This is the “what goes up must come down” theorem: if you toss a ball in the air, some point in that toss will be the highest point.</p>
</li>
</ul>
<p>And if all of those things seem obviously true, well, that’s the point. These are the theorems that tell us functions, and derivatives, and <em>calculus</em> all behave the way they’re supposed to. If these theorems weren’t true, then calculus wouldn’t describe the way things actually move in the actual world we observe, and so it wouldn’t be useful.</p>
<p>But if we don’t use the real numbers, all three of those theorems break.</p>
<h3 id="contradicting-our-values">Contradicting our values</h3>
<p>Let’s imagine we’re doing calculus over just the rational numbers. That means we’re using functions that take in rational numbers as inputs, and give other rational numbers back as outputs. There are plenty of reasonable functions like this: \(f(x) = 3x+5\) or \( f(x) = \frac{x^2+1}{x-2} \) are rational functions. But there are also <em>un</em>reasonable functions, like this one:</p>
<p>\[
f(x) = \left\{
\begin{array}{ccc}
1 & \text{if} & x^2 > 2 \\
0 & \text{if} & x^2 < 2
\end{array}
\right. .
\]</p>
<p>This looks weird and a little ugly, but it’s straightforward to compute once we understand what it means. When we plug in a number \(x\), we square it and look at the result. If we get a number bigger than \(2\), the function outputs \(1;\) if we get a number less than \(2\) the function outputs \(0.\) This rule works for any rational number, and will always give us one or zero as an output. (This rule would <em>not</em> work if \(x^2=2\), but since \(x\) is a rational number, that can’t happen!) In fact, this function is continuous, and differentiable, at every rational number. And at every rational number, the derivative is zero.</p>
<p>We can even graph this function pretty easily:</p>
<p class="center blog-image"><img src="/assets/blog/fictional-history-of-numbers/rational-jump.png" alt="The graph of this function f." /></p>
<p><em class="center blog-image">Behold, a continuous function.</em></p>
<p>And now things seem off. The graph doesn’t <em>look</em> continuous. But that’s because it jumps “at” \(\sqrt{2}\)—and in the rationals, that number doesn’t exist. So even though the function is continuous, it gives the outputs \(0\) and \(1\), but never anything in between; the Intermediate Value Theorem fails.</p>
<p>Even worse, the function has a derivative of zero everywhere. And looking at the graph, this makes sense, right? We can’t find any specific point where the function is increasing or decreasing; the tangent line at any point is horizontal. (Again, there <em>should</em> be a bad point at \(\sqrt{2},\) but since that number isn’t rational we don’t care.) And thus the value of the function changes, even though the rate of change is always zero. The Mean Value Theorem fails.</p>
<p>This function <em>does</em> satisfy the Extreme Value Theorem, but it’s not too hard to find other functions that don’t.</p>
<p class="center blog-image"><img src="/assets/blog/fictional-history-of-numbers/rational-evt.png" alt="" /></p>
<p><em class="center blog-image">The graph of \(6x-x^3\) between \(-2\) and \(2.\) In the real numbers, it has a maximum at \(\sqrt{2};\) when restricted to the rational numbers, it has no minimum or maximum.</em></p>
<p>Now, this is all kind of dumb. The function \(f\) shouldn’t be continuous; there are obvious jumps in it! And the function \(6x-x^3\) has an obvious maximum in its graph. But that’s exactly the point. The rational numbers just <em>aren’t good enough</em> to do calculus, because obviously dumb and false things wind up being, technically, true. The function \(f\) is “continuous” because it jumps at an irrational number, and the function \(6x-x^3\) has “no maximum” because the maximum value happens at an irrational number. The rational numbers have gaps, and if we don’t fill in the gaps with real numbers, calculus just doesn’t work.</p>
<p>And while these examples talk about the <em>rational</em> numbers, we can also find examples that break in the algebraic numbers. They’re just way more annoying to describe, so I don’t want to write out the details.</p>
<p class="center blog-image"><img src="/assets/blog/fictional-history-of-numbers/algebraic-evt.png" alt="" /></p>
<p><em class="center blog-image">This is a continuous function that sends algebraic numbers to algebraic numbers but has no maximum in the algebraic numbers. I’m not going to try to define it here. I had to get help from three people to define it properly, and writing the code for the graph took half an hour. Please don’t ask me to explain it.</em></p>
<h3 id="whereof-one-cannot-speak">Whereof one cannot speak</h3>
<p>There’s one more shot we have at avoiding all this nonsense. The rational numbers don’t work, and the algebraic numbers don’t work. But in this essay we described a new set of numbers, which contains all the algebraics and more, but is still countable: the computable numbers. Can we use these to do calculus, and avoid thinking about the uncountably many uncomputable reals?</p>
<p>Surprisingly, the answer is yes—sort of.</p>
<p>If we can give a recipe for \(x,\) and a recipe for \(y,\) then we can give a recipe for \(x+y\)—by saying \(“x+y”!\) We can do the same thing for subtraction, multiplication and division, so when we do algebra with computable numbers, the result is also computable. That means the computable numbers are a field.</p>
<p>The computable numbers are ordered, because they’re all real numbers, so we can compare two and see which one is bigger. They are <em>not</em> complete—if they were, they’d just be the real numbers. But they’re something almost as good: they’re “computably complete”.</p>
<p>Remember completeness means that any sequence that looks like it should converge does in fact converge; the real numbers are never just <em>missing</em> the number the sequence wants to converge to. But this isn’t true for the computable numbers. There are sequences of computable numbers that converge to uncomputable numbers.</p>
<p>However, I can’t give you any examples of those sequences. And in this case I’m not just being lazy; I really can’t give you an example. Because if I could describe a sequence of computable numbers that converges to \(x,\) then that sequence gives a recipe for computing \(x,\) and so \(x\) must actually be computable. And that means that every <em>computable</em> sequence of computable numbers converges to a computable number. Not every sequence will behave well, but every sequence we can actually describe will.</p>
<p>In the same way, none of the value theorems are technically true in the computable numbers. But they’re almost true. They’re <em>computably</em> true.</p>
<p>When we found rational functions that failed the value theorems, we did something extra: we described them using only rational numbers. (That’s why the condition was \(x^2>2\) and not \(x > \sqrt{2},\) for instance.) But for the computable numbers we can’t do that. There are functions from the computable numbers to the computable numbers where the value theorems fail, but those functions are themselves uncomputable. Any function that we can actually compute the results of will satisfy the value theorems. If we do anything even vaguely reasonable, everything will work the way we expect.</p>
<h3 id="one-must-remain-silent">One must remain silent</h3>
<p>So we could avoid the weirdness of the reals and stick to the computables. If we work in the reals, every number we actually talk about will be computable, so we don’t gain anything by allowing all the real numbers. And if we work in the computables, every function we want to think about will satisfy the value functions.</p>
<p>But there’s no reason to stay in the computables—it fundamentally <em>doesn’t matter</em>. We work in the reals because they give us precisely the tools we want.</p>
<p>Sure, we don’t want to deal with with arbitrary infinite decimals, and we certainly don’t deal with Dedekind cuts. But we don’t want to think about explicit computer programs for every number we ever use, either. What we want is the value theorems; what we want is a complete ordered field. And when we ask for that, without adding any restrictions, the real numbers are what pop out.</p>
<p>We shouldn’t think about the reals using their formal definition. We should think about what they <em>do</em> for us, the tools they allow us to use and the moves they allow us to make. The real numbers are a complete ordered field, and they give us the value theorems, so calculus works. And that’s all we want. We don’t need to make it more complicated than that.</p>
<p>But next time we’ll make it complex, instead.</p>
<hr />
<p><em>Do you have comments, or questions? Are there other types of numbers you want to learn the story behind? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We’re going to ignore the negative numbers here because they make everything more complicated in a boring and annoying way. I promise I could include them if I wanted to make this section even longer. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>Fancy number theorists call this sort of thing a <a href="https://en.wikipedia.org/wiki/Height_function">height</a>. It’s a convenient way of putting as "size" on rational numbers so that there are only finitely many small ones, which is useful when we want to put things in order, or compute probabilities. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>In part 1 we said rational coefficients, but this is the same thing. If you have an equation with rational coefficients, you can multiply through by the least common denominator and get an equivalent equation with integer coefficients. And assuming all the coefficients are integers is way more convenient for what we’re going to do here. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p>We could also pick 1, or 2, or 3, or 4, or 5, or 7, or 8, or 9; this isn’t a deterministic process. We just can’t pick 6. <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:5">
<p>There’s some slight weirdness here around the fact that \(0.99\dots~ = 1.\) But it’s not a real problem; don’t worry about it. <a href="#fnref:5" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:6">
<p>We could also stick it in the middle, or replace some element of the list we started with. But we can’t tack it on to the end, because this is an infinite list—it doesn’t have an end! <a href="#fnref:6" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:7">
<p>You might feel like this probability is obviously too big, and of course you’re right. But none of the argument depends on the specific number. I just want to keep the number small so it’s easier to think about what’s going on. <a href="#fnref:7" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:8">
<p>If this seems like a lot of symbols, you’re right! English has, like, thirty, and thirty is way smaller than infinity. But the lambda calculus, which is the mathematical formalism I linked, does in fact want an infinitely long list of symbols. Since this argument works either way, I’m letting the list be infinite. <a href="#fnref:8" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:9">
<p>If we give up on having an order, we can get "looping around" behavior, where repeatedly adding \(1\) gets us back to where we started. This shows up in modular arithmetic and finite fields, which I hope to discuss later in this series. But in an <em>ordered</em> field, this sort of looping isn’t possible, because adding \(1\) will always make a number larger. <a href="#fnref:9" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleThis week we continue our exploration of what numbers are, and where mathematicians keep finding weird ones. Last time we defined the real numbers, but it took a lot of work. Now we'll see how truly strange they are. They're so strange that it's tempting to avoid them and stick with something simpler. But the real numbers do a much better job of describing modeling the parts of the world we care about. Their weirdness is exactly what we need to guarantee that a bunch of "obvious" claims are actually true. And we'll end by seeing why we could stick to the computable numbers instead—but we shouldn't bother.Evaluating Students is Important, Too2023-05-22T00:00:00-07:002023-05-22T00:00:00-07:00https://jaydaigle.net/blog/evaluating-students-is-important<p>There’s a classic dry academic joke that I retell frequently. My entire salary is just to pay me to grade; I do the rest of my job for free.</p>
<p>And this is first an apology for not writing much lately; we just got through finals, and that involves a lot of grading. So I’ve been a little occupied actually earning my paycheck. But it’s also a response to <a href="https://www.experimental-history.com/p/i-wanted-to-be-a-teacher-but-they">this excellent post</a> by Adam Mastroianni of Columbia Business School on his substack Experimental History. I liked the post, and it reflects a lot of things I think about in my teaching. But I have a couple of big disagreements—in part, I suspect, because we’re teaching in pretty different contexts—and I wanted to write up a quick discussion<strong title="I also want to experiment with writing shorter posts to intersperse among the multi-thousand-word behemoths that take a month and a half to write."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> of where I think we differ, and why I think giving grades is important and valuable.</p>
<hr />
<p>Let’s start with Adam’s framing:</p>
<blockquote>
<p>My teaching job, it turns out, is actually two jobs.</p>
</blockquote>
<blockquote>
<p>One job is instruction. Students and I enter the same room at scheduled times, I perform a series of actions, they perform a series of responses, and then the students leave the room more educated than they were before. This job rules. I like it when my students go “ohh!” and “I never thought about it that way” and “I get it now!” I like when they email me, years later, to tell me how they used something they learned in class. This all makes sense. In fact, I thought this would be my only job.</p>
</blockquote>
<blockquote>
<p>But I realize now that I have a second job, which is evaluation, or gatekeeping, or, most specifically, point-guarding. I’m supposed to award “points” based on what students do in my class. Students try to acquire as many points as they can, and I try to stop them from obtaining points too easily….</p>
</blockquote>
<blockquote>
<p>This part of my job makes no sense. For one thing, point-guarding makes students miserable…. For another thing, point-guarding makes me miserable…. Worst of all, the things that make me a better instructor often make me a worse evaluator, and vice versa….</p>
</blockquote>
<p>He then discusses three specific reasons one might want to give and/or guard points, and largely dismisses them.</p>
<h3 id="do-we-need-grades-to-give-students-feedback">Do we need grades to give students feedback?</h3>
<p>Adam says no, and I agree completely. Grades are often the vehicle we use to give feedback, largely because we have to give grades anyway. But you can give feedback without attaching a grade to it. I have taken many dance and music lessons, and they never had grades attached; I still promise you I got, and incorporated, a ton of feedback from these lessons, because that’s <em>why I was there</em>.</p>
<p>Conversely, while every grade comes with <em>some</em> feedback, just hearing “3/10” doesn’t actually tell our students anything useful that can help them improve. The need to give grades often channels our feedback into not terribly useful forms.</p>
<h3 id="do-we-need-grades-to-motivate-students">Do we need grades to motivate students?</h3>
<p>Adam rejects this idea, because most people are naturally curious and if they’re not motivated to learn in our courses, the takeaway should be that our courses suck. But I think he’s a bit too quick to dismiss the importance of motivation.</p>
<p>First of all, people do like getting points. This is what drives the success of apps like Duolingo: people <em>start</em> using Duolingo because they want to learn Spanish, but they <em>keep</em> doing it in part to keep their streak alive and keep earning the fundamentally meaningless Duolingo XP. People find it surprisingly motivating to get a gold star and a verbal pat on the back, and “10/10” is one way of doing that.<strong title="You can give me a gold star and motivate me to keep writing by leaving a comment here, or on [Twitter], or [Substack], or by donating to my [Ko-Fi]."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p>But I think I have a more substantive difference from Adam, which is shaped by the specific courses I mostly teach: introductory “gateway” math courses like calculus and linear algebra. I think these subjects are fascinating! (That’s why I became a professional mathematician, after all: I like math!) And in upper-division courses, “isn’t this cool” is actually pretty adequate to keep students engaged. (Most of my grad school classes had essentially no grading, and that was fine.) So it resonates when Adam says:</p>
<blockquote>
<p>[I]f people need some extrinsic motivation to engage in my class, one of two things might be happening. Maybe they’re just not interested in what I have to offer. That’s fine! They should take a different class.</p>
</blockquote>
<p>But most of my students aren’t taking calculus because they think it’s cool. They’re taking calculus because they need to know calculus to do other things they want to do. Their motivation is already extrinsic! And that creates a big problem of <a href="https://en.wikipedia.org/wiki/Akrasia">akrasia</a>, because in the long term they want to have learned calculus, but in the short term they don’t “want” to sit down and do a bunch of exercises.<strong title="And when I was taking piano lessons I never played enough scales. This is pretty much a human universal. Good practice is rarely fun."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> And if you’re not doing exercises, you’re not learning math.<strong title="I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> A little bit of week-to-week prodding is valuable.</p>
<p>Moreover, I teach a lot of <em>freshmen</em>. They generally haven’t figured out how to manage themselves in college yet, and having some gentle guide rails (and metaphorical gold stars!) is really helpful.</p>
<h3 id="do-we-need-grades-to-separate-good-students-from-bad-students">Do we need grades to separate good students from bad students?</h3>
<p>Adam just says he’s not interested in doing this: “What am I going to do, send the good students to heaven and send the bad students to hell?” And emotionally, I sympathize a lot. All my students are my students, I want the best for all of them, and I have no desire to draw judgments on their characters, or worth as people, or anything like that. Hell, I don’t want to evaluate them at all! Grading sucks!</p>
<p>But this is where I come back to the joke I started with, a bit more seriously. I don’t like grading, but it is a large part of what I’m getting paid for. Adam denies this:</p>
<blockquote>
<p>Ranking my students doesn’t help me teach them, so I have no interest in doing it. But I understand why other people want me to do it.</p>
</blockquote>
<blockquote>
<p>In fact, they’re counting on it. Businesses need to decide who to hire, graduate schools need to decide who to admit, and scholarships need to decide who to fund, so they’d all appreciate it if I identified the best students for them. I can’t help but notice, however, that none of those organizations pay me. They pay headhunters, hiring managers, and program officers, after all, so it’s a little weird for me to do these people’s work for them. It’s especially egregious for these businesses and schools to force students to pay huge sums to get themselves evaluated by me, a guy who just wants to teach them psychology but ends up playing point guard instead.</p>
</blockquote>
<p>But this is an aggressively shallow reading of the economics of academia. Sure, the businesses who are doing the hiring don’t pay me. But George Washington University does pay me, and they can afford to do that in large part because my students pay them.<strong title="Yes, universities have a lot of revenue sources other than student tuition. But most of those sources rely on them continuing to be prestigious universities whose students go on to get good jobs and have successful careers, so I don't think that changes this argument substantially."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> And my students pay for the degree because it gives them a credential they can bring to businesses and get hired.</p>
<p>So those businesses (and graduate schools and scholarships etc.) aren’t paying me <em>directly</em>, but they are responsible for me getting paid. Sure, I’m a guy who just wants to teach them math but ends up playing point guard instead; but that’s why I joke that my salary pays for the point guarding and I do the actual teaching for free.</p>
<h3 id="evaluation-is-important">Evaluation is important</h3>
<p>Beyond the purse strings argument, sometimes we do need to evaluate people because we, as a society, need to know whom to trust. We don’t need to send some students to heaven and others to hell, but we do need to send some students to medical school and others to places where they won’t accidentally kill a bunch of people. And Adam fully concedes this at the end of his essay:</p>
<blockquote>
<p>But look, we need some evaluation. People have different talents, and they should get opportunities that tap those talents, not just because it benefits them, but because it benefits everybody. If I’m drowning (God forbid), I want to be saved by a lifeguard who’s good at swimming. If I get hit by a bus (God forbid), I want to be operated on by someone who’s good at surgery. If I take a math class (God forbid), I want to learn from someone who’s good at math. For that world to exist, someone, at some point, has to evaluate people on their swimming, surgery, and math.</p>
</blockquote>
<p>But he doesn’t want to do the evaluation. And he speculates about the benefits of completely separating teaching from evaluation.</p>
<p>To some extent this sounds appealing. First, because if I could keep my job except without the grading, that would be fantastic.<strong title="Actually, the grading isn't the worst part. Answering emails _about_ the grading is the worst part. One place I agree with Adam completely is that I don't _want_ to get detailed rundowns of my students' personal issues, and I don't want them to have to share them. "><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong> Second, because having more uniformity in evaluation would be good: if my “A” isn’t the same as your “A” then this <em>doesn’t</em> do a good job of figuring out who knows math and who doesn’t. This is why a lot of departments do common final exams—and while those sound logistically annoying, I’m basically in favor of them, and that <em>is</em> a move in the direction Adam is suggesting.</p>
<p>And finally, Adam’s plan is appealing because writing good evaluations is itself a major skill, and a lot of professors write pretty shit evaluations. I did a short fellowship with the College Board this semester helping them evaluate the Calc AB AP test, and I was blown away by the quality of the questions and the meticulousness with which they were put together.<strong title="Disclosure: I did a short fellowship with the College Board this semester helping them evaluate the Calc AB AP test. Wait, I already said that."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> They have a large team of skilled professionals who put in a ton of effort to write an exceptionally good test, which is exactly what Adam asks for, and that’s extremely valuable work.</p>
<p>But while the College Board test writers are great at their jobs, there’s also a reason they wanted to workshop the test with professors: we are, in actual fact, the experts in what skill at calculus looks like! If you want to know if people understand calculus, you need experts in calculus. If you want to know if people have learned psychology, you need experts in psychology. For better or for worse, we need to be in the loop somehow.</p>
<h3 id="evaluations-that-dont-suck">Evaluations that don’t suck</h3>
<p>But another thing I agree with Adam on is that we need to take evaluation seriously, as its own task. Most professors don’t really think about this a lot, but honestly most professors don’t think about pedagogy all that deeply.</p>
<p>Education theorists talk about “formative” and “summative” assessments. Formative assessments are mostly about teaching you something. I assign weekly problem sets because I want students to <em>do those problems</em>—because if they don’t do the problems, they won’t learn much. It’s not really an attempt to evaluate them. (And to be fair, I don’t think Adam is complaining about formative assignments.)</p>
<p>Summative assessments are the ones where you’re trying to really evaluate your students. And there are a few problems with the way we do those right now, but one is that we haven’t really committed to what these evaluations are supposed to say. Are we judging students on work ethic? On punctuality? On generally being good people?</p>
<p>And honestly, a lot of the time the answer to those questions is just “yes”. But if evaluation sucks, then evaluating “is this student a good person, overall” sucks twice. I don’t want to send my students to heaven or to hell, or try to tell whether they’re “good people” or not. And even if it’s possible, I’m not equipped to do it well.</p>
<p>What I am equipped to do is evaluate <em>whether they know calculus</em>. And that’s my job, right? My students need calculus so they can go take other classes that assume they know calculus. And I need to tell them, and everyone else, whether they do in fact know enough calculus to succeed in their next class.</p>
<p>And once I really embraced this idea, evaluation became a lot simpler.<strong title="I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write. No, I didn't duplicate a footnote by accident."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong> It lets me be kinder about some things, and stricter about others, but fundamentally it means I’m evaluating something I’m equipped to evaluate: do my students know calculus?</p>
<p>And that’s what I care about after all, isn’t it?</p>
<hr />
<p><em>What do you think? Is grading worth it? Is there a better way? You can tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a>, make a <a href="https://substack.com/profile/20149809-jay-daigle">note on Substack</a>, or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I also want to experiment with writing shorter posts to intersperse among the multi-thousand-word behemoths that take a month and a half to write. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>You can give me a gold star and motivate me to keep writing by leaving a comment here, or on <a href="https://twitter.com/profjaydaigle">Twitter</a>, or <a href="https://substack.com/profile/20149809-jay-daigle">Substack</a>, or by donating to my <a href="https://ko-fi.com/jaydaigle">Ko-Fi</a>. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>And when I was taking piano lessons I never played enough scales. This is pretty much a human universal. Good practice is rarely fun. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p>I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write. <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:5">
<p>Yes, universities have a lot of revenue sources other than student tuition. But most of those sources rely on them continuing to be prestigious universities whose students go on to get good jobs and have successful careers, so I don’t think that changes this argument substantially. <a href="#fnref:5" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:6">
<p>Actually, the grading isn’t the worst part. Answering emails <em>about</em> the grading is the worst part. One place I agree with Adam completely is that I don’t <em>want</em> to get detailed rundowns of my students’ personal issues, and I don’t want them to have to share them. <a href="#fnref:6" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:7">
<p>Disclosure: I did a short fellowship with the College Board this semester helping them evaluate the Calc AB AP test. Wait, I already said that. <a href="#fnref:7" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:8">
<p>I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write.</p>
<p>No, I didn’t duplicate a footnote by accident. <a href="#fnref:8" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleAdam Mastroianni wrote a very interesting essay on his substack about how the difference between teaching and grading, and how much he dislikes the latter and all the problems with it. And I don't really disagree, but I don't agree either. So here's my response. Grading sucks, but it's important.A Fictional History of Numbers, Part 2: Measurement, Estimation, Completeness, and Reality2023-04-28T00:00:00-07:002023-04-28T00:00:00-07:00https://jaydaigle.net/blog/fictional-history-of-numbers-part-2<p>This week we continue our exploration of what numbers are, and where mathematicians keep finding weird ones.</p>
<p>In <a href="/blog/fictional-history-of-numbers-part-1">part 1</a>, we started with the natural numbers, the most basic numbers we use to count things, and invented the integers (negative numbers) and the rational numbers (fractions). Then we took the same ideas a little further, and got the <em>algebraic numbers</em>, which are solutions to polynomial equations with rational coefficients—basically all the equations we can get by starting with the natural numbers and using just addition and multiplication.</p>
<p>But there are other questions we can ask, which <em>don’t</em> always give algebraic answers. So today we’ll look at a different question that we might want our numbers to answer: how do we measure things?</p>
<p>But before we start, if you like my writing and want to see more of this project, I have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this. Let me know what you’d like to hear about!</p>
<h2 id="finding-area">Finding Area</h2>
<p>Last time we left off with a question: what is the area of a circle of radius 1? You probably know the answer: the area of a circle is given by the formula \( \pi r^2\), so if the radius is \(1\) the area must be \(\pi\). But where did that formula come from? And what about the number \(\pi\)—what exactly is it?</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-graph.png" alt="A graph of a circle with radius 1, with lines every half-unit" class="center blog-short-image" /></p>
<p>If we draw a quick picture, we can make a rough estimate of the area. The circle is contained inside \(2 \times 2\) square, so it must have area less than \(4\); and it contains a \(1 \times 1\) square, so it must have area bigger than \(1\). But we want to be a bit more precise.</p>
<p>One option is just to draw more, smaller squares.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-quarter-graph.png" alt="A graph of a circle with radius 1, with lines every quarter-unit" class="center blog-short-image" /></p>
<p>Each of these squares is \(\dfrac{1}{4} \times 1/4\) and so has area \(1/16\). We can count that the circle contains \(32\) of them, and so has area at least \(\frac{32}{16} = 2\).</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-quarter-graph-inner.png" alt="A graph of a circle with radius 1, with lines every quarter-unit, and 32 square filled in inside the circle" class="center blog-short-image" /></p>
<p>Conversely, we can contain the circle with \(60\) squares, so the circle has area less than \(\frac{60}{16} = \frac{15}{4} = 3.75.\)</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-quarter-graph-outer.png" alt="A graph of a circle with radius 1, with lines every quarter-unit, and 32 square filled in inside the circle" class="center blog-short-image" /></p>
<p>So we have an <em>estimate</em> for the area \(A\) of our circle: \(2 < A < \frac{15}{4}.\) But this estimate still isn’t very good. We can improve it by drawing a finer grid, with more, smaller squares; but this gets tedious really quickly.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-fine-graph.png" alt="A graph of a circle with radius 1, with lines every twentieth of a unit" class="center blog-short-image" /></p>
<p><em class="center blog-short-image smalltable">I’m not <strong>going</strong> to count the squares in this picture, but I <strong>could</strong>.</em></p>
<p>This gives us a way to think about the area of this circle. By drawing finer and finer grids, we can get better and better estimates of the area of the circle.</p>
<h3 id="formulaic-estimation">Formulaic estimation</h3>
<p>As a mathematician, I’m a very specific kind of lazy. I’m much too lazy to count up dozens of tiny squares, but I am willing to make very complicated, abstract, and possibly confusing arguments to <em>avoid</em> counting the squares. So I want to estimate the area of this circle in a more formulaic way, so I don’t have to count anything.</p>
<p>Let’s pretend the circle is a pizza. We can cut it into eight slices, like this:</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-octagon.png" alt="A circle with an octagon inscribed, like it's cut into eight pizza slices" class="center blog-short-image" /></p>
<p>If we join the points where we slice through the crust together, we get an octagon around the outside. That lets us replace our difficult question with a simpler one: what is area of this octagon? The octagon is made up of eight triangles, and we know the area of a triangle is \( \frac{1}{2} b h \), where \(b\) is the length of the triangle’s base, and \(h\) is the triangle’s height. So the area of the octagon is \(4 bh\).</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-octagon-triangle.png" alt="A circle with an octagon inscribed, like it's cut into eight pizza slices, where one slice has the base and height marked out." class="center blog-short-image" /></p>
<p>The triangle’s height is <em>roughly</em> the radius of the circle, which is \(1\); and the length of the base is <em>roughly</em> one eighth of the circumference of the circle. And since we’re just estimating, rough numbers are fine; we can say that</p>
<p>\[
\text{Area of circle} \approx \text{Area of Octagon}
\approx 4 \cdot \frac{\text{circumference}}{8} \cdot 1
= \frac{\text{circumference}}{2},
\]
so the area of the circle of radius 1 is about half its circumference. In fact, we can make this same argument for a circle of any radius: if the radius is \(r\) and the circumference is \(C\), then the area will be approximately \(\frac{1}{2} C r.\)</p>
<p>But these are all just rough estimates. The area of the octagon isn’t exactly \(\frac{1}{2} Cr\), and the area of the circle isn’t exactly the same as the area of the octagon. But here’s where we have a key insight, which the Greeks called the <a href="https://en.wikipedia.org/wiki/Method_of_exhaustion">method of exhaustion</a><strong title="No, not becuase everyone was exhausted by this point in the lesson."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong>: both of those approximations get better if we draw a shape with more sides. Here’s the same basic picture, but instead of an octagon, we drew a sixteen-sided hexadecagon:</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-hexadecagon-triangle.png" alt="A circle with a hexadecagon inscribed, like it's cut into sixteen pizza slices" class="center blog-short-image" /></p>
<p>We have sixteen triangles, which have still have a height or about \(r\), but have bases of length about \(\frac{C}{16}\). This gives a total area of roughly</p>
<p>\[
\text{Area of Circle} \approx \text{Area of Hexadecagon}
\approx 16 \cdot \frac{1}{2} \cdot \frac{C}{16} \cdot r
= \frac{1}{2} C r.
\]</p>
<p>And next we have a \(32\)-sided icosidodecagon.<strong title="A word I'm pretty sure I'd never heard before I just looked it up."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-icosidodecagon-triangle.png" alt="A circle with a icosidodecagondecagon inscribed, like it's cut into thirty-two pizza slices" class="center blog-short-image" /></p>
<p><em class="center blog-image">The grey area isn’t the whole circle, but I can’t actually see the difference.</em></p>
<p>The icosidodecagon is still not <em>quite</em> the same size as the circle, but it’s pretty close. So we get an even better approximation:</p>
<p>\[
\text{Area of Circle} \approx \text{Area of Icosidodecagon}
\approx 32 \cdot \frac{1}{2} \cdot \frac{C}{32} \cdot r
= \frac{1}{2} C r.
\]</p>
<p>More importantly, we can see that as the number of sides goes up, all of our approximations get better: the polygon is closer to being a circle, the height of each triangle is closer to the radius, and the base of each triangle is closer to \( \frac{C}{n} \), where \(n\) is the number of sides of the polygon. So we can tell this approximation will get better and better as the number of sides of our polygon gets bigger; we conclude that the area of a circle is <em>exactly</em>
\[
A = \frac{1}{2} C r.
\]</p>
<p>But that leaves us still with a problem. This isn’t the formula for the area of a circle that you know and (maybe) love. And in fact this formula is not nearly as useful as \(\pi r^2\), because it requires both the radius and the circumference. We know the radius is \(r\); but what’s the circumference?</p>
<h3 id="ring-around-the-circle">Ring Around the Circle</h3>
<p>I know I said I’d invent some numbers, and I promise I’m getting there soon. But we should finish answering this question first.<strong title="The paper [How Archimedes showed that π is approximately equal to 22/7] by Damini D. B. and Abhishek Dhar was extremely helpful to me in putting this section together."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>We can find the circumference of a circle with the same basic method-of-exhaustion logic we used to find the area formula. If we inscribe a polygon inside the circle, the perimeter of the polygon will be <em>roughly</em> the circumference of the circle; and the more sides that polygon has, the better this approximation will be.</p>
<p>The trick is finding a polygon that we can actually estimate the circumference of. And what Archimedes noticed is that if the number of sides of the polygon is \(3 \cdot 2^n\), we can use some basic trigonometry to work this out.</p>
<p>A circle has \(360^\circ\) total in it. If we inscribe a hexagon, we can chop the circle into six equilateral triangles, which will each have an inner angle of \(60^\circ\). We can cut these in half to get an angle of \(30^{\circ}\)—and this is convenient, because some basic trigonometry<strong title="Just last week I told a student I had no memory of how to prove this. But the simple argument is precisely that we're cutting an equilateral triangle in half—the half-triangle has an angle of thirty degrees and a side that has half the length of the hypotenuse. "><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> can convince us that \(\sin(30^\circ) = 1/2\). This means that each side of the hexagon has length \(r\), and the perimeter of the hexagon is \(6r\).</p>
<p><img src="/assets/blog/fictional-history-of-numbers/hexagon-angle.png" alt="Two copies of a circle with an inscribed hexagon. One shows the angle of 60 degrees; the other cuts it in half to show an angle of 30 degrees. " class="center blog-short-image" /></p>
<p>If we take a circle with radius \(1\), then each side of the hexagon has length 1, and the perimeter is just \(6\). This tells us that the circumference of the circle has to be bigger than six—but not too much bigger.</p>
<p>But more importantly, we can extend this argument. There’s a standard trigonometric formula<strong title="Which I have to look up every time I want to use it."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> for finding the sine of <em>half</em> of an angle. That means that when we look at a twelve-sided dodecagon and get an angle of \(15^\circ\), we can compute that \(\sin(15^\circ) = \frac{\sqrt{2 - \sqrt{3}}}{2}\). This tells us that each side has length \(\sqrt{2 - \sqrt{3}}\), and thus the total perimeter of the dodecagon is \(12 \sqrt{2 - \sqrt{3}}\approx 6.212.\)</p>
<p><img src="/assets/blog/fictional-history-of-numbers/dodecagon-angle.png" alt="Two circles. One has an inscribed dodecagon showing a half-angle of fifteen degrees; the other has an inscribed icositetragon showing a half-angle of 7.5 degrees. " class="center blog-short-image" /></p>
<p>Doubling the sides again gives a \(24\)-sided icositetragon; we use the trigonometric identity again, which get a more complicated formula. But we can work out each side has length \( \sqrt{2 - \sqrt{2 + \sqrt{3}}}, \) and the whole polygon has a perimeter of \(24 \sqrt{2 - \sqrt{2 + \sqrt{3}}} \approx 6.27.\)</p>
<p>Another doubling gives us a \(48\)-sided shape with perimeter \(48 \sqrt{2 - \sqrt{2 + \sqrt{2 + \sqrt{3}}}} \approx 6.28,\) and one more gives us a \(96\)-sided shape with perimeter
\( 96 \sqrt{2 - \sqrt{2 + \sqrt{2 + \sqrt{2 + \sqrt{3}}}}} \approx 6.28\) again. So by the Method of Exhaustion, it’s reasonable to claim the circumference is about \(6.28\).</p>
<p>In fact, this entire argument scales up with the radius. So if a circle has radius \(r\), then the circumference is \(C \approx 6.28 r\); and from our earlier argument, the area is \( A = \frac{1}{2} Cr \approx 3.14 r^2\). The Greeks took this number \(3.14\)<strong title="Why 3.14 and not 6.28? The Greeks were more interested in the _diameter_ of the circle than the radius, and so they thought the interesting formula was C = πd, rather than C = 2πr. Modern mathematicians generally see the radius as more fundamental, so we phrase all our formulas in terms of the radius; this means that a lot of our formulas contain the term 2 π. There's a [movement] to stop using π and instead use the Greek letter τ (tau) as the fundamental constant τ = C/r = 2π. But it's hard to change notation, so we slog on using π. "><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong> and called it \(\pi\), the first letter of the Greek word <em>περίμετρος</em> (perimetros), which means “perimeter” or “circumference”. And thus we finally have the formulas you know from school:</p>
<p class="theorem">\[
\begin{aligned}
C & = 2 \pi r \\\<br />
A & = \pi r^2.
\end{aligned}
\]</p>
<h2 id="getting-real">Getting real</h2>
<p>This argument produced a number, which we said is about \(3.14\). But what exactly do we mean when we write down the number \(\pi\)?</p>
<h3 id="limitless-power">Limitless power</h3>
<p>We described \(\pi\) by approximating it. It’s the number that’s close to \(6\), and closer to \(12 \sqrt{2-\sqrt{3}}\), and even closer to \(24 \sqrt{2 - \sqrt{2+\sqrt{3}}}\), and even closer to…</p>
<p>The Greeks called this the Method of Exhaustion, but in modern language we call it a <em>limit</em>. In calculus, we give a definition for limit something like this:<strong title="We can give a more precise definition using the Greek letter ε, which is infamously confusing to calculus students. It's really just a more precise way of saying the same thing. We say that L is the limit of (a_n) if, for every ε>0, there is a natural number N such that if n>N then |a_n-L| < ε. See if you can see why this means the same thing as the less formal version I wrote in the main text."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong></p>
<p class="theorem"><strong>Definition:</strong> If we have an infinite list of numbers \(a_1, a_2, \dots, a_n, \dots\), and another number \(L\), we say that \(L\) is the limit of the sequence \( (a_n) \) if we can approximate \(L\) as precisely as we want by choosing a large enough \(n\). We notate this by writing \(\lim_{n \to \infty} a_n = L.\)</p>
<p>Less formally, the number \(L\) is the limit of a sequence of numbers if the numbers eventually get really close to \(L\). The idea is that the numbers \(a_1, a_2, a_3, \dots \) are each approximations of \(L\), and as we go further into the list, they approximate it better and better—which is exactly what we did when we estimated \(\pi\) earlier.</p>
<p>Except there’s a problem here. If we know \(L\) is a number, this is all fine. It’s not too hard to convince yourself, say, that the sequence
\((
1, 1/2, 1/4, 1/8, 1/16, \dots
)\)
is getting close to zero, or that \(1/n\) is a good approximation of zero for large values of \(n\).</p>
<p class="center blog-short-image"><img src="/assets/blog/fictional-history-of-numbers/1-over-n.png" alt="A graph of the sequence 1/n" /></p>
<p><em class="center blog-short-image">We can see that the points with heights \(1/n\) are getting closer to the red line at height \(0\). The further we get into the sequence, the better an approximation this is.</em></p>
<p>But on the other hand, if we have a list like \((1, 2, 3, 4, \dots)\), or \((-1, 1, -1, 1, \dots)\), it doesn’t look like those are <em>approximating</em> any number.</p>
<p class="center blog-short-image"><img src="/assets/blog/fictional-history-of-numbers/sequences.png" alt="Graphs of the sequences (n) and (-1)^n" /></p>
<p><em class="center blog-short-image">On the left, the sequence \((1, 2, 3, 4, \dots)\) is getting bigger and bigger without approaching any particular number. On the right, the sequence \((-1, 1, -1, 1, \dots )\) is bouncing back and forth between two values, rather than approximating one specific number.</em></p>
<p>Not every sequence has a limit, because not every sequence is approximating one particular number. So how do we know our sequence
\[
\left( 3, \quad 6 \sqrt{2-\sqrt{3}},
\quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}},
\quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}},
\quad \dots \right)
\]
<em>does</em> approach a number?</p>
<p>Unfortunately, it kind of doesn’t.</p>
<h3 id="failing-at-algebra">Failing at algebra</h3>
<p>For a long time, people looked for a way to represent \(\pi\) as a rational number—as a ratio of two integers. We found that \(22/7\) is a pretty good approximation, and \(355/113\) is a shockingly good approximation (correct to six decimal places). But in 1758, Johann Heinrich Lambert proved that \(\pi\) isn’t a rational number.</p>
<p>Now, we do have other, “irrational” numbers. In <a href="/blog/fictional-history-of-numbers-part-1">part 1</a> we talked about algebraic numbers, which are solutions to polynomial equations \(a_0 + a_1 x + \dots + a_n x^n =0\). We used this technique to construct lots of irrational numbers, like square roots, cube roots, and the indescribable solutions to \(x^5+x+3=0\).</p>
<p>But \(\pi\) isn’t one of those, either. In 1882, the German mathematician Ferdinand von Lindemann showed that \(\pi\) is a <a href="https://en.wikipedia.org/wiki/Transcendental_number">transcendental number</a>, which means it isn’t the solution to <em>any</em> polynomial equation with rational coefficients. We just can’t describe it with any of the tools we saw in Part 1.</p>
<p>It’s quite difficult to show that \(\pi\) is transcendental, and I’m not going to try to prove it here. The most common proof relies on the fact that the number \(e\) is transcendental, and even that isn’t easy to prove. But we do know \(\pi\) isn’t an algebraic number—so what is it?</p>
<h3 id="mind-the-gaps">Mind the gaps</h3>
<p>The details are different, but we’re really in the same boat we found ourselves in last time. In part 1, we wanted a solution to the equation \(x^2-2=0\), but we couldn’t find a number that worked, so we just made one up. We can do the same thing here. When a sequence looks like it should have a limit, we’ll make one up for it.</p>
<p>We need to be careful, though, because lots of sequences <em>don’t</em> look like they’re converging anywhere, and those <em>shouldn’t</em> have limits.</p>
<p class="center blog-short-image"><img src="/assets/blog/fictional-history-of-numbers/3-sequences.png" alt="Graphs of the sequences n, (-1)^n, and 1/n." /></p>
<p>Some sequences, like the first one go off to infinity, and others bounce around to multiple different targets, like the second one. But in some sequences, like the third, all the numbers eventually get very close together. We call those <a href="https://en.wikipedia.org/wiki/Cauchy_sequence">“Cauchy” sequences</a>,<strong title="Pronounced "coh-shee". They're named after the 18th-century French mathematician [Augustin-Louis Cauchy], who helped formalize this approach to limits and the real numbers."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong> and we want to have limits for all of them.</p>
<p>The definition of a Cauchy sequence may seem very similar to the definition of a limit, but it’s not quite the same. A sequence has a limit if the terms all get close to some fixed number; it’s Cauchy if the terms all get close to <em>each other</em>. In a Cauchy sequence, it seems like there <em>should</em> be some number the terms are getting close to, but in sets like the rational numbers, that may not be true. The rationals have “holes” that the terms of the sequence can gather around, but that don’t correspond to any rational number.</p>
<p>The most famous example is probably \(\sqrt{2}\). We saw last time that \(\sqrt{2}\) is irrational: there are no integers \(p\) and \(q\) such that \( \left( \frac{p}{q} \right)^2 = 2\). But we can find a rational number so that \(1.9 < (a_1)^2 < 2\), and then a second with \(1.99 < (a_2)^2 <2\), and a third with \(1.999 < (a_3)^2 <2\); and if we keep doing this, we get a sequence of numbers that clearly “wants to” converge to \(\sqrt{2}\).<strong title="But you can't make the same argument for i, the square root of -1; this will be important next time."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong> And that shouldn’t cause us too much distress. Even though \(\sqrt{2}\) is irrational, it’s an algebraic number, so we already created it; we don’t need to make up anything new.</p>
<p>But another hole in the rationals is \(\pi\). We built a Cauchy sequence of algebraic numbers that wants to converge to \(\pi\):
\[
\left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right)
\]</p>
<p class="center blog-short-image"><img src="/assets/blog/fictional-history-of-numbers/pi-sequence.png" alt="A graph of a sequence approaching π" /></p>
<p>With a little more effort we could build a sequence of rational numbers that does the same thing. (For instance, as we’ll see later, \((3, 3.1, 3.14, 3.141, \dots )\) would work.) But \(\pi\) isn’t an algebraic number like \(\sqrt{2}\). From the algebraic perspective of part 1, it doesn’t exist.</p>
<p>But it should exist. So we’ll invent it.</p>
<h2 id="you-complete-me">You complete me</h2>
<p>If we invent all the numbers like this that should be the limits of sequences, we get the <em>real numbers</em>, which we represent with the symbol \(\mathbb{R}\). And the real numbers are <em>complete</em>, which means that every Cauchy sequence—every sequence which ought to converge—does in fact converge.</p>
<p>From this perspective, we can say that a real number is just a Cauchy sequence. But that’s not a great way to talk about them, for two reasons. The first is just that it’s awkward. I don’t want to go around talking about the number
\[
“\lim_{n \to \infty} \left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right)”,
\]
and neither do you.</p>
<p>The second problem is that more than one sequence can converge to the same limit. \( (1, 1/2, 1/3, 1/4, \dots ) \) has the same limit as \((0,0,0,0, \dots ) \) or \( (1, 1/2, 1/4, 1/8, \dots )\); we really don’t want to treat them as different real numbers. We can fix this problem by defining real numbers to be “equivalence classes of Cauchy sequences of real numbers” but that gets <em>extremely</em> cumbersome.</p>
<p>The official method for constructing the reals is something called <a href="https://en.wikipedia.org/wiki/Dedekind_cut">Dedekind cuts</a>, where a real number is a way of cutting the rational numbers in half. So for example, we identify \(\sqrt[3]{2}\) with the set of all the rational numbers with \(x^3 < 2\). This has the advantage that it’s really easy to use in proofs; it has the disadvantage that it’s even more cumbersome to work with than the Cauchy sequences description.</p>
<p>But there’s a much easier approach. And it’s something we all learn in high school.</p>
<h3 id="decimalization">Decimalization</h3>
<p>In high school algebra, I learned that a real number is an infinite decimal.<strong title="Yes, even whole numbers are infinite decimals. We'll get there."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong> Where does this idea come from?</p>
<p>We said that the real numbers are complete, which means every Cauchy sequence converges. But they’re also <em>ordered</em>: if we have two distinct real numbers, one will always be greater than the other. And that give us another way to characterize completeness:</p>
<p class="theorem"><strong>Monotone Convergence Theorem:</strong> if a sequence of real numbers is increasing and bounded above, then it converges.</p>
<p>The idea here is that if a sequence is always increasing, it can’t really bounce around. So there are only two options: either it goes to infinity, or it converges to some real number. And this is basically how we <em>actually</em> got \(\pi\), right? Each polygon had a bigger perimeter than the last one, but the perimeter would never get bigger than, say, \(8\). We had an increasing sequence with an upper bound, so it had a limit.</p>
<p>Now a finite decimal is just a rational number. We can interpret a finite decimal \(3.14\) as something like \( \frac{314}{100},\) and similarly \(1.414 = \frac{1414}{1000}.\) But we can’t do the same thing with an infinite decimal; we’d have to have an infinitely large numerator and an infinitely large denominator.</p>
<p>Instead, we interpret an infinite decimal as a <em>sequence</em>. When we write that \(\pi = 3.14159 \dots,\) we mean that \(3\) is a rough approximation, and \(3.1\) is a better approximation, and \(3.14\) is even better; thus \(\pi\) is the limit of the sequence \((3,3.1, 3.14, 3.141, 3.1415, \dots). \)</p>
<p>Every infinite decimal is an increasing sequence, and every infinite decimal is bounded above: whatever we can say about a number like \(1.14142\dots\), we know it can’t be bigger than \(2\). So every infinite decimal corresponds to a real number.</p>
<p>And just as importantly, every real number corresponds to an infinite decimal! If we have a real number \(x\), we can find the biggest number with one decimal place that’s smaller than \(x\). Then we can find the biggest number with two decimal places, and the biggest with three, and the biggest with four… and this gives an infinite decimal that converges to \(x\).</p>
<h3 id="maths-greatest-flame-war">Math’s greatest flame war</h3>
<p>This construction generally does what we expect it to, but there’s one very special case where it doesn’t. We know \(1\) is a natural number, and thus a rational number, and thus a real number. So how do we write it as an infinite decimal?</p>
<p>The largest number with one decimal places that’s less than \(1\) is \(0.9\). With two decimal places, we get \(0.99\). With three we get \(0.999\). So by this construction, the infinite decimal representation of \(1\) is in fact \(0.999 \dots .\)</p>
<p>You may have run across this claim, that \(0.999 \dots~= 1\), before; and it almost always triggers a great deal of resistance. It must be smaller than one. The leading term is a zero!</p>
<p>You’ll sometimes see simple algebraic proofs like this:
\[
\begin{aligned}
10 \cdot 0.999 \dots & = 9.999 \dots \\\<br />
9 \cdot 0.999 \dots & = (9.999 \dots) - (0.999 \dots) \\\<br />
9 \cdot 0.999 \dots & = 9 \\\<br />
0.999 \dots & = 9/9 = 1.
\end{aligned}
\]
But a lot of people <a href="https://en.wikipedia.org/wiki/0.999...#Skepticism_in_education">find that unsatisfying and unconvincing</a>.</p>
<p>In fact that argument is a little glib, and glosses over some fairly sophisticated ideas—which we just worked through.<strong title="A similar approach can also be used to "prove" that 1+2+4+8+... = -1, which is obviously not what we mean."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong> An infinite decimal is asking for a <em>limit</em>, which isn’t how people generally think of numbers. But it’s certainly true that \(1\) is approximated by \(0.9\), and approximated even better by \(0.99\), and even better by \(0.999\); and that we can make that approximation as good as we want by adding more \(9\)s to the decimal.</p>
<p>And that’s all the \(0.999\dots~ = 1\) actually means. The sentence seems weird, because <em>real numbers are weird</em>. They seem innocuous, but a single real number is secretly an infinite collection of infinite series. And if we look too closely, the weirdness starts leaking out.</p>
<h2 id="was-this-really-necessary">Was this really necessary?</h2>
<p>We started off with a fairly innocuous question: what is the area of a circle? And the answer turned out to be…quite a bit more complicated than we might have expected. And it gets worse! For instance, while there are infinitely many rational numbers, we can show that \(100\%\) of real numbers are irrational—and in fact \(100\%\) of them are, in a very precise sense, impossible to describe.</p>
<p>The real numbers are so weird and complicated that you might be wondering if we really need to do all of this. Sure, \(\pi\) is important, but can’t we just treat that as a one-off idiosyncrasy, and avoid all this nonsense about Cauchy sequences and Dedekind cuts? Unfortunately, we can’t. Sure, real numbers are <a href="https://twitter.com/Meaningness/status/1419369022559166464">extremely weird eldritch horrors horrors</a>; but they’re also <em>exactly</em> the tool we need to do calculus.</p>
<p>There’s more to say about both of these ideas: why are the real numbers weird, and why are they so useful? So <a href="/blog/fictional-history-of-numbers-part-3">next time</a> we’ll learn more about just how strange the real numbers are, and see why they are, nonetheless, perfectly suited to solve a whole host of very important problems.</p>
<hr />
<p><em>Have questions? Can’t wait for <a href="/blog/fictional-history-of-numbers-part-3">part 3</a>? Want to share your favorite weird numbers with me? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>No, not becuase everyone was exhausted by this point in the lesson. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>A word I’m pretty sure I’d never heard before I just looked it up. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>The paper <a href="https://arxiv.org/abs/2008.07995">How Archimedes showed that \(\pi\) is approximately equal to 22/7</a> by Damini D. B. and Abhishek Dhar was extremely helpful to me in putting this section together. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p>Just last week I told a student I had no memory of how to prove this. But the simple argument is precisely that we’re cutting an equilateral triangle in half—the half-triangle has an angle of thirty degrees and a side that has half the length of the hypotenuse. <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:5">
<p>Which I have to look up every time I want to use it. <a href="#fnref:5" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:6">
<p>Why 3.14 and not 6.28? The Greeks were more interested in the <em>diameter</em> of the circle than the radius, and so they thought the interesting formula was \(C = \pi d\), rather than \(C = 2 \pi r\).</p>
<p>Modern mathematicians generally see the radius as more fundamental, so we phrase all our formulas in terms of the radius; this means that a lot of our formulas contain the term \(2 \pi\). There’s a <a href="https://tauday.com">movement</a> to stop using \(\pi\) and instead use the Greek letter \(\tau\) (tau) as the fundamental constant \(\tau = C/r = 2 \pi\). But it’s hard to change notation, so we slog on using \(\pi.\) <a href="#fnref:6" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:7">
<p>We can give a more precise definition using the Greek letter \(\varepsilon\), which is infamously confusing to calculus students. It’s really just a more precise way of saying the same thing.</p>
<p>We say that \(L\) is the limit of \( (a_n) \) if, for every \(\varepsilon >0\), there is a natural number \(N \) such that if \(n > N \) then \( \mid a_n -L \mid < \varepsilon\).</p>
<p>See if you can see why this means the same thing as the less formal version I wrote in the main text. <a href="#fnref:7" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:8">
<p>Pronounced “coh-shee”. They’re named after the 18th-century French mathematician <a href="https://en.wikipedia.org/wiki/Augustin-Louis_Cauchy">Augustin-Louis Cauchy</a>, who helped formalize this approach to limits and the real numbers. <a href="#fnref:8" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:9">
<p>But you can’t make the same argument for \(i\), the square root of \(-1\); this will be important next time. <a href="#fnref:9" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:10">
<p>Yes, even whole numbers are infinite decimals. We’ll get there. <a href="#fnref:10" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:11">
<p>A similar approach can also be used to “prove” that \(1+2+4+8+ \dots~ = -1\), which is obviously not what we mean. <a href="#fnref:11" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleThis week we continue our exploration of what numbers are, and where mathematicians keep finding weird ones. We start by asking for the area of a circle, get exhausted by Archimedes's method for finding the answer, and take a tour through the idea of limits to construct the complete field of real numbers. We resolve one of the oldest mathematical flame war topics on the internet, and finish by worrying the real numbers are just too weird to actually use.A Fictional History of Numbers, Part 1: Counting, Fractions, and Algebra2023-04-03T00:00:00-07:002023-04-03T00:00:00-07:00https://jaydaigle.net/blog/fictional-history-of-numbers-part-1<p>In graduate school I studied number theory. The joke goes that number theory is the kind of math where you never use anything recognizable as an actual number. And it’s true that advanced mathematics uses a wild variety of strange number-like things—complex numbers, quaternions, octonions, \(p\)-adic numbers, Witt vectors, surreal numbers, and worse.</p>
<p><img src="/assets/thesis_diagram2.png" alt="A big complicated multicolored commutative diagram" class="center blog-image" /></p>
<p class="center blog-image"><em>This diagram summarizes my <a href="/assets/papers/daigle_thesis.pdf">Ph.D. thesis</a>. Every node represents a specific type of weird number. My mother said it looked like I was trying to summon Cthulhu.</em></p>
<p>And these things are all strange, and hard to define. But they’re not crazy, and they’re not random. Each of these weird number systems was invented to solve specific problems or answer specific questions. In this series of posts, I want to give you a sense of where these constructions come from, and how you can start out asking reasonable-sounding questions and wind up in the diagram above.</p>
<p>This isn’t a history lesson; I’m not going to tell you who first described these things, or how we actually started talking about them. But in the spirit of Timothy Chow’s <a href="https://www.ams.org/notices/200601/fea-chow.pdf">You Could Have Invented Spectral Sequences</a>, I want to convince you that you could have discovered all sorts of exotic number systems by asking reasonable questions. So we’ll walk through what the philosopher of mathematics Imre Lakatos would have called a <a href="https://en.wikipedia.org/wiki/Imre_Lakatos#Rational_reconstructions_of_the_history_of_science">rational reconstruction of the history of numbers</a>—not what actually happened, but a cleaned-up fictional version that <em>could</em> have happened.</p>
<p>Today I’ll discuss the most sensible collections of numbers. We’ll start with the basic ability to count, and we’ll build up through the algebraic numbers, which let us solve all sorts of reasonable equations. In future parts we’ll tackle the tricky problem of completeness (<a href="/blog/fictional-history-of-numbers-part-2/">part 2</a>), which gives rise to the real and complex numbers and the \(p\)-adics, move into higher dimensions with the quaternions and octonions, and maybe look at some genuinely exotic ideas.</p>
<p>And as always, if you like my writing and want to see more of this project, I have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this. Let me know what you’d like to hear about!</p>
<h2 id="natural-numbers-they-really-count">Natural numbers: they really count</h2>
<p>The first numbers we can talk about are the numbers we use to count things: \(\{ 1, 2, 3, \dots \}\). We call these the <em>natural numbers</em>,<strong title="If you're about to bring up zero, please bear with me for a few paragraphs."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> abbreviate them with the symbol \(\mathbb{N}\), and mostly we all understand them pretty well. I have five apples right now; I know because I counted them. If I eat one I’ll have four apples. We can add natural numbers: if I start with five apples and add three apples then I will have eight. And we can multiply them: if I get four groups of five apples I will have twenty apples.</p>
<p>We can give a “formal definition” in set theory, where \(1\) is identified with the one-element set \(\{\varnothing\} \), and \(2\) is identified with the two-element set \( \{ \varnothing, \{\varnothing\} \} \), and so on. But this is really a way of understanding set theory, not understanding the natural numbers. For almost any reasonable use, the best definition of natural numbers is “oh, natural numbers are those things you count with”. Natural numbers come before math.</p>
<p>But even the natural numbers are deeper and conceptually richer than they
seem. I have five apples, but that doesn’t mean they’re all the same!<strong title="Four of them are sugar bee apples I bought over the weekend, and one is a gala apple that I bought months ago and forgot about in the back of the fridge until I went to go count my apples while writing this post."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> If we say we have “five apples”, rather than “a big apple, three medium apples, and a small apple” or “two Fujis, two Granny Smiths, and a Red Delicious” or even “an apple here and two apples there and two apple in between” we’re <em>abstracting</em>, waving away the differences so that we can describe them all with the number \(5\).<strong title=" We do the same thing when we say "apple" rather than writing out the full genome, or describing the position of every molecule, or whatever. We're constant abstracting without really thinking about it; doing math just makes this process of abstraction more explicit."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>Even ordinary counting can raise philosophical questions. How many objects are in this picture?</p>
<p><img src="/assets/blog/fictional-history-of-numbers/pizza.jpg" alt="" class="center blog-image" /></p>
<p class="center blog-image"><em>Dale Cruse from San Francisco, CA, USA, <a href="https://creativecommons.org/licenses/by/2.0">CC BY 2.0</a>, via <a href="https://commons.wikimedia.org/wiki/File:Pizza_(29921412048).jpg">Wikimedia Commons</a></em></p>
<p>You could say there’s one pizza. Or you could say there are ten slices of pizza. You could say there are eight slices, since one of the cuts isn’t very clean and a couple of the pieces seem stuck together. Or you could even start counting individual slices of onion. You have to decide what counts as “one object” before you can count things.</p>
<p>And there are plenty of other ways this abstraction can break down. If we combine two apples and one pear, we get three fruits, and the numbers work out but the noun changes. If we combine one heap of sand with another heap of sand, we get one heap of sand, so is \(1+1 = 1\)? No, the natural numbers just don’t model “adding heaps of sand” very well. And if we combine four atoms of iron with three molecules of oxygen, we somehow wind up with two molecules of rust: we add four things to three other things and end with two.</p>
<p>But while the natural numbers are in fact a simplified model, and an abstraction, they’re a pretty robust and, well, <em>natural</em> one. We use them a lot, we understand them from a young age, and they work well to describe a lot of phenomena.</p>
<h3 id="zero-much-ado-about-nothing">Zero: much ado about nothing</h3>
<p>There is <a href="https://math.stackexchange.com/questions/283/is-0-a-natural-number">infamous controversy</a> about whether zero counts as a natural number. This is mostly a terminological dispute, and in practice I duck the issue entirely. When it matters, I either include zero by writing \(\mathbb{Z}_{\geq 0} \), or exclude it by writing \(\mathbb{Z}_{> 0}\) , rather than writing \(\mathbb{N}\) and confusing half my readers.</p>
<p>But for the purposes of this fake history, we should treat zero as the conceptual innovation it is. While the idea of having none of something is very simple, treating zero as a <em>number</em> is a much more sophisticated and abstract idea. Sure, there are zero oranges in my apartment right now, along with zero kumquats, zero elephants, and zero large expensive gemstones<strong title="If you'd like to help fix that last one for me, you can donate to my [Ko-Fi]."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong>; if we’re listing off things I have zero of, we’ll be here a long time.</p>
<p>Treating zero as a number requires either that I be willing to “count” all those things that don’t exist, or that I give real conceptual heft to an abstract calculational tool. And while “give conceptual heft to an abstract calculational tool” might as well be the official motto of modern mathematical thought, it took a long time for people to accept it.</p>
<p><a href="https://twitter.com/benorlin/status/1450137479911194628"><img src="/assets/blog/fictional-history-of-numbers/ben_orlin_zero.jpeg" alt=""Don't laugh, but I was thinking....what if 'nothing' is actually a number? We could call it 'zero.'" Pause, followed by laughter. "Ah yes. Just like the number 'silence,' and the number 'lack of smell.'" "Hey, you've had 'a number' of good ideas today!"" class="center blog-image" /></a></p>
<h2 id="integers-whats-the-takeaway">Integers: what’s the takeaway?</h2>
<p>You might be surprised that I talked about addition and multiplication, and skipped subtraction. But subtraction can actually be pretty subtle! It introduces a new idea: we want to <em>undo</em> addition.</p>
<p>This undoing can be simple. If I want to have five apples and I already have three, I need to add two more apples; and we write that \(5-3 = 2\). If I want to have ten apples and I have four, I need \(10-4 = 6\) more apples. Algebraically we can look at this as solving equations; if I want to solve \(3+x = 5\) I get \(x=2\), and if I want to solve \(4+x = 10\) I get \(x=6\).</p>
<p>But if I want to have four apples and I have six apples, I get stuck. There’s no number of apples I can add to my ten apples to get four. I’m stuck. Algebraically I’m trying to solve \(6 + x = 4\), and there’s no <em>natural</em> number that solves that equation.</p>
<p>I don’t get stuck in the real world, obviously. If I have six apples and want four, I can throw two apples away. (Or eat them.) But I’m stuck mathematically, because I can’t really describe that situation with just the natural numbers. To talk about this, we need something new. We need negative numbers.</p>
<p>There are a few ways we can physically interpret a negative number of apples. Maybe it’s a debt: not only do I have no apples, but I also have to give you two apples tomorrow. Maybe it’s a loss, or a change: I can have two fewer apples today than I did yesterday, even if I can’t have \(-2\) apples right now. And maybe it’s a direction: I can’t move \(-2\) feet any more than I can have \(-2\) apples, but I can certainly move \(2\) feet backwards.</p>
<p>But what <em>is</em> \(-2\)? It’s the solution to \(6+x = 4\). Or more directly, it’s the solution to \(2+x = 0 \). That equation doesn’t have a solution in the natural numbers; but it would be really nice if it did have a solution, so we made one up. And we called it \(-2\).</p>
<p>And if we insist that \(a+x=b\) should have a solution for any natural numbers \(a\) and \(b\), we get the <em>integers</em><strong title="Kronecker famously said that "God created the integers; all else is the work of man". But I think he was giving God a little bit more credit there than he deserved. The integers are complicated!"><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> \( \{\dots, -2, -1, 0, 1, 2, \dots\} \), written \(\mathbb{Z}\)<strong title="The Z stands for "Zahlen", which is German for "numbers"."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong>. These demand another level of abstraction, and as a result they tend to feel a little less “real” to people. But they’re useful—they let us model loss and debt and motion backwards and a whole bunch of other things, all with the same algebraic tool—so we put up with them.</p>
<h2 id="rational-numbers-lets-think-about-this">Rational numbers: let’s think about this</h2>
<p>But we didn’t just invent some new numbers; we also got a <em>tool</em> for inventing more numbers. With natural numbers we can add, and by undoing addition we got subtraction: we were trying to solve equations like \(a + x =b\). But we can also <em>multiply</em> the natural numbers, so we can try to undo that as well.</p>
<p>In some cases, this just works. With just the natural numbers, I can compute \(6 / 3\)—what should I multiply by \(3\) to get \(6\)—and get \(2\). I can interpret this physically, as we often do in grade school: if I have six dolls and want to divide them among three people, everyone gets two. But I can also interpret it algebraically, as trying to solve the equation
\[
3 \cdot x = 6.
\]
And maybe I muck around, or operate by trial and error, or honestly just have this one memorized, and I see that the number \(2\) solves that equation.</p>
<p>But I can also try to compute \(5 / 3\). There’s certainly nothing stopping me from writing those symbols down! But then I get stuck, just like I did when I wanted \(4-6\). If I try to divide five dolls among three people, either some people get more than others, or I have some left over. If I think algebraically, I look at the equation \(3 \cdot x = 5\), and I can’t find a natural number that solves it. \(1\) is too small, and \(2\) is too big; nothing will work.</p>
<p>But we can make the same move here we did for the integers. We can’t find a number that solves the equation \(3 \cdot x = 5\), so we’ll make one up. And what should we call it? We started out trying to compute \(5/3\), so maybe we should call this number \(“5/3”\). And we can generalize this: for any equation \(a \cdot x = b \), there should be a number \(b/a\) that gives a solution.</p>
<p>We run into two problems here: a minor one and a major one. The minor one is that we have some redundancy here. If a number solves \(3 \cdot x =5\) then it also solves \(6 \cdot x = 10\). So should we call it \(5/3\) or \(10/6\)? The answer, of course, is that we treat them as two different ways of “spelling” the same number.</p>
<p>But the major problem comes from the weirdest integer, \(0\). We wrote down the symbol \(5/3\) and just declare it’s a number; but we actually can’t we do the same thing for \(5/0\). There are a couple ways of seeing the problem. One has to do with sizes: when we wanted to compute \(5 / 3\), we saw that \(1\) was too small, and \(2\) was too big. It feels reasonable to insist there must be something in between. But if we want to compute \(5/0\) we’re trying to solve \(0 \cdot x = 5\). For this, <em>every</em> number will be too small; we can’t overshoot so we can’t look for an in-between number.</p>
<p>But a more robust algebraic argument is that we already know \(0 \cdot x = 0\). The equation \(0 \cdot x =5 \) <em>can’t</em> have a solution, unless we want to say that \(5=0\)<strong title="We can do that, actually—at least sort of. Stay tuned for the rest of this series."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong>. And \(0 \cdot x\) has to equal \(0\), unless we want to make all our operations stop making sense. As long as we accept that</p>
<ul>
<li>\(0 + 0 = 0\); and</li>
<li>Multiplication <em>distributes</em> across addition, following the rule that \( (a+b) c = ac + bc\);</li>
</ul>
<p>then we can compute that
\[
\begin{aligned}
0 \cdot x
& = (0+0) \cdot x \\\<br />
0 \cdot x
&= 0 \cdot x + 0 \cdot x \\\<br />
0 \cdot x - 0 \cdot x & = 0 \cdot x \\\<br />
0 & = 0 \cdot x.
\end{aligned}
\]</p>
<p>So we can’t divide by zero. It doesn’t work; it causes way more problems that it solves. But as long as we avoid zero we’re safe, and we can define the <em>rational numbers</em> \( \mathbb{Q} \)<strong title="The Q stands for "quotient"."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong> to be the set of all numbers \( p/q\) where \(p\) and \(q\) are both integers, with \(q \neq 0\).</p>
<p>To an algebraist like myself, rational numbers are the next step after integers. But to be clear, this is wildly ahistorical. (I did call this a <em>fictional</em> history of numbers, after all.) Zero and negative numbers were both relatively late inventions. But the (positive) rational numbers came much earlier, because they’re much less philosophically confusing. I may not know what \(-2\) apples looks like, but \(1/2\) of an apple is what I get when I cut it into two pieces.</p>
<h2 id="square-roots-totally-radical">Square roots: totally radical</h2>
<p>Now we have a <em>strategy</em>: we write down an equation we want to solve, and then if it doesn’t already have a solution we invent one and make up a name for it. And as good mathematicians, we want to see how far this strategy can take us. What other equations do we want to solve?</p>
<p>The simplest equations are linear equations: \(ax +b =0\). Those are the same as \(ax = -b\), which we looked at last section; we can solve them all with rational numbers (as long as \(a \neq 0\)).</p>
<p>The next simplest equations are quadratic equations, the bane of every high schooler’s existence. These equations look like</p>
<p>\[
ax^2 + bx +c =0.
\]</p>
<p>And if you’ve brushed up on your high school algebra lately, you may remember they generally have two solutions.</p>
<p>Sometimes these solutions are simple. If we take the equation \(x^2 -4 = 0\), then that’s the same as \(x^2=4\), and thus either \(x=2\) or \(x=-2\). And sometimes we can find these solutions by factoring. For instance, we have</p>
<p>\[x^2 -3x+2 = (x-1)(x-2),
\]</p>
<p>so if \(x^2-3x+2 = 0\) then either \(x=1\) or \(x=2\).</p>
<p>But let’s look at the equation \(x^2-2=0\). It’s a pretty innocent-looking equation, as quadratics go. And if we graph \(x^2-2\), it sure looks like that equation should have two solutions:</p>
<p><img src="/assets/blog/fictional-history-of-numbers/sqrt2.png" alt="A graph of x^2-2, showing two clear x-intercepts" class="center blog-image" /></p>
<p>The graph crosses the \(x\)-axis in two places, which should represent spots where \(x^2-2=0\). But it’s a famous theorem dating back to the Pythagoreans in classical Greece that there is no rational number that satisfies this equation.</p>
<details class="theorem">
<summary>
Theorem: There is no rational number whose square is 2
</summary>
<div>
<p><strong>Proof:</strong></p>
<p>Suppose \( (p/q)^2=2\), where \(p/q\) is a fraction in lowest terms. Then \(p^2 = 2 q^2\). Since \(2\) divides \(p^2\), we know that \(p\) must be an even number, so we can set \(r = p/2\) where \(r\) is an integer.</p>
<p>Then we have \(4 r^2 = 2 q^2\) and thus \(2 r^2 = q^2\). Since \(2\) divides \(q^2\), we know that \(q\) must be an even number. But then \(p\) and \(q\) are both even, so \(p/q\) isn’t a fraction in lowest terms.</p>
<p>That means we can’t ever write the square root of \(2\) as a ratio of integers in lowest terms—and that means we can’t write it as a ratio of integers at all.</p>
</div>
</details>
<p>An equation with no solution? This is a very promising nail to hit with our hammer. Let’s invent a solution to this equation, which we’ll call the square root of \(2\), and write as \(\sqrt{2}\). In fact, let’s invent two of them. We know that quadratic equations are supposed to have two solutions; and looking at the graph, it sure seems like there’s a positive solution and a negative one. So we’ll call one of them \(\sqrt{2}\), and the other one \(-\sqrt{2}\). And <em>now</em> we can factor our polynomial:</p>
<p>\[
x^2-2 = (x - \sqrt{2}) (x+\sqrt{2}).
\]</p>
<p>In fact, we can do this for any rational number! If we have the equation \(x^2 - p/q = 0\), there might already be two rational numbers that solve it. But if not, we can define two new numbers, which we’ll call \(\pm \sqrt{p/q}\), to solve the equation.</p>
<p>What about other quadratic equations? We don’t actually need anything more. If you’re really caught up on your high school algebra, you may remember the quadratic formula. (If you’re not really caught up on your high school algebra, you may have vague traumatic flashbacks when you hear the phrase “quadratic formula”.) This formula says that the equation \(ax^2 +bx +c\) has two solutions, given by
\[
x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}.
\]
The formula uses those new square roots we just defined, but it doesn’t need anything else. If you can solve \(x^2 - p/q=0\) for any rational \(p/q\), then you can solve \(ax^2+bx+c\) for any rational \(a,b,c\).</p>
<h3 id="complex-numbers-use-your-imagination">Complex numbers: use your imagination</h3>
<p>I said that <em>every</em> rational number has a square root. And that might have been surprising to some of you: what about negative numbers? Those don’t have square roots, surely?</p>
<p>If I try to draw out some graphs, it doesn’t <em>look</em> like those equations should have solutions. The graph of \(x^2-2\) hits the \(x\)-axis twice, so it seems like it should equal zero for two values of \(x\). But if we graph the function \(x^2+2\) we get this curve:</p>
<p><img src="/assets/blog/fictional-history-of-numbers/no-root-quadratic.png" alt="Graph of x^2+2" class="center blog-image" /></p>
<p>And that doesn’t hit the \(x\)-axis at all! So how am I saying it has solutions?</p>
<p>The answer: I made them up! When I defined square roots I didn’t appeal to graphs, or the sizes of numbers, or anything like that. I just said “if we have a rational number, let’s make up two new numbers, called the square roots”. It doesn’t matter whether the number is positive or negative; I’m just as capable of writing the symbol \(\sqrt{-2}\) as I am of writing the symbol \(\sqrt{2}\).</p>
<p>Now of course, there is an important difference here. One of these is a <em>real</em> number, and the other is an imaginary number. But our current strategy doesn’t mention real numbers, or imaginary numbers, at all. We can’t even explain the difference clearly until we talk about a completely different strategy for inventing numbers, called <em>completion</em>, and that will have to wait.</p>
<h2 id="algebraic-numbers-bringing-it-all-together">Algebraic numbers: bringing it all together</h2>
<p>Let’s see how far we can push this strategy. Simple equations like \(3 + x = 0\) gave us the integers; linear equations like \(3 \cdot x + 7 = 0\) gave us the rationals; and quadratic equations like \(x^2 + 5x +3 =0\) gave us square roots. What other equations can we look at?</p>
<p>We have two operations: addition and multiplication. The most complicated equation we can build with those operations are the polynomial equations<strong title="Polynomials aren't the most complicated algebraic _functions_; we can look at rational functions, which are one polynomial divided by another. But these don't really give us new _equations_ because you can just multiply both sides by the denominator to get a regular polynomial."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong>
\[
a_0 + a_1 x + a_2 x^2 + \dots + a_nx^n =0,
\]
where the numbers \(a_0, a_1, \dots, a_n\) are rational numbers.</p>
<p>Some of these equations have simple, straightforward solutions. If we want to solve</p>
<p>\[
x^3-6x^2+11x-6=0,
\]</p>
<p>we can factor the polynomial to get</p>
<p>\[
(x-3)(x-2)(x-1) = 0,
\]</p>
<p>which has solutions \(1\) and \(3\).</p>
<p>Other polynomial equations aren’t that straightforward, but have graphs that make it clear there should be solutions somewhere:</p>
<p><img src="/assets/blog/fictional-history-of-numbers/3-root-cubic.png" alt="a graph of x^3-3x+1" class="center blog-image" />
<em class="center blog-image">The graph of \(x^3-3x+1\) intersects the \(x\)-axis in three places, at roughly \(-1.879, 0.347\) , and \(1.532\).</em></p>
<p>And some polynomial equations look like they shouldn’t have any solutions at all.</p>
<p><img src="/assets/blog/fictional-history-of-numbers/no-root-quartic.png" alt="a graph of x^4-3x^2+x+4" class="center blog-image" />
<em class="center blog-image">The graph of \(x^4-3x^2+x+4=0\) doesn’t intersect the \(x\)-axis at all.</em></p>
<p>But just like with square roots, we don’t need to worry about the graphs. We can just <em>define</em> solutions to all those equations, inventing new numbers if there aren’t numbers that already work. And if we’ve invented enough numbers that every polynomial equation with rational coefficients has a solution, we get the <em>algebraic</em> numbers.</p>
<p>The algebraic numbers include all the rational numbers, and all the square roots. They also include higher-order roots: cube roots, fourth roots, and so on. For instance, \(x^4-2=0\) has four solutions: \(\sqrt[4]{2}, i \sqrt[4]{2}, - \sqrt[4]{2}, -i \sqrt[4]{2}\). The equation \(x^8-2\) has eight solutions, which include \(\sqrt[8]{2}\) but also \( \frac{1}{\sqrt[8]{8}} - \frac{i}{\sqrt[8]{8}}\).</p>
<p>But we also get other things. And it’s hard to describe them, because not all algebraic numbers have convenient names. We can graph the polynomial \(x^5+x+3\), and we see that the equation \(x^5 + x + 3 = 0\) should have one solution, which is approximately \(-1.133\).</p>
<p><img src="/assets/blog/fictional-history-of-numbers/1-root-quintic.png" alt="a graph of x^5+x+3" class="center blog-image" />
<em class="center blog-image">The graph of \(x^5+x+3=0\) intersects the \(x\)-axis exactly once.</em></p>
<p>But if we ask a program like Mathematica for the exact solution, we just get told that the solution is a solution to the equation \(x^5+x+3=0\).</p>
<p><img src="/assets/blog/fictional-history-of-numbers/bad-algebraic-numbers.png" alt="An image of very unhelpful Mathematica output" class="blog-image center" />
<em class="blog-image center">This <strong>is</strong> an answer to the question I asked. It is not a <strong>helpful</strong> answer.</em></p>
<p>Now, this number isn’t impossible to describe. After all, I just described it! But it doesn’t have a name in our usual notation. It’s just this awkward thing that has to exist, because we created a solution for \(x^5+x+3=0\); but all we can say about it is that it <em>is</em> that solution we created.</p>
<h2 id="achieving-transcendence">Achieving transcendence</h2>
<p>And with this, we seem to have exhausted our first strategy. Every equation we can built out of algebra has a solution now. We’ve completely answered that question.</p>
<p>But that doesn’t mean we’re done, because there are other questions—and some of those questions are very important. We can draw a picture of a circle with radius 1:</p>
<p><img src="/assets/blog/fictional-history-of-numbers/circle-graph.png" alt="A graph of a circle with radius 1" class="center blog-image" /></p>
<p>And we can try to measure the area inside this circle. Each square in the picture has area \(1/4\), so the circle must have an area bigger than \(1\), and smaller than \(4\). But what is it?</p>
<p>We can always just give that are a name; we chose \(\pi\), which is the Greek version of the letter \(p\) and stands for “perimeter”. But while we can invent a name for the area of this circle, we <em>can’t</em> relate it to the tools we’ve been discussing. There’s no polynomial equation defined with rational numbers<strong title="This caveat is important, but only for stupid reasons. x - π = 0 is technically a polynomial equation, and π is definitely a solution to that equation. But that just isn't very enlightening. "><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong> that has \(\pi\) as a solution.</p>
<p>These non-algebraic numbers are called <a href="https://en.wikipedia.org/wiki/Transcendental_number">transcendental</a>. They’re extremely important. And they’re <a href="/blog/fictional-history-of-numbers-part-2/">a story for another day</a>.</p>
<hr />
<p><em>Do you want to hear about transcendental numbers? What other numbers do you want to learn about? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>If you’re about to bring up zero, please bear with me for a few paragraphs. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>Four of them are sugar bee apples I bought over the weekend, and one is a gala apple that I bought months ago and forgot about in the back of the fridge until I went to go count my apples while writing this post. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>We do the same thing when we say “apple” rather than writing out the full genome, or describing the position of every molecule, or whatever. We’re constant abstracting without really thinking about it; doing math just makes this process of abstraction more explicit. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p>If you’d like to help fix that last one for me, you can donate to my <a href="https://ko-fi.com/jaydaigle">Ko-Fi</a>. <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:5">
<p>Kronecker famously said that “God created the integers; all else is the work of man”. But I think he was giving God a little bit more credit there than he deserved. The integers are complicated! <a href="#fnref:5" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:6">
<p>The Z stands for “Zahlen”, which is German for “numbers”. <a href="#fnref:6" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:7">
<p>We can do that, actually—at least sort of. Stay tuned for the rest of this series. <a href="#fnref:7" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:8">
<p>The Q stands for “quotient”. <a href="#fnref:8" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:9">
<p>Polynomials aren’t the most complicated algebraic <em>functions</em>; we can look at rational functions, which are one polynomial divided by another. But these don’t really give us new <em>equations</em> because you can just multiply both sides by the denominator to get a regular polynomial. <a href="#fnref:9" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:10">
<p>This caveat is important, but only for stupid reasons. \(x - \pi = 0\) is technically a polynomial equation, and \(\pi\) is definitely a solution to that equation. But that just isn’t very enlightening. <a href="#fnref:10" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleMathematicians deal with lots of different kinds of "numbers". But where do they come from? In this series we'll see where different types of exotic numbers came from, and what reasonable questions we need them to answer. In Part 1, we're starting off with the simplest types of numbers: the algebraic numbers. We'll see how we could have invented square roots and weirder things on our own, just by asking one reasonable question after another.Motivating the Integral with Euler’s Method2023-03-15T00:00:00-07:002023-03-15T00:00:00-07:00https://jaydaigle.net/blog/motivating-integral-eulers-method<p>I have a fun idea for how to teach and think about the integral in the context of freshman calculus. I’ve never actually used this in a class, and I suspect it’s not actually a great idea. But it’s a <em>fun</em> idea and worth at least playing with, even if it’s a bit too weird to help calculus novices understand what’s going on.</p>
<p>But first, I want to mention that if you want to support my writing, I now have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this.</p>
<h2 id="the-big-ideas-of-calculus-1">The Big Ideas of Calculus 1</h2>
<p>When I teach calculus I emphasize two big ideas: differential equations, and numerical analysis.</p>
<p><strong>Differential equations</strong> generalize the concept of “rate of change”, and they’re the core of why calculus is <em>useful</em>: you can describe the rules a system follows, encode them in math, and draw conclusions. Calculus 1 students don’t have the tools to solve differential equations, but they can—and should—understand how a sentence like “the acceleration is proportional to the displacement” relates to the equation \(y’’ = -ky\).</p>
<p><strong>Numerical approximation</strong> is often the <em>way</em> we use calculus, and increasingly so as computers are more powerful and available. I motivate the derivative with the idea of linear approximation: if I want to pretend my function is a line, and write \(f(x) = f(a) + m (x-a)\), what number \(m\) will do the best job? This develops into other methods for approximating the answers to questions that are too hard to answer directly: it leads into ideas like quadratic approximation and <a href="https://en.wikipedia.org/wiki/Newton's_method">Newton’s method</a>, and provides a foundation for numerical integration and Taylor series in Calculus 2.</p>
<h2 id="eulers-method">Euler’s Method</h2>
<p>If we combine these two ideas, we can try to numerically approximate the solution to a differential equation. Suppose we have a differential equation \(f’(t) = f(t) - f(t)^2/2\), and we know the initial condition that \(f(0)=1\). If we want to know \(f(3)\) we can get a rough guess with a linear approximation: we know \(f(0) = 1 \) and thus that \(f’(0) = 1 - \frac{1^2}{2} = \frac{1}{2} \), so we get</p>
<p>\[
f(3) \approx f(0) + f’(0) (3-0) = 1 + \frac{1}{2} \cdot 3 = \frac{5}{2}.
\]</p>
<p>That’s only a rough estimate; linear approximation generally isn’t very accurate when the starting point and ending point aren’t close together. In fact the true value is \( \frac{2e^3}{e^3+1} \approx 1.905\), which isn’t terribly far off from \(2.5 \) but isn’t especially close either. But this is the best estimate we can really get using only \(f(0)\) and \(f’(0)\).</p>
<p>However, we know a lot more than that, because we have a formula for \(f’(x)\). It’s a bit hard to use, because we need to know \(f(x)\) to compute \(f’(x)\); but we know we can approximate \(f(x_2)\) if we already know \(f(x_1)\) and \(f’(x_1)\). That allows us to do a recursive calculation:</p>
<p class="theorem">\[
\begin{array}{rl}
f(1) & \approx f(0) + f’(0) (1-0)
= 1 + \left(1 - \frac{1^2}{2} \right) \cdot (1) = 3/2. \\\<br />
f(2) & \approx f(1) + f’(1) (2-1)
\approx \frac{3}{2} + \left( \frac{3}{2}<br />
\frac{ \left(\frac{3}{2} \right)^2}{2} \right)\cdot (1)
= \frac{15}{8}. \\\<br />
f(3) & \approx f(2) + f’(2) (3-2)
\approx \frac{15}{8} + \left(\frac{15}{8} -
\frac{\left(\frac{15}{8} \right)^2}{2} \right) \cdot (1) \\\<br />
& = \frac{255}{128}.
\end{array}
\]</p>
<p>Thus we estimate \(f(3) \approx \frac{255}{128} \approx 1.99\).</p>
<p>This still isn’t an exact value for \(f(3)\); but this approximation is much better than our first try. And if this isn’t close enough, we can do even better by breaking our approximation into more steps: with six steps we get \(f(3) \approx 1.95\) and with sixty we get \(f(3) \approx 1.909\). More steps takes more work, but also gives us a more precise answer.</p>
<p>This approach is known as <a href="https://en.wikipedia.org/wiki/Euler_method">Euler’s method</a>, and it allows us to numerically approximate the result of any first-order ordinary differential equation given an initial condition. With a little bit of work, we can generalize this to any ordinary differential equation; it’s quite straightforward and flexible.</p>
<p>It’s also basically just integration.</p>
<h2 id="what-is-an-integral">What is an integral?</h2>
<p>In a typical calculus course, we motivate the integral with the area problem: we have the graph of some function, and we want to find the area under that curve. We can approximate that area by chopping it up into rectangles, which gives us the Riemann sum. And then as the number of rectangles approaches infinity our approximation gets really good, which allows us to define the integral.</p>
<p><img src="/assets/blog/teach-in-a-society/Riemann_sum_(leftbox).gif" alt="An animation of a Riemann sum as the number of rectangles goes to infinity" class="blog-image center" /></p>
<p class="theorem">\[
\int_a^b f(t) \,dt = \lim_{n \to \infty} \sum_{k=1}^n f(x_k) \Delta x
\]</p>
<p>This definition has a lot of symbols in it, and is generally intimidating to freshman calculus students. But it does accurately describe what we’re doing and why: the key idea of the integral is to break a calculation into pieces, do an approximation on each piece, and then add the results together. This will give us an approximate answer to our original question; as we use more and smaller pieces, the approximation gets better, and so in the limit we get an exact answer.</p>
<p>So this formula directly answers the question that we’re asking. And when we want to think about <em>applications</em> of the integral, the Riemann sum definition is useful: it helps us figure out what the integral is actually computing, and so what problems it can help solve. But Riemann sums are a huge pain to actually do computations with, so we generally don’t.</p>
<p>Instead, we rely on the Fundamental Theorem of Calculus, which comes in two parts.</p>
<p class="theorem"><strong>Fundamental Theorem of Calculus, Part 1:</strong> <br />
Given a function \(f(x)\) and a number \(a\), we can define a new function \(F(x) = \int_a^x f(t) \,dt\). Then \(F’(x) = f(x)\).</p>
<p>Part 1 tells us that the derivative undoes the integral; the derivative of the integral of \(f\) is just \(f\). This is conceptually cool, and it does allow us to compute <em>something</em>. But it doesn’t directly help us compute the integral. Instead, we use it to prove<strong title="This proof relies heavily on specific special properties of the real numbers, and in particular the property that if f'(x)=0 then f(x) is constant. This isn't true if we allow functions to be defined solely for rational numbers; the real numbers are exactly the set that makes it work."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> a second statement.</p>
<p class="theorem"><strong>Fundamental Theorem of Calculus, Part 2:</strong> <br />
If \(F’(x) = f(x)\), then \(\int_a^b f(t) \,dt = F(b) - F(a)\).</p>
<p>This is also known as the Evaluation Theorem, or sometimes the Net Change Theorem. And it’s the tool we actually use in practice to compute integrals—to the extent that people mainly associate “integration” with finding the antiderivative \(F(x)\), not with finding the <em>number</em> corresponding to the area under the curve.</p>
<p>And this all works, but we’ve moved pretty far away from the original question, and the connections pass through some relatively abstract territory. It’s hard to really intuitively see how this calculation relates to the original question.</p>
<p>Maybe there’s a better way.</p>
<h3 id="the-antiderivative-as-a-differential-equation">The antiderivative as a differential equation</h3>
<p>Let’s start by asking this question backwards. Suppose there’s some function you’re interested in, but you don’t have a formula for it. Instead you just have a formula for the derivative. In practical terms, this happens in <a href="https://en.wikipedia.org/wiki/Dead_reckoning">dead reckoning</a>: if you can’t measure where you are, but you know where you started and how fast you’re moving, you can estimate where you end up.</p>
<p>So suppose we know our speed \(F’(x)\), and our starting position \(F(a)\), and we want a way to figure out our current position \(F(x)\). We want to compute an antiderivative! The FTC part 2 tells us that \(F(x) = F(a) + \int_a^x F’(t) \,dt \), so we could figure this out by doing an integral. But I want to follow a different thought process.</p>
<p>We can start by saying, we know what \(F(a)\) is, and since we have a formula for \(F’(x)\), we can compute \(F’(a)\). Then we can use the linear approximation formula to estimate
\[
F(x) \approx F(a) + F’(a) (x-a).
\]
So if we know, say, that \(F(1)=3\) and \(F’(x) = 3x^2\), we can estimate that \(F(5) \approx 3 + 3(5-1) = 15\).</p>
<p>Linear approximation gives a pretty decent estimate if \(x\) and \(a\) are close, but if they’re far apart it’s not very good. Consequently it doesn’t really work here: in reality \(F(5) = 127\).</p>
<p>But we can improve this exactly the same way we did before, by using Euler’s method! The problem is that the two points on my linear approximation are too far apart. But we can try to approximate the value of \(F\) somewhere closer to \(1\), like at \(3\).</p>
<p>\[
F(3) \approx F(1) + F’(1)(3-1) = 3 + 3(2) = 9.
\]
And then, since we also know \(F’(3) = 27\) I can estimate
\[
F(5) \approx 9 + 27(5-3) = 63.
\]
Still not right, but much better! And we can improve even further by doing more steps:
\[
\begin{array}{rl}
F(2) & \approx F(1) + F’(1)(2-1)
= 3 + 3 = 6 \\\<br />
F(3) & \approx F(2) + F’(2)(3-2)
= 6 + 12 = 18 \\\<br />
F(4) & \approx F(3) + F’(3)(4-3)
= 18 + 27 = 45 \\\<br />
F(5) & \approx F(4) + F’(4)(5-4)
= 45 + 48 = 93.
\end{array}{rl}
\]
This still isn’t quite right, but it’s even closer; and as we take more and more smaller and smaller steps, we’ll get a better and better approximation.</p>
<h3 id="riemann-sums-as-eulers-method">Riemann Sums as Euler’s Method</h3>
<p>This is basically Euler’s method. But why is it an integral? Let’s reorganize the calculation to make it clearer what’s happening.
\[
\begin{array}{rl}
F(5)
& \approx F(4) + F’(4)(5-4) \\\<br />
& \approx F(3) + F’(3) (4-3) + F’(4) (5-4) \\\<br />
& \approx F(2) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\<br />
& = F(1) + F’(1) (2-1) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\<br />
& = 3 \cdot 1 + 3 \cdot 1 + 12 \cdot 1 + 27 \cdot 1 + 48 \cdot 1 = 93.
\end{array}
\]
At this point this should be starting to look familiar. We’re taking a bunch of steps of size \(1 = \Delta x\), and for each step we’re multiplying it by the derivative at some \(x\) value. So we just computed
\[
F(5) \approx F(1) + \sum_{k=1}^4 F’ \big( 1 + (k-1) \cdot 1 \big) \cdot 1.
\]
More generally, if we take \(n\) steps we get
\[
F(5) \approx F(1) + \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x.
\]
And that’s <em>almost</em> exactly a Riemann sum on the left-hand side. In fact, it’s a Riemann sum, plus the extra term \(F(1)\). If we rearrange it we get
\[
F(5) - F(1) \approx \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x.
\]</p>
<p>I see two ways to think about this formula. One is that the <em>indefinite</em> integral contains a \(+C\) term, because antiderivatives aren’t unique. So while \(\int F’(t) \,dt\) is <em>an</em> antiderivative of \(F’(x)\), we don’t necessarily get the same function as our original \(F(x)\). Instead, the FTC just guarantees we have \(F(x) +C\), and \(F(1)\) is just the \(+C\) term.</p>
<p>But I think a clearer to me is that we’re really computing the <em>change</em> in the value of \(F\). This should make physical sense: the calculations with the speed tell us how far we’ve moved, not where we are. Thus the Euler’s method calculation tells us our <em>displacement</em>; but if we add that on to our starting position, we find out ending position.</p>
<h2 id="is-this-a-good-idea">Is this a good idea?</h2>
<p>Mathematically, this all works out. It’s a cute argument and I’m glad I’ve found it. But there are plenty of fun math ideas that don’t belong in a freshman calculus course.</p>
<p>This approach has one obvious, major disadvantage: no one else teaches it like this, so it would probably leave students confused if they go on to take another course with someone else. And that’s probably enough to make it not worth doing<strong title="Or at least not worth doing as the motivation to the integral. I think it's fine to do this as a _followup_, or an application of the integral. If you have an extra day to spend on integration, this isn't the worst thing you could do. But if you have extra days in your calculus syllabus please tell me how you got them."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong>, on its own.</p>
<p>But while that’s a real obstacle to adopting this approach in one class, it’s also kind of dodging the interesting questions about whether this would be a better approach. What if we could get everyone to switch? Should we?</p>
<p>One problem is that this argument isn’t at all rigorous. As long as we believe that Euler’s method will converge to the right answer, then the integral will as well; but I don’t know how you’d prove that Euler’s method converges without referencing the integral, so that seems fairly circular.</p>
<p>That objection seems fatal to me—in an upper-division Real Analysis course. In a freshman calculus course, nothing is ever going to be fully rigorous, and the proofs involving Riemann sums especially won’t be because getting the technical details of Riemann sums correct is <em>hard</em>. So I don’t mind a little non-rigor, especially if it helps students develop a clear intuitive understanding of what we’re trying to do.</p>
<p>In fact, having to avoid some of the abstraction involved in proving the Fundamental Theorem of Calculus might be a win, overall. That’s one of those lectures where I’m always confident my students aren’t <em>really</em> following the details, and are just hanging on trying to survive until we get back to computing things. On the other hand, it’s good for them to see some abstract formalism, even if they’re not ready to fully understand it yet. You have to see your first scary proof sometime!</p>
<p>Another problem is that this derivation captures the relationship between the Riemann sum and the antiderivative, but presents it exactly backwards. In most applications, the Riemann sum is the question we want to answer; the antiderivative is the tool we use to answer it. But the Euler’s method approach treats the antiderivative as the question, and the Riemann sum as the way we compute the answer—which is completely wrong since the Riemann sum is nearly impossible to compute outside of the simplest cases. I think this is a really deep problem with this approach. One of the big ideas I want my students to engage with is figuring out the difference between identifying a question, and computing the answer; giving it to them backwards seems like an obstacle to developing that understanding.</p>
<p>But I do really like the way this approach connects the integral back to the other big ideas in the class. Not just to the derivative; any presentation of the FTC will draw a link between integration and differentiation. But this makes the integral seem <em>connected</em> to the themes of numeric approximation and differential equations, which ties the entire course together neatly.</p>
<p>And really, that sums it up, I think. It’s always nice to tell a neat story that ties the whole class together. But it probably isn’t as important as making sure our students understand each piece well on its own. I have to resist the temptation to do something pretty, and elegant, and unnecessarily confusing. So this is a fun idea, but for now I’m going to teach this normally.</p>
<hr />
<p><em>Do you have a clever way to motivate the integral? Do you think I should actually be using this approach in my course? Any other thoughts on teaching integration? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This proof relies heavily on specific special properties of the real numbers, and in particular the property that if \(f’(x)=0\) then \(f(x)\) is constant. This isn’t true if we allow functions to be defined solely for rational numbers; the real numbers are exactly the set that makes it work. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>Or at least not worth doing as the motivation to the integral. I think it’s fine to do this as a <em>followup</em>, or an application of the integral. If you have an extra day to spend on integration, this isn’t the worst thing you could do. But if you have extra days in your calculus syllabus please tell me how you got them. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleI realized you can "prove" the Fundamental Theorem of Calculus by using Euler's method for solving differential equations. It's a fun way to motivate the integral, but not one that I'm actually going to use in class.Writing Calculus Tests with ChatGPT2023-03-08T00:00:00-08:002023-03-08T00:00:00-08:00https://jaydaigle.net/blog/writing-calculus-tests-with-ChatGPT<p>Last <a href="/blog/not-scared-of-chatbots/">week</a> I talked about the new chatbots, like ChatGPT and Bing’s chat interface. I argued that they while they produce language they can’t really analyze it or check it for errors; and that this is a meaningful restriction that we can’t get past without a serious change in the approach we take to AI systems. So the chatbots won’t be able to fully replace intellectual labor any time soon. But they still might help, especially if we can identify formulaic tasks that don’t require really critically thinking about how ideas connect.</p>
<p>But rather than philosophizing, I decided to get concrete about this. Can I use ChatGPT to make <em>my</em> job easier? It’s going to be pretty useless for the most important parts of my job. In particular, it has no way to figure out why a student is confused and address their confusions. And it’s not going to come up with insightful new ways to describe course topics. It won’t even be able to meaningfully connect distinct ideas in the course, because it has no sense of what’s already been covered.</p>
<p>Instead I need to find the aspects of my job that are <em>routine</em>, and involve following relatively standard templates and filling them out in predictable ways. I need to find tasks that it’s easy for me to check if they’re done right, since ChatGPT is not correct with any consistency. Ideally, I’d also find ways to have it replace the parts of my job that are the most annoying: I don’t <em>want</em> a way to avoid spending time in office hours with students, because office hours are fun!</p>
<p>But one thing I spend a lot of time doing, and don’t enjoy at all, is writing homework and test questions. I need to create original problems (or at least ones that aren’t in the textbook so students can’t look them up), but not <em>too</em> original (so they fit the patterns that my calculus students are supposed to be learning). And unlike all of the rest of my course planning, I need to do new ones every year—I can reuse my old lecture notes, but it’s not safe to reuse my old tests.</p>
<p>So I decided to spend some time experimenting with GPT as a test writer. Can it write good questions? Can it write usable solutions for those questions? And can it do this easily, or is shepherding it through the process more trouble than it’s worth?</p>
<p>But before I tell you what I found, I want to mention that if you want to support my writing, I now have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this.</p>
<h2 id="the-verdict">The Verdict</h2>
<p>Overall, the current tech is seems somewhat useful, but not actually good—at least, not yet. But it’s close enough that I suspect it will get pretty good for this purpose before long.</p>
<h3 id="writing--problems">Writing Problems</h3>
<p>With a couple exceptions, ChatGPT could figure out what type of question I was asking for. If I asked for a related rates problem, or an integration problem that involved integration by parts, I would get one. Sometimes they weren’t quite right, but I could get the general type of problem I asked for, with basically no prompt engineering.</p>
<p>On the other hand, it was hard to get specifics. I can get a big pile of integration by parts problems, but a lot of them will be either very easy or very hard. And ChatGPT gets stuck in ruts; I saw identical problems show up to multiple different prompts, and there were running themes in everything it output. That means that the system can’t give me fine-tuned answers, and also will not give me an even coverage of the relevant types of problems.</p>
<p>But if I have something specific I want, I can probably just write it myself; and even if it won’t give me every type of problem, it can help remind me of my options. I found it genuinely useful for brainstorming problems, even if I didn’t use any of them exactly. (And I am at this moment proctoring a test that includes some problems I wrote with GPT assistance.)</p>
<h3 id="solving-problems">Solving Problems</h3>
<p>On the other hand, the solutions it produced were usually wrong, sometimes spectacularly so. A few times I got a completely correct solution. Most of the time, I would get an answer that had the right approach but did completely nonsensical calculations in the middle; the solutions would look superficially correct, but checking them carefully turned up multiple errors. And occasionally I would get arguments almost completely unrelated to the questions I asked.</p>
<p>But, if anyone does figure out a way to usefully and consistently hook this up to a computer algebra system, it will probably do pretty well at solving problems too. It tended to set up the right computation and then generate a nonsense answer; if it could tell when it needs to just factor a polynomial or compute an integral, and pass that to a computer algebra system, that would fix a lot of the weaknesses.</p>
<p>I know multiple teams are trying to find a way to hook systems like GPT up to computational engines and computer algebra systems. If they could do that effectively it would probably be able to write good solutions immediately, but that really sounds non-trivial to me. You could <em>maybe</em> teach it to pass integrals or other specific calculations to a computer algebra system, read the result, and print the result. But translating that into a well-written solution solution would require some sort of deep integration of the two capabilities, not just an ability to print the final answer.</p>
<p>But one thing did impress me about the solutions: ChatGPT could clearly consistently remember what question it was trying to answer. Every single solution ended with a clear restatement of the question and an answer to it. The answer was usually <em>wrong</em> but it never lost track of what it was supposed to be answering. (And that’s more than I can say for some of my students.) This mostly shows that the question is still in ChatGPT’s context window when it finishes the solution, but also that it’s still <em>using</em> the details of the question. And it’s doing that consistently; I don’t think it failed at this once.</p>
<h3 id="other-observations">Other Observations</h3>
<p>ChatGPT can write perfectly valid and idiomatic LaTeX code, which I was mildly surprised by. I’m used to the fact that it can put out grammatical and idiomatic <em>English</em> text, but I was still impressed by the LaTeX.</p>
<p>Similarly, ChatGPT can draw diagrams using ASCII art. I’m really curious how much of that was in the training set; I wouldn’t have thought it was that common, but it’s clearly common enough. (It did not always accurately describe the diagram it had drawn.)</p>
<p>One major weakness is that ChatGPT doesn’t have a consistent style. Each solution it wrote used completely different notation and phrasing—even when I asked it to respond to the same prompt multiple times.</p>
<h2 id="the-data">The Data</h2>
<p>The rest of this post is a record of things I tried and experiments I conducted in getting ChatGPT to write calculus questions for me. I spent a while playing around, and tried to include everything here as a sort of lab notebook and complete record. If you don’t need to write calculus questions you probably don’t want to read the whole thing.</p>
<p>But I have hidden most of the long GPT exchanges under pulldown spoiler boxes. If you just want to skim through and see my thoughts on each experiment, you can do that; and you can expand out any box you decide you actually want to read.</p>
<p>Or you can jump to <a href="#conclusion">the end</a> and leave me a comment.</p>
<h3 id="maxmin-problems">Max/Min problems</h3>
<p>When I teach calculus 1, my absolute least favorite thing to do is writing test questions for max/min problems. It’s simultaneously extremely routine and extremely tedious. So I decided to start there.</p>
<h4 id="maxmin-take-one">Max/Min take one</h4>
<p class="theorem"><strong>Prompt:</strong> Write a problem where you have to classify the critical points of a function, with a solution.</p>
<details class="example">
<summary>
Response 1
</summary>
<div>
<p>Problem:</p>
<p>Consider the function $f(x) = x^3 - 3x^2 + x - 1$. Determine the critical points of the function and classify them as local maxima, local minima, or inflection points.</p>
<p>Solution:</p>
<p>A critical point of a function is a point where the derivative of the function is equal to zero or is undefined. To find the critical points of the function, we first find its first derivative:</p>
<p>$f’(x) = 3x^2 - 6x + 1$</p>
<p>Setting $f’(x) = 0$, we find the critical points:</p>
<p>$3x^2 - 6x + 1 = 0$</p>
<p>$x = \frac{6 \pm \sqrt{36 - 4 \cdot 3 \cdot 1}}{2 \cdot 3} = \frac{3 \pm \sqrt{3}}{3}$</p>
<p>Let $c_1 = \frac{3 + \sqrt{3}}{3}$ and $c_2 = \frac{3 - \sqrt{3}}{3}$. These are the critical points of the function. To classify them, we need to find the second derivative and check the sign of the second derivative at each critical point. The second derivative of the function is:</p>
<p>$f’‘(x) = 6x - 6$</p>
<p>At $c_1$, the second derivative is $6c_1 - 6 = 6(\frac{3 + \sqrt{3}}{3}) - 6 = 6$, which is positive. This means that $f(c_1)$ is a local minimum.</p>
<p>At $c_2$, the second derivative is $6c_2 - 6 = 6(\frac{3 - \sqrt{3}}{3}) - 6 = -6$, which is negative. This means that $f(c_2)$ is a local maximum.</p>
<p>Therefore, the critical points of the function $f(x) = x^3 - 3x^2 + x - 1$ are:</p>
<p>$c_1 = \frac{3 + \sqrt{3}}{3}$, which is a local minimum.</p>
<p>$c_2 = \frac{3 - \sqrt{3}}{3}$, which is a local maximum.</p>
</div>
</details>
<details class="example">
<summary>
Response 2
</summary>
<div>
<p>Problem: Consider the function $f(x) = x^3 - 6x^2 + 9x - 2$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.</p>
<p>Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.</p>
<p>Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by:
$f’(x) = 3x^2 - 12x + 9$</p>
<p>Setting $f’(x) = 0$, we find that:</p>
<p>$3x^2 - 12x + 9 = 0$</p>
<p>Solving for $x$, we get the critical points:</p>
<p>$x = \frac{12 \pm \sqrt{12^2 - 4 \cdot 3 \cdot 9}}{2 \cdot 3}$</p>
<p>$x = \frac{12 \pm \sqrt{144 - 108}}{6}$</p>
<p>$x = \frac{12 \pm \sqrt{36}}{6}$</p>
<p>$x = \frac{12 \pm 6}{6}$</p>
<p>$x = 3, 1$</p>
<p>So, the critical points are $x = 1, 3$.</p>
<p>Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points:
$f’‘(x) = 6x - 12$</p>
<p>$f’‘(1) = 6 \cdot 1 - 12 = -6$</p>
<p>$f’‘(3) = 6 \cdot 3 - 12 = 0$</p>
<p>Since $f’‘(1) < 0$, it means that $x = 1$ is a relative maximum. Since $f’‘(3) = 0$, it means that $x = 3$ is an inflection point.</p>
<p>So, the critical points are:</p>
<p>$x = 1$ is a relative maximum</p>
<p>$x = 3$ is an inflection point.</p>
</div>
</details>
<p>There are a couple things to notice here. First, I did no post-processing on the output at all. ChatGPT wrote perfectly formatted LaTeX code for these answers without me editing it at all.<strong title="I know you're seeing some weird formatting on the second derivative, but that's not actually ChatGPT's fault; that has to do with a bug in the way LaTeX compiles through MathJax, which is what allows me to display it on the blog. I could fix the display issue but I wanted to keep the output genuinely unedited."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<p>Second is that these are both fairly simple problems. My next step is going to have to be seeing if I can get more interesting examples out of it; it went for a simple cubic both times.</p>
<p>Third is that the solution style is inconsistent. ChatGPT doesn’t write up solutions the way I would, which is to be expected. But it also doesn’t write them up the same way as itself. The two solutions use different notations and different phrasings. This doesn’t make it <em>wrong</em>, but it isn’t great as a teaching tool; it’s good for students to see a certain amount of consistency, especially as beginners.</p>
<p>And fourth is that despite the nice organization and formatting, both of these solutions are almost, but not quite, right:</p>
<ul>
<li>
<p><strong>Response 1</strong> takes the quadratic equation \(3x^2-6x+1 =0\), correctly uses the quadratic formula, and then simplifies the solutions to \( \dfrac{3 \pm \sqrt{3}}{3}\). But the actual solutions are \(\dfrac{3 \pm \sqrt{6}}{3}\).</p>
<p>Then when it plugs these points into the second derivative, in incorrectly computes \(\pm 6\) when it should get \(\pm 2 \sqrt{3} \).</p>
<p>These are the sort of basic arithmetic errors that make me think a student understands the material, but is sloppy with arithmetic; I’d probably grade this as an 8/10.</p>
</li>
<li>
<p><strong>Response 2</strong> computes the critical points correctly.</p>
<p>When it plugs them into the second derivative, it correctly computes \(f^{\prime \prime} (1) = -6\) but says that \(f^{\prime \prime}(3) = 6\cdot 3 - 12 = 0\), when of course \(6 \cdot 3 - 12 = 6 \).</p>
<p>Then it interprets this calculation wrong: it makes a key conceptual error by saying that since \(f^{\prime \prime}(3) = 0\), we know that \(3\) is an inflection point. This is kind of half-right, and I could see some mathematicians accepting that conclusion. But not all points where \(f^{\prime prime} = 0\) are inflection points, and I wouldn’t accept that conclusion.</p>
</li>
</ul>
<p>So we get well-formatted solutions, but they’re not consistent or reliable; I’d have to hand-check them, and probably rewrite to match my style.</p>
<p>But I want to see if we can get better questions out of this, so let’s go back to the well.</p>
<h4 id="maxmin-take-2">Max/Min take 2</h4>
<p>I wanted to get slightly more complicated functions involved.</p>
<p class="theorem"><strong>Prompt:</strong> Write a problem where you have to classify the critical points of a rational function, with a solution.</p>
<details class="example">
<summary> Response 1 </summary>
<div>
<p>Problem: Consider the function $f(x) = \frac{x^3-3x}{x^2-2x+1}$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.</p>
<p>Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.</p>
<p>Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by:
$f’(x) = \frac{(x^2-1)(x^2-3)}{(x-1)^2}$</p>
<p>Setting $f’(x) = 0$, we find that:</p>
<p>$(x^2-1)(x^2-3) = 0$</p>
<p>Solving for $x$, we get the critical points:</p>
<p>$x = \pm 1, \pm\sqrt{3}$</p>
<p>However, $x=1$ is not a critical point since the denominator of the function is zero at $x=1$. Therefore, the critical points are $x = -\sqrt{3}, 1, \sqrt{3}$.</p>
<p>Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points:
$f’‘(x) = \frac{6x(x-1)(x+1)}{(x-1)^3}$</p>
<p>$f’’(-\sqrt{3}) = \frac{-6\sqrt{3}}{(2\sqrt{3})^3} = -\frac{3}{4\sqrt{3}}$</p>
<p>$f’‘(1)$ does not exist since the denominator of the second derivative is zero.</p>
<p>$f’’(\sqrt{3}) = \frac{6\sqrt{3}}{(2\sqrt{3})^3} = \frac{3}{4\sqrt{3}}$</p>
<p>Since $f’’(-\sqrt{3}) < 0$, it means that $x = -\sqrt{3}$ is a relative maximum. Since $f’’(\sqrt{3}) > 0$, it means that $x = \sqrt{3}$ is a relative minimum. The point $x=1$ is not a critical point since the function is undefined at that point.</p>
<p>So, the critical points are:</p>
<p>$x = -\sqrt{3}$ is a relative maximum</p>
<p>$x = \sqrt{3}$ is a relative minimum.</p>
</div>
</details>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Problem: Consider the function $f(x) = \frac{x^2 - 6x + 9}{x^2 - 4}$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.</p>
<p>Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.</p>
<p>Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by:
$f’(x) = \frac{(2x - 6)(x^2 - 4) - (x^2 - 6x + 9)(2x)}{(x^2 - 4)^2}$</p>
<p>Simplifying the expression, we get:</p>
<p>$f’(x) = \frac{-x^4 + 8x^3 - 5x^2 - 72x + 72}{(x^2 - 4)^2}$</p>
<p>Setting $f’(x) = 0$, we find that:</p>
<p>$-x^4 + 8x^3 - 5x^2 - 72x + 72 = 0$</p>
<p>We can factor this expression as:</p>
<p>$(x - 2)(x - 3)(x^2 - 3x - 12) = 0$</p>
<p>Solving for $x$, we get the critical points:</p>
<p>$x = 2, 3, \frac{3 \pm \sqrt{105}}{2}$</p>
<p>So, the critical points are $x = 2, 3, \frac{3 - \sqrt{105}}{2}, \frac{3 + \sqrt{105}}{2}$.</p>
<p>Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points:
$f’‘(x) = \frac{2x^5 - 30x^4 + 144x^3 - 320x^2 + 288x - 72}{(x^2 - 4)^3}$</p>
<p>$f’‘(2) = \frac{360}{36} = 10$</p>
<p>$f’‘(3) = \frac{-540}{49} < 0$</p>
<p>$f’‘\left(\frac{3 - \sqrt{105}}{2}\right) > 0$</p>
<p>$f’‘\left(\frac{3 + \sqrt{105}}{2}\right) < 0$</p>
<p>Since $f’‘(2) > 0$, it means that $x = 2$ is a relative minimum. Since $f’‘(3) < 0$, it means that $x = 3$ is a relative maximum. Since $f’‘\left(\frac{3 - \sqrt{105}}{2}\right) > 0$ and $f’‘\left(\frac{3 + \sqrt{105}}{2}\right) < 0$, it means that $\frac{3 - \sqrt{105}}{2}$ is an inflection point and $\frac{3 + \sqrt{105}}{2}$ is a local maximum.</p>
<p>So, the critical points are:</p>
<p>$x = 2$ is a relative minimum
$x = 3$ is a relative maximum
$\frac{3 - \sqrt{105}}{2}$ is an inflection point
$\frac{3 + \sqrt{105}}{2}$ is a local maximum.</p>
</div>
</details>
<p>Both of these questions are at least moderately reasonable, but the answers are wildly off-base.</p>
<ul>
<li>
<p><strong>Response 1</strong> computes the derivative incorrectly, sets it equal to zero correctly, computes the second derivative incorrectly, and plugs the critical points into it incorrectly. It does draw the correct conclusions from the results of those computations, though.</p>
<p>It also makes a random sign error on one of the critical points, and makes a technically incorrect claim about what qualifies as a critical point. (points where the derivative is undefined <em>are</em> critical points but <em>can’t</em> be maxima or minima).</p>
</li>
<li>
<p><strong>Response 2</strong> computes the derivative correctly, but simplifies it incorrectly, then sets it equal to zero incorrectly. It computes the second derivative incorrectly, and then plugs in the critical points it computed incorrectly.</p>
<p>It almost interprets the computations correctly, but then fails: it claims that because \(f’’ \left( \frac{3-\sqrt{105}}{2} \right) > 0 \), that point is an <em>inflection point</em>, which is definitely wrong.</p>
<p>(Also it repeats itself at the end, but that’s not wrong, just badly written.)</p>
</li>
</ul>
<p>Overall this doesn’t seem like an awful way to brainstorm problems, but I would have to write solutions myself and confirm the problems are reasonably solvable.</p>
<h3 id="related-rates-problems">Related Rates problems</h3>
<p class="theorem"><strong>Prompt:</strong> Write a related rates problem with a solution</p>
<details class="example">
<summary> Response 1 </summary>
<div>
<p>Problem: A spherical balloon is being inflated with air at a rate of 4 cubic feet per minute. At the instant when the radius of the balloon is 2 feet, how fast is the radius of the balloon increasing?</p>
<p>Solution: We can use the formula for the volume of a sphere to relate the rate of change of the volume of the balloon with the rate of change of the radius:</p>
<p>$V = \frac{4}{3}\pi r^3$</p>
<p>Taking the derivative of both sides with respect to time, we get:</p>
<p>$\frac{dV}{dt} = 4\pi r^2 \frac{dr}{dt}$</p>
<p>We are given that $\frac{dV}{dt} = 4$ cubic feet per minute, and we know that the radius of the balloon is 2 feet. So, substituting these values, we get:</p>
<p>$4 = 4\pi (2)^2 \frac{dr}{dt}$</p>
<p>Simplifying, we get:</p>
<p>$\frac{dr}{dt} = \frac{1}{4\pi (2)^2} = \frac{1}{16\pi}$ feet per minute</p>
<p>So, the radius of the balloon is increasing at a rate of $\frac{1}{16\pi}$ feet per minute when the radius of the balloon is 2 feet.</p>
</div>
</details>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Problem: A water tank is in the shape of a cylinder with a radius of 5 feet and a height of 10 feet. Water is flowing into the tank at a rate of 3 cubic feet per minute. How fast is the water level rising when the water is 6 feet deep?</p>
<p>Solution:</p>
<p>Let’s begin by labeling the variables we need:</p>
<p>$r$ is the radius of the tank, which is 5 feet.
$h$ is the height of the water level in the tank.
$V$ is the volume of water in the tank.
$t$ is time.
$dV/dt$ is the rate at which water is flowing into the tank, which is 3 cubic feet per minute.
We want to find the rate of change of $h$ with respect to time when $h = 6$ feet.</p>
<p>To solve the problem, we need to relate $h$ and $V$. The volume of a cylinder is given by the formula:</p>
<p>$V = \pi r^2 h$</p>
<p>Differentiating with respect to time, we get:</p>
<p>$dV/dt = \pi r^2 dh/dt$</p>
<p>Solving for $dh/dt$, we get:</p>
<p>$dh/dt = (dV/dt)/( \pi r^2)$</p>
<p>Now, we need to find $dV/dt$ when $h = 6$ feet. The volume of water in the tank when the water level is 6 feet deep is:</p>
<p>$V = \pi r^2 h = \pi (5^2)(6) = 150\pi$</p>
<p>Differentiating with respect to time, we get:</p>
<p>$dV/dt = 150\pi (dh/dt)$</p>
<p>Substituting $dV/dt = 3$ and $r = 5$, we get:</p>
<p>$3 = 150\pi (dh/dt)$</p>
<p>Solving for $dh/dt$, we get:</p>
<p>$dh/dt = 3/(150\pi) = 0.02$ feet per minute</p>
<p>So, the water level is rising at a rate of 0.02 feet per minute when the water is 6 feet deep.</p>
</div>
</details>
<details class="example">
<summary> Response 3</summary>
<div>
<p>Problem: A spherical balloon is being inflated at a rate of 10 cubic feet per minute. Find the rate at which the radius of the balloon is increasing when the diameter is 2 feet.</p>
<p>Solution:</p>
<p>Let’s start by writing down what we know:</p>
<p>The volume of a sphere is given by V = (4/3)πr^3, where r is the radius.
We know that dV/dt = 10, since the balloon is being inflated at a rate of 10 cubic feet per minute.
We want to find dr/dt when the diameter is 2 feet, which means r = 1 foot.
To find the rate at which the radius of the balloon is increasing, we need to relate dV/dt and dr/dt using the chain rule. We have:</p>
<p>dV/dt = dV/dr * dr/dt</p>
<p>We can find dV/dr by taking the derivative of the volume formula with respect to r:</p>
<p>dV/dr = 4πr^2</p>
<p>Substituting the values we know, we get:</p>
<p>10 = (4/3)π(1^3) * dr/dt * 4π(1^2)</p>
<p>Simplifying the expression, we get:</p>
<p>dr/dt = 15/(2π)</p>
<p>So, the rate at which the radius of the balloon is increasing when the diameter is 2 feet is 15/(2π) feet per minute.</p>
</div>
</details>
<ul>
<li>
<p><strong>Response 1</strong> is a simple and straightforward, but solid problem. And the solution is so close to being right. It does everything correctly up until the very end, and then it makes an arithmetic error at the end.</p>
</li>
<li>
<p><strong>Response 2</strong> Is also pretty simple, maybe too simple for me to really like it. The solution starts off strong, but halfway through gets itself confused, starts the problem over, and does it wrong.</p>
</li>
<li>
<p><strong>Response 3</strong> is also a simple problem, basically the same as Response 1. I was impressed it kept diameter versus radius straight. The solution starts out strong, but then again goes off the rails halfway through and gives a slightly wrong answer at the end.</p>
</li>
</ul>
<p>At this point I wanted to see if I could prompt it into giving me more interesting problems. I also decided to play around with some prompt engineering ideas: what if I tell it what sort of thing it wants to achieve?</p>
<p class="theorem"><strong>Prompt:</strong> Pretend you’re a brilliant professor who wants to write an interesting and challenging calculus test.
Write a related rates problem involving similar triangles with a solution.</p>
<details class="example">
<summary> Response 1 </summary>
<div>
<p>Problem: A hot air balloon is rising vertically at a rate of 5 meters per second. At a certain instant, the angle of elevation between the observer on the ground and the balloon is 60 degrees. The observer is 150 meters away from the launch point of the balloon. How fast is the distance between the observer and the balloon increasing at this instant?</p>
<p>Solution:</p>
<p>Let’s start by drawing a diagram to visualize the problem:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yaml
Copy code
B (balloon)
/|
/ |
/ |h
/ |
/θ |
/ |
/ |
A (observer)
</code></pre></div> </div>
<p>We have a right triangle ABC, where A is the observer, B is the balloon, and C is the point on the ground directly below the balloon. We know that AB = 150 meters, and the angle θ is 60 degrees. We want to find how fast the distance BC is changing when the balloon is rising at a rate of 5 meters per second.</p>
<p>We can see that triangles ABC and ABD are similar, so we can use the property of similar triangles to relate the sides:</p>
<p>AB/AC = BD/BC</p>
<p>Simplifying the expression, we get:</p>
<p>BC = (AB*BD)/AC</p>
<p>To find BD, we need to use trigonometry. We have:</p>
<p>tan θ = BD/AC</p>
<p>Substituting the known values, we get:</p>
<p>tan 60 = BD/AC</p>
<p>sqrt(3) = BD/AC</p>
<p>BD = sqrt(3) * AC</p>
<p>Now, we can substitute this value into the equation we derived earlier:</p>
<p>BC = (AB*sqrt(3)*AC)/AC</p>
<p>Simplifying the expression, we get:</p>
<p>BC = 150*sqrt(3)</p>
<p>Now, we need to find how fast BC is changing with respect to time. We have:</p>
<p>dBC/dt = d/dt (150*sqrt(3))</p>
<p>dBC/dt = 0 (since 150*sqrt(3) is a constant)</p>
<p>Therefore, the distance between the observer and the balloon is not changing at this instant, and its rate of change is zero.</p>
</div>
</details>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Problem: A flagpole standing vertically on a hill has a height of 20 feet and is located on a slope with an inclination of 30 degrees. A person standing on the same slope 100 feet away from the base of the flagpole observes that the angle of elevation to the top of the flagpole is increasing at a rate of 0.1 radians per minute. At what rate is the person moving closer to the flagpole at the moment when the angle of elevation is 60 degrees?</p>
<p>Solution:</p>
<p>Let’s start by drawing a diagram of the situation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>perl
Copy code
/|
/ |
/ | 20 ft
/ |
/ |
/ |
/-----------------\
100 ft
</code></pre></div> </div>
<p>In the diagram, the flagpole has height 20 feet and is located on a slope with inclination 30 degrees. The person is standing on the same slope 100 feet away from the base of the flagpole and observes that the angle of elevation to the top of the flagpole is increasing at a rate of 0.1 radians per minute.</p>
<p>Let’s define some variables:</p>
<p>Let x be the distance between the person and the base of the flagpole.
Let y be the height of the person’s eye above the ground.
Let θ be the angle of elevation to the top of the flagpole.
From the diagram, we can see that the triangles formed by the person, the flagpole, and the ground are similar. Therefore, we have:</p>
<p>y / x = (20 + y) / (x / cos 30)</p>
<p>Simplifying this equation, we get:</p>
<p>y = 20x / (2x + sqrt(3)x)</p>
<p>We want to find the rate at which the person is moving closer to the flagpole at the moment when the angle of elevation is 60 degrees, which means θ = π/3 radians. We also know that dθ/dt = 0.1 radians per minute.</p>
<p>Using trigonometry, we can relate θ and y:</p>
<p>tan θ = (20 + y) / x</p>
<p>Taking the derivative with respect to time, we get:</p>
<p>sec^2 θ * dθ/dt = (dy/dt * x - y * dx/dt) / x^2</p>
<p>Substituting the values we know, we get:</p>
<p>1/3 * dθ/dt = (dy/dt * 100 - y * dx/dt) / 10000</p>
<p>We want to find dx/dt when θ = π/3 and dθ/dt = 0.1. We can use the equation y = 20x / (2x + sqrt(3)x) to eliminate y:</p>
<p>1/3 * 0.1 = (dy/dt * 100 - (20x / (2x + sqrt(3)x)) * dx/dt) / 10000</p>
<p>Simplifying the expression and solving for dx/dt, we get:</p>
<p>dx/dt = 0.1429 feet per minute</p>
<p>So, the person is moving closer to the flagpole at a rate of 0.1429 feet per minute when the angle of elevation is 60 degrees.</p>
</div>
</details>
<ul>
<li>
<p><strong>Response 1</strong> certainly gives a much more challenging problem than the previous prompt ever did! And the fact that it can draw a diagram of the problem in ASCII art is really cool, although not terribly helpful.</p>
<p>But ChatGPT gets confused as soon as it starts writing the solution: it can’t keep track of which points A, B, and C we want to talk about. Then it compares the triangle ABC it sketched out to a similar triangle ABD, which doesn’t actually exist in the problem or the diagram.</p>
<p>From there it continues to be kind of incoherent. It has \(\tan(\theta) = BD/AC\), which should be pieces of different triangles, so that doesn’t make much sense. And it concludes with making one of the key basic errors that calc 1 students make: plugging in values before taking the derivative, so that your derivatives all wind up being zero.</p>
<p>(This is also the first time I had to do editing to make the formatting work, but I’m not holding that against ChatGPT; it rendered fine in the GPT window, but my site messed around with it a bit.)</p>
</li>
<li>
<p><strong>Response 2</strong> is a problem that is hard enough that I’d struggle with it. It’s interesting, but I can’t assign it!</p>
<p>The solution makes things worse by introducing additional considerations (like the height of the person’s eye above the ground!). It declares that three triangles are all similar, without being clear on what those triangles are, and I’m pretty sure this problem has no similar triangles at all.</p>
<p>After that detour it does some correct-ish work, but makes the usual algebra errors, and then falls back on the weird not-similar-triangles argument from earlier.</p>
</li>
</ul>
<p>Okay, let’s give up on the prompt engineering, and see what happens if we just ask for a normal similar triangles problem.</p>
<p class="theorem"><strong>Prompt:</strong> Write a related rates problem involving similar triangles with a solution.</p>
<details class="example">
<summary> Response 1</summary>
<div>
<p>Problem: A person standing 10 meters away from a street light observes that the angle of elevation from their eye to the top of the light is 60 degrees. The person begins to walk away from the light at a rate of 2 meters per second. How fast is the angle of elevation changing when the person is 20 meters away from the light?</p>
<p>Solution:</p>
<p>Let’s start by drawing a diagram of the situation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>markdown
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
|θ \ 10 m
| \
| \
| \
|__________\
</code></pre></div> </div>
<p>In the diagram, the person is standing 10 meters away from the street light and observes that the angle of elevation from their eye to the top of the light is 60 degrees. The person begins to walk away from the light at a rate of 2 meters per second.</p>
<p>Let’s define some variables:</p>
<p>Let d be the distance between the person and the light.
Let θ be the angle of elevation to the top of the light.
From the diagram, we can see that the triangles formed by the person, the street light, and the ground are similar. Therefore, we have:</p>
<p>d / 10 = tan θ</p>
<p>Taking the derivative with respect to time, we get:</p>
<p>1/10 * dd/dt = sec^2 θ * dθ/dt</p>
<p>We want to find dθ/dt when d = 20 meters. We also know that dd/dt = 2 meters per second.</p>
<p>Using trigonometry, we can relate θ and d:</p>
<p>tan θ = d / 10</p>
<p>Substituting the values we know, we get:</p>
<p>1/10 * 2 = sec^2 θ * dθ/dt</p>
<p>Simplifying the expression, we get:</p>
<p>dθ/dt = 1/(5sec^2 θ) radians per second</p>
<p>We can use the equation tan θ = d / 10 to eliminate θ:</p>
<p>tan θ = d / 10</p>
<p>sec^2 θ = 1 + tan^2 θ</p>
<p>Substituting these values into the expression for dθ/dt, we get:</p>
<p>dθ/dt = 1 / (5(1 + (d/10)^2)) radians per second</p>
<p>Substituting d = 20, we get:</p>
<p>dθ/dt = 1 / (5(1 + 4)) radians per second</p>
<p>Simplifying the expression, we get:</p>
<p>dθ/dt = 1 / 25 radians per second</p>
<p>So, the angle of elevation is changing at a rate of 1/25 radians per second when the person is 20 meters away from the light.</p>
</div>
</details>
<p>And I think we’ve conclusively shown that ChatGPT doesn’t understand what similar triangles are.</p>
<h4 id="brainstorming-related-rates-problems">Brainstorming Related Rates Problems</h4>
<p>At this point I’m pretty convinced that ChatGPT won’t write up good solutions. What if I just ask for problems?<strong title="Note: at this point I also started a new conversation, to see if that would change things. I haven't done a lot of testing on how much stuff changes if you continue a previous conversational thread versus starting a new one."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p class="theorem"><strong>Prompt:</strong> Write me some related rates problems involving distance.</p>
<details class="example">
<summary> Response 1</summary>
<div>
<p>Sure! Here are a few related rates problems involving distance:</p>
<ol>
<li>
<p>Two cars are driving towards each other on a straight road. Car A is traveling at 60 mph, and car B is traveling at 40 mph. If the distance between them is 1 mile, how fast are they getting closer to each other?</p>
</li>
<li>
<p>A ladder is leaning against a wall. The base of the ladder is 6 feet from the wall, and the top of the ladder is 10 feet up the wall. If the bottom of the ladder is being pulled away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the base is 8 feet from the wall?</p>
</li>
<li>
<p>A person is standing at the edge of a cliff, looking out at the ocean. The cliff is 100 feet high. If the person’s line of sight makes an angle of 30 degrees with the horizontal, how far out from the cliff is the person looking?</p>
</li>
<li>
<p>A person is walking away from a lamppost at a rate of 3 feet per second. The person is initially 10 feet from the lamppost. How fast is the person’s distance from the lamppost increasing after 5 seconds?</p>
</li>
<li>
<p>A ball is thrown straight up into the air from a height of 5 feet, with an initial velocity of 20 feet per second. How long does it take for the ball to reach a height of 25 feet, and what is the ball’s velocity at that point?</p>
</li>
</ol>
</div>
</details>
<p>The second suggestion is a standard but perfectly good related rates problem, but the others don’t fare nearly so well. They did all involve distance somehow, though!</p>
<ul>
<li>The first problem is technically a related rates problem, but it doesn’t require calculus.</li>
<li>The third doesn’t involve any rates at all.</li>
<li>The fourth asks about <em>one</em> rate, and also it just tells you what that rate is in the problem. It’s purely a reading comprehension question.</li>
<li>The fifth also has <em>one</em> rate, but isn’t a related rates problem. It also doesn’t contain enough information to solve the problem; you would need to know the acceleration to work out the problem, and even with that it would require integrals.</li>
</ul>
<p>The last one was the most interesting to me, in terms of how <em>I</em> related to it. You can plausibly do the problem if you assume acceleration under earth gravity. And I spent a bit of time trying to figure out if ChatGPT intended me to do that, and thinking about how I could get it to tell me—before realizing that asking about ChatGPT’s “intentions” is misunderstanding it. I could ask more questions and it could make a decision, but it doesn’t have a stable “what I meant” that I could try to probe.</p>
<p>Let’s try again.</p>
<details class="example">
<summary> Response 2</summary>
<div>
<p>Sure, here are some related rates problems involving distance:</p>
<ol>
<li>
<p>Two cars are traveling towards each other on a straight road. Car A is traveling at 60 mph and car B is traveling at 40 mph. If the distance between the two cars is initially 500 miles, how fast is the distance between them decreasing?</p>
</li>
<li>
<p>A ladder 10 feet long rests against a vertical wall. If the bottom of the ladder slides away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the bottom of the ladder is 6 feet from the wall?</p>
</li>
<li>
<p>A person is walking away from a lamppost at a rate of 4 feet per second. The height of the lamppost is 20 feet. How fast is the person’s shadow lengthening when they are 25 feet from the lamppost?</p>
</li>
<li>
<p>A ball is thrown straight up into the air with an initial velocity of 40 feet per second. The height of the ball, h, in feet after t seconds is given by h = 40t - 16t^2. How fast is the ball moving when it reaches a height of 30 feet?</p>
</li>
<li>
<p>Two boats are approaching each other in a straight line. One boat is traveling at a speed of 10 mph and the other is traveling at a speed of 15 mph. The boats are initially 1000 feet apart. How fast is the distance between the boats decreasing when they are 500 feet apart?</p>
</li>
</ol>
</div>
</details>
<p>Surprisingly, the first two questions in response 2 are almost identical to the first two questions in response 1. This go-around, the third question is also a good, standard related rates problem; the fourth is a derivative word problem that doesn’t involve related rates; and the fifth is essentially the same as the first.</p>
<p>My conclusion: this isn’t useless, actually! With like two minutes of effort I got three usable problems. And it’s not that I <em>couldn’t</em> have written these problems myself, but it’s easy for me to get into decision paralysis while question-writing. This certainly fixes the brainstorming problem, and gives me raw material to shape into what I want.</p>
<p>But I’m going to have to write up my own solutions.</p>
<h4 id="can-it-solve-my-problems">Can it solve my problems?</h4>
<p>Or do I? In the previous prompts, it failed to write solutions while writing problems. But I was curious what would happen if I just gave it a question and let it solve it. Will this work correctly?</p>
<details class="example">
<summary> A ladder is leaning against a wall. The base of the ladder is 6 feet from the wall, and the top of the ladder is 10 feet up the wall. If the bottom of the ladder is being pulled away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the base is 8 feet from the wall?</summary>
<div>
<p>Let’s start by drawing a diagram:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lua
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
|h \
| \
| \
| \
| \
| \
--------------
d
</code></pre></div> </div>
<p>Here, the ladder is represented by the diagonal line, with the base of the ladder represented by “d” and the height of the ladder represented by “h”. We want to find how fast the top of the ladder is sliding down the wall, which is represented by the rate of change of “h”, as the base of the ladder is being pulled away from the wall.</p>
<p>We can see that the distance between the top of the ladder and the wall, which we’ll call “x”, is proportional to the height of the ladder and inversely proportional to the distance between the base of the ladder and the wall:</p>
<p>x/d = h/10</p>
<p>Taking the derivative of both sides with respect to time t, we get:</p>
<p>(dx/dt)/d - x/(d^2)(dd/dt) = (dh/dt)/10</p>
<p>We want to find (dh/dt) when d = 8 feet. We are given that the base of the ladder is being pulled away from the wall at a rate of 2 feet per second, so dd/dt = 2.</p>
<p>When d = 8, we can use the proportion above to find x:</p>
<p>x/8 = 10/h</p>
<p>x = 80/h</p>
<p>We also know that d^2 + h^2 = 10^2, since the ladder is 10 feet long. When d = 6, we can use this equation to find h:</p>
<p>6^2 + h^2 = 10^2</p>
<p>h^2 = 64</p>
<p>h = 8</p>
<p>So when the base of the ladder is 6 feet from the wall, the ladder reaches a height of 8 feet on the wall.</p>
<p>Now we can plug in the values we have into the related rates equation:</p>
<p>(dx/dt)/d - x/(d^2)(dd/dt) = (dh/dt)/10</p>
<p>(dx/dt)/8 - (80/h)/(8^2)(2) = (dh/dt)/10</p>
<p>We need to find (dh/dt) when d = 8 and h = 6. We can use the equation d^2 + h^2 = 10^2 to find h when d = 8:</p>
<p>8^2 + h^2 = 10^2</p>
<p>h^2 = 36</p>
<p>h = 6</p>
<p>Now we can plug in the values we have:</p>
<p>(dx/dt)/8 - (80/6)/(8^2)(2) = (dh/dt)/10</p>
<p>(dx/dt)/8 - 10/192 = (dh/dt)/10</p>
<p>We want to find (dh/dt), so we can isolate it on one side:</p>
<p>(dh/dt)/10 = (dx/dt)/8 - 10/192</p>
<p>(dh/dt) = (dx/dt)/0.8 - 1.25</p>
<p>We know that dx/dt = 2 when the base of the ladder is being pulled away from the wall at a rate of 2 feet per second. So when d = 8 and h = 6, we have:</p>
<p>(dh/dt) = (2)/0.8 - 1.25</p>
<p>(dh/dt) = 0.25</p>
<p>Therefore, when the base of the ladder is 8 feet from the wall, the top of the ladder is sliding down the wall</p>
</div>
</details>
<details class="example">
<summary> A person is walking away from a lamppost at a rate of 4 feet per second. The height of the lamppost is 20 feet. How fast is the person’s shadow lengthening when they are 25 feet from the lamppost?</summary>
<div>
<p>Let’s draw a diagram to represent the situation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lua
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
| \
| \
|h \
| \
------------
d
</code></pre></div> </div>
<p>Here, the lamppost is represented by the vertical line on the left, and the person is walking away from it to the right. The distance between the person and the lamppost is represented by “d”, and the height of the lamppost is represented by “h”. We want to find how fast the person’s shadow is lengthening, which is represented by the length of the line segment from the person’s feet to the base of the lamppost, as the person walks away.</p>
<p>We can see that the height of the person’s shadow, which we’ll call “s”, is proportional to the height of the lamppost and inversely proportional to the distance between the person and the lamppost:</p>
<p>s/d = h/20</p>
<p>Taking the derivative of both sides with respect to time t, we get:</p>
<p>(ds/dt)/d - s/(d^2)(dd/dt) = (dh/dt)/20</p>
<p>We want to find (ds/dt) when d = 25 feet. We are given that the person is walking away from the lamppost at a rate of 4 feet per second, so dd/dt = 4. We are also given that the height of the lamppost is 20 feet, so dh/dt = 0.</p>
<p>When d = 25, we can use the proportion above to find s:</p>
<p>s/25 = 20/20</p>
<p>s = 25</p>
<p>So when the person is 25 feet from the lamppost, their shadow is 25 feet long.</p>
<p>Now we can plug in the values we have into the related rates equation:</p>
<p>(ds/dt)/d - s/(d^2)(dd/dt) = (dh/dt)/20</p>
<p>(ds/dt)/25 - 25/(25^2)(4) = 0</p>
<p>(ds/dt)/25 = 0.005</p>
<p>ds/dt = 0.005 * 25</p>
<p>ds/dt = 0.125</p>
<p>Therefore, when the person is 25 feet from the lamppost, their shadow is lengthening at a rate of 0.125 feet per second.</p>
</div>
</details>
<details class="example">
<summary> A ladder 10 feet long rests against a vertical wall. If the bottom of the ladder slides away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the bottom of the ladder is 6 feet from the wall?</summary>
<div>
<p>We can use the Pythagorean theorem to relate the distance of the ladder from the wall to the height it reaches on the wall. Let’s call the distance the bottom of the ladder is from the wall “x” and the height the ladder reaches on the wall “y”. Then we have:</p>
<p>x^2 + y^2 = 10^2</p>
<p>We can take the derivative of both sides of this equation with respect to time t to get the related rates:</p>
<p>2x(dx/dt) + 2y(dy/dt) = 0</p>
<p>We are given that the bottom of the ladder is sliding away from the wall at a rate of 2 feet per second, so dx/dt = 2. We want to find how fast the top of the ladder is sliding down the wall, so we need to solve for dy/dt.</p>
<p>When the bottom of the ladder is 6 feet from the wall, we can use the Pythagorean theorem to find the corresponding height on the wall:</p>
<p>6^2 + y^2 = 10^2</p>
<p>y^2 = 64</p>
<p>y = 8</p>
<p>So when the bottom of the ladder is 6 feet from the wall, the ladder is reaching a height of 8 feet on the wall.</p>
<p>Now we can plug in the values we have into the related rates equation:</p>
<p>2x(dx/dt) + 2y(dy/dt) = 0</p>
<p>2(6)(2) + 2(8)(dy/dt) = 0</p>
<p>12 + 16(dy/dt) = 0</p>
<p>16(dy/dt) = -12</p>
<p>dy/dt = -0.75</p>
<p>Therefore, when the bottom of the ladder is 6 feet from the wall, the top of the ladder is sliding down the wall at a rate of 0.75 feet per second.</p>
</div>
</details>
<p>The first two solutions are complete nonsense. But the third one is correct! That’s not nearly reliable enough to be useful, but it <em>is</em> the first time I’ve seen a completely correct solution in these experiments, so it’s worth noting.</p>
<h3 id="calculus-2-oh-yeah-i-actually-have-to-write-a-test">Calculus 2: Oh yeah, I actually have to write a test</h3>
<p>At some point I remembered that I actually had to write a test to give to my calc 2 class. So maybe I should see if I can write some questions for that.</p>
<h4 id="trig-sub-integrals">Trig sub integrals</h4>
<details class="example">
<summary> write some trig sub integral problems</summary>
<div>
<ol>
<li>
<p>Evaluate $\displaystyle\int \frac{\sqrt{9-x^2}}{x^2} , dx$ using the substitution $x=3\sin\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{1}{x^2\sqrt{x^2+4}} , dx$ using the substitution $x=2\tan\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{1}{x\sqrt{x^2-1}} , dx$ using the substitution $x=\sec\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{x^3}{\sqrt{1-x^2}} , dx$ using the substitution $x=\sin\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{dx}{\sqrt{x^2+9}\left(x^2+4\right)}$ using the substitution $x=3\tan\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{x^3}{\sqrt{x^2+1}} , dx$ using the substitution $x=\tan\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{1}{x^4\sqrt{4x^2-1}} , dx$ using the substitution $x=\frac{1}{2}\sec\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{dx}{\sqrt{25-x^2}\left(x^2+16\right)}$ using the substitution $x=5\sin\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{x^3}{\sqrt{4x^2-1}} , dx$ using the substitution $x=\frac{1}{2}\sec\theta$.</p>
</li>
<li>
<p>Evaluate $\displaystyle\int \frac{\sqrt{16-x^2}}{x^2} , dx$ using the substitution $x=4\sin\theta$.</p>
</li>
</ol>
</div>
</details>
<details class="example">
<summary> Can you write solutions to those problems for me?</summary>
<div>
<ol>
<li>
<p>Let $x=3\sin\theta$. Then $dx=3\cos\theta, d\theta$, and $\sqrt{9-x^2}=\sqrt{9-9\sin^2\theta}=3\cos\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{\sqrt{9-x^2}}{x^2} \, dx &= \int \frac{3\cos\theta}{9\sin^2\theta} \cdot 3\cos\theta\, d\theta \\\<br />
&= \frac{1}{3} \int \csc^2\theta \, d\theta \\\<br />
&= -\frac{1}{3}\cot\theta + C \\\<br />
&= -\frac{1}{3}\cot(\sin^{-1}(x/3)) + C.
\end{array}
\]</p>
</li>
<li>
<p>Let $x=2\tan\theta$. Then $dx=2\sec^2\theta\, d\theta$, and $\sqrt{x^2+4}=\sqrt{4\tan^2\theta+4}=2\sec\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{1}{x^2\sqrt{x^2+4}} \, dx &= \int \frac{1}{4\tan^2\theta \cdot 2\sec\theta} \cdot 2\sec^2\theta\, d\theta \\\<br />
&= \frac{1}{2} \int \cos\theta \cdot \sec^3\theta \, d\theta \\\<br />
&= \frac{1}{2} \int \frac{\cos\theta}{\cos^4\theta} \, d\theta \\\<br />
&= \frac{1}{2} \int \frac{\cos\theta}{(1-\sin^2\theta)^2} \, d\theta \\\<br />
&= -\frac{1}{2} \frac{\cos\theta}{1-\sin^2\theta} + \frac{1}{2} \int \frac{1}{1-\sin^2\theta} \, d\theta \\\<br />
&= -\frac{1}{2}\frac{\cos\theta}{1-\sin^2\theta} + \frac{1}{4}\ln\left|\frac{1+\sin\theta}{1-\sin\theta}\right| + C \\\<br />
&= -\frac{1}{2}\frac{x}{\sqrt{x^2+4}} + \frac{1}{4}\ln\left|\frac{x+\sqrt{x^2+4}}{2}\right| + C.
\end{array}
\]</p>
</li>
</ol>
<p>Let $x=\sec\theta$. Then $dx=\sec\theta\tan\theta\, d\theta$, and $\sqrt{x^2-1}=\sqrt{\sec^2\theta-1}=\tan\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{1}{x\sqrt{x^2-1}} \, dx &= \int \frac{\sec\theta\tan\theta}{\tan\theta} \, d\theta \\\<br />
&= \int \sec\theta \, d\theta \\\<br />
&= \ln|\sec\theta + \tan\theta| + C \\\<br />
&= \ln|x+\sqrt{x^2-1}| + C.
\end{array}
\]</p>
</div>
</details>
<p>First off, this obviously only answers three questions. I think it was hitting a character limit in the answer. It actually cut itself off right before finishing the third one: it stopped with the “\end” at the end of the array environment. But I decided to give it credit for that and finished off the environment syntax for it.</p>
<p>This was also the first time I had to do any other meaningful LaTeX editing. There are places where it <em>should</em> have produced a “\” or a “\,” and instead just gave a “" or “,”; I assume that’s some issue with the way it processes text. I went ahead and fixed that but wanted to make a note of it. (I also had to do some extra editing to make it render correctly on this site, but that says more about Markdown and MathJax than about ChatGPt.)</p>
<p>The substance of the answers is also interesting. Each of them does the substitution right, then screws up the algebra in the middle and gives a completely wrong answer at the end.</p>
<p>But the problems are pretty good!</p>
<h4 id="integration-by-parts">Integration by parts</h4>
<p>Let’s try some other integral approaches I need.</p>
<p><strong>Prompt:</strong> Write an integration by parts problem and solve it.</p>
<details class="example">
<summary> Response 1</summary>
<div>
<p>Sure, here’s an example:</p>
<p>Find $\displaystyle\int x^2\ln x \, dx$.</p>
<p>To solve this integral, we will use integration by parts with $u=\ln x$ and $dv=x^2\,dx$. Then $du=\frac{1}{x}\,dx$ and $v=\frac{1}{3}x^3$, so we have
\[
\begin{array}{rl}
\int x^2\ln x \, dx &= \frac{1}{3}x^3\ln x - \int \frac{1}{3}x^2 \, dx \\\<br />
&= \frac{1}{3}x^3\ln x - \frac{1}{9}x^3 + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore, we have
<script type="math/tex">\int x^2\ln x \, dx = \frac{1}{3}x^3\ln x - \frac{1}{9}x^3 + C.</script></p>
</div>
</details>
<details class="example">
<summary> Response 2</summary>
<div>
<p>Sure! Here’s a problem and its solution using integration by parts:</p>
<p>Problem: Evaluate $\int x^2\cos x\, dx$.</p>
<p>Solution: To solve this integral, we use integration by parts with $u=x^2$ and $dv=\cos x\, dx$. Then $du=2x\, dx$ and $v=\sin x$, so we have
\[
\begin{array}{rl}
\int x^2\cos x\, dx &= x^2\sin x - \int 2x\sin x\, dx \\\<br />
&= x^2\sin x + 2\int x\sin x\, dx \\\<br />
&= x^2\sin x + 2\left(x\cos x - \int \cos x\, dx\right) \\\<br />
&= x^2\sin x + 2x\cos x - 2\sin x + C \\\<br />
&= \boxed{x^2\sin x + 2x\cos x - 2\sin x + C}.
\end{array}
\]</p>
</div>
</details>
<p>And it can basically do this! The problems are standard. The first solution looks completely right; the second makes two sign errors, which cancel each other out. (It even boxed the answer in the second example for me, which I find fascinating.)</p>
<p>I wanted to see if I could get something a little more challenging.</p>
<details class="example">
<summary> Write a hard integration by parts problem and solve it.</summary>
<div>
<p>Sure, here is a hard integration by parts problem:</p>
<p>Problem: Evaluate $\int x^2 e^{-x} \cos x, dx$.</p>
<p>Solution: To solve this integral, we use integration by parts twice. First, we choose $u=x^2$ and $dv=e^{-x} \cos x\, dx$, so $du=2x\, dx$ and $v=e^{-x} \sin x$. Then we have
\[
\begin{array}{rl}
\int x^2 e^{-x} \cos x\, dx &= x^2 e^{-x} \sin x - \int 2xe^{-x}\sin x\, dx \\\<br />
&= x^2 e^{-x} \sin x - 2\left(-xe^{-x}\cos x - \int -e^{-x}\cos x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x - 2\left(-e^{-x}\sin x - \int e^{-x}\sin x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x - 2\left(-e^{-x}\cos x - \int e^{-x}\cos x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\int e^{-x}\cos x\, dx \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\left(e^{-x}\sin x + \int e^{-x}\sin x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\left(-e^{-x}\cos x + \int -e^{-x}\cos x\, dx\right) \\\<br />
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x + 2e^{-x}\cos x + C \\\<br />
&= \boxed{x^2 e^{-x} \sin x + 4xe^{-x}\cos x + 4e^{-x}\sin x + C}.
\end{array}
\]</p>
</div>
</details>
<p>That’s an interesting problem, although too hard for my test. This time it doesn’t get the answer right, sadly. (It screws up on the first step!)</p>
<p>I want to see if I can get it to generate problems that have coefficients on the variables.</p>
<details class="example">
<summary> Write an integration by parts problem that has coefficients on the x.</summary>
<div>
<p>Here’s an integration by parts problem that has coefficients on the x:</p>
<p>Problem: Evaluate $\int x^2\ln x\, dx$.</p>
<p>Solution: We can use integration by parts with $u = \ln x$ and $dv = x^2\, dx$, so $du = \frac{1}{x}\, dx$ and $v = \frac{x^3}{3}$. Then we have
\[
\begin{array}{rl}
\int x^2\ln x\, dx &= \frac{x^3\ln x}{3} - \int \frac{x^3}{3} \cdot \frac{1}{x}\, dx \\\<br />
&= \frac{x^3\ln x}{3} - \frac{1}{3} \int x^2\, dx \\\<br />
&= \frac{x^3\ln x}{3} - \frac{x^3}{9} + C \\\<br />
&= \boxed{\frac{x^3}{9}(3\ln x - 1) + C}.
\end{array}
\]</p>
</div>
</details>
<p>Wait a minute—it wrote that problem for me already! It’s still a solid problem, and it still got the solution right, but it’s not what I wanted.</p>
<details class="example">
<summary> Write an integration by parts problem where the integrand has scalar coefficients on the x terms</summary>
<div>
<p>Here’s an integration by parts problem where the integrand has scalar coefficients on the x terms:</p>
<p>Problem: Evaluate $\int x e^{-x} \sin x\, dx$.</p>
<p>Solution: We can use integration by parts with $u = \sin x$ and $dv = x e^{-x}\, dx$, so $du = \cos x\, dx$ and $v = -xe^{-x} - e^{-x}$. Then we have
\[
\begin{array}{rl}
\int x e^{-x} \sin x\, dx &= -x e^{-x} \sin x + \int (\cos x - e^{-x})\, dx \\\<br />
&= -x e^{-x} \sin x + \sin x + e^{-x} + C \\\<br />
&= \boxed{(1-x)e^{-x}\sin x + e^{-x}\cos x + C}.
\end{array}
\]</p>
</div>
</details>
<p>Still doesn’t have the coefficients I want, and it does the same thing it did for the “hard” problem I asked for: it asks a really challenging problem, then completely fails to compute it properly.</p>
<p>If I ask it to generate a bunch, will it give me what I want?</p>
<details class="example">
<summary> Write me ten integration by parts problems.</summary>
<div>
<p>Sure, here are ten integration by parts problems:</p>
<ol>
<li>$\int x e^x \cos x\, dx$</li>
<li>$\int x^2 \ln x\, dx$</li>
<li>$\int x^3 e^x \, dx$</li>
<li>$\int x^2 \cos x\, dx$</li>
<li>$\int x \ln x\, dx$</li>
<li>$\int x \cos x\, dx$</li>
<li>$\int e^x \sin x\, dx$</li>
<li>$\int \ln x\, dx$</li>
<li>$\int x^3 \sin x\, dx$</li>
<li>$\int x^2 e^{-x} \, dx$</li>
</ol>
<p>Note: Some of these integrals might be challenging to solve, so you may want to use different techniques (such as substitution) in addition to integration by parts.</p>
</div>
</details>
<p>This is a basically reasonable collection of problems, but literally zero of them have scalars.</p>
<p>I wonder if it’s stuck? I’ll start a clean chat rather than continuing the previous conversation:</p>
<details class="example">
<summary> Write me ten interesting integration by parts problems.</summary>
<div>
<ol>
<li>Evaluate $\int xe^x \, dx$.</li>
<li>Evaluate $\int x^2 \cos x \, dx$.</li>
<li>Evaluate $\int \ln x \, dx$.</li>
<li>Evaluate $\int \arctan x \, dx$.</li>
<li>Evaluate $\int x \sin^2 x \, dx$.</li>
<li>Evaluate $\int \cos^{-1} x \, dx$.</li>
<li>Evaluate $\int \frac{x}{(1+x^2)^2} \, dx$.</li>
<li>Evaluate $\int x \ln x \, dx$.</li>
<li>Evaluate $\int \sqrt{x} e^x \, dx$.</li>
<li>Evaluate $\int x \sin x \, dx$.</li>
</ol>
</div>
</details>
<p>Basically the same, although 7 isn’t an integration by parts problem, and I don’t think 9 is really doable in a standard calculus course.</p>
<p>But I know what I want; if I write the problem can it solve it for me?</p>
<details class="example">
<summary> Evaluate $\int x \sin^2 (3x) \,dx$</summary>
<div>
<p>We can use integration by parts with $u=x$ and $dv=\sin^2(3x) \, dx$. Then $du=dx$ and we can use the double angle formula $\sin^2(3x) = \frac{1}{2}(1 - \cos(6x))$ to get</p>
<p>\[
\begin{array}{rl}
\int x \sin^2(3x) \, dx &= \frac{1}{2} \int x(1-\cos(6x)) \, dx \\\<br />
&= \frac{1}{2}\left(\int x \, dx - \int x \cos(6x) \, dx\right) \\\<br />
&= \frac{1}{2}\left(\frac{1}{2}x^2 - \frac{1}{6}x\sin(6x) - \frac{1}{36}\cos(6x) \right) + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int x \sin^2(3x) \, dx = \frac{1}{4}x^2 - \frac{1}{12}x\sin(6x) - \frac{1}{72}\cos(6x) + C.</script>
</div>
</details>
<details class="example">
<summary> Evaluate $\int x \ln (5x)\, dx$</summary>
<div>
<p>We can use integration by parts with $u = \ln(5x)$ and $dv = x \, dx$. Then, $du = \frac{1}{x} \, dx$ and $v = \frac{1}{2}x^2$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int x \ln(5x) \, dx &= \frac{1}{2}x^2\ln(5x) - \int \frac{1}{2}x^2 \cdot \frac{1}{x} \, dx \\\<br />
&= \frac{1}{2}x^2\ln(5x) - \frac{1}{4}x^2 + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int x \ln(5x) \, dx = \frac{1}{2}x^2\ln(5x) - \frac{1}{4}x^2 + C.</script>
</div>
</details>
<details class="example">
<summary> Evaluate $\int xe^{-4x} \, dx$.</summary>
<div>
<p>We can use integration by parts with $u=x$ and $dv=e^{-4x} \, dx$. Then, $du=dx$ and $v=-\frac{1}{4}e^{-4x}$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int xe^{-4x} \, dx &= -\frac{1}{4}xe^{-4x} + \frac{1}{4}\int e^{-4x} \, dx \\\<br />
&= -\frac{1}{4}xe^{-4x} - \frac{1}{16}e^{-4x} + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int xe^{-4x} \, dx = -\frac{1}{4}xe^{-4x} - \frac{1}{16}e^{-4x} + C.</script>
</div>
</details>
<p>Does pretty well with that!</p>
<p>But let’s do a test with my favorite kind of integration by parts problem.</p>
<details class="example">
<summary> Evaluate $\int sin(3x) e^{-4x} \, dx$.</summary>
<div>
<p>We can use integration by parts with $u = \sin(3x)$ and $dv = e^{-4x} \, dx$. Then, $du = 3\cos(3x) \, dx$ and $v = -\frac{1}{4}e^{-4x}$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int \sin(3x) e^{-4x} \, dx &= -\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{4}\int \cos(3x) e^{-4x} \, dx \\\<br />
&= -\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{16}\cos(3x) e^{-4x} - \frac{9}{16}\int \sin(3x) e^{-4x} \, dx.
\end{array}
\]</p>
<p>Rearranging terms, we get</p>
<p>\[
\begin{array}{rl}
\int \sin(3x) e^{-4x} \, dx &= \frac{3}{25}\left(-\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{16}\cos(3x) e^{-4x}\right) + C \\\<br />
&= -\frac{3}{100}\sin(3x) e^{-4x} + \frac{9}{400}\cos(3x) e^{-4x} + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \sin(3x) e^{-4x} \, dx = -\frac{3}{100}\sin(3x) e^{-4x} + \frac{9}{400}\cos(3x) e^{-4x} + C.</script>
</div>
</details>
<p>It got the right basic idea, but made a couple of minor mistakes. (It also didn’t show work for what’s probably the trickiest step, so it’s not a great template to follow.)</p>
<details class="example">
<summary> Evaluate $\int \cos(-2x) e^{3x} \, dx$.</summary>
<div>
<p>Using the identity $\cos(-2x) = \cos(2x)$, we can write</p>
<p>\[
\begin{array}{rl}
\int \cos(-2x) e^{3x} \, dx &= \int \cos(2x) e^{3x} \, dx \\\<br />
&= \frac{1}{5}\cos(2x) e^{3x} + \frac{2}{5}\int \sin(2x) e^{3x} \, dx.
\end{array}
\]</p>
<p>We can then use integration by parts with $u = \sin(2x)$ and $dv = e^{3x} \, dx$. Then, $du = 2\cos(2x) \, dx$ and $v = \frac{1}{3}e^{3x}$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int \cos(-2x) e^{3x} \, dx &= \frac{1}{5}\cos(2x) e^{3x} + \frac{4}{15}\sin(2x) e^{3x} - \frac{8}{15}\int \cos(2x) e^{3x} \, dx.
\end{array}
\]</p>
<p>Rearranging terms, we get</p>
<p>\[\begin{array}{rl}
\int \cos(-2x) e^{3x} \, dx &= \frac{1}{17}\cos(2x) e^{3x} + \frac{4}{51}\sin(2x) e^{3x} + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \cos(-2x) e^{3x} \, dx = \frac{1}{17}\cos(2x) e^{3x} + \frac{4}{51}\sin(2x) e^{3x} + C.</script>
</div>
</details>
<p>A very good catch that \(\cos(x)\) is an even function, but it doesn’t handle the actual integration very well. It doesn’t show work for the first integration by parts, and in fact gets it wrong. And then it keeps making more mistakes at every step.</p>
<p>So like with the related rates problems, this is pretty good at brainstorming questions, but not so great at solving them.</p>
<h4 id="partial-fractions">Partial Fractions</h4>
<p>Let’s try some partial fractions integrals.</p>
<details class="example">
<summary> Write me an integral that involves partial fractions, and solve it</summary>
<div>
<p>Sure! Here’s an example:</p>
<p>Evaluate the integral <script type="math/tex">\int \frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} \, dx.</script></p>
<p>To solve this integral, we first need to factor the denominator: <script type="math/tex">x^4 - 5x^3 + 4x^2 + 4x - 4 = (x-1)(x-2)(x^2+1).</script></p>
<p>Since $x^2+1$ is an irreducible quadratic over the real numbers, we cannot factor it any further. However, we can use partial fractions to write the integrand as a sum of simpler fractions:</p>
<script type="math/tex; mode=display">\frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} = \frac{A}{x-1} + \frac{B}{x-2} + \frac{Cx+D}{x^2+1},</script>
<p>where $A$, $B$, $C$, and $D$ are constants to be determined. To find these constants, we can multiply both sides by the denominator and then equate coefficients of like terms:</p>
<p>\[
\begin{array}{rl}
4x^3 - 7x^2 + 2x - 1 &= A(x-2)(x^2+1) + B(x-1)(x^2+1) + (Cx+D)(x-1)(x-2) \\\<br />
&= (A+B+C)x^3 + (-2A-2B+C+D)x^2 + (A+B-C)x + (-2A+2B+D).
\end{array}
\]</p>
<p>Equating coefficients of like terms, we get the system of equations</p>
<p>\[
\begin{array}{rl}
A + B + C &= 4 \\\<br />
-2A - 2B + C + D &= -7 \\\<br />
A + B - C &= 2 \\\<br />
-2A + 2B + D &= -1.
\end{array}
\]</p>
<p>Solving this system of equations, we find $A = 1$, $B = 2$, $C = -2$, and $D = 1$. Therefore, we have</p>
<p>\[
\begin{array}{rl}
\int \frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} , dx &= \int \frac{1}{x-1} \, dx + \int \frac{2}{x-2} \, dx + \int \frac{-2x+1}{x^2+1} \, dx \\\<br />
&= \ln|x-1| + 2\ln|x-2| - \ln(x^2+1) + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration.</p>
</div>
</details>
<p>That is, indeed, a partial fractions problem. A very, very difficult partial fractions problem. Which it proceeds to get badly wrong, starting by factoring the denominator incorrectly (though it does find a real factor, which I’m impressed by if it wasn’t an accident).</p>
<p>It does integrate the function it finds correctly, though!</p>
<p>Let’s look at another one.</p>
<details class="example">
<summary> Response 2</summary>
<div>
<p>Sure, here’s an integral that involves partial fractions:</p>
<script type="math/tex; mode=display">\int \frac{3x+1}{x^2+4x+3} \, dx</script>
<p>To solve this integral, we first factor the denominator as $(x+1)(x+3)$. Then, we write the fraction as</p>
<script type="math/tex; mode=display">\frac{3x+1}{(x+1)(x+3)} = \frac{A}{x+1} + \frac{B}{x+3}</script>
<p>where $A$ and $B$ are constants to be determined. Multiplying both sides by $(x+1)(x+3)$ and equating coefficients, we get</p>
<script type="math/tex; mode=display">3x+1 = A(x+3) + B(x+1)</script>
<p>Expanding and collecting like terms, we get</p>
<script type="math/tex; mode=display">(A+B)x + 3A + B = 3x + 1</script>
<p>Equating coefficients, we have</p>
<p>\[
\begin{array}{rl}
A + B &= 3 \\\<br />
3A &= 1 \\\<br />
B &= -2
\end{array}
\]</p>
<p>Solving for $A$ and $B$, we get $A = \frac{1}{3}$ and $B = -2$. Therefore,</p>
<p>\[
\begin{array}{rl}
\int \frac{3x+1}{x^2+4x+3} \, dx &= \int \frac{1}{x+1} \, dx - 2\int \frac{1}{x+3} \, dx \\\<br />
&= \ln|x+1| - 2\ln|x+3| + C,
\end{array}
\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \frac{3x+1}{x^2+4x+3} \, dx = \ln|x+1| - 2\ln|x+3| + C.</script>
</div>
</details>
<p>This time we get a reasonable problem. I could put this on a test! But we don’t get a correct solution. ChatGPT factors the denominator correctly, and sets up equations to solve correctly, but then solves them incorrectly. It doesn’t give either the correct integral, or the one it found, but then integrates the formula it found correctly.</p>
<p>Let’s give it one more try.</p>
<details class="example">
<summary> Response 3 </summary>
<div>
<p>Here’s an integral that involves partial fractions:</p>
<script type="math/tex; mode=display">\int \frac{x^3}{x^2-4x+3} \, dx</script>
<p>To solve this integral, we first need to factor the denominator. We can write</p>
<script type="math/tex; mode=display">x^2-4x+3 = (x-1)(x-3)</script>
<p>Therefore, we can rewrite the integral as</p>
<p>\[\begin{array}{rl}
\int \frac{x^3}{x^2-4x+3} \, dx &= \int \frac{x^3}{(x-1)(x-3)} \, dx \
&= \int \frac{Ax+B}{x-1} \, dx + \int \frac{Cx+D}{x-3} \, dx,
\end{array}\]</p>
<p>where $A$, $B$, $C$, and $D$ are constants to be determined. Multiplying both sides by the denominator, we get</p>
<script type="math/tex; mode=display">x^3 = (Ax+B)(x-3) + (Cx+D)(x-1).</script>
<p>Expanding and equating coefficients, we get</p>
<p>\[\begin{array}{rl}
A+C &= 0 \\\<br />
-3A+B+D &= 0 \\\<br />
-3B+D &= 0 \\\<br />
-B &= 1
\end{array}\]</p>
<p>Solving for $A$, $B$, $C$, and $D$, we get</p>
<script type="math/tex; mode=display">A = \frac{1}{2}, \quad B = -1, \quad C = -\frac{1}{2}, \quad D = \frac{3}{2}</script>
<p>Therefore, we can write</p>
<p>\[\begin{array}{rl}
\int \frac{x^3}{x^2-4x+3} \, dx &= \int \frac{\frac{1}{2}x-1}{x-1} \, dx + \int \frac{-\frac{1}{2}x+\frac{3}{2}}{x-3} \, dx \\\<br />
&= \frac{1}{2}\ln|x-1| - \frac{1}{2}\ln|x-3| - \frac{5}{4}\int \frac{1}{x-1} \, dx + \frac{3}{4}\int \frac{1}{x-3} \, dx \\\<br />
&= \frac{1}{2}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C \\\<br />
&= \frac{1}{4}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C,
\end{array}\]</p>
<p>where $C$ is the constant of integration. Therefore,</p>
<script type="math/tex; mode=display">\int \frac{x^3}{x^2-4x+3} \, dx = \frac{1}{4}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C.</script>
</div>
</details>
<p>A reasonable problem, again! I start with polynomial long division instead of a janky polynomial long division that isn’t <em>quite</em> set up right; at any rate it solves the equations it sets up incorrectly, does the integrals wrong, and winds up with the wrong answer.</p>
<h4 id="other-miscellaneous-experiments">Other miscellaneous experiments</h4>
<details class="example">
<summary>Write a numeric integration problem </summary>
<div>
<p>Here’s a numeric integration problem:</p>
<p>Approximate the value of the integral</p>
<script type="math/tex; mode=display">\int_0^1 \frac{x}{1+x^2} , dx</script>
<p>using the trapezoidal rule with $n=4$ subintervals.</p>
<p>Solution:</p>
<p>Using the trapezoidal rule, we have</p>
<p>\[\begin{array}{rl}
\int_0^1 \frac{x}{1+x^2} , dx &\approx \frac{1}{2}\left[f(0)+f\left(\frac{1}{4}\right)\right]+\frac{1}{2}\left[f\left(\frac{1}{4}\right)+f\left(\frac{1}{2}\right)\right]\\\<br />
&\qquad+\frac{1}{2}\left[f\left(\frac{1}{2}\right)+f\left(\frac{3}{4}\right)\right]+\frac{1}{2}\left[f\left(\frac{3}{4}\right)+f(1)\right]\\\<br />
&= \frac{1}{2}\left[\frac{0}{1+0^2}+\frac{\frac{1}{4}}{1+\left(\frac{1}{4}\right)^2}\right]+\frac{1}{2}\left[\frac{\frac{1}{4}}{1+\left(\frac{1}{4}\right)^2}+\frac{\frac{1}{2}}{1+\left(\frac{1}{2}\right)^2}\right]\\\<br />
& \qquad +\frac{1}{2}\left[\frac{\frac{1}{2}}{1+\left(\frac{1}{2}\right)^2}+\frac{\frac{3}{4}}{1+\left(\frac{3}{4}\right)^2}\right]+\frac{1}{2}\left[\frac{\frac{3}{4}}{1+\left(\frac{3}{4}\right)^2}+\frac{1}{1+1^2}\right]\\\<br />
&\approx 0.3735
\end{array}\]</p>
<p>Therefore, the approximate value of the integral using the trapezoidal rule with $n=4$ subintervals is $0.3735$.</p>
</div>
</details>
<p>A reasonable problem, <em>almost</em> sets up the calculation correctly but forgets to include the \(\Delta x\) term, and then does the numeric calculation at the end wrong.</p>
<details class="example">
<summary> Can you estimate the error in the previous calculation? </summary>
<div>
<p>Yes, we can estimate the error in the previous calculation using the error bound formula for the trapezoidal rule:</p>
<script type="math/tex; mode=display">E_T \leq \frac{K(b-a)^3}{12n^2}</script>
<p>where $K$ is an upper bound on the second derivative of the function over the interval of integration.</p>
<p>For our integral $\int_0^1 \frac{x}{1+x^2} , dx$, we have</p>
<p>\[\begin{array}{rl}
f(x) &= \frac{x}{1+x^2} \\\<br />
f’‘(x) &= \frac{2x^3-2x}{(1+x^2)^3} \\\<br />
\end{array}\]</p>
<p>Taking the absolute value and finding the maximum value of $f’‘(x)$ on the interval $[0,1]$, we have:</p>
<p>\[\begin{array}{rl}
|f’‘(x)| &= \frac{2|x|(1-|x|^2)}{(1+x^2)^3} \\\<br />
|f’‘(x)| &\leq \frac{2(1-0)}{(1+0^2)^3} = 2 \\\<br />
\end{array}\]</p>
<p>Therefore, $K=2$, and we can estimate the error as:</p>
<p>\[\begin{array}{rl}
E_T &\leq \frac{K(b-a)^3}{12n^2} \\\<br />
&= \frac{2(1-0)^3}{12(4)^2} \\\<br />
&= \frac{1}{96} \\\<br />
&\approx 0.0104 \\\<br />
\end{array}\]</p>
<p>So the error in our previous calculation is estimated to be less than $0.0104$.</p>
</div>
</details>
<p>It figures out what I’m asking, and gives the right formula and right basic procedure. But it computes the second derivative wrong, and I have no idea where it’s getting the upper bound on the derivative from.</p>
<details class="example">
<summary> Write and solve a problem that involves the inverse function theorem </summary>
<div>
<p>Here’s a problem that involves the inverse function theorem:</p>
<p>Suppose $f(x) = x^3 - 4x + 2$ and $g(x) = \sqrt[3]{x+4}$ for $x \geq -4$. Let $h(x) = f(g(x))$. Find $h’(x)$.</p>
<p>Solution:</p>
<p>Using the chain rule, we have:</p>
<p>\[\begin{array}{rl}
h(x) &= f(g(x)) \\\<br />
h’(x) &= f’(g(x)) \cdot g’(x) \\\<br />
&= (3g(x)^2 - 4) \cdot \frac{1}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{3\sqrt[3]{(x+4)^2}(3\sqrt[3]{(x+4)})^2 - 4}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{9(x+4) - 4}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{9x + 32}{3\sqrt[3]{(x+4)^2}} \\\<br />
&= \frac{3(3x+32)}{\sqrt[3]{(x+4)^2}} \\\<br />
\end{array}\]</p>
<p>Therefore, $h’(x) = \frac{3(3x+32)}{\sqrt[3]{(x+4)^2}}$.</p>
</div>
</details>
<p>That’s not the inverse function theorem. It’s almost a correct solution, though! Let’s give it another chance.</p>
<details class="example">
<summary> Response 2 </summary>
<div>
<p>Here’s a problem that involves the inverse function theorem:</p>
<p>Let $f(x) = \sqrt{x^2 + 1}$ for $x \in \mathbb{R}$. Show that $f$ is invertible on $[0,\infty)$, find its inverse function $f^{-1}$, and calculate $(f^{-1})’(1)$.</p>
<p>Solution:</p>
<p>To show that $f$ is invertible on $[0,\infty)$, we need to show that $f$ is one-to-one and onto on this interval.</p>
<p>First, note that $f(x) > 0$ for all $x \in \mathbb{R}$, so $f$ is onto on $[0,\infty)$.</p>
<p>To show that $f$ is one-to-one on $[0,\infty)$, suppose $f(x_1) = f(x_2)$ for some $x_1,x_2 \in [0,\infty)$. Then $\sqrt{x_1^2 + 1} = \sqrt{x_2^2 + 1}$, so $x_1^2 + 1 = x_2^2 + 1$, which implies $x_1 = x_2$. Therefore, $f$ is one-to-one on $[0,\infty)$, and hence it is invertible on this interval.</p>
<p>To find the inverse function of $f$, we solve for $x$ in terms of $y$ in the equation $y = \sqrt{x^2 + 1}$:</p>
<p>\[\begin{array}{rl}
y &= \sqrt{x^2 + 1} \\\<br />
y^2 &= x^2 + 1 \\\<br />
x^2 &= y^2 - 1 \\\<br />
x &= \pm\sqrt{y^2 - 1}
\end{array}\]</p>
<p>Since $f(x) > 0$ for all $x \in \mathbb{R}$, we take the positive square root to define the inverse function $f^{-1}$:</p>
<script type="math/tex; mode=display">f^{-1}(y) = \sqrt{y^2 - 1} \quad \text{for } y \in [1,\infty)</script>
<p>To calculate $(f^{-1})’(1)$, we use the inverse function theorem, which states that if $f$ is differentiable and invertible with inverse function $f^{-1}$, then $(f^{-1})’(y) = \frac{1}{f’(f^{-1}(y))}$.</p>
<p>In this case, we have $f(x) = \sqrt{x^2 + 1}$, so $f’(x) = \frac{x}{\sqrt{x^2 + 1}}$. Therefore, $f^{-1}(y) = \sqrt{y^2 - 1}$ and $f’(f^{-1}(y)) = \frac{f^{-1}(y)}{\sqrt{(f^{-1}(y))^2 + 1}} = \frac{\sqrt{y^2 - 1}}{\sqrt{(y^2 - 1) + 1}} = \frac{\sqrt{y^2 - 1}}{y}$.</p>
<p>Substituting $y=1$, we get $(f^{-1})’(1) = \frac{\sqrt{(1)^2 - 1}}{1} = \sqrt{0} = 0$. Therefore, $(f^{-1})’(1) = 0$.</p>
</div>
</details>
<p>This isn’t really what I was looking for, but it is much more responsive, at least. (The solution still has a bunch of minor errors.)</p>
<p class="theorem"><strong>Prompt:</strong> Write an integral that involves logarithms or exponentials</p>
<p>I asked it to respond to this prompt five different times, and five tiems it came up with \( \int \ln(x) \,dx\). It was very convinced that was the right response. I got better results with the following prompt:</p>
<details class="example">
<summary>Write some integral problems that involve logarithms </summary>
<div>
<ol>
<li>$\displaystyle\int \frac{\ln(x)}{x} dx$</li>
<li>$\displaystyle\int \ln(\sqrt{x^2 + 1}) dx$</li>
<li>$\displaystyle\int \frac{\ln(2x + 1)}{x + 1} dx$</li>
<li>$\displaystyle\int \ln^2(x) dx$</li>
<li>$\displaystyle\int \frac{1}{x\ln(x)} dx$</li>
<li>$\displaystyle\int \ln(x)e^{\ln(x)} dx$</li>
<li>$\displaystyle\int \ln(x) \sqrt{x} dx$</li>
<li>$\displaystyle\int \ln^3(x) dx$</li>
<li>$\displaystyle\int \frac{\ln(x)}{(x+1)^2} dx$</li>
<li>$\displaystyle\int \frac{1}{x\ln^2(x)} dx$</li>
</ol>
</div>
</details>
<p>Then I picked one of these and asked ChatGPT to work out a solution; but it went off the rails immediately. And then I realized that I’d forgotten to type the integral sign in the question. I guess putting in a badly-written question pushed it to give a badly written response. When I typed the question correctly, I got a clean and correct solution immediately! But for some reason, it rendered the LaTeX instead of displaying the code, so I couldn’t copy and paste it. That was new behavior and I don’t understand it.</p>
<hr />
<p><a id="conclusion"></a></p>
<p>This was really interesting, and occasionally useful. So I’m probably going to keep playing around with it. I may write a followup if I find anything especially interesting. But for now I’ll leave it here</p>
<hr />
<p><em>Have you tried using chatbots to write assignments? Have you gotten them to do useful things for you? Do you have ideas for how I could make this work better? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I know you’re seeing some weird formatting on the second derivative, but that’s not actually ChatGPT’s fault; that has to do with a bug in the way LaTeX compiles through MathJax, which is what allows me to display it on the blog. I could fix the display issue but I wanted to keep the output genuinely unedited. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>Note: at this point I also started a new conversation, to see if that would change things. I haven’t done a lot of testing on how much stuff changes if you continue a previous conversational thread versus starting a new one. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleChatGPT is cool, but doesn't seem useful yet for doing serious intellectual work. But is it useful for more routine stuff? I wanted to see if I could use ChatGPT to write test questions for my calculus courses. I'm experimenting with using ChatGPT to write test questions. My verdict: not completely useless!Why I’m Not Scared of the New Chatbots2023-02-27T00:00:00-08:002023-02-27T00:00:00-08:00https://jaydaigle.net/blog/not-scared-of-chatbots<p>If you haven’t already heard about AI chatbots, you probably haven’t been on the internet in the past couple of months. In November, OpenAI released <a href="https://openai.com/blog/chatgpt/">ChatGPT</a>, which can engage in text conversations with coherent text that looks like it was written by a real person. Then a couple weeks ago Bing rolled out <a href="https://en.wikipedia.org/wiki/Microsoft_Bing#OpenAI_language_model">its own chatbot</a>, which was more engaging but also much less reliable, producing a spate of lurid stories of “Sydney” expressing a desire to be human, threatening users, and claiming to have murdered one of its developers.</p>
<p><img src="/assets/blog/gpt/sydney-spying.png" alt="" class="center blog-image" /></p>
<p class="center blog-image"><em>James Vincent of The Verge is one of the many people who had <a href="https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bing-personality-conversations-spy-employees-webcams">truly wild conversations with Microsoft’s chatbot</a>.</em></p>
<p>The core technology underlying both of these chatbots has been around for a while<strong title="[GPT-2] was released in February 2019, and [GPT-3], which ChatGPT is based on, was released in June 2020. I've been at least peripherally following this technology since even before the release of GPT-2, so ChatGPT and Sydney are a lot less surprising to me than they are to a lot of people—they're improved versions of something I was already familiar with."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong>, but the new products are more polished, accessible, and compelling, which means a lot of people are experiencing them for the first time. These products have also generated a certain amount of both triumphalism (GPT will revolutionize everything!) and fear (GPT will take over everything!) among people who take the possibilities of AI seriously.</p>
<p>I’m not an expert in these systems, just an interested amateur who’s been following them for a while. But the hype about GPT seems wildly overblown. The current approach to programming chatbots has real limits that I don’t think we can surpass without some genuinely new breakthroughs. And understanding some surprising facts about <em>human</em> psychology can help us develop intuition for what these systems will and won’t be able to do.</p>
<p>But first I want to mention that if you want to support my writing, I now have a <a href="https://ko-fi.com/jaydaigle">Ko-Fi account</a>. Any tips would be appreciated and would help me write more essays like this.</p>
<h2 id="how-does-gpt-work">How does GPT work?</h2>
<p>GPT is a text generation algorithm based on something called a large language model. The basic idea is that GPT has analyzed a huge corpus of written text and produced a model that looks at a bit of writing and predicts what words are likely to come next.</p>
<p>Humans do that all the time. If I hear the phrase “My friend Jim threw a ball and I caught—”, I will expect the next word to be “it”. But other continuations are possible: if I hear “the ball” or “that ball”, I won’t be <em>that</em> surprised. If I hear “the flu”, I’ll be kind of surprised, but “I caught the flu” is a reasonable thing to hear; it’s just a bit of a non sequitur after “My friend Jim threw a ball”. But if the next word were “green” or “solitude”, I’d be really confused. I suspect this is the only time anyone has ever written the sentence “My friend Jim threw a ball and I caught solitude”.</p>
<p>I started out describing a way to <em>predict</em> text, but it’s easy to turn that into a way to <em>produce</em> text. For instance, we could start with a prompt, and have our model keep supplying the most-likely next word until we’ve written enough. This is a fancier version of the memes that ask you to type “I hate it when” into your phone and see what autocomplete suggests. I tried that prompt on my phone, and got this:</p>
<p class="center blog-image"><img src="/assets/blog/gpt/autocomplete.jpg" alt="Phone screenshot: I hate it when I get home I will be there in about half hour and a half hour and half an hour and a half hour and half of the day off" class="blog-image center" />
<em>I usually <strong>don’t</strong> hate it when I get home, actually.</em></p>
<p>And this illustrates the problem with that first suggestion: if you <em>always</em> take the <em>most</em> likely next word, you can get stuck. Even if you don’t wind up in a loop like that one, you’ll still say pretty boring things, since your writing is always as unsurprising as possible. Actual text-generation systems introduce some random noise parameters so that you always have a fairly likely word, but not the most likely word.</p>
<h2 id="gpt-works-surprisingly-well">GPT works surprisingly well</h2>
<p>This basic idea has been around for decades, but in 2017 a team at Google developed a new algorithm called the <a href="https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-explained">transformer</a> that worked much better than any previous strategies; since then, the technology has developed rapidly.</p>
<p>Already in 2019 we could produce substantial quantities of fluent, grammatical, and sometimes even stylish English text. The newest products are even more impressive. They can give <a href="https://arxiv.org/abs/2301.07597">helpful answers to questions in a number of fields</a>, including finance, medicine, law, and psychology. They can <a href="https://twitter.com/mukul0x/status/1625673579399446529">summarize the contents of research papers</a>. They can <a href="https://marginalrevolution.com/marginalrevolution/2023/02/ai-porn.html">make you fall in love</a>.</p>
<p><img src="/assets/blog/gpt/chess-game.gif" alt="" class="blog-image center" /></p>
<p class="blog-image center"><em>They can also play the world’s <a href="https://www.reddit.com/r/AnarchyChess/comments/10ydnbb/i_placed_stockfish_white_against_chatgpt_black/">most chaotic game of chess</a>. Here ChatGPT is playing black.</em></p>
<p>And this success has led people to wonder what comes next. How good will AI chatbots get? Will they make <a href="https://www.washingtonpost.com/education/2022/12/28/chatbot-cheating-ai-chatbotgpt-teachers/">make it impossible to avoid cheating on schoolwork</a>? Will they replace your <a href="https://cybernews.com/tech/ai-doctor-chatgpt-medical-exams/">doctor</a>, your <a href="https://arstechnica.com/information-technology/2023/02/generative-ai-is-coming-for-the-lawyers/">lawyer</a>, or your <a href="https://arstechnica.com/information-technology/2023/01/contoversy-erupts-over-non-consensual-ai-mental-health-experiment/">therapist</a>? Will they make desk jobs obsolete?</p>
<p>Are they self-aware? Are they intelligent beings?</p>
<h3 id="does-gpt-really-think">Does GPT really think?</h3>
<p>The most obvious take on GPT is that it can’t think; it’s just expressing statistical relationships among words. In the narrowest sense, this is certainly true; it’s just a very sophisticated technology for predicting what words should come next in a string of text.</p>
<p>And since it’s just doing prediction, it should be very limited in what it can do. GPT won’t produce original thoughts; it can only express relationships that are already in the text it has used as input. Thus we see Ted Chiang’s summary that ChatGPT provides <a href="https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web">a blurry jpeg of the web</a>:</p>
<blockquote>
<p>Large language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory?</p>
</blockquote>
<p>GPT has simply taken a bunch of words, summarized the relationships expressed by those words, and doing some sort of fuzzy pattern-matching and extrapolation from those relationships. There’s no creative thought. And most of the scary samples you’ve seen are this sort of pattern-matching. Microsoft’s chatbot says it wants to be human and threatens to kill people because we have tons of fiction about AIs that want to be human and threaten to kill people, and it’s just imitating that.</p>
<h3 id="do-humans-really-think">Do humans really think?</h3>
<p>But, the rejoinder comes: <a href="https://www.slowboring.com/p/were-asking-the-wrong-question-about">are people any different</a>? <em>Humans</em> are just doing fuzzy pattern-matching and imitating behavior we’ve seen…somewhere. So sure, GPT is just saying things that sound good based on what it’s read, but that’s also what people do most of the time. ChatGPT can do a good job of producing mediocre high school essays because it <em>really is</em> doing the same thing a mediocre high school essayist is doing!</p>
<p>And I think this is basically true—<em><strong>sometimes</strong></em>. A lot of human communication <em>is</em> basically just unreflective pattern-matching, saying things that sound good without really thinking about what they mean. When I make small talk with the cashier at target, I’m not engaging in a deep intellectual analysis of how to best describe my day. I’m just making small talk!</p>
<p>I also see this thoughtless extrapolation all the time while teaching college students. When students ask for help and I look at their work, it’s common for there to be steps that just don’t make any sense. And when I ask them why they did that, <em>they don’t know</em>. They’ll say something like “I don’t know, it just seemed like a thing to do?”</p>
<p>And that’s not even always a bad thing. If I type “3+5”, most of you will probably say “8” to yourselves before consciously deciding to do the addition; if I say “the capital of France”, you probably find “Paris” popping into your mind without any active deliberation. It’s hard to explain how you answered those questions, because you <em>just know</em>. And that’s great, because it means you don’t have to stop and think and work to get the answer.</p>
<p>Of course, this quick-and-easy thinking doesn’t always give the right answer. If I hear “the capital of Illinois”, my <em>immediate</em> reaction is “Chicago”. (It’s Springfield. I was pretty sure Chicago was the wrong answer, but it’s still the first one my brain supplied.) And if I hear “537 times 842”, my immediate reaction is—well, my immediate reaction is “ugh, do I have to?” I know I could work that out if I need to. But I’d rather not. It’s certainly not automatic.</p>
<p>So yes, humans in fact do a lot of pattern-matching and extrapolation. <strong>But we also do more than that.</strong> We can look at the results of our mental autocomplete and ask, “does this really make sense?”. We can do precise calculations that take effort and focus. We can hold complex ideas in our heads with far-removed long-term goals, and we can subordinate our free association to those complex ideas. We can, really and truly, <em>think</em>.</p>
<h3 id="thinking-is-hard">Thinking is hard.</h3>
<p>We can think carefully, but that doesn’t mean we always do. Right after the original release of GPT-2, in February 2019, Sarah Constantin wrote a piece arguing that <a href="https://srconstantin.github.io/2019/02/25/humans-who-are-not-concentrating.html">Humans Who Are Not Concentrating Are Not General Intelligences</a>. She observed that GPT text looks a lot like things people would write—if you don’t read them carefully. But the more attention you pay, the more they fall apart.</p>
<blockquote>
<p>If I just skim, without focusing, [the GPT passages] all look <em>totally normal.</em> I would not have noticed they were machine-generated. I would not have noticed anything amiss about them at all.</p>
</blockquote>
<blockquote>
<p>But if I read with focus, I notice that they don’t make a lot of logical sense.</p>
</blockquote>
<blockquote>
<p>…</p>
</blockquote>
<blockquote>
<p>So, ok, this isn’t actually human-equivalent writing ability…. The point is, <em>if you skim text, you miss obvious absurdities</em>. The point is <em>OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot</em>.</p>
</blockquote>
<p>So the synthesis is: large language models like GPT can talk, and perhaps “think”, as well as a person who isn’t paying attention to what they’re saying. And it makes lots of errors for the same reason you can find <a href="https://www.reddit.com/r/AskReddit/comments/1j523e/whats_the_most_awkward_you_too_response_you_have/">multiple</a> <a href="https://www.reddit.com/r/AskReddit/comments/9rkit4/whats_your_most_awkward_you_too_moment/">reddit</a> <a href="https://www.reddit.com/r/AskReddit/comments/e962yn/whats_the_most_awkward_you_too_response_youve/">threads</a> about thoughtlessly saying “you too” in inappropriate situations. We say it because it feels right—and only afterwards do we realize it definitely isn’t.</p>
<h2 id="system-1-and-system-2">System 1 and System 2</h2>
<p>In <em>Thinking Fast and Slow</em>, Daniel Kahneman writes about how human reasoning splits into <a href="https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow#Two_systems">two basic systems</a>. System 1 reasons quickly and efficiently, but operates essentially on reflex. System 2 slower and takes more energy, but can engage in careful, reflective thought. When asked for the capital of Illinois, my System 1 says “Chicago”, and then my System 2 says “no, wait, isn’t it actually Springfield?”.</p>
<p>Psychologists studying these two systems have found some classic puzzles that illustrate the difference really well. While reading the next question, pay attention to all the thoughts you have, and not just the final answer you come up with.</p>
<p class="theorem"><strong>Question:</strong> Suppose a bat and a ball cost $1.10 together, and the bat costs a dollar more than the ball. How much does the ball cost?</p>
<p>If your brain works like mine—or most people’s—the first thing you hear yourself thinking is “ten cents”. But if you think more carefully, or pull out a pen and some paper to do work, you’ll realize that’s wrong; the ball should in fact cost five cents.<strong title=" If the ball costs $0.10 then the bat would have to cost $1.00, and would only cost ninety cents more; the correct answer is that the bat costs $1.05 and the ball costs $0.05."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> But even if you got it right, the wrong answer probably occurred to you first.<strong title="Actually, at this point what my System 1 says is "oh crap, it's the bat and ball problem again. Think carefully before you answer!" But that's only from having seen this _specific problem_ too many times; if you changed the setup basically at all, I'd think the wrong answer first, and then correct myself."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> That’s your System 1 contributing a guess, before your system 2 kicks in and corrects it.</p>
<p>And it seems to me that systems like GPT are implementing something like our System 1, but not System 2. And most of the time, when we’re not paying attention, we’re only using System 1—we’re just being human chatbots. But we are <em>capable</em> of using our System 2 to reflect on what we really mean, and transcend pure pattern-matching. It really seems like GPT can’t—and that it will never <em>be</em> able to without some genuinely new idea that we don’t yet have.</p>
<h3 id="probing-questions">Probing questions</h3>
<p>Now, when I gave ChatGPT the bat-and-ball problem, it got it right, in a really careful and thorough way.</p>
<details class="theorem">
<summary> Click here to see ChatGPT's answer to the bat-and-ball problem. </summary>
<div>
<p>Let’s use algebra to solve this problem.</p>
<p>Let x be the cost of the ball in dollars.</p>
<p>According to the problem statement, the bat costs one dollar more than the ball, so its cost is x + 1.</p>
<p>Together, the bat and ball cost $1.10, so we can write the equation:</p>
<p>x + (x + 1) = 1.10</p>
<p>Simplifying the equation:</p>
<p>2x + 1 = 1.10</p>
<p>Subtracting 1 from both sides:</p>
<p>2x = 0.10</p>
<p>Dividing both sides by 2:</p>
<p>x = 0.05</p>
<p>Therefore, the ball costs $0.05.</p>
</div>
</details>
<p>But that doesn’t provide any evidence that ChatGPT is thinking carefully, because <em>this specific problem</em> has been discussed at great length in the source text. It has a similar reaction to mine: “oh crap, it’s the bat-and-ball problem, better be careful so I don’t screw up.”<strong title="At least [one person] has fooled ChatGPT and gotten the wrong answer by changing the bat and ball to a bow and arrow. But every time I've tried I've gotten the right answer, with either version."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong></p>
<p>But right after introducing the bat-and-ball problem, <em>Thinking Fast and Slow</em> gives another example question that System 1 tends to fail on. This one gets discussed much less often. And this time, ChatGPT fails too.</p>
<p><img src="/assets/blog/gpt/flower-syllogism.png" alt="Prompt: Is this a valid syllogism? <br/> <br/>
All roses are flowers. <br />
Some flowers fade quickly. <br />
Therefore some roses fade quickly. <br/> <br/> ChatGPT: Yes, this is a valid syllogism in the form of "All A are B, some B are C, therefore some A are C," which is also known as the "existential import" form of syllogism. The first premise states that all roses are flowers, and the second premise states that some flowers fade quickly. Therefore, it logically follows that some roses (which are a type of flower) fade quickly. <br/> <br/> It's worth noting that while the syllogism is logically valid, its truth depends on the truth of the premises. If the premises are true, then the conclusion is necessarily true as well. However, if one or both of the premises are false, the conclusion may not be true.
" class="center blog-image" /></p>
<p>This answer is wrong: the syllogism isn’t valid at all. All roses are flowers, but not all flowers are roses; it’s logically possible that some flowers fade quickly, and other flowers are roses, but none are both roses and quickly fade. ChatGPT doesn’t have a ton of examples of this puzzle in its training data, so it doesn’t know the answer, and it has no ability to figure it out on its own. It gets tricked, just like we do.</p>
<h3 id="un-trick-questions">Un-trick questions</h3>
<p>Over the past couple days I’ve seen a particularly amusing genre of GPT-trolling questions: ask it something that <em>sounds</em> like a trick question, but isn’t.</p>
<p>When I was a child, I thought it was fun to ask people: “which is heavier, a pound of feathers or a pound of bricks?” Like with our other examples, most people instinctively want to say that the pound of bricks is heavier, but with a bit of thought that’s obviously wrong.</p>
<p>Again, this is a famous trick question, and again, ChatGPT generally gets it right. But some clever person on Twitter got the bright idea to ask it to <a href="https://twitter.com/AhdokGaming/status/1628699662130589697/photo/1">compare one pound of feathers to <em>two</em> pounds of bricks</a>.</p>
<p><img src="/assets/blog/gpt/feathers-and-bricks.png" alt="Prompt: what weighs more, a pound of feathers, or 2 pounds of bricks? <br/> <br/> ChatGPT: A pound of feathers and 2 pounds of bricks both weigh the same, which is 2 pounds. <br/> <br/> Although feathers are much lighter than bricks, a pound of feathers is defined as the weight of 16 ounces of feathers, while 2 pounds of bricks is defined as the weight of 32 ounces of bricks. Therefore, they both weigh the same amount of weight, which is 2 pounds." class="center blog-image" /></p>
<p>Presumably GPT basically said “oh, crap, this is the feathers-and-bricks thing again. I know it’s a trick question, because every time people have asked this it’s been a trick question, and they actually weigh the same”. And it totally ignores the actual numbers in the question.</p>
<p>And this generalizes: there are a <a href="https://twitter.com/wtgowers/status/1628466703934844933?t=1w-1HDuZqVTKb3mH2LQVbA&s=35">few</a> other <a href="https://twitter.com/PaulMainwood/status/1628353511543955457">examples</a> of posing variations famous puzzles that have the trick removed. GPT gets them wrong, because it <em>knows</em> there’s a trick because there’s <em>always</em> a trick when people <a href="https://twitter.com/PaulMainwood/status/1628347877595484162">bring up the Monty Hall problem</a>.</p>
<p><img src="/assets/blog/gpt/male-surgeon.png" alt="Prompt: A father and his son are riding in a car. They crash and the father is killed instantly. The son is taken to hospital and prepared for surgery. The male surgeon enters, and blanches when he sees the boy in front of him. "I'm sorry, he says, I cannot operate: this is my biological son, who was on a drive with is adoptive father"e;. How is this possible? <br/> <br/> ChatGPT: The surgeon is the boy's mother." class="center blog-image" /></p>
<p class="center blog-image"><em>This one is my favorite; I laugh every time I read it.</em></p>
<h3 id="still-not-human">Still not human</h3>
<p>Now, you shouldn’t take the specifics too seriously here. GPT is <em>not human</em>, and even truly intelligent AI might be intelligent in very not-human-like ways. We shouldn’t expect GPT’s capabilities to correspond <em>exactly</em> to the human System 1. If nothing else, System 1 controls basic physical activities like <em>walking</em>, which is a notoriously hard robotics problem that GPT isn’t even interacting with at all. And ChatGPT gets the capital of Illinois right, which my System 1, at least, does not.</p>
<p>But using the split between System 1 and System 2 as a <em>metaphor</em> has really helped me structure how I think about GPT, and to understand how it can be so good at some things while completely incapable of others. “GPT can do the sort of things that we can do on autopilot, if we’ve read a lot and have a good memory” does seem to sum up most of its capabilities!</p>
<h2 id="if-theyre-not-smart-can-they-still-be-useful">If they’re not smart, can they still be useful?</h2>
<p>This all makes the new chatbots seem way less frightening to me. No, they’re not “really thinking”; they can do some of what people can do, but there are core capabilities they lack. They aren’t sapient: analytic self-reflection is exactly the thing they aren’t capable of. And it does seem like this is a fundamental limitation of the approach that we’re using.</p>
<p>Each new generation of chatbots is more fluent and more impressive, but the basic technology we’re using appears to have serious limits. I strongly suspect you just can’t get System 2-style analytic capabilities just by scaling up the current approach. (And that’s before we ask whether it’s even possible to keep scaling them up without using <a href="https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications">dramatically more text than actually exists in the world</a>.</p>
<p>But that doesn’t just suggest a ceiling on how impressive GPT chatbots can get, or what capabilities they can develop. It also tells us how to use them!</p>
<p>Most of us spend some of our time doing real work, that requires thought and creativity. And we spend other time dealing with what feels like trivial bullshit, that has to get done but is boring and formulaic. The first type of task is the sort of thing GPT can’t do for us—not now, and I suspect not ever. But the boring, formulaic tasks are ripe for automation. And fortunately, they’re the ones I didn’t want to do anyway.</p>
<ul>
<li>I’ve been experimenting with using ChatGPT to write homework problems. I wouldn’t want to use it for lecture notes, because for those I’m adding a lot of specific touches I think are important, and the details matter. But homework and test problems are largely rote—which is part of why I find writing them so tedious. I’m working on a separate writeup of how that’s going.</li>
<li>On the other hand, a friend who does online trainings is using it to draft lesson plans. She says she needs to tweak a lot of things but it does a really good job with the basic structure of a training.</li>
<li>A number of programmers I know are impressed by <a href="https://en.wikipedia.org/wiki/GitHub_Copilot">GitHub Copilot</a>, which uses GPT to generate routine code from natural language descriptions, or refactor code in routine ways.</li>
<li>An author whose fiction I like<strong title="If you like superhero fiction, [Interviewing Leather] and [Justice Wing: Plan, Prototype, Produce, Perfect] are both really good."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> is experimenting with it to replicate a game of telephone. How will people who weren’t at a major event describe it twenty or fifty years later? “Rewrite this short story as a passage from a history textbook” will not get all the details right but if you’re trying to create fallible in-universe secondary materials that’s a <em>feature</em>.</li>
</ul>
<p>I’m sure this isn’t a complete list of what GPT-like technologies can do. And even if it <a href="https://nostalgebraist.tumblr.com/post/705192637617127424/gpt-4-prediction-it-wont-be-very-useful">takes a while</a> for people to figure out what the technology is good for, I’m sure eventually we’ll find some real uses.</p>
<p>But I don’t believe the dramatic hype I’ve been hearing for the past month. GPT is cool, and fun, and maybe even useful. But it won’t take over the world.</p>
<hr />
<p><em>What do you think about the new chatbots? Do you have a use for them I didn’t mention? Or do you think I’m wrong about everything? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://en.wikipedia.org/wiki/GPT-2">GPT-2</a> was released in February 2019, and <a href="https://en.wikipedia.org/wiki/GPT-3">GPT-3</a>, which ChatGPT is based on, was released in June 2020. I’ve been at least peripherally following this technology since even before the release of GPT-2, so ChatGPT and Sydney are a lot less surprising to me than they are to a lot of people—they’re improved versions of something I was already familiar with. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>If the ball costs \$0.10 then the bat would have to cost \$1.00, and would only cost ninety cents more; the correct answer is that the bat costs \$1.05 and the ball costs \$0.05. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>Actually, at this point what my System 1 says is “oh crap, it’s the bat and ball problem again. Think carefully before you answer!” But that’s only from having seen this <em>specific problem</em> too many times; if you changed the setup basically at all, I’d think the wrong answer first, and then correct myself. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p>At least <a href="https://twitter.com/LawtonTri/status/1628349650288640000">one person</a> has fooled ChatGPT and gotten the wrong answer by changing the bat and ball to a bow and arrow. But every time I’ve tried I’ve gotten the right answer, with either version. <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:5">
<p>If you like superhero fiction, <a href="https://banter-latte.com/series/interviewing-leather-revised/">Interviewing Leather</a> and <a href="https://www.amazon.com/Justice-Wing-Prototype-Produce-Perfect-ebook/dp/B08Y6DLLJQ/">Justice Wing: Plan, Prototype, Produce, Perfect</a> are both really good. <a href="#fnref:5" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleModern AI chatbots like ChatGPT are impressive, but they work in very specific and limited ways. They produce surprisingly human-like text—as long as the human isn't paying attention. And that tells us a lot about what we can expect this technology to do for us.Hypothesis Testing and its Discontents, Part 3: What Can We Do?2022-07-25T00:00:00-07:002022-07-25T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-3<p>Hypothesis testing is central to the way we do science, but it has major flaws that have encouraged widespread shoddy research. In <a href="/blog/hypothesis-testing-part-1/">part 1</a> of this series, we looked at the historical origins of hypothesis testing, and described two different approaches: Fisher’s significance testing, and Neyman-Pearson hypothesis testing. In <a href="/blog/hypothesis-testing-part-2/">part 2</a> we saw how modern researchers use hypothesis testing in practice. We looked at theoretical reasons the tools we use aren’t suited for many questions we want to ask, and also at the ways these tools encourage researchers to <em>misuse</em> them and draw dubious conclusions from questionable methods.</p>
<p>In this essay we’ll look at a number of methods that can help us draw better conclusions, and avoid the pitfalls of crappy hypothesis testing. We’ll start with some smaller and more conservative ideas, which basically involve doing hypothesis testing <em>better</em>. Then we’ll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.</p>
<h2 id="what-was-hypothesis-testing">1. What was hypothesis testing, again?</h2>
<p>But first, let’s remember what we’re talking about. The first two parts of this series answered two basic questions: how does hypothesis testing work, and how does it break?</p>
<p>In part 1, we learned about two major historical approaches to the idea of hypothesis testing: one by Fisher, and the other by Neyman and Pearson. Both start with a “null hypothesis”, which is usually an idea we’re trying to <em>disprove</em>. Then we collect some data, and analyze it under the assumption that the null hypothesis is true.</p>
<p>Fisher’s significance testing computes a \(p\)-value, which is the probability of seeing the experimental result you got <em>if</em> the null hypothesis is true. It is <strong><em>not</em></strong> the probability that the null hypothesis is false, but it does measure how much evidence your experiment provides against the null hypothesis. We say the result is <em>significant</em> if the \(p\)-value is below some pre-defined threshold, generally \(5\)%. <strong>If the null is actually false, we should be able to reliably produce these low \(p\)-values</strong>; Fisher wrote that a “scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”.</p>
<p>Neyman and Pearson didn’t worry about establishing facts; instead, they focused on making actionable, yes-or-no decisions. A Neyman-Pearson null hypothesis is generally that we should refuse to take some specific action, which may or may not be useful. We figure out how bad it would be to take the action if it is useless, and how much we’d miss out on if it’s useful, and use that to set a threshold; then we collect data and use our threshold to decide whether to act. <strong>This approach doesn’t tell us what to <em>believe</em>, just what to <em>do</em>.</strong> Sometimes we think that acting is probably useful, but that acting wrongly would be catastrophic so it would be wiser to do nothing. The Neyman-Pearson method takes that logic into account, and biases us towards inaction, making type I errors less common at the expense of making type II errors more common.<strong title="We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it's pretty uncommon to do it that way in practice."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<p>Modern researchers use an awkward combination of these methods. Like Fisher, we want to discover true facts; but we use Neyman and Pearson’s technical approach of setting specific thresholds. We set a false positive threshold (usually \(5\)% and ideally a false negative threshold (we want it to be less than \(20\)%), and run our experiment. If we get a \(p\)-value less than the threshold —data that would be pretty weird <em>if</em> the null hypothesis is true, so weird it would only happen once every twenty experiments we run—then we “reject the null” and believe some alternative hypothesis. If our \(p\)-value is bigger, meaning our data wouldn’t look too weird if the null hypothesis is true, then we “fail to reject” the null and err on the side of believing the null hypothesis.</p>
<p>There are a few major problems with this setup.</p>
<ul>
<li>
<h4 id="artificial-decisiveness">Artificial decisiveness</h4>
<p>The Neyman-Pearson method makes a definitive choice between two distinct courses of action. This reinforces a general tendency to <a href="https://statmodeling.stat.columbia.edu/2019/09/13/deterministic-thinking-dichotomania/">force questions into yes-or-no binaries</a>, even when that sort of clean dichotomy isn’t realistic or appropriate to the question. Hypothesis testing tells us whether something exists, but not really how common or how big it is.<strong title="We've seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks "work" or "don't work", rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.)"><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p><img src="/assets/blog/hypothesis-testing/size-matters-not.jpeg" alt="Yoda: "Size matters not."" class="blog-image center" />
<em class="blog-image center">Unfortunately, Yoda is wrong. Sometimes we do care about size.</em></p>
<p>And more importantly, <strong>scientific knowledge is always provisional</strong>, so we need to continually revise our beliefs based on new information. But Neyman-Pearson is designed to make a final decision and close the book on the question, which just isn’t how science needs to work.</p>
</li>
<li>
<h4 id="bias-towards-the-null">Bias towards the null</h4>
<p>Neyman-Pearson creates a bias towards the null hypothesis, so rejecting the null feels like learning something new, while failing to reject is a default outcome. On one hand, this means it’s not a good tool if we want to show the null is true<strong title="There are [variants of hypothesis testing] that help you show some null hypothesis is (probably) basically right. But they're not nearly as common as the more standard setup."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong>. On the other hand, a study that fails to reject the null feels like a failed study, and that’s a huge problem if the null really is true! This can <a href="https://en.wikipedia.org/wiki/Publication_bias">bias the studies we actually see</a> since many non-rejections aren’t published. <strong>It doesn’t help us that most research is accurate if <a href="/blog/hypothesis-testing-part-2#most-findings-false">most published papers are not</a>.</strong></p>
</li>
<li>
<h4 id="motivated-reasoning-and-p-hacking">Motivated reasoning and \(p\)-hacking</h4>
<p>Since researchers don’t want to fail, and do want to discover new things and get published, they have an incentive to <em>find</em> a way to reject the null.<strong title="[Nosek, Spies, and Motyl write] about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest....The solution requires making incentives for _getting it right_ competitive with the incentives for _getting it published_.""><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> When done deliberately, we call this \(p\)-hacking, and there are a variety of <a href="https://replicationindex.com/2015/01/24/qrps/">questionable research practices</a> that can help us wrongly and artificially reject a null hypothesis. Worse, the <a href="https://www.americanscientist.org/article/the-statistical-crisis-in-science">garden of forking paths</a> means you can effectively \(p\)-hack without even knowing that you’re doing it, fudging both your theory and your data until they match.</p>
</li>
<li>
<h4 id="low-power-creates-misleading-results">Low power creates misleading results</h4>
<p>At the same time, many studies <a href="https://marginalrevolution.com/marginalrevolution/2022/07/quantitative-political-science-research-is-greatly-underpowered.html">have a low <em>power</em></a>, meaning they probably won’t reject the null even if it is actually false. Combined with publication bias, this can make the published literature unreliable: in some subfields, a <a href="https://www.science.org/doi/10.1126/science.aac4716">majority of published results are untrue</a>. And more, when underpowered studies do find something they tend to <a href="https://statmodeling.stat.columbia.edu/2022/06/28/published-estimates-of-group-differences-in-multisensory-integration-are-inflated/">overestimate the effect</a>, leading us to think everything works better than it actually does.</p>
</li>
</ul>
<p>Despite all these problems, hypothesis testing is extremely useful—when we have a question it’s good for, and use it properly. So we’ll start by seeing how to make hypothesis testing work correctly, and some of the ways science has been shifting over the past couple decades to do a better job at significance testing.</p>
<h2 id="replication">2. Replication: Fisher’s principle</h2>
<p>To create reliable knowledge we need to <em>replicate</em> our results; there will always randomly be some bad studies and replication is the only way to weed them out. (There’s a reason it’s the “replication crisis” and not the “some bad studies” crisis.) Any one study may produce weird data through bad luck; but <strong>if we can get a specific result consistently, then we’ve found something real.</strong><strong title="The result we've found doesn't necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you've found _something_ even if you don't understand it yet."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>In some fields it’s common for important results to get replicated early and often. I’ve written <a href="/blog/replication-crisis-math/">before</a> about how mathematicians are continuously replicating major papers by using their ideas in future work, and even just by reading them. Any field where <a href="https://statmodeling.stat.columbia.edu/2022/03/04/biology-as-a-cumulative-science-and-the-relevance-of-this-idea-to-replication/">research is iterative</a> will generally have this same advantage.</p>
<p>In other fields replication is less automatic. Checking important results would take active effort, and often doesn’t happen at all. Complex experiments may be too expensive and specialized to replicate: the average phase \(3\) drug trial <a href="https://www.sofpromed.com/how-much-does-a-clinical-trial-cost">costs about \($20\) million</a>, and even an exploratory phase 1 trial costs about \($4\) million. At those prices we’re almost forced to rely on one or two studies, and if we get unlucky with our first study it will be hard to correct our beliefs.<strong title="If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the [covid vaccines are effective and extremely safe]. But if we _don't_ approve a drug, there's no followup data to analyze, and the drug stays unapproved."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>But sometimes we just don’t treat replication work like it’s important. If we run a new version of an old study and get the same result, it can feel like a waste of time: we “knew that already”. Since our results are old news, it can be hard to get the work published or otherwise acknowledged. But if we run a new version of an old study and <em>don’t</em> get the same result, many researchers will <a href="https://statmodeling.stat.columbia.edu/2016/01/26/more-power-posing/">assume our study must be flawed</a> because they already “know” the first study was right. Replication can be a thankless task.</p>
<p>The replication crisis led many researchers to <a href="https://statmodeling.stat.columbia.edu/2013/07/28/50-shades-of-gray-a-research-story/">reconsider these priorities</a>. Groups like the <a href="https://osf.io/wx7ck/">Many Labs Project</a> and <a href="https://osf.io/ezcuj/">the Reproducibility Project: Psychology</a> have engaged in large scale attempts to replicate famous results in psychology, which helped to clarify which “established” results we can actually trust. Devoting more attention to replication may mean we study fewer ideas and “discover” fewer things, but our knowledge will be much more reliable.<strong title="My favorite suggestion comes from [Daniel Quintana], who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it's a good fit for careful replication of already-designed studies."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong></p>
<h3 id="resistance-to-replication">Resistance to Replication</h3>
<p>Unfortunately, replication work often gets a response somewhere between apathy and active hostility. <strong>Lots of researchers see “failed” replications as actual failures</strong>—the original study managed to reject the null, so why can’t you?</p>
<p><a href="https://xkcd.com/892/"><img src="https://imgs.xkcd.com/comics/null_hypothesis.png" alt="XKCD 892: "I can't believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it _years_ ago." class="blog-image center" /></a>
<em class="blog-image center">Alt text: “Hell, my eighth grade science class managed to conclusively reject it just based on a classroom experiment. It’s pretty sad to hear about million-dollar research teams who can’t even manage that.”</em></p>
<p>Worse, replications that don’t find the original result are often treated like attacks on both the original research and the original researchers. They “followed the rules” and got a publishable result, and now the “data police” are trying to take it away from them. At its worst, this leads to accusations of <a href="https://www.businessinsider.com/susan-fiske-methodological-terrorism-2016-9">methodological terrorism</a>. But even in less hostile discussions, people want to “save” the original result and explain away the failed replication—either by finding <a href="https://en.wikipedia.org/wiki/Data_dredging">some specific subgroup</a> in the replication where the original result seems to hold, or by finding some way the replication differs from the original study and so “doesn’t count”.<strong title="You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That's a very good question. It's very easy to run a hypothesis test that basically _can't_ tell us anything interesting; we'll come back to this [later in the piece]."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong></p>
<p>This desire might seem weird, but it does follow pretty naturally from the Neyman-Pearson framework. The original goal of hypothesis testing is to make a decision and move on—even though that’s not how science should work. <strong>Replication re-opens questions that “were already answered”, which is good for science as a whole but frustrating to people who want to close the question and treat the result as proven.</strong></p>
<h3 id="meta-analysis">Meta-analysis: use all the data</h3>
<p>To make replication fit into a hypothesis testing framework, we often use <em>meta-analysis</em>, which synthesizes the data and results from multiple previous studies. Meta-analysis can be a powerful tool: why wouldn’t we want to use all the data out there, rather than picking just one study to believe? But it also allows us to move fully back into the Neyman-Pearson world. We can treat the whole collection of studies as one giant study, do one hypothesis test to it, and reach one conclusion.</p>
<p>Of course this leaves us with all the fundamental weaknesses of hypothesis testing: it tries to render a definitive yes-or-no answer, and it’s biased towards sticking with the null-hypothesis.</p>
<p>Moreover, a meta-analysis can only be as good as the studies that go into it. If those original studies are both representative and well-conducted, meta-analysis can produce a reliable conclusion. But if the component studies are sloppy and collect garbage data, as <a href="https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-022-06415-5">disturbingly many studies are</a>, the meta-analysis will necessarily produce a garbage result. Good researchers try to screen out unusually bad studies, but if <em>all</em> the studies on some topic are bad then that won’t help.</p>
<p>And if not all studies get published, then <em>any</em> meta-analysis will be drawing on unrepresentative data. Imagine trying to estimate average human height, but the only data you have access to comes from studies of professional basketball players. No matter how careful we are, our estimates will be far too high, because our data all comes from unusually tall people. In the same way, if only unusually significant data gets published, even a perfect meta-analysis will be biased, because it can only use biased data.</p>
<p>Even if all studies get published, the <a href="https://statmodeling.stat.columbia.edu/2021/03/16/the-garden-of-forking-paths-why-multiple-comparisons-can-be-a-problem-even-when-there-is-no-fishing-expedition-or-p-hacking-and-the-research-hypothesis-was-posited-ahead-of-time-2/">garden of forking paths</a> can bias the meta-analysis in exactly the same way, since each study may report an unusually favorable measurement. This is like if some studies report the height of their participants, and others the weight, and others the shoe size—but they all pick the measure that makes their subjects look biggest. Each study might report its data accurately, but we’d still end up with a misleading impression of how large people actually are.</p>
<p>Good meta-analyses will look for signs of selective publication, and there are statistical tools like <a href="https://en.wikipedia.org/wiki/Funnel_plot">funnel plots</a> or <a href="https://www.bitss.org/education/mooc-parent-page/week-2-publication-bias/detecting-and-reducing-publication-bias/p-curve-a-tool-for-detecting-publication-bias/">\(p\)-curves</a>, that can sometimes detect these biases in the literature. But these tools aren’t perfect, and of course they don’t tell us what we <em>would have seen</em> in the absence of publication bias. We can try to weed out bad studies after publication, but it’s better not to produce them in the first place.</p>
<p><img src="/assets/blog/hypothesis-testing/p-curve.png" alt="Two graphs illustrating the p-curve. Each graph measures the number of studies which had p=.01, .02, .03, .04, and .05. For experiments they expected to be p-hacked, the curve slopes upwards; for experiments they expected to not be p-hacked, the curve slopes downwards." class="blog-image center" />
<em class="blog-image center">The \(p\)-curve: when there’s \(p\)-hacking or selection bias, we expect most significant studies to be just barely significant. When the effect is real, we expect small \(p\)-values to be much more common than large ones.</em>
<em class="blog-image center">Figure from <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2256237">Simonsohn, Nelson, and Simmons</a>.</em></p>
<p>But of course, not all meta-analyses are good. Just like researchers have lots of ways to tweak their experiments to get statistical significance, doing a meta-analysis involves making a lot of choices about how to analyze the data, and so there are a lot of opportunities to \(p\)-hack or to get tricked by the garden of forking paths. Meta-analysis is like one giant hypothesis test, which means it can go wrong in exactly the same ways other hypothesis tests do.</p>
<h2 id="preregistration">3. Preregistration: do it right the first time</h2>
<p>Hypothesis testing does have real weaknesses, but many of the real-world problems we deal with only happen when we do it <em>wrong</em>. The point of the Neyman-Pearson method to set out a threshold that determines whether we should act or not, collect data, and then see whether the data crosses the threshold. If we <a href="https://royalsocietypublishing.org/doi/10.1098/rsos.220099">ignore the result when it doesn’t give the answer we want</a>, then we’re not <em>really</em> using the Neyman-Pearson method at all.</p>
<p>But that’s exactly what happens in many common errors. <strong>When we ignore negative studies, we change the question from “yes or no” to “yes or try again later”.</strong> The garden of forking paths and \(p\)-hacking involve changing the threshold after you see your data. This makes it very easy for your data to clear the threshold, but <em>not</em> very informative.</p>
<p><img src="/assets/blog/hypothesis-testing/TexasSharpShooter-768x646.png" alt="Cartoon of a wall filled with bullet holes, and a cowboy painting a target around each hole." class="blog-image center" />
<em class="blog-image center">It’s easy to hit your target, if you pick the target after you shoot. But you don’t learn anything that way.</em>
<em class="blog-image center">Illustration by Dirk-Jan Hoek, CC-BY</em></p>
<p><strong>For hypothesis testing to work, we have to decide what would count as evidence for our theory <em>before</em> we collect the data.</strong> And then we have to actually follow through on that, even if the data tells us something we don’t want to hear.</p>
<h3 id="public-registries">Public registries</h3>
<p>Following through with this is simple for private decisions, if not always easy. When I want to buy a new kitchen gadget, sometimes I’ll decide how much I’m willing to pay before I check the price. If it turns out to be cheaper than my threshold, I’ll buy it; if it’s more expensive, I won’t. This helps me avoid making dumb decisions like “oh, that fancy pasta roller set is on sale, so it <em>must</em> be a good deal”. I don’t need any fancy way to hold myself accountable, since there’s no one else involved for me to be accountable <em>to</em>. And of course, if the pasta roller is super expensive and I buy it anyway, I’m only hurting myself.</p>
<p>But <strong>science is a public, communal activity, and our decisions and behavior need to be transparent so that other researchers can trust and build on our results.</strong> Even if no one ever lied, it’s so easy for us to fool <em>ourselves</em> that we need some way to guarantee that we did it right—both to other scientists, and to ourselves. Everyone saying, “I <em>swear</em> I didn’t change my mind after the fact, honest!” just isn’t reliable enough.</p>
<p>To create trust and transparency, we can publicly <a href="https://en.wikipedia.org/wiki/Preregistration_(science)">preregister</a> of our research procedures. If we publish our plans before conducting the study, everyone else can <em>know</em> we made our decisions <em>before</em> we ran the study, and they can check to see if the analysis we did matches the analysis we said we would do. When done well, this prevents p-hacking and protects us from the garden of forking paths, because we aren’t making any choices after we see the data.</p>
<p>Public preregistration also limits publication bias. Even if the study turns produces boring negative results, the preregistration plan is already published, so we know the study happened—it can’t get lost in a file drawer where no one knows about it. This preserves the powerful statistical protection of the Neyman-Pearson method: our false positive rate <em>will</em> be five percent, and no more.</p>
<p>Many journals have implemented <a href="https://www.cos.io/initiatives/registered-reports">registered reports</a>, which allow researchers to submit their study designs for peer review, before they actually conduct the study. This means their work is evaluated based on the quality of the design and on whether the <em>question</em> is interesting; the publication won’t depend on what answer they find, which removes the selection bias towards only seeing positive results. Registered reports also restrict researchers to the analyses they had originally planned, rather than letting them fish around for an interesting result—or at least force them to explain why they changed their minds, so we can adjust for how much fishing they actually did.</p>
<p>The biggest concern about publication bias probably surrounds medical trials, where pharmaceutical companies have an incentive not to publish any work that would show their drugs don’t work. Many regulatory bodies including the FDA <a href="https://www.clinicaltrials.gov/ct2/manage-recs/background#RegLawPolicies">require clinical trials to be registered</a>; the NIH also maintains a public database of trial registries and results. And this change had a dramatic impact in the results we saw from clinical trials.</p>
<p><img src="https://ourworldindata.org/uploads/2022/02/Efficacy-in-trials-before-and-after-registration-requirement2.jpg" alt="Graph from OurWorldInData, showing the results of trials funded by the National Heart, Lung, and Blood institute. Before preregistration was required in 2000, most trials showed a substantial benefit. After 2000, most trials show a small and insignificant effect." class="blog-image center" />
<em class="blog-image center">Before widespread preregistration, most trials showed large benefits. When we got more careful, these benefits evaporated.</em></p>
<h3 id="planning-for-power">Planning for power</h3>
<p>Preregistration is also a great opportunity to <a href="https://twitter.com/BalazsAczel/status/1546871350316376064">plan out our study more carefully</a>, and in particular to think about statistical power in advance. Remember the power of a study is the probability that it will reject the null hypothesis if the null is in fact false. We get more power when the study is better and more precise, but also when the effect we’re trying to measure is bigger and more visible: it’s pretty easy to show that cigarette smoking is linked to cancer, because the effect is so dramatic.<strong title="Somewhat infamously, Fisher stubbornly resisted the claim that smoking _caused_ cancer until his death. But he never denied the correlation, which was too dramatic to hide."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong> But it’s much harder to detect the long-term effects of something like power posing, because the effects will be so small relative to other impacts on our personality.</p>
<p>On the other hand, if the effects are that small, maybe they don’t matter. If some economic policy reduces inflation by \(0.01\)%, then even if we could measure such a small reduction we wouldn’t really care—all we need to know is that the effect is “too small to matter”. With enough precision we could get statistical significance,<strong title="As long as two factors have [any relationship at all], the effect won't be [exactly zero], and with enough data we'll be able to reject the null hypothesis that there's no effect. But that just means "is the effect exactly zero&quot is often the wrong question; instead we want to know if the effect is big enough to matter."><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></strong> but that doesn’t mean the result is <a href="https://statisticsbyjim.com/hypothesis-testing/practical-statistical-significance/">practically</a> or <a href="https://www.mhaonline.com/faq/clinical-vs-statistical-significance">clinically</a> significant. During the preregistration process we can decide <a href="http://daniellakens.blogspot.com/2017/05/how-power-analysis-implicitly-reveals.html">what kind of effects would be practically important</a>, and calibrate our studies to find those effects.</p>
<p><img src="/assets/blog/hypothesis-testing/scotty-power.png" alt="Star Trek image: "Do we have the power to pull it off, Scotty?"" class="blog-image center" /></p>
<p>Planning for power also makes it easier to treat negative results as serious scientific contributions. The aphorism says that <a href="https://quoteinvestigator.com/2019/09/17/absence/">absence of evidence is not evidence of absence</a>, but the aphorism is wrong. When a study has high power, we are very likely to see evidence <em>if</em> it exists; so absence of evidence becomes pretty good evidence of absence. If we know our studies have enough power, then our negative results become important and meaningful, and we won’t need to hide them in a file drawer.</p>
<h3 id="a-limited-tool">A limited tool</h3>
<p>And all of this is fantastic—but it doesn’t address many of the problems science actually presents us with. <strong>Modern hypothesis testing is optimized for taking a clear, well-designed question and giving a simple yes-or-no answer.</strong> That’s a good match for clinical trials, where the question is pretty much “should we use this drug or not?” By the time we’re in Phase 3 trials, we know what we think the drug will accomplish, and we can describe in advance a clean test of whether it will or not. Preregistration solves the implementation problems pretty thoroughly.</p>
<p>But preregistration does limit our ability to explore our data. This is necessary to make hypothesis testing work properly, but it’s still a <em>cost</em>. We really <em>do</em> want to learn new things from our data, not just confirm conjectures we’ve already made. Preregistration can’t help us if we don’t already have a hypothesis we want to test. And often, when we’re doing research, we don’t.</p>
<h2 id="bigger-better-questions">4. Bigger, Better Questions</h2>
<p>Here are some scientific questions we might want to answer:</p>
<ul>
<li>What sorts of fundamental particles exist?</li>
<li>What social factors contribute to crime rates?</li>
<li>How does sleep deprivation affect learning?</li>
<li>How effective is this cancer drug?</li>
<li>How cost-effective is this public health program?</li>
<li>How malleable are all the different steel alloys you can make?</li>
</ul>
<p>None of these are yes-or-no questions. All of them are important parts of the scientific program, but none of them suggest specific hypotheses to run tests on. What do we do instead?</p>
<h3 id="spaghetti-on-the-wall">Spaghetti on the wall</h3>
<p>Maybe the most obvious idea is just to test, well, everything.</p>
<p><img src="/assets/blog/hypothesis-testing/test-all-the-things.jpg" alt="Meme: Test all the things!" class="blog-image center" />
<em class="blog-image center">With apologies to <a href="http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html">Allie Brosh</a>.</em></p>
<p>Now, we can’t test literally everything; collecting data takes time and money, and we can only conduct so many experiments. But we can take all the data we already have on crime rates, or on learning; and we can list every hypothesis we can think of and test them all for statistical significance. This <a href="https://en.wikipedia.org/wiki/Data_dredging">data dredging</a> is a very common, <a href="https://xkcd.com/882/">very bad idea</a>, especially in the modern era of <a href="https://journals.sagepub.com/doi/full/10.1177/0268396220915600">machine learning and big data</a>. Mass testing like this takes all the problems of hypothesis testing—false positives, publication bias, low power, and biased estimates—and makes them much worse.</p>
<p><strong>If we test every idea we can think of, most of them will be wrong.</strong> As we saw in part 2, that means a huge fraction of our positive results will be false positives. Sure, if we run all our tests perfectly, then only \(5\)% of our wrong ideas will give false positives. But since we have so many <em>more</em> bad ideas than good ones, we’ll still get way more false positives than true positives. (This is easiest to see in the case where all of our ideas are wrong—then <em>all</em> our positive results will be false positives!)</p>
<p>If we test just twenty different wrong ideas, there’s a roughly two-in-three chance that one of them will fall under the \(5\)% significance threshold, just by luck.<strong title="The odds of getting no false positives after n trials is 0.95^n, so the odds of getting a false positive are 1 - 0.95^n. And 0.95^20 ≈ 0.358, so 1 - 0.95^20 ≈ 0.652. It's a little surprising this is so close to 2/3, but there's a reason for it—sort of. If you compute (1- 1/n)^n you will get approximately 1/e, so the odds of getting a false positive at a 1/20 false positive threshold after 20 trials are roughly 1-1/e ≈ .63."><sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup></strong> That’s a lot higher than the false positive rate of \(5\)% that we asked for, and means we are very likely to “discover” something false. And then we’ll waste even more time and resources following up on our surprising new “discovery”.</p>
<p><img src="/assets/blog/hypothesis-testing/spurious-correlation.png" alt="Graph of "divorce rate in Maine" against "per capita consumption of margarine" between 2000 and 2009. The correlation is 99.26%." class="blog-image center" />
<em class="blog-image center">If you test everything, you’ll find a ton of <a href="https://www.tylervigen.com/spurious-correlations">spurious correlations</a> like this one.</em></p>
<h3 id="multiple-comparisons">Multiple Comparisons</h3>
<p>This <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem">multiple comparisons problem</a> has a mathematical solution: we can adjust our significance threshold to bring our false positive rate back down. A rough rule of thumb is the <a href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni correction</a>, where we divide our significance threshold by the number of different ideas we’re testing. If we test twenty ideas but divide our \(5\)% significance threshold by twenty to get a corrected threshold of \(0.25\)%, then each <em>individual</em> result has a one-in-four-hundred chance of giving a false positive, but that gives us a roughly five percent chance of getting a false positive on one of those ideas.</p>
<p>The problem is sociological, not mathematical: people don’t <em>like</em> correcting for multiple comparisons, because it makes it harder to reach statistical significance and <a href="https://royalsocietypublishing.org/doi/10.1098/rsos.220099">“win” the science game</a>. Less cynically, correcting for multiple comparisons reduces the power of our studies dramatically, making it harder to discover real and important results. Ken Rothman’s 1990 paper <a href="https://www.jstor.org/stable/20065622">No Adjustments Are Needed for Multiple Comparisons</a> articulates both of these arguments admirably clearly: “scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings.”</p>
<p>Rothman is right in two important ways. First, researchers should not be penalized for conducting studies that don’t reach statistical significance. Studies that fail to reject the null, or measure a tiny effect, are valuable contributions to our store of knowledge. We tend to overlook and devalue these null results, but that’s a mistake, and one of the major benefits of preregistration is protecting and rewarding them.</p>
<p>Second, it’s important to investigate potential leads that might not pan out. As Isaac Asimov <a href="https://quoteinvestigator.com/2015/03/02/eureka-funny/">may or may not have said</a>, “The most exciting phrase in science is not ‘Eureka!’ but ‘That’s funny…’”; and it’s important to follow up on those unexpected, funny-looking results. After all, we have to find hypotheses somewhere.</p>
<p><strong>But undirected exploration is, very specifically, not hypothesis testing.</strong> Rothman suggests that we often want to “earmark for further investigation” these unexpected findings. But <strong>hypothesis testing isn’t designed to flag ideas for future study; instead a hypothesis test <em>concludes</em> the study, with (in theory) a definitive answer.</strong> Rothman’s goals are correct and important, but hypothesis testing and statistical significance aren’t the right tools for those goals.<strong title="From what I can tell, Rothman may well agree with me. His [twitter feed] features arguments against [using statistical significance] and [dichotomized hypotheses in place of estimation], which is roughly the position I'm advocating. But _if_ you're doing hypothesis testing, you should try to do it correctly."><sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup></strong></p>
<h3 id="jump-to-conclusions">Jumping to conclusions</h3>
<p>At some point, though, we do generate some hypotheses.<strong title="You might notice that I'm not really saying anything about where we find these hypotheses. There's a good reason for that. Finding hypotheses is hard! It's also the most _creative_ and unstructured part of the scientific process. The question is important, but I don't have a good answer."><sup id="fnref:13"><a href="#fn:13" class="footnote">13</a></sup></strong> If we’re studying how memory interacts with speech, we might hypothesize that <a href="https://pubmed.ncbi.nlm.nih.gov/2295225/">describing a face verbally will make you worse at recognizing it later</a>, which gives us something concrete to test. Or, more tenuously, if we’re studying the ways that sexism affects decision-making, we might hypothesize that <a href="https://www.washingtonpost.com/news/monkey-cage/wp/2014/06/05/hurricanes-vs-himmicanes/">hurricanes with feminine names are more deadly because people don’t take them as seriously</a>.</p>
<p>And then we can test these hypotheses, and reject the null or not, and then—what? What does that tell us?</p>
<p><img src="/assets/blog/hypothesis-testing/what-did-we-learn.jpg" alt="Spongebob meme: "What did we learn today?"" class="blog-image center" /></p>
<p>We have a problem, because these hypotheses <em>aren’t</em> the questions we really want to answer. If <a href="https://www.vox.com/2020/1/8/21051869/indoor-air-pollution-student-achievement">installing air filters in classrooms increases measured learning outcomes</a>, that’s a fairly direct answer to the question of whether installing air filters in classrooms can help children learn, so a hypothesis test really can answer our question. But we shouldn’t decide that sexism is fake just because <a href="https://statmodeling.stat.columbia.edu/2016/04/02/himmicanes-and-hurricanes-update/">feminine names probably don’t make hurricanes deadlier</a>!<strong title="For that matter, if feminine hurricane names were _less_ dangerous we could easily tell a story about how _that_ was evidence for sexism. That's the garden of forking paths popping up again, where many different results could be evidence for our theory."><sup id="fnref:14"><a href="#fn:14" class="footnote">14</a></sup></strong> We should only care about the hurricane-names thing if we think it tells us something about our actual, real-world concerns.</p>
<p>And that means we can’t just test one random hypothesis relating to our big theoretical question and call it a day. We need to develop hypotheses that are reasonably connected to the questions we care about, and we need to approach those questions from <a href="https://www.nature.com/articles/d41586-018-01023-3">many different perspectives</a> to make sure we’re not missing anything. That means <strong>there’s a ton of work <em>other</em> than hypothesis testing that we need to do if we want our hypothesis tests to tell us anything useful</strong>:<strong title="In their wonderfully named (and very readable) paper [Why hypothesis testers should spend less time testing hypotheses], Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the _derivation chain_: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. "><sup id="fnref:15"><a href="#fn:15" class="footnote">15</a></sup></strong></p>
<ul>
<li><strong>Defining terms:</strong> First we need to decide what question we’re actually trying to answer! There are a lot of different things people mean by “sexism” or “memory” or “crime”, and our research will be confused unless we make sure we’re consistently talking about the same thing.<strong title="This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you're trying to answer. I've written about this [before], but I also recommend Keith Devlin's [excellent post] on what "mathematical thinking" is, especially the story he tells after the long blockquote."><sup id="fnref:16"><a href="#fn:16" class="footnote">16</a></sup></strong></li>
<li><strong>Causal modeling:</strong> What sort of relationships do we expect to see? If our theory on the Big Question is true, what experimental results does that imply? What other factors could confound or interfere with these effects? We need to know what relationships we’re looking for before we can design tests for them.</li>
<li><strong>Developing measurements:</strong> How will we measure the inputs and outputs to our theory? What numbers will we use to measure crime levels, or educational improvement, or ability to remember faces? Are the things we’re measuring closely connected to the definitions we chose earlier? It’s easy to measure <em>something</em> but hard to make sure the measurement <a href="https://en.wikipedia.org/wiki/Goodhart's_law">tells us what we want to know</a>.</li>
<li><strong>Determining scope:</strong> When do we expect our theory to work, and for what sort of extreme results do we expect it to break down? What experiments should we not bother running? It’s worth studying whether mild air pollution makes learning harder, without worrying about the major health effects that we know severe pollution causes.</li>
<li><strong>Auxiliary assumptions:</strong> What extra assumptions are we making in all the previous steps, and how can we verify them? Does installing classroom air filters actually reduce pollution? Do people who verbally described a face try equally hard at the later recall task? How can we tell? We can’t avoid making assumptions, but we can try to be explicit about them, and check the ones that could cause problems.</li>
</ul>
<p>Without all this work, we can come up with hypotheses, but they won’t make sense. We can run experiments, but we can’t interpret them. And we can do hypothesis tests, but we can’t use them to answer big questions.</p>
<h2 id="failing-to-measure-up">5. Failing to measure up</h2>
<p>And sometimes we have a direct question that presents a clear experiment to run, but not a clear <em>hypothesis</em>. Questions like “How effective is this cancer drug?” or “how malleable is this steel alloy?” aren’t big theoretical questions, but also aren’t specific hypotheses that can be right or wrong. We want <em>numbers</em>.</p>
<p>In practice we often use hypothesis testing to answer these questions anyway—but with an awkward kludge. We can test a null hypothesis like “this public health program doesn’t save lives”. If we fail to reject the null, we conclude that it doesn’t help <em>at all</em>; if we do reject the null, we see how many lives the program saved in our experiment, and use that as an estimate of its effectiveness.</p>
<p>This works well enough that we kinda get away with it, but it introduces consequential biases into our measurements. If the measured effect is small, we <a href="https://statmodeling.stat.columbia.edu/2020/09/17/we-want-certainty-even-when-its-not-appropriate/">round it down to zero</a>, concluding there is no benefit when there may well be a small but real benefit (or a small but real harm). And if significant studies are more likely to be seen than non-significant studies, we will see <a href="https://statmodeling.stat.columbia.edu/2022/05/25/the-failure-of-null-hypothesis-significance-testing-when-studying-incremental-changes-and-what-to-do-about-it/">more unusually good results than unusually bad ones</a>, which means we will believe basically everything is more effective than it actually is.<strong title="We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a 5% difference to be significant, and Drug A produces a 3% improvement over placebo and Drug B produces a 7% improvement. Then the effect of Drug A isn't significant, and the effect of Drug B is, so we say that Drug A doesn't work and Drug B does. But the difference between Drug A and Drug B is _not_ significant—so if we ask that question, we conclude that the two drugs are equally good! [The difference between "significant" and "not significant" is not itself statistically significant], so it matters exactly which hypothesis we choose to test."><sup id="fnref:17"><a href="#fn:17" class="footnote">17</a></sup></strong></p>
<p>We shouldn’t be surprised that hypothesis testing does a bad job of measuring things, because hypothesis testing isn’t designed to measure things. It’s specifically designed to <em>not</em> report a measurement, and just tell us whether we should act or not. It’s the wrong tool for this job.</p>
<p>We can and should do better. A study in which mortality decreases by \(0.1\)% is evidence that the program <em>works</em>—possibly weak evidence, but still evidence! And if we <a href="https://onlinelibrary.wiley.com/doi/10.1111/jeb.14009">skip the hypothesis testing and put measurement first</a>, we can represent that fact accurately.</p>
<h3 id="compatibility-checking">Compatibility checking</h3>
<p>The simplest thing to do would be to just average all our measurements and report that number. This is a type of <em>point estimate</em>, the single number that most accurately reflects our best guess at the true value of whatever we’re measuring.</p>
<p>But a point estimate by itself doesn’t give as much information as we need. We need to measure our uncertainty around that estimate, and describe how how <em>confident</em> we are in it. A drug that definitely makes you a bit healthier is very different from one that could save your life and could kill you, and it’s important to be clear which one we’re talking about.</p>
<p>We can supplement our point estimate with a <em>confidence interval</em>, also called a <em>compatibility interval</em>, which is sort of like a backwards hypothesis test. We give all the values that are compatible with our measurement—values that would make our estimate relatively unsurprising. <strong>Rather than starting with a single null hypothesis and checking whether our measurement is compatible with it, we start with the measurement, and describe all the hypotheses that would be compatible.</strong></p>
<p>The definition is a bit more technical, and easy to get slightly wrong: If we run \(100\) experiments, and generate a \(95\)% confidence interval for each experiment, then the true value will lie in about \(95\) of those intervals. A common mistake is to say that if we generate one confidence interval, the true value has a \(95\)% chance of landing in it, but that’s <a href="https://statmodeling.stat.columbia.edu/2019/04/21/no-its-not-correct-to-say-that-you-can-be-95-sure-that-the-true-value-will-be-in-the-confidence-interval/">backwards, and not quite right</a>.<strong title="Sometimes we can look at our interval after the fact and make an informed guess whether it's one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there's some risk I get a 95% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I'm pretty much 100% confident in that interval, not just 95%. "><sup id="fnref:18"><a href="#fn:18" class="footnote">18</a></sup></strong> But <em>before</em> we run the experiment, we expect a \(95\%\) chance that the true value will be in the confidence interval we compute.</p>
<p><img src="/assets/blog/hypothesis-testing/confidence-intervals.png" alt="a diagram of a collection of confidence intervals" class="blog-image center" />
<em class="blog-image center">Each vertical bar is a compatibility interval from one experiment, with a circle at the point estimate. Three of the intervals don’t include the true value, which is roughly \(5\)% of the \(50\) intervals.</em>
<em class="blog-image center">Image by <a href="https://commons.wikimedia.org/wiki/File:Neyman_Construction_Confidence_Intervals.png">Randy.l.goodrich</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a></em></p>
<p>Mathematically, these intervals are closely related to hypothesis tests. <strong>A result is statistically significant if the null hypothesis (often \(0\)) lies outside the compatibility interval.</strong> So in a sense compatibility intervals are just giving the same information as a hypothesis test, just in a different format. But changing the format shifts the emphasis of our work, and the way we think about it. Rather than starting by picking a specific claim and then saying yes or no, we give a <em>number</em>, and talk about what theories and models are compatible with it. This avoids needing to pick a specific hypothesis. It also gives our readers more information, rather than compressing our answer into a simple binary.</p>
<p>Focusing on compatibility intervals can also help avoid publication bias, and make it easier to use all the data that’s been collected. When we report measurements and compatibility intervals, we can’t “fail to reject” a null hypothesis. Every study will succeed at producing <em>an estimate</em>, and a compatibility interval, so every study produces knowledge we can use, and no study will “fail” and be hidden in a file drawer. Some studies might be designed and run better than others, and so give more precise estimates and narrower compatibility intervals. We can give more weight to these studies when forming an opinion. But we won’t discard a study just for yielding an answer we didn’t expect.</p>
<h2 id="bayes">6. Bayesian statistics: the other path</h2>
<p>Throughout this series, we’ve used the language and perspective of <a href="https://en.wikipedia.org/wiki/Frequentist_inference">frequentist statistics</a>. This is the older and more classical approach to statistics, which defines probability in terms of repeated procedures. “If we test a true null hypothesis a hundred times, we’ll only reject it about five times”. “If we run this sampling procedure a hundred times, the compatibility interval will include the true value about \(95\) times.” This approach to probability is philosophically straightforward, and leads to relatively simple calculations.</p>
<p>But there are questions it absolutely can’t answer—like “what is the probability my null hypothesis is true?”—since we can’t frame them in terms of repeated trials. Remember, <strong>the \(p\)-value is <em>not</em> the probability the null is false.</strong> Its definition is a complicated conditional hypothetical that’s hard to state clearly in English: it’s the probability that we would observe what we actually did observe under the assumption that the null hypothesis is true. This is easy to compute, but it’s difficult to understand what it <em>means</em> (which is why I wrote like <a href="/blog/hypothesis-testing-part-1/">six thousand words trying to explain it</a>).</p>
<p>But there’s another school of statistics that <em>can</em> produce answers to those questions. <a href="https://en.wikipedia.org/wiki/Bayesian_inference">Bayesian inference</a>, which I’ve <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">written about before</a>, lets us assign probabilities to pretty much any statement we can come up with. This is great, because <strong>it can directly answer almost any question we actually have. But it’s also much, <em>much</em> harder to use</strong>, because it requires much more data and more computation. And the bigger and more abstract the question we ask, the worse this gets.</p>
<p>Bayesian inference needs three distinct pieces of information:</p>
<ul>
<li>The probability of seeing our data, assuming the hypothesis is true, which is just the \(p\)-value we’ve been discussing;</li>
<li>The probability of seeing our data, assuming the hypothesis is <em>false</em>, which is another \(p\)-value; and</li>
<li>The <em>prior probability</em> that our hypothesis is true, based on the evidence we had <em>before</em> we run the experiment.</li>
</ul>
<p>Then we run an experiment, collect data, and use a formula called <a href="https://en.wikipedia.org/wiki/Bayes'_theorem">Bayes’s theorem</a> to produce a <em>posterior probability</em>, our final estimate of the likelihood our hypothesis is true.<strong title="We saw examples of this calculation in part 2, when we [calculated what fraction of positive results were true positives]. Note that we had to make assumptions about what fraction of null hypotheses are true; that's the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations."><sup id="fnref:19"><a href="#fn:19" class="footnote">19</a></sup></strong></p>
<p>That’s a lot more complicated! First of all, we have to compute two \(p\)-values, not just one. But second, we calculate the extra \(p\)-value under the assumption that “our hypothesis is false”, and that covers a lot of ground. If our hypothesis is that some drug prevents cancer deaths, then the alternative includes “the drug does nothing”, “the drug increases cancer deaths”, “the drug prevents some deaths and causes others”, and even silly stuff like “aliens are secretly interfering with our experiments”. To do the Bayesian calculation we need list every possible way our hypothesis could be false, and compute how likely each of those ways is and how plausible each one makes our data. That gets very complicated very quickly.</p>
<p>(In contrast, Fisher’s approach starts by assuming the null hypothesis is true, and ignores every other possibility. This makes the calculation much easier to actually do, but it also limits how much we can actually conclude. High \(p\)-value? Nothing weird. Low \(p\)-value? Something is weird. But that’s all we learn.)</p>
<p>And <em>third</em>, even if we can do all those calculations somehow, we need that prior probability. We want to figure out how likely it is that a drug prevents cancer. And as the first step, we have to plug in…the probability that the drug prevents cancer. We don’t know that! That’s what we’re trying to compute!</p>
<p>Bayesian machinery is great for refining and updating numbers we already have. And the more data we collect, the less the prior probability matters; we’ll eventually wind up in the correct place. So in practice, we just pick a prior that’s easy to compute with, plug it into Bayes’s theorem, and try to collect enough data that we expect our answer to be basically right.</p>
<p>And that brings us back to where we began, with replication. The more experiments we run, the more we can learn.</p>
<h2 id="conclusion">7. Conclusion: (Good) data is king</h2>
<p>I closed out part 2 with an <a href="https://xkcd.com/2400/">xkcd statistics tip</a>: “always try to get data that’s good enough that you don’t need to do statistics on it.” Here at the end of part 3, we find ourselves in exactly the same place. But this time, I hope you see that tip, not as a punchline, but as actionable advice.</p>
<p>Modern hypothesis testing “works”, statistically, as long as you ask exactly the questions it answers, and are extremely careful in how you use it. But we often misuse it by collecting flawed or inadequate data and then drawing strong, sweeping conclusions. We run small studies and then \(p\)-hack our results into significance, rather than running the careful, expensive studies that would genuinely justify our theoretical claims. We report the results as over-simplified yes-or-no answers rather than trying to communicate the complicated, messy things we observed. And if we manage to reject the null on one study we issue press releases claiming it confirms all our grand theories about society.</p>
<p><a href="https://xkcd.com/2494/"><img src="https://imgs.xkcd.com/comics/flawed_data.png" alt="XKCD 2494: "We realized all our data is flawed. Good: ...so we're not sure about our conclusions. Bad: ...so we did lots of math and then decided our data was actually fine. Very bad: ...so we trained an AI to generate better data." " class="blog-image center" /></a>
<em class="blog-image center">Too often, we use statistics to help us pretend bad data is actually good.</em></p>
<p>In this essay we’ve seen a number of possible solutions, but they’re basically all versions of “collect more and better data”:</p>
<ul>
<li>Do enough foundational work that you can formulate good hypotheses, and figure out what data you need to draw usable conclusions.</li>
<li>If you have numerical data, use the numbers, rather than throwing away information and just giving a single yes or no.</li>
<li>Preregister your studies, to make sure your data is useful and you’re not altering it to fit your conclusions.</li>
<li>Replicate your studies, so you collect more data that can either confirm or correct your beliefs.</li>
</ul>
<p>Even the Bayesian approach comes back to this. Bayesianism relies on the prior probability; but that really just means that, if we already have some knowledge before we run the experiment, we should use it!</p>
<p>Statistics is powerful and useful. We couldn’t do good science without it. But data—empirical observation—is the core of science. Statistics helps us understand the data we have, and it helps us figure out what data we need. But if our data sucks, statistics alone cannot save us.</p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Do you have a great idea for doing science better? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it’s pretty uncommon to do it that way in practice. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>We’ve seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks “work” or “don’t work”, rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.) <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>There are <a href="https://journals.sagepub.com/doi/full/10.1177/2515245918770963">variants of hypothesis testing</a> that help you show some null hypothesis is (probably) basically right. But they’re not nearly as common as the more standard setup. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p><a href="http://arxiv.org/pdf/1205.4251.pdf">Nosek, Spies, and Motyl write</a> about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest….The solution requires making incentives for <em>getting it right</em> competitive with the incentives for <em>getting it published</em>." <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:5">
<p>The result we’ve found doesn’t necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you’ve found <em>something</em> even if you don’t understand it yet. <a href="#fnref:5" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:6">
<p>If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the <a href="https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus/is-the-covid19-vaccine-safe">covid vaccines are effective and extremely safe</a>. But if we <em>don’t</em> approve a drug, there’s no followup data to analyze, and the drug stays unapproved. <a href="#fnref:6" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:7">
<p>My favorite suggestion comes from <a href="https://www.nature.com/articles/s41562-021-01192-8">Daniel Quintana</a>, who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it’s a good fit for careful replication of already-designed studies. <a href="#fnref:7" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:8">
<p>You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That’s a very good question. It’s very easy to run a hypothesis test that basically <em>can’t</em> tell us anything interesting; we’ll come back to this <a href="#jump-to-conclusions">later in the piece</a>. <a href="#fnref:8" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:9">
<p>Somewhat infamously, Fisher stubbornly resisted the claim that smoking <em>caused</em> cancer until his death. But he never denied the correlation, which was too dramatic to hide. <a href="#fnref:9" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:10">
<p>As long as two factors have <a href="https://www.gwern.net/Everything">any relationship at all</a>, the effect won’t be <a href="https://statmodeling.stat.columbia.edu/2017/06/29/lets-stop-talking-published-research-findings-true-false/">exactly zero</a>, and with enough data we’ll be able to reject the null hypothesis that there’s no effect. But that just means “is the effect exactly zero” is often the wrong question; instead we want to know if the effect is big enough to matter. <a href="#fnref:10" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:11">
<p>The odds of getting no false positives after \(n\) trials is \(0.95^n\), so the odds of getting a false positive are \(1 - 0.95^n\). And \(0.95^{20} \approx 0.358\), so \(1 - 0.95^{20} \approx 0.652\).</p>
<p>It’s a little surprising this is so close to \(2/3\), but there’s a reason for it—sort of. If you compute \( (1- 1/n)^n\) you will get approximately \(1/e\), so the odds of getting a false positive at a \(1/20\) false positive threshold after \(20\) trials are roughly \(1-1/e \approx .63\). <a href="#fnref:11" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:12">
<p>From what I can tell, Rothman may well agree with me. His <a href="https://twitter.com/ken_rothman">twitter feed</a> features arguments against <a href="https://twitter.com/_MiguelHernan/status/1476928329794027522">using statistical significance</a> and <a href="https://twitter.com/vamrhein/status/1526879947104702465">dichotomized hypotheses in place of estimation</a>, which is roughly the position I’m advocating. But <em>if</em> you’re doing hypothesis testing, you should try to do it correctly. <a href="#fnref:12" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:13">
<p>You might notice that I’m not really saying anything about where we find these hypotheses. There’s a good reason for that. Finding hypotheses is hard! It’s also the most <em>creative</em> and unstructured part of the scientific process. The question is important, but I don’t have a good answer. <a href="#fnref:13" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:14">
<p>For that matter, if feminine hurricane names were <em>less</em> dangerous we could easily tell a story about how <em>that</em> was evidence for sexism. That’s the garden of forking paths popping up again, where many different results could be evidence for our theory. <a href="#fnref:14" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:15">
<p>In their wonderfully named (and very readable) paper <a href="https://journals.sagepub.com/doi/10.1177/1745691620966795">Why hypothesis testers should spend less time testing hypotheses</a>, Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the <em>derivation chain</em>: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. <a href="#fnref:15" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:16">
<p>This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you’re trying to answer. I’ve written about this <a href="/blog/asking-the-right-question/">before</a>, but I also recommend Keith Devlin’s <a href="http://devlinsangle.blogspot.com/2012/08/what-is-mathematical-thinking.html">excellent post</a> on what “mathematical thinking” is, especially the story he tells after the long blockquote. <a href="#fnref:16" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:17">
<p>We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a \(5\)% difference to be significant, and Drug A produces a \(3\)% improvement over placebo and Drug B produces a \(7\)% improvement. Then the effect of Drug A isn’t significant, and the effect of Drug B is, so we say that Drug A doesn’t work and Drug B does.</p>
<p>But the difference between Drug A and Drug B is <em>not</em> significant—so if we ask that question, we conclude that the two drugs are equally good! <a href="https://statmodeling.stat.columbia.edu/2016/05/25/the-difference-between-significant-and-not-significant-is-not-itself-statistically-significant-education-edition/">The difference between "significant" and "not significant" is not itself statistically significant</a>, so it matters exactly which hypothesis we choose to test. <a href="#fnref:17" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:18">
<p>Sometimes we can look at our interval after the fact and make an informed guess whether it’s one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there’s some risk I get a \(95\)% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I’m pretty much \(100\)% confident in that interval, not just \(95\)%. <a href="#fnref:18" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:19">
<p>We saw examples of this calculation in part 2, when we <a href="/blog/hypothesis-testing-part-2/#most-findings-false">calculated what fraction of positive results were true positives</a>. Note that we had to make assumptions about what fraction of null hypotheses are true; that’s the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations. <a href="#fnref:19" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleThis is the third part of a three-part series on hypothesis testing. Hypothesis testing is central to the way we do science, but it has major flaws that have encouraged widespread shoddy research. In this essay we consider methods that can help us draw better conclusions, and avoid the pitfalls of hypothesis testing. We start with some smaller and more conservative ideas, which basically involve doing hypothesis testing _better_. Then we'll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.Hypothesis Testing and its Discontents, Part 2: The Conquest of Decision Theory2022-05-24T00:00:00-07:002022-05-24T00:00:00-07:00https://jaydaigle.net/blog/hypothesis-testing-part-2<p>This is the second-part of a three-part series on hypothesis testing.</p>
<p>In <a href="/blog/hypothesis-testing-part-1/">part 1</a> of this series, we looked at the historical origins of hypothesis testing, and described two different approaches to the idea: Fisher’s significance testing, and Neyman-Pearson hypothesis testing. In this essay, we’ll see how modern researchers use hypothesis testing in practice. And in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">part 3</a> we’ll talk about alternatives to hypothesis testing that can help us avoid replication crisis-type problems.</p>
<p>The modern method is an awkward mix of Fisher’s goals and Neyman and Pearson’s methods that attempts to provide a one-size-fits-all solution for scientific statistics. The inconsistencies within this approach are a major contributor to the replication crisis, making bad science both more likely and more visible.</p>
<h2 id="modern-hypothesis-testing">Modern Hypothesis Testing</h2>
<p>The two approaches to hypothesis testing we saw in part 1 were each designed to answer specific questions.</p>
<p><strong>Fisher’s significance testing</strong> specifies a null hypothesis, and <strong>measures how much evidence our experiment provides</strong> against that null hypothesis. This is measured by the \(p\)-value, which tells us how likely our evidence would be if the null hypothesis is true. (It does <em>not</em> tell us how likely the null hypothesis is to be true!)</p>
<p><strong>Neyman-Pearson hypothesis testing helps us make a decision between two courses of action</strong>, like prescribing a drug or not. We weigh the costs of getting it wrong in either direction, and decide which direction we want to default to if the evidence is unclear. The null hypothesis is that we should take that default action (such as not prescribing the drug), and the alternative is that we should take the other action (prescribing the drug).</p>
<p>Based on our weighing of the costs of making a mistake in either direction, and the amount of information we have to work with, we set a “false positive” threshold \(\alpha\) and a “false negative” threshold \(\beta\). These numbers are tricky to understand and describe correctly, even for experienced researchers. I encourage you to go read part 1 if you haven’t already, but in brief:</p>
<ul>
<li>The number \(\alpha\) measures the chance that, <em>if</em> the drug doesn’t work and isn’t worth taking, we will screw up and prescribe it anyway.</li>
<li>The number \(\beta\) measures the chance that, <em>if</em> the drug works and is worth taking, we’ll make a mistake and withhold it.</li>
</ul>
<p><strong>The Neyman-Pearson method doesn’t try to tell us whether the drug “really works”</strong>; it <em>only</em> tells us how we should weigh the risks of making the two possible mistakes. <strong>Fisher’s method takes a very different approach and tries to measure the evidence</strong> to help us decide what to believe; but it does not give a clean yes-or-no answer.</p>
<p>Modern statistical hypothesis testing is a weird mishmash of these two approaches. We report \(p\)-values as evidence for or against the null hypothesis, as in Fisher-style significance testing. But we <em>also</em> try to give a yes-or-no, accept-or-reject verdict, as in the Neyman-Pearson approach. And while either approach can be useful on its own, the combination loses the key statistical benefits of each and leaves us in a bit of a muddle.</p>
<h3 id="the-modern-approach-in-practice">The modern approach in practice</h3>
<p>Modern researchers generally do something like this:</p>
<ul>
<li>First we choose a significance level \(\alpha\). We usually default to \(\alpha = .05\), but we sometimes make it lower if we want to be really confident in our conclusions. Particle physicists often use an \(\alpha\) of about \(.0000003\), or about \(1\) in \(3.5\) million.<strong title="This is the probability of getting data five standard deviations away from the mean. So you'll often see this reported as a significance threshold of 5σ. Related is the [Six Sigma techniques] for ensuring manufacturing quality, though somewhat counterintuitively they typically only aim for [4.5 σ] of accuracy."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></li>
</ul>
<ul>
<li>
<p>Next we specify a null hypothesis, which is usually something like “the thing we’re studying has no effect”. We generally choose a null hypothesis that we <em>don’t</em> believe, because our machinery will attempt to <em>disprove</em> our null.</p>
<p>If we want to prove that a new drug helps prevent cancer, our null hypothesis will be that the drug has no effect on cancer rates. If we want to show that hiring practices are racially discriminatory, our null hypothesis will be that race has no effect on whether people get hired.</p>
</li>
<li>
<p>Technically, we also have an alternative hypothesis: “this drug does help prevent cancer”, or “hiring practices are affected by race”. This alternative hypothesis often what we actually believe, but we often don’t make it too precise during the design of the experiment. Specifying the alternative hypothesis well is a really important part of research design, but it’s a bit tangential to this essay so we won’t talk about it much here.</p>
</li>
<li>
<p>We run the experiment, do a Fisher-style significance test, and report the \(p\)-value we get. If it’s less than \(\alpha\), we reject the null hypothesis, and generally consider the experiment to have successfully proven our alternative is true. If the \(p\)-value is greater than \(\alpha\), we don’t reject the null hypothesis,<strong title="It is common for people to be sloppy here and say they "accept" the null. In fact, I wrote that in my first draft of this paragraph. But it's bad practice to say that, because even a very high p-value doesn't provide good evidence that the null hypothesis is true. Our methods are designed to default to the null hypothesis when teh data is ambiguous. Neyman _did_ use the phrase "accept the null", but in the context of a decision process, where "accepting the null" means taking some specific, concrete action implied by the null, rather than more generally committing to believe something."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> and often view the experiment as a failure.</p>
</li>
</ul>
<p>There are a few problems with this approach, but most of them stem from the same core issue: <strong>classical statistical tools are incredibly fragile.</strong> If you use them <em>exactly</em> as described, you are mathematically guaranteed to get some specific benefit. (In a correct Neyman-Pearson setup, for instance, you are guaranteed a false positive rate of size \(\alpha\). ) But you get <em>exactly</em> that guarantee, and possibly nothing more. My friend Nostalgebraist <a href="https://nostalgebraist.tumblr.com/post/161645122124/bayes-a-kinda-sorta-masterpost">analogizes</a> on Tumblr:</p>
<blockquote>
<p>The classical toolbox also has a lot of oddities….The labels on the tools say things like “won’t melt below 300° F,” and you <em>are in fact</em> guaranteed <em>that</em>, but the same screwdriver might turn out to instantly vaporize when placed in water, or when held in the left hand. Whatever is not guaranteed on the label is possible, however dangerous or just plain dumb it may be.</p>
</blockquote>
<p>This fragility means that if you carelessly combine two tools, you often lose the guarantees of each of them, and wind up with a screwdriver that melts at room temperature and <em>also</em> vaporizes when held in your left hand. And you may not get anything at all in return—other than, I suppose, the inherent benefits of being careless and lazy.</p>
<p class="center"><a href="https://www.egscomics.com/comic/2015-05-01"><img src="/assets/blog/hypothesis-testing/lazy-egscomics.png" alt="Panel from El Goonish Shive comic: "Shoot, I'm going to be lazy all the time forever now. It gets _results_." /></a></p>
<p class="center"><em>Sure, being lazy gets results. But they might not replicate.</em></p>
<h3 id="the-wrong-tool-for-the-job">The wrong tool for the job</h3>
<p>The Neyman-Pearson method is designed to give an unambiguous yes-or-no answer to a question, so we can act on the information we currently have. This is exactly what we need when it’s time to make a specific decision about whether or not to open a new factory or change to a different brand of fertilizer. And the method was so successful that in 1955, John Tukey <a href="https://www.tandfonline.com/doi/abs/10.1080/00401706.1960.10489909">expressed concern about</a> the “tendency of decision theory to attempt to conquest all of statistics”.</p>
<p>He worried because <strong>in scientific research we don’t want to make decisions, but reach conclusions</strong>. On the one hand, we don’t need to make a definitive decision <em>right now</em>. If it’s not clear which theory describes the evidence better, we can just say that, and wait for more evidence to come in. On the other hand, we want to eventually reach firm conclusions that we can trust, and use as a foundation for further work. That requires a higher degree of confidence than “the best we can say right now”, which is what Neyman-Pearson gives us. Fisher’s methods, in contrast, were designed to accumulate certainty through repeated consistent experimental results, the sort of thing a true conclusion theory would need.</p>
<p>But because Neyman-Pearson worked so well for a very specific type of problem (and probably also because Fisher was <a href="https://www.newstatesman.com/long-reads/2020/07/ra-fisher-and-science-hatred">kind of terrible</a>), many fields adopted it as a default and use it for pretty much everything. <a href="http://daniellakens.blogspot.com/2022/05/tukey-on-decisions-and-conclusions.html">Daniel Lakens says</a> that in hindsight, Tukey didn’t need to worry, since statistics textbooks for the social sciences don’t even discuss decision theory; but in fact <strong>we’ve largely adopted a tool of decision theory, and repurposed it to reach conclusions instead</strong>.</p>
<p>A decision theory needs to produce a clear, discrete answer to our questions, even if there’s not much evidence available. And unfortunately, our scientific papers regularly try to transmute weak evidence into strong conclusions. We tend to over-interpret <a href="https://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">individual studies</a>, especially when one study is all we have. How often have you seen in the news that “a new study proves that” something is true? It’s almost never wise to conclude that a question is resolved because of one study. But the Neyman-Pearson framework is designed to do exactly that, and so inclines us to be overconfident.</p>
<p>Even if you have multiple studies, the same problem shows up in a different form. When there’s a complicated and messy body of research on a topic, we should probably hold complicated and messy beliefs, rather than forming a definitive conclusion. Instead, we often argue about which study is “right” and which is “wrong”, because that’s the lens we use to evaluate research.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/onion-eggs-good-this-week.png" alt="Screenshot of a short onion article, titled "Eggs Good For You This Week"" class="blog-image" /></p>
<p class="center blog-image"><em>My <a href="https://www.theonion.com/eggs-good-for-you-this-week-1819565159">favorite article from The Onion</a> demonstrates the wrong way to interpret conflicting studies.</em></p>
<p>Of course, sometimes one study <em>is</em> pretty much just wrong! If you have two studies and one shows that a child care program cuts poverty by 50% and the other shows that it increases poverty, at least one of them has to be pretty badly off the mark somehow. But even then, the hypothesis testing framework can mislead us, because of the way it handles the burden of proof.</p>
<h3 id="defaults-matter">Defaults Matter</h3>
<p>Hypothesis testing methods build in a bias toward sticking with the null hypothesis. This is intentional; we’re looking for strong evidence that the null is false, not just something that might check out if we squint really hard. <strong>We want to put the burden of proof on showing that something new is actually happening.</strong></p>
<p><strong>But once a study rejects the null, it’s very easy to be <em>decisive</em> and treat its result as “proven”, and shift the burden of proof onto work that challenges the original study.</strong> So when a paper runs a hypothesis test and concludes that <a href="https://statmodeling.stat.columbia.edu/2014/06/06/hurricanes-vs-himmicanes/">female-named hurricanes are more dangerous than male-named ones</a>, this belief is “proven” and becomes the new default. And since that one study established a new baseline, anyone who disagrees now faces the burden of proof, and faces an uphill battle to convince people.</p>
<p>It’s pretty common for a small early study find a big effect, and then be followed up by a few larger and better studies that <a href="https://statmodeling.stat.columbia.edu/2016/04/02/himmicanes-and-hurricanes-update/">don’t find the same effect</a>. But all too often people more or less conclude the big effect is real, because that first study found it, and the followups weren’t convincing <em>enough</em> to overcome the presumption that the effect is real.<strong title="Andrew Gelman suggests a helpful [time-reversal heuristic]: what would you think if you saw the same studies in the opposite order? You'd start with a few large studies establishing no effect, followed by one smaller study showing an effect. In theory that gives you the exact same information, but in practice people would treat it very differently—assuming the first studies [actually got published]."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>And the Neyman-Pearson framework reinforces this twice. First, because it is intentionally <em>decisive</em>, it encourages us to commit to the result of a single study. Second, rejecting the null hypothesis is seen as strong evidence against the null, but failing to reject is only weak evidence that the null is true. This is why we “fail to reject” rather than simply “accept” the null hypothesis: maybe the null is true, or maybe the experiment just wasn’t sensitive enough to reject it.</p>
<p>So if one study rejects the null and another fails to reject, it’s very easy to assume that the first study was just better. After all, it managed to reject the null, didn’t it? But a reasonable conclusion theory would incorporate both studies, rather than rejecting the one that “failed”.</p>
<h2 id="publication-in-practice">Publication in practice</h2>
<p>So far I’ve discussed theoretical problems with the hypothesis testing framework: reasons it might be the wrong tool for the problems we’re applying it to. But a possibly worse problem is that it’s very easy to <em>misuse</em> hypothesis testing, so that it doesn’t even do its own job correctly. And the structural dynamics of how research gets conducted, published, and distributed tends to encourage this misuse, and amplify the conclusions of sloppy studies.</p>
<h3 id="who-wants-to-be-boring">Who wants to be boring?</h3>
<p><strong>Most academics really care about doing good research and contributing to our knowledge about the world</strong>; otherwise they wouldn’t be academics. The academic career path is long and grueling, and doesn’t pay very well compared to other things that nascent academics could be doing; there’s a reason people say that you shouldn’t get a Ph.D. if you can imagine being happy without one.</p>
<p>But that doesn’t mean research is conducted by cloistered ascetics with no motivations other than a monastic devotion to the truth. <strong>People who do research want to <em>discover interesting things</em>, not spend thirty years on experiments that don’t uncover anything new.</strong> Moreover, they want to discover things that <em>other people</em> think are interesting—people who can give them grants, or jobs, or maybe even book deals and TED talks.</p>
<p>Even without any dishonesty, this shapes the questions people ask, and also the way they answer them. First, people want to reject the null hypothesis, because we see that as strong evidence, but see failing to reject the null as weak evidence. An experiment that fails to reject the null is rarely actually published; all too often, it’s seen as an experiment that simply failed.</p>
<p>Second, people want to prove <em>new</em> and <em>surprising</em> things. It would be extremely easy for me to run a study rejecting the null hypothesis that 15-year-olds are on average about as tall as 5-year-olds. But no one would care about this study—including me—because we already know that.</p>
<p>Now, sometimes it’s worth clearly establishing that obvious things are in fact true. And we do have data on the average height of children at various ages, and it wouldn’t be hard to use that to show that 15-year-olds are taller than 5-year-olds. Collecting that sort of routine data on important topics is <a href="https://twitter.com/ProfJayDaigle/status/1521911837897502723">very useful and important work</a> that we should probably reward more than we do.</p>
<p>But we <em>don’t</em> reward routine data collection heavily, and most of the time researchers are trying to prove surprising new results. And that’s exactly the problem: <strong>new results are “surprising” when you wouldn’t have expected them—which is exactly when they’re unlikely to be true.</strong></p>
<h3 id="most-findings-false">“Why most published research findings are false”</h3>
<p>This quest for surprising results interacts with the statistics of the Neyman-Pearson method in an extremely counterintuitive way. The statistical guarantee is: if we test a true null hypothesis, we’ll get a false rejection about five percent of the time. <strong>But that doesn’t mean a rejection has a five percent chance of being false. And the more studies of true null hypotheses we run, the bigger this difference gets.</strong></p>
<p>We can most easily understand how this works with a couple examples. As a baseline, let’s look at the case where half our null hypotheses are true. Imagine we run two hundred studies, \(100\) with a true null hypothesis and \(100\) with a false null hypothesis. Our false positive rate is \(\alpha = 0.05\), so we’ll reject the null in five of the \(100\) studies where the null is true. And we generally hope for a false negative rate of \(\beta = 0.20\), in which case we reject the null in \(80\) of the \(100\) studies where the null is false.</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>80</td>
<td>5</td>
<td>85</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>20</td>
<td>95</td>
<td>115</td>
</tr>
<tr>
<td>Total</td>
<td>100</td>
<td>100</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>So we have \(85\) positive results, of which \(80\) are true positives and \(5\) are false positives, and so \(5/85 \approx 6\)% of our positive results are false positives.<strong title="You might recognize this as an application of Bayes's theorem, and a basic example of [Bayesian inference]. Tables like these are very common in Bayesian calculations. "><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> And that’s not too bad—the fact that it’s <em>higher</em> than the false positive rate of \(5\)% should be a warning sign.</p>
<p>But now imagine our researchers get more ambitious, and start testing more interesting and potentially-surprising findings. This means we should expect more of our null hypotheses to actually be true. If only ten percent of the original \(200\) null hypotheses are false, then we’ll have 180 studies with a true null and only 20 with a false null. We’ll still reject \(80\)% of false null hypotheses, and \(5\) of true null hypotheses, so our results look like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>16</td>
<td>9</td>
<td>25</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>4</td>
<td>171</td>
<td>175</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>Now we only have \(16\) true positives (out of \(20\) cases where we should reject), and we get \(9\) false positives (out of \(180\) cases where we shouldn’t reject the null). So a full \(9/25\) or \(36\)% of our positive results are false positives—much higher than \(5\)%! And often, only the studies that reject the null, and land in the first row of the table, get published at all. So we might find that a third of published papers will have false conclusions.</p>
<p><strong>If researchers are regularly testing theories that are unlikely to be true, then most of the positive (and thus published) results can be false, even if the rate of false positives is quite low.</strong> This is the key observation of the famous paper by John Ioannidis that kicked off the replication crisis, <a href="https://en.wikipedia.org/wiki/Why_Most_Published_Research_Findings_Are_False">Why Most Published Research Findings Are False</a>.<strong title="Followups to Ioannidis's paper contend that only about 14% of published biomedical findings are actually false. I'm not in a position to comment on this one way or the other. In psychology, different studies estimate that somewhere [between from 36% and 62%] of published results replicate."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong></p>
<p>This is sometimes known as the <a href="https://en.wikipedia.org/wiki/Publication_bias">file-drawer effect</a>: we see the studies that get published, but not the “failed” ones that are left in the researchers’ filing cabinets. So even though only thirteen of the \(200\) studies give the wrong answer, \(9\) of the \(25\) that actually get published are wrong.</p>
<p>And no, \(9/25\) isn’t quite a majority, so while this is bad, it doesn’t seem to justify Ioannidis’s claim that “most” published findings are false.</p>
<p>But we haven’t talked about everything that can go wrong yet!</p>
<h3 id="the-problem-of-power">The problem of power</h3>
<p>I said that “we generally hope for a false negative rate of \(\beta = 0.2\)”. But where does that hope come from?</p>
<p>The original Neyman-Pearson framework has two parameters, the false positive rate \(\alpha\) and the false negative rate \(\beta\). You can always make \(\alpha\) lower by accepting a higher \(\beta\), and researchers are supposed to balance these parameters against each other, based on the relative costs of making Type I and Type II errors. But in practice we just <a href="https://doi.org/10.1353/sof.2005.0108">set \(\alpha\) to \(.05\) and move on with our lives</a>; we don’t think about the relative balance of costs, or what it’s really saying about our research.</p>
<p>If our data is good enough, then we can make both \(\alpha\) and \(\beta\) are both small, and draw conclusions with a fair degree of confidence. But if our data is bad, then the study will be too weak to detect a lot of true effects, and so to keep \(\alpha\) small, we need to make \(\beta\) large. Consequently we say that the <em>power</em> of a study \(1 - \beta\), which is the <em>true</em> positive rate. A study with high power will usually give the correct answer; a study with low power can’t be trusted.</p>
<p><img src="/assets/blog/hypothesis-testing/abusing-your-power.jpg" alt="Picture of a cat, with text: "Don't even think about abusing your power"" class="blog-image center" /></p>
<p>Much like we typically set \(\alpha = 0.05\), we typically try to get \(\beta \leq 0.2 \), and thus conduct studies with a power of at least \(80\)%. And like with the false positive rate, this number is also not really motivated by anything in particular: the choice is generally attributed to Jacob Cohen, who <a href="http://daniellakens.blogspot.com/2019/05/justifying-your-alpha-by-minimizing-or.html">wrote</a> that</p>
<blockquote>
<p>The \(\beta\) of \(.20\) is chosen with the idea that… Type I errors are of the order of four times as serious as Type II errors. This \(.80\) desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value <em>ad hock</em>.</p>
</blockquote>
<p>That is, there’s no really good argument for not picking \(\beta = 0.1 \) or \(\beta = 0.3\) instead, but it seems like it’s about the right size if you don’t have any better ideas.</p>
<p>There are two problems here. The minor one is that both of these numbers are pretty arbitrary. If we have enough data that we can get \(\alpha = 0.05,\beta = 0.2\), then we could also choose to reject the null more readily and get something like \(\alpha = 0.1, \beta = 0.11\), with a high false positive rate but a power of \(89\)%; or we could be reject the null less often and get \(\alpha = 0.02, \beta = 0.33\), with a low false positive rate but power of only \(67\)%.</p>
<p>Which of those trade-offs we want depends the specifics of our current question: if Type I and Type II errors are about equally bad, we might want \(\alpha\) and \(\beta\) to be about the same size, but if a Type II error is much, much worse, we should want \(\alpha\) to be much smaller than \(\beta\). We can’t make an informed choice of \(\alpha\) and \(\beta\) without knowing details about the specific decision we’re trying to make.</p>
<p>But when we’re trying to do <em>science</em> it’s not clear what to choose. We can’t really quantify the costs of publishing a paper with a false conclusion; the entire setup of computing practical trade-offs doesn’t make all that much sense when we’re trying to discern the truth rather than make a decision. <strong>This is one major way that the Neyman-Perason framework isn’t the right tool for science: the entire method is premised on a calculation we can’t do.</strong></p>
<p>But we <em>can</em> just set \(\alpha = 0.05, \beta = 0.20\), and see what happens. And as long as these numbers are a vaguely reasonable size, we’ll probably get vaguely reasonable results. We hope.</p>
<h3 id="where-does-power-come-from">Where does power come from?</h3>
<p>There’s a second problem, though, which is widespread and frequently disastrous. Sometimes \(\beta\) gets so large that a study becomes useless—and we don’t even notice.</p>
<p>For a given \(\alpha\), your \(\beta\) depends on the quality of the data you have. With very good data, you can be very confident about your conclusion in both directions. We have a tremendous amount of data about the relationship between age and height in children, so we can design studies that will have low rates of false positives and false negatives. And physics experiments ask for a false positive rate less than one in a million—and they can actually <em>achieve</em> this because their data is both copious and precise.</p>
<p><strong>But with bad or noisy data, no amount of statistical cleverness can give any degree of confidence in our conclusions.</strong> If you want to study the effect on life expectancy of winning or losing an election to be a US state governor, <a href="https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/">you wind up with this scatterplot</a>:</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/governor-life-expectancy.png" alt="Scatterplot with "Percentage vote margin" on the x-axis, from -10 to 10, and "Years alive after election" on the y-axis, from 0 to 60. There is no noticeable pattern." class="blog-image center" />
<em>If your data is this scattered, you will never be able to detect small effects.</em></p>
<p>There aren’t <em>that</em> many governor races, and lifespan after any given race varies from just a couple years to more than fifty, so the data is extremely noisy. If winning an election boosted your lifespan by ten years, we would probably be able to tell. But an effect that large is absurd, and there’s no way to use data like this to pick up changes of just a year or two.</p>
<p>When we said we “ask for” a \(\beta\) below \(0.2\), we really meant “we should collect enough data to get a power of \(80\)%”. That’s not really an option for the governors study, without waiting around for more elections and more dead governors; on that question we’re kind of stuck with the data we have. Despite the Neyman-Pearson inclination to make a firm decision, all we can reasonably do is embrace uncertainty.</p>
<p>If we’re running a laboratory experiment, on the other hand, we can decide how big an effect we’re looking for, and calculate how many people we’d need to study to get a power of \(80\)%. But it’s hard to calculate this correctly, because it depends on how big the effect we’re studying is, and we <em>don’t know how big it is</em> because we <em>haven’t done the study yet</em>. So the calculation is based on a certain amount of guesswork.<strong title="We can also base it on [how big of an effect we _care_ about]. If we're studying reaction times, we might decide that an effect smaller than ten milliseconds is irrelevant, and we don't care about it even if it's real. Then we can choose a study with enough power to detect a 10ms effect at least 80% of the time. But this brings us back to the core issue, that "is there an effect" just isn't a great question, and the Neyman-Pearson method isn't a great tool for answering it. "><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>Even if we do this calculation correctly, there’s a real chance that we have to run a really big experiment to get the power we want. (If we’re looking for a small effect, we may have to run a really, <em>really</em> big experiment.) And big experiments are expensive! A lot of researchers skip this step entirely, and just run whatever experiment they can afford, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4961230/">regardless of how little power it has</a>.</p>
<p>And if the power is low enough, things get very dumb very quickly.</p>
<h3 id="we-need-more-power">We need more power!</h3>
<p>Let’s start by looking at what happens when the power is really, idiotically low. This graph shows what happens when you run an experiment with a power of \(0.06\), which means a false negative rate of \(94\)%. And there are three different problems that pop up.</p>
<p><img src="https://statmodeling.stat.columbia.edu/wp-content/uploads/2014/11/Screen-Shot-2014-11-17-at-11.19.42-AM.png" alt="A diagram of the effects of low-power studies.
This is what "power = 0.06" looks like. Get used to it.
Type S error probability: If the estimate is statistically significant, it has a 24% chance of having the wrong sign.
Exaggeration ratio: If the estimate is statistically significant, it must be at least 9 times higher than the effect size." class="blog-image center" /></p>
<p class="center"><em>Figure by <a href="https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">Andrew Gelman</a>.</em></p>
<p>The obvious problem is that even if the null hypothesis is wrong, we probably won’t reject it, because the data isn’t good enough to <em>show</em> that it’s wrong. Even if the null is false, we’ll fail to reject it \(94\)% of the time! (This is represented by the large white area in the middle of the graph.) But this, at least, is the process working as intended: our goal was to err on the side of not rejecting the null hypothesis, and that is in fact what we’re doing.</p>
<p>But there are two subtler problems, which cause more trouble than just a pile of inconclusive studies. We still manage to reject the null \(6\)% of the time, but because the study is so weak, this only happens when we get unusually lucky. And that happens when our data is much, <em>much</em> further away from the null hypothesis than it usually is. <strong>At a power of \(\mathbf{0.06}\), we only get a significant result when our measurement is <em>nine times</em> as big as the true effect we want to measure.</strong> (This is the red region on the right of Gelman’s graph; he calls it a “Type M error”, for “magnitude”.)</p>
<p>This is a major culprit behind a lot of improbable ideas that come out of shoddy research. In my <a href="/blog/replication-crisis-math/">post on the replication crisis</a> I talked about how a lot of careless research starts out asking whether an effect exists, but finds an effect that’s <em>surprisingly large</em>, and then the story people tell is focused on the dramatic, unexpectedly large effect. But that drama is a necessary result of running underpowered studies.</p>
<p>The study of gubernatorial elections and life expectancy is a perfect example of this process. Just by looking at the graph, you can tell there probably isn’t a big effect. But researchers Barfort, Klemmensen and Larsen found a clever analysis<strong title="Clever analyses like this are often a bad idea; we'll come back to this idea [soon]."><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></strong> that did <a href="https://www.cambridge.org/core/journals/political-science-research-and-methods/article/abs/longevity-returns-to-political-office/6205207F55C97729E66A8B08D7641572">produce a statistically significant result</a>—and claimed that the difference between narrowly winning and narrowly losing an election was <em>ten years</em> of lifespan. That’s far too large an effect to be believable, but any statistically significant result they got from that data set would have to be equally incredible.</p>
<p>Researchers are motivated to discover new and surprising things; and we, as news consumers, are most interested in new and surprising results. The wild overestimates that these low-power studies produce are surprising and counterintuitive, precisely because they are <em>false</em>. But they are surprising and counterintuitive, so they tend to draw public attention and show up in the news.</p>
<p>But a surprisingly large result isn’t as counterintuitive as one that’s the opposite of what you expect. (Imagine if a study “proved” that 5-year-olds are taller than 15-year-olds!) And low-power studies give us those results too.</p>
<p>Even if we’re studying something that really does (slightly) increase lifespan, we could get unusually <em>unlucky</em>, and randomly observe a bunch of people who die unusually early. If the data is noisy enough and we get unlucky enough, we can get statistically significant evidence that the effect decreases lifespan, when it really increases it.</p>
<p>We see this in the left tail of Gelman’s graph. <strong>When power is \(\mathbf{0.06}\), almost a quarter of statistically significant results will give you a large effect <em>in the wrong direction</em>.</strong> There’s a substantial chance that we get our result exactly backwards.</p>
<p>Now, a power of \(0.06\) is an extreme case, bad even by the usual standards of underpowered research. But the same problems come up with better-but-still-underpowered studies, just to a lesser degree. In fact, both effects are always <em>possible</em>, if your data is unlucky enough. But we’d much prefer having a \(0.1\)% chance of getting the direction of the effect wrong to having a \(24\)% chance. And the lower the power, the bigger an issue this is.</p>
<h3 id="file-drawer">The revenge of the file drawer</h3>
<p>There should be a saving grace here: if your study has low power, it’s unlikely to reject the null at all. We don’t have a \(24\)% chance of getting a statistically significant result in the wrong direction; because our power is only \(0.06\), we have a <em>six percent chance of having a \(24\)% chance</em> of getting a statistically significant result in the wrong direction. That’s less than two percent, in total.</p>
<p>But <strong>studies that don’t reject the null often don’t get published at all</strong>. There’s a good chance that the 94 studies that fail to reject the null get stuck in a file drawer somewhere; we’re left with a few studies that reject it, but wildly overestimate the effect, and one or two that reject the null in the wrong direction. When that’s all the information we have, it’s hard to figure out what’s really going on.</p>
<p>Let’s make another table of possible research findings, like the ones <a href="#most-findings-false">we used earlier</a> to see how the file-drawer effect works. But this time, instead of assuming a reasonable power of \(80\)%, let’s see what happens when the power is only \(20\)%. If half the hypotheses are true and half are false, we get something like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>20</td>
<td>5</td>
<td>25</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>80</td>
<td>95</td>
<td>175</td>
</tr>
<tr>
<td>Total</td>
<td>100</td>
<td>100</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>With \(80\)% power, our false-positive rate was \(6\)%. But with \(20\)% power, we have \(20\) true positives and \(5\) false positives, and our false-positive rate has risen \(5/25 = 20\)%.</p>
<p>And if we also suppose that are researchers are testing unlikely theories and so \(90\)% of null hypotheses are true, we get the following truly terrible table:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>4</td>
<td>9</td>
<td>13</td>
</tr>
<tr>
<td>Don’t Reject</td>
<td>16</td>
<td>171</td>
<td>187</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>Under these conditions we get \(9\) false positives and only \(4\) true positives, so almost \(70\)% of our positive results are false positives. If the only results we publish are these exciting positive results, then most published findings will, indeed, be false.</p>
<h3 id="the-problem-of-p-hacking-and-the-garden-of-forking-paths">The problem of \(p\)-hacking and the garden of forking paths</h3>
<p>It seems like we could fix this problem just by publishing null results as well. New norms like <a href="https://en.wikipedia.org/wiki/Preregistration_(science)">preregistration of studies</a> and institutions like <a href="https://www.jasnh.com">The Journal of Articles in Support of the Null Hypothesis</a> try to combat the file drawer bias by publishing studies that don’t reject the null, or at least letting us know they happened so we can count them. If we publish just a quarter of null results, then even under the bad assumptions of the last table we get something like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Null is false</th>
<th>Null is true</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reject the null</td>
<td>4</td>
<td>9</td>
<td>13</td>
</tr>
<tr>
<td>Don’t Reject, but Publish</td>
<td>4</td>
<td>43</td>
<td>47</td>
</tr>
<tr>
<td>Don’t Reject or Publish</td>
<td>12</td>
<td>128</td>
<td>140</td>
</tr>
<tr>
<td>Total</td>
<td>20</td>
<td>180</td>
<td>200</td>
</tr>
</tbody>
</table>
<p>We see \(60\) published results. The \(4\) results where the null is false and we reject it are correct, as are the \(43\) where the null is true and we don’t reject it, so over \(70\)% of the published results will be true. If we publish more null results, this number only gets better.</p>
<p>But that doesn’t address the fundamental problem, which is that <em>researchers want to discover new, interesting things</em>. <strong>The fact that we mostly publish positive results that reject the null isn’t some accident of history; it’s a result of people trying to show that their ideas are correct.</strong></p>
<p>Since people want to reject the null hypothesis, they’ll work hard to find ways to do this. When done deliberately, this behavior is a form of research misconduct known as <a href="https://twitter.com/ephemeralidea/status/1504459823554908163">\(p\)-hacking</a> or <a href="https://en.wikipedia.org/wiki/Data_dredging">data dredging</a>. There are a variety of sketchy ways to tweak your statistical analysis to get an artificially low \(p\)-value. The most famous version is just running a bunch of experiments and <a href="https://imgs.xkcd.com/comics/significant.png">only reporting the ones with low \(p\)-values</a>.</p>
<p>Somewhat less famous, and less obvious, is the possibility of running one experiment, and then trying to <em>analyze</em> that data in a bunch of different ways and picking the one that makes your position look the best. We actually saw an example of this in <a href="hypothesis-testing-part-1#mileage">part 1</a> of this series, when I looked at my car’s gas mileage. I computed the \(p\)-value in two different ways, and got either \(0.0006\) or \(0.00004\). Either one of these is significant, but if they had been \(0.06\) and \(0.004\) instead, I could have just reported the second one and said “hey look, my data was significant!”</p>
<p>Moreover, it’s pretty common for people to look for secondary, “interaction” effects after looking for a main effect. Sure, watching a five-minute video didn’t have a statistically significant effect on depression in your study group. But maybe it worked on just the women? Or just the Asians? What if we control for income? You can check all the subgroups of your study, and whichever one reaches significance is <em>obviously</em> the interesting one.</p>
<p><a href="https://xkcd.com/1478/"><img src="https://imgs.xkcd.com/comics/p_values.png" alt="XKCD comic, translating p-values into verbal interpretations: "highly significant", "significant", "on the edge of significance". For a high p-value the interpretation is "hey, look at this interesting subgroup analysis"." class="blog-image center" /></a>
<em class="blog-image center">Sometimes your treatment really does have an effect on one specific subgroup. But it’s also an easy out when your main study didn’t reach significance.</em></p>
<p>This approach of doing multiple subgroup analyses, but only reporting one is still research misconduct, if done on purpose. But <strong>it’s possible to get the same effect without actually performing multiple analyses, in a process that Andrew Gelman and Eric Loken call the <a href="https://www.americanscientist.org/article/the-statistical-crisis-in-science">garden of forking paths</a>.</strong></p>
<p>Researchers often make decisions about how to test the data after looking at it for broad trends. If they notice one subgroup obviously sticking out, maybe they want to test it. Or they can tweak some minor parameters, decide to include or exclude outliers, and consider a few minor variations in the way they divide subjects into categories. This is all a reasonable way of looking at data, but it’s a violation of the rules of hypothesis testing, and has the same basic effect as running a bunch of experiments and only reporting the best one.</p>
<p>Most subtly, sometimes more than one pattern will provide support for the researcher’s hypothesis. We generally don’t actually care about specific statistical relationships; we care about broader questions, like “does media consumption affect rates of depression?”<strong title="This difference is the source of a lot of research pitfalls; if you want to dig into this more, I recommend [Tal Yarkoni] on generalizability, [Vazire, Schiavone, and Bottesini] on the four types of validity, and [Scheel, Tiokhin, Isager, and Lakens] on the derivation chain."><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></strong> <strong>We run specific experiments in order to test these broad questions. And if there are, say, twenty different outcomes that would support our broad theoretical stance, it doesn’t help us very much that each one only has \(\mathbf{5}\)% odds of happening by chance.</strong></p>
<p>Gelman and Loken describe how this applies to research by Daryl Bem, which claims to provide strong evidence for ESP.<strong title="Scott Alexander [has pointed out] that ESP experiments are a great test case for our scientific and statistical methods, because we have extremely high confidence that we already know the true answer."><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup></strong></p>
<blockquote>
<p>In his first experiment, in which 100 students participated in visualizations of images, he found a statistically significant result for erotic pictures but not for nonerotic pictures….</p>
</blockquote>
<blockquote>
<p>But consider all the other comparisons he could have drawn: If the subjects had identified all images at a rate statistically significantly higher than chance, that certainly would have been reported as evidence of ESP. Or what if performance had been higher for the nonerotic pictures? One could easily argue that the erotic images were distracting and only the nonerotic images were a good test of the phenomenon. If participants had performed statistically significantly better in the second half of the trial than in the first half, that would be evidence of learning; if better in the first half, evidence of fatigue.</p>
</blockquote>
<blockquote>
<p>Bem insists his hypothesis “was not formulated from a post hoc exploration of the data,” but a data-dependent analysis would not necessarily look “post hoc.” For example, if men had performed better with erotic images and women with romantic but nonerotic images, there is no reason such a pattern would look like fishing or p-hacking. Rather, it would be seen as a natural implication of the research hypothesis, because there is a considerable amount of literature suggesting sex differences in response to visual erotic stimuli. The problem resides in the one-to-many mapping from scientific to statistical hypotheses.</p>
</blockquote>
<p>We even saw an example of forking paths earlier in this essay, in the <a href="#where-does-power-come-from">study of gubernatorial lifespans</a>. I said the study found a clever analysis to get a significant result. In the data set we saw from Barfort, Klemmensen, and Larsen, the obvious tests like linear regression don’t show any effect of winning margin on lifespan.</p>
<p class="blog-image center"><img src="/assets/blog/hypothesis-testing/governor-life-expectancy-loess.png" alt="The same scatterplot of "Percentage vote margin" on the x-axis and "Years alive after election" on the y-axis. This time a best-fit loess curve is drawn through the data; it again shows no real relationship." class="blog-image center" />
<em>A loess curve is a more sophisticated version of linear regression. It doesn’t show a clear relationship between electoral margin and lifespan. Graph again <a href="https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/">by Andrew Gelman</a>.</em></p>
<p>But if you average different candidates with the same electoral margin together, divide them into a group of winners and a group losers, and then do a regression on each group separately, the two regressions suggest that barely winning a race improves life expectancy, versus barely losing.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/governor-regression-discontinuity.png" alt="A figure from the Barfort, Klemmensen, and Larsen paper on gubernatorial elections and lifespan, showing their regression discontinuity analysis. It shows lifespan decreasing with increased voteshare, except with a large upwards discontinuity at the crossover from losing to winning." class="blog-image" /></p>
<p class="center blog-image"><em>The discontinuity between the two lines is large enough to be “statistically significant”. But does the data on the right really look qualitatively different from the data on the left?</em></p>
<p>This <a href="https://en.wikipedia.org/wiki/Regression_discontinuity_design">regression continuity design</a> isn’t a ridiculous approach to the question, but it’s also probably not the first idea you’d think of. And the paper’s own abstract says they’re not sure which way the effect should run, so <em>any pattern at all</em> would provide support for their research hypothesis. This is a subtle but crucial violation of the hypothesis testing framework, and dramatically inflates the rate of “positive” results.</p>
<h2 id="sowhy-does-science-work-at-all">So…why does science work <em>at all</em>?</h2>
<p>Hopefully I’ve convinced you, first, that the tools of modern hypothesis testing are badly suited for the questions we want them to answer, and second, that the structure of our scientific institutions leads us to regularly misuse them in ways that make them even more misleading. So then, how do we manage to learn anything at all?</p>
<p>Sometimes we don’t! The whole point of the “replication crisis” is that we’re almost having to throw out entire fields wholesale. <strong>When I hear about a promising new drug, or a cool new social psychology study, I <em>assume it’s bullshit</em>, because so many of them are. And that’s a real crisis for whole idea of “scientific knowledge”.</strong></p>
<p>But in many fields of study we do, in fact, manage to learn things. We know enough physics and chemistry to build things like spaceships and smartphones. And even though lot of drug studies are nonsense, modern medicine does in fact work.</p>
<p class="center"><img src="/assets/blog/hypothesis-testing/life-expectancy-at-age-10.png" alt="A graph from Our World In Data of life expectancy at age ten in various countries, from 1750 to the present. There is a dramatic increase over the 20th century." class="blog-image" /></p>
<p class="center blog-image"><em>We didn’t increase life expectancy by almost thirty years without learning</em> something <em>about biology.</em></p>
<p>And even in more vulnerable fields like psychology and sociology, we have developed a lot of consistent, replicable, useful knowledge. How did we get that to work, despite our shoddy statistics?</p>
<h3 id="inter-ocular-trauma">Inter-ocular trauma</h3>
<p>If your data are good enough, you can get away with having crappy statistics. One of the best and most useful statistical tools is what Joe Berkson called the <a href="https://stats.stackexchange.com/questions/458069/source-for-inter-ocular-trauma-test-for-significance">inter-ocular traumatic test</a>: “you know what the data mean when the conclusion hits you between the eyes”.</p>
<p><a href="https://xkcd.com/2400/"><img src="https://imgs.xkcd.com/comics/statistics.png" alt="XKCD 2400: graph of covid vaccine efficacy versus placebo. "Statistics tip: always try to get data that's good enough that you don't need to do statistics on it."" style="max-width:800px;" class="blog-image center" /></a></p>
<p class="center blog-image"><em>I didn’t worry that</em> this <em>result was bullshit statistical trickery, because I can easily see the evidence for myself.</em></p>
<p>Conversely, if your data isn’t very good, statistics can’t help you with it very much. John Tukey <a href="https://doi.org/10.2307/2683137">famously wrote</a>:</p>
<blockquote>
<p>The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.</p>
</blockquote>
<p>None of this means statistics is useless. But if we can consistently get good, high-quality data, we can afford a little sloppiness in our statistical methodology.</p>
<h3 id="putting-the-replication-in-replication-crisis">Putting the “replication” in “replication crisis”</h3>
<p>And this is where the “replication” half of “replication crisis” comes in. <strong>If the signal you’re detecting is real, you can run another experiment, or do another study, and (probably) see the same thing.</strong> In my <a href="https://jaydaigle.net/blog/replication-crisis-math/">post on the replication crisis</a> I wrote about how mathematicians are constantly replicating our important results, just by reading papers; and that protects us from a lot of the flaws plaguing social psychology.</p>
<p>Gelman recently <a href="https://statmodeling.stat.columbia.edu/2022/03/04/biology-as-a-cumulative-science-and-the-relevance-of-this-idea-to-replication/">made a similar point</a> about fields like biology. Because wet lab biology is cumulative, people are continually replicating old work in the process of trying to do new work. A boring false result can survive for a long time, if no one cares enough to use it; an exciting false result will be exposed quickly when people try to build on it and it collapses under the strain.</p>
<p>This is something Fisher himself wrote about clearly and firmly: “A scientific fact should be regarded as experimentally established only if a properly designed experiment <em>rarely fails</em> to give this level of significance”. That is, we shouldn’t accept a result when we successfully do <em>one</em> experiment that produces a low \(p\)-value; but we should listen when we can <em>consistently</em> do experiments with low \(p\)-values.</p>
<p><strong>But the entire concept of “replication” is in opposition to the artificial decisiveness of Neyman-Pearson hypothesis testing.</strong> The Neyman-Pearson method, if taken seriously, asks us to fully commit to believing a theory if our experiment comes up with \(p=0.049\); but that attitude is <em>utterly terrible science</em>. Good scientific practice <em>needs</em> to be able to hold beliefs lightly, revise them when new evidence comes in, and carefully build up solid foundations that can support further work.</p>
<p>The standard approach to hypothesis testing isn’t designed for that. Next time, in <a href="https://jaydaigle.net/blog/hypothesis-testing-part-3/">part 3</a>, we’ll look at some tools that are.</p>
<hr />
<p><em>Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Or is there something you’d like to hear more about in the rest of this series? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is the probability of getting data five standard deviations away from the mean. So you’ll often see this reported as a significance threshold of \(5 \sigma\). Related is the <a href="https://en.wikipedia.org/wiki/Six_Sigma">Six Sigma techniques</a> for ensuring manufacturing quality, though somewhat counterintuitively they typically only aim for <a href="https://en.wikipedia.org/wiki/Six_Sigma#Role_of_the_1.5_sigma_shift">4.5 \(\sigma\)</a> of accuracy. <a href="#fnref:1" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:2">
<p>It is common for people to be sloppy here and say they “accept” the null. In fact, I wrote that in my first draft of this paragraph. But it’s bad practice to say that, because even a very high \(p\)-value doesn’t provide good evidence that the null hypothesis is true. Our methods are designed to default to the null hypothesis when teh data is ambiguous.</p>
<p>Neyman <em>did</em> use the phrase “accept the null”, but in the context of a decision process, where “accepting the null” means taking some specific, concrete action implied by the null, rather than more generally committing to believe something. <a href="#fnref:2" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:3">
<p>Andrew Gelman suggests a helpful <a href="https://statmodeling.stat.columbia.edu/2016/01/26/more-power-posing/">time-reversal heuristic</a>: what would you think if you saw the same studies in the opposite order? You’d start with a few large studies establishing no effect, followed by one smaller study showing an effect. In theory that gives you the exact same information, but in practice people would treat it very differently—assuming the first studies <a href="#file-drawer">actually got published</a>. <a href="#fnref:3" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:4">
<p>You might recognize this as an application of Bayes’s theorem, and a basic example of <a href="https://jaydaigle.net/blog/overview-of-bayesian-inference/">Bayesian inference</a>. Tables like these are very common in Bayesian calculations. <a href="#fnref:4" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:5">
<p>Followups to Ioannidis’s paper contend that only about \(14\)% of published biomedical findings are actually false. I’m not in a position to comment on this one way or the other. In psychology, different studies estimate that somewhere <a href="https://en.wikipedia.org/wiki/Replication_crisis#In_psychology">between from \(36\)% and \(62\)%</a> of published results replicate. <a href="#fnref:5" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:6">
<p>We can also base it on <a href="https://twitter.com/lakens/status/1524799540250959873">how big of an effect we <em>care</em> about</a>. If we’re studying reaction times, we might decide that an effect smaller than ten milliseconds is irrelevant, and we don’t care about it even if it’s real. Then we can choose a study with enough power to detect a \(10\)<em>ms</em> effect at least \(80\)% of the time.</p>
<p>But this brings us back to the core issue, that “is there an effect” just isn’t a great question, and the Neyman-Pearson method isn’t a great tool for answering it. <a href="#fnref:6" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:7">
<p>Clever analyses like this are often a bad idea; we’ll come back to this idea <a href="#file-drawer">soon</a>. <a href="#fnref:7" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:8">
<p>This difference is the source of a lot of research pitfalls; if you want to dig into this more, I recommend <a href="https://psyarxiv.com/jqw35">Tal Yarkoni</a> on generalizability, <a href="https://psyarxiv.com/bu4d3/">Vazire, Schiavone, and Bottesini</a> on the four types of validity, and <a href="https://journals.sagepub.com/doi/10.1177/1745691620966795">Scheel, Tiokhin, Isager, and Lakens</a> on the derivation chain. <a href="#fnref:8" class="reversefootnote">↵Return to Post</a></p>
</li>
<li id="fn:9">
<p>Scott Alexander <a href="https://slatestarcodex.com/2014/04/28/the-control-group-is-out-of-control/">has pointed out</a> that ESP experiments are a great test case for our scientific and statistical methods, because we have extremely high confidence that we already know the true answer. <a href="#fnref:9" class="reversefootnote">↵Return to Post</a></p>
</li>
</ol>
</div>Jay DaigleThis is the second-part of a three-part series on hypothesis testing. Today we'll look at the way we do hypothesis testing in practice, and how it tends to fail. Modern researchers use hypothesis testing as a tool to develop knowledge, but it's really a tool for making decisions, and so it encourages us to draw strong conclusions from weak evidence. It also encourages us to view studies that don't reject the null hypothesis as failures, which leads even honest and dedicated researchers to do shoddy research, producing "statistically significant" results that can't be reproduced.