The anachronisms in the Abagnale book are more dramatic, in part because the book is older (from 2001), and in part because Abagnale is more of a sensationalist to begin with. (It seems a lot of the claims he made about his most impressive capers are less than accurate—which makes sense coming from a successful con man!)
He takes pains to explain cutting-edge technology, like color scanners and laser printers. Then he warns about people using them to forge store gift certificates, when I’m not sure I’ve seen an actual printed gift certificate (as opposed to a gift card) in years. He talks about scanning and printing near-perfect replicas of US currency, which is no longer possible. He describes the exciting new security features in the redesign of the twenty-dollar bill, which I just barely remember being introduced.
But the most jarring bits are in his first real chapter, about check forgery. Partly, again, the technology has gotten better. He complains that many companies print checks on “that familiar blue or green basketweave check paper” you can buy at any office supply store. But it’s not really familiar to me! Instead I just take for granted that all checks have the fancy new security features he’s advocating.
But moreover, he’s amazed that stores accept checks without checking the signature—whereas I’m amazed that stores accept checks at all! He raises the possibility of paper checks dying out, only to dismiss it:
I’ll be long dead, even if I live to a ripe old age, before checks will ever disappear. The amount of checks we write is growing at a rate of more than a billion checks a year. So they’re not even declining in use. They’re growing. I remember fifteen years ago, when we were writing 40 billion checks a year, people said it would never reach 50 billion, and now we’re at almost 70 billion. People happen to like checks. They’re familiar. Many consumers will say, “I like this check. It has some float to it. I like that much better than when the bank immediately goes into my checking account and takes the money out. I also like the idea that I can get the check back and see who I wrote it to and have a record of it.” And we have a very large generation that is not comfortable with smart cards and electronics. They’re leery of new ways of payments, and they don’t fully grasp them.
Electronic banking is still much more of an unknown frontier. And there’s no forgetting the billions of dollars that banks have invested in electronic readers, sorters, and other check processing equipment. We’re not going to just scrap it and plow money into home banking. There are banks out there pushing electronics, but there are a lot of other banks that would just as soon stay with checks.
And that’s all very convincing, except for one thing:
Data collected from https://www.federalreserve.gov/paymentsystems/frps_previous.htm
I can’t explain the discrepancy between Abagnale’s numbers and the Fed’s. Abagnale doesn’t cite a source, and while the Fed is pretty clear that its numbers aren’t totally solid, I know I trust them more. But it sure looks like The Art of the Steal was written at nearly the exact peak of US check-writing. The book is confidently asserting that checks would never fade—to a present-day audience which knows they’re well on their way out.^{1}
Joseph Heath wrote a much more serious book, and a much better researched one. It’s also much more recent, from just 2014. But that makes the anachronisms more disconcerting.
The first thing that really surprised me is his discussion of computer chess. Heath argues (correctly!) that people don’t think in a purely linear, logical-deductive manner, but instead rely on a lot of shortcuts and heuristics. He illustrates the difference by contrasting the human approach to chess-playing with the approach of chess computers like Deep Blue. Computers, he explains, are analyzing millions of branches of the chess decision tree; in contrast, human grandmasters rely on “a heuristic pruning of the decision tree, guided by an intuitive sense of what seem to be the most promising moves or of what sort of position they want on the board.” He goes on to observe that
[N]o one is able to articulate how this initial pruning is done. It is all based on “feel.” … To this day, no one has ever succeeded in reproducing the intuitive style of thinking in a computer, simply because we don’t know how it is done (despite the fact that we ourselves do it)…. The fact that this much computing power can be deployed without yet achieving the “final, generally accepted, victory over the human”22 is a monument to the power and sophistication of nonrational thought processes in the human mind.
Three years later, Google unveiled the AlphaZero engine, which uses modern machine learning techniques to do heuristic pruning very similar to what humans do, and avoids the need to crunch through the entire decision tree. To the best of my knowledge, every top chess program now uses these neural network-based heuristics.
I don’t bring this up to criticize Heath. He was correct when he was writing; and his main point is still correct, since he was mostly trying to explain how human thought works, not how to write a chess program. But it’s definitely a moment where I paused and was thrown out of the argument, because my first reaction was “but that isn’t true!” With a belated followup of “…any more”.
But there’s another bit that seems far more jarring and anachronistic today, even though it also seems prescient. Heath writes as an unapologetic liberal^{2}, and his project is to build a modern, renewed liberal politics. So he sets the stage for his argument by discussing some of the problems he sees in the modern Republican party.
The big tent of the American right has always sheltered its share of crazies… There came a point, however, when the sideshow began to take over center stage. Americans woke up to find that their political system was increasingly divided, not between right and left, but between crazy and non-crazy. And what’s more, the crazies seemed be gaining the upper hand.
He later observes that the American right “always seem to be very angry”, and that
there has also been a significant rise in the amount of bullshit. Lying for political advantage, of course, is as old as the hills. What has changed is that politicians used to worry about getting caught.
He is, of course, describing the 2012 campaign that pitted Mitt Romney against Rick Santorum in the primary and Barack Obama in the general election.
Ten years later, I’m not sure whether to read Heath’s writing as prescient or naive. He forecast the shape of Trumpian politics nearly perfectly, so in that sense he was clearly on to something. But it’s disconcerting to remember a time when we might have viewed Romney and Santorum as shockingly out-of-bounds artists of bullshit.
So those are two different books I’m reading, which both aged surprisingly quickly. I don’t have any grand takeaways from this, or anything. But it’s interesting to see just how unpredictable trends can be. Sometimes they keep going much further than you think they can. And other times, when they seem like they’ll last forever, they stop almost without warning.
What else has aged surprisingly quickly—or surprisingly well? Tweet me @ProfJayDaigle, BlueSky me @profjaydaigle.bsky.social, or leave a comment below.
In Abagnale’s defense, he only claims they won’t disappear, and indeed they haven’t. But the dynamics of check-cashing today are radically different from the dynamics he describes, and his prediction that banks will keep avoiding electronic banking seems particularly off the mark.
In Abagnale’s offense, he has a comment a few chapters later that his children don’t like writing checks and he thinks it’s a generational thing. So he could have seen it coming. ↵Return to Post
In both senses of the term; he opposes the political right, but he also isn’t a leftist. ↵Return to Post
In the first few essays in this series, we saw two different approaches to finding new types of numbers. But they gave us different—and overlapping, but distinct—sets of numbers. Today we’ll see what happens when we combine both techniques, and develop the complex numbers. This won’t finish our quest to find weird numbers that mathematicians care about; far from it. But it will finish one line of questions, and cover pretty much everything we normally see in high school algebra and calculus.
But before I start, I want to take a moment to thank everyone who has donated to my Ko-Fi account. Tips are never necessary, but always appreciated, and they really do make a difference and help me to keep writing essays like this one.
In part 1, we started with the natural numbers, which are the basic numbers we use to count. Using basic arithmetic operations, we introduced negative numbers to get the integers, then fractions to get the rational numbers. We ended by asking all polynomial equations to have solutions, which gave us the algebraic numbers. These include square roots and cube roots of all the rational numbers, and also some stranger things like the solutions to \(x^5+x+3=0\). This gave us a set that was algebraically closed: any polynomial equation defined with algebraic numbers will have a solution that is an algebraic number. So algebraic tools couldn’t push us any farther.
In part 2 we asked a different question, about measurement and approximation. We wanted areas and lengths to all correspond to numbers, and this led to the idea of completeness, where any number we can approximate with rational numbers should actually exist. Completing the rational numbers gave us the real numbers. We might call this the analytic approach to extending the rationals, in contrast to the algebraic approach of part 1.
In part 3 we showed that not every real number is algebraic; in particular \(\pi\) is a transcendental number, which isn’t the solution to any polynomial equation. But more generally, we showed that the algebraic numbers are countable, which means we can describe any one of them with a finite amount of information, but the real numbers are uncountable, which means it takes an infinite amount of information to describe most of them. There aren’t just more real numbers than algebraic numbers; there are infinitely more.
But that doesn’t mean the real numbers cover everything! There are algebraic numbers that aren’t real numbers. And there are real polynomials that don’t have real solutions. So what happens if we start with the real numbers and do part 1 again? Can we get a field with the completeness of the reals, but also the nice algebraic closure of the algebraic numbers?
How do we know there are algebraic numbers that aren’t real?
We can start with the quadratic polynomial equation \(x^2+1=0\). This is defined entirely with real numbers. But when we graph the function \(x^2+1\), we see it doesn’t cross the \(x\)-axis, which means that \(x^2+1=0\) doesn’t have a real solution.
We maybe should have expected this. We know that \(\sqrt{2}\) is real, because we can find a rational numbers whose squares are between \(1\) and \(2\), or \(1.9\) and \(2\), or between \(1.99999\) and \(2.\) That gives us a sequences of numbers that approximates \(\sqrt{2}\), and thus \(\sqrt{2}\) must be real. But we can’t do the same trick for \(−1\): no rational number has a square less than zero, so we can’t find anything that’s close to the square root of \(-1\).
But we can see this more directly by using the core principles of the real numbers: they’re a complete ordered field. Since they’re ordered, that every (non-zero) number must be either positive or negative. Since they’re an ordered field, the product of two positive numbers must be positive, and the product of two negative numbers must also be positive.
So suppose we have a number \(i\) that solves this equation. Then \(i^2 = -1\), which means \(i\) can’t be positive, and also can’t be negative. It’s clearly not zero. So it can’t be a real number at all. But it’s definitely algebraic: it’s the solution to \(x^2+1=0\).
Can we find other non-real algebraic numbers? Sure! There’s \(2i\) and \(3i\) and \(1+i\) and…. We can use \(i\) to build lots more non-real numbers.
But that’s it. It turns out that if we take the real numbers, and then add in everything we can build with the number \(i\), we have all the algebraic numbers. And in fact we have the solution to any polynomial we can write down with real numbers. This gives us everything we could ever want.^{1} But to see why this gets us everything, we’ll need to take a bit of a detour
We want to look at all the numbers we can build by combining the real numbers and \(i.\) These numbers will all look like \(a + bi\) where \(a\) and \(b\) are real numbers.^{2} And we call the set of all these things the complex numbers, abbreviated \(\mathbb{C}.\) If we have a complex number \(z = a + bi\) then we say the real number \(a\) is the real part and the real number \(b\) is the imaginary part.
Remember our goal was to extend the real numbers to something algebraically nice. So we should start my making sure that we can still do arithmetic operations—that complex numbers are a field. Now, addition and subtraction are fine, since can use the rules \[ \begin{aligned} (a+bi) + (c+di) & = (a+c) + (b+d) i \\ (a+bi) - (c+di) & = (a-c) + (b-d) i . \end{aligned} \] Multiplication is also pretty straightforward. By FOILing we get \[ \begin{aligned} (a+bi)(c+di) & = ac + adi + bci + bdi^2 \\ & = ac + adi + bci + bd(-1) \\ & = (ac - bd) + (ad +bc)i \end{aligned} \] so if we multiply two complex numbers, we get another.
Division is a little trickier; we don’t have a good way to distribute something like \( \frac{a+bi}{c+di}. \) Here we need to be clever, and maybe start by asking a new question that introduces a second big idea.
We defined \(i\) to be the square root of \(-1\). That is, \(i^2=-1\) is the definition of the number \(i.\) But what happens if we square the number \(-i\)? We have \[ (-i)^2 = (-1)^2 (i)^2 = 1 \cdot (-1) = -1. \] So we have two different numbers that both satisfy our equation \(x^2 = -1.\) How do we know which is the “positive” \(i\), and which is the “negative” \(-i\)?
And the answer is that there’s no real difference! A positive number like \(4\) has two square roots, \(2\) and \(-2\), and since they’re both real numbers one is positive and the other is negative. A negative number like \(-1\) will also have two square roots, but since they aren’t real numbers, neither one of them is actually positive. We just pick one to call \(i\), and call the other one \(-i\)—but it doesn’t matter which one is which. And that means that if we swap \(i\) and \(-i\), nothing else should change. Thus we can define an operation called complex conjugation by the rule \[ \overline{a + bi} = a - bi. \] This operation swaps \(i\) with \(-i,\) without changing anything else about our number.^{3}
But the complex conjugate has another useful property. What happens if we multiply a number by its own conjugate? We get \[ \begin{aligned} (a+bi) \overline{(a+bi)} &= (a+bi)(a-bi) \\ &= a^2 +abi - abi - b^2 i^2 \\ &= a^2 - b^2 (-1) \\ &= a^2+b^2. \end{aligned} \] If we multiply any complex number by its conjugate, we get a real number—and in fact, a positive real number, as long as we didn’t start with 0.
And this gives us a way to complex-number divisions, by turning them into real-number division: \[ \begin{aligned} \frac{a+bi}{c+di} & = \frac{a+bi}{c+di} \frac{c-di}{c-di} \\ & = \frac{ (ac +bd) + (bc - ad)i}{c^2 + d^2} \\ & = \frac{ac+bd}{c^2+d^2} + \frac{bc-ad}{c^2+d^2} i. \end{aligned} \] So we can in fact divide by any non-zero complex number. This means we can do basic arithmetic, and thus the complex numbers are a field.
And like the real numbers, they’re complete. The simplest way to think about this: we can think of a complex number \(z = a +bi\) as a pair of real numbers \(a\) and \(b\). So a sequence of complex numbers is basically just two sequences of real numbers, and we know that sequences of real numbers behave well. So any complex number that we can approximate has to actually exist; there aren’t any holes.
So while the reals are the unique complete ordered field, the complex numbers are a complete unordered field, which contains all the reals. And by giving up the order, we hope to get something else: every complex polynomial has a complex number solution. Once we take the real numbers and add in \(i\) there’s nothing left to algebraically add.
But it’s not obvious why that’s true. How do we know there’s not some polynomial equation we haven’t thought of, that doesn’t have a solution even in the complex numbers? To answer this, we need to turn to geometry.
If we have a pair of real numbers, we can graph it on a plane, using the first number for the horizontal coordinate and the second number for the vertical coordinate. But a complex number \(z = a +bi\) is a pair of real numbers. And that means that, just like we can think of the real numbers as forming a line:
we can think of the complex numbers as forming a plane:
There are a lot of geometric ideas we can poke at here; for instance, complex numbers give us a useful way to talk about angles that I’m not going to talk about here, since it doesn’t help answer our current question.
But distances and sizes will be extremely useful. So let’s think about those: if we have a number \(z = a+bi\), let’s figure out how far away from the origin at \(0\) it is. The \(x\)- and \(y\)-coordinates are \(a\) and \(b\), so we have a triangle with side lengths \(a\) and \(b\). By the Pythagorean theorem, the length of the hypotenuse, and thus the distance from the origin, is \(\sqrt{a^2+b^2}\).
So far, we haven’t used the fact that we have complex numbers running around. But if we remember the calculations we did with the complex conjugate, we might notice that \[ a^2+b^2 = (a+bi)(a-bi) = (a+bi)\overline{(a+bi)}. \] So we can rewrite our distance formula: if we have a complex number \(z\), the distance from the origin is \(\sqrt{z \cdot \overline{z}} \). We call this the modulus or absolute value of the number \(z\), and write it \(|z|\). It’s one of the most important operations we can do with complex numbers.
Specifically, it allows us to talk about sizes. Because the complex numbers aren’t ordered, we can’t directly compare numbers like \(3-4i\) and \(1 + 3i\); neither one is greater than the other. But once we graph them it’s visually clear that \(3-4i\) is much further from \(0\) than \(1+3i\) is, and in that sense it’s definitely “bigger”.
The modulus lets us compute this numerically: \[ \begin{aligned} | 3 - 4i | & = \sqrt{3^2 + 4^2} = \sqrt{25} = 5 \\ | 1+3i | & = \sqrt{1^2 + 3^2} = \sqrt{10} \approx 3.16 \\ \end{aligned} \] and so the first number is in this sense “bigger” than the second.
This size computation allows us to do a few things. First, we need it to do geometry, since it allows us to compute distances: the distance between \(z\) and \(w\) \( |z-w|\), the modulus of the difference. And then that lets us talk about “completeness” more precisely. Completeness tells us that when all the points in a sequence get close together, they must have some limit; for that to make sense, we need to know what “close” means!
And importantly for us, the modulus lets us talk about maximum values for functions. In the real numbers this is simple to talk about: we’re looking for the greatest possible output. But a function that outputs complex numbers can’t really have a maximum, because the outputs aren’t ordered! But instead we can look for the “biggest” output, where the modulus is greatest. Since the modulus is always a (positive) real number, this is a question that makes sense.
And once we investigate the maxima of complex functions, we get one of the most surprising results in all of complex analysis.
In the real numbers we had three key theorems in our “value pack”. One was the Extreme Value Theorem, which says that a continuous function on a closed interval has a maximum and minimum value. This doesn’t quite work in the complex numbers, because the lack of order means we lack both maximum outputs, and also “intervals”.
A real interval is one-dimensional and doesn’t make sense in the complex plane.
But it’s basically true, after we tweak it. Instead of a closed interval, we want to have a closed and bounded region, which you can think of as a loop and everything inside of it, very much including all the points on the loop. And we need to look for the greatest modulus, instead of the “greatest complex number”. But after we make those tweaks, we can restate the Extreme Value Theorem: a continuous function on a closed and bounded region has a maximum (and minimum) modulus.
A closed region in the complex plane. The outer blue boundary is included.
In fact, we can get even more than that. A real function on the plane has to have a maximum, but that can happen basically anywhere, without restrictions.
Some real-valued functions have lots of local maxima all over the place.
But a complex function, if it has a derivative, is much more restricted. The maximum modulus principle says that \(|f|\) doesn’t just have a maximum somewhere in the region; the maximum has to occur on the boundary of the loop. In fact, unless the function isn’t totally constant, the maximum value can only occur on the boundary. If we have a point on the inside of the loop, we can always get a bigger modulus by moving in some way towards the boundary, so there aren’t even local maxima on the inside of the region.
The height in this graph gives the modulus of the output, and color tells us the angle. If you ignore color this graph looks extremely boring—which is the point.
This has widespread and surprising implications. One of the most famous is that if a complex function is differentiable and bounded—meaning there is some maximum modulus the function can output, no matter the input—then it has to be constant.
And that’s really restrictive! A differentiable real function can easily be bounded without being constant:
The functions \(\sin(x)\) and \(e^{-x^2}\) are differentiable, bounded, non-constant real functions.
but a differentiable complex function cannot. Either it has only one possible output, or the outputs go to infinity. And this sort of behavior leads to what some mathematicians have jokingly called the only theorem of complex analysis:
Let \(f\) be a complex differentiable function with any interesting properties whatsoever. Then \(f\) is constant.
In truth, there’s a lot more to the calculus of complex numbers than that; and I could hang out all day talking about cool weird tricks. Like, we can use complex numbers to compute the integrals of purely real-valued functions that are too tricky to solve over just the real numbers, and that’s really cool and also kind of obnoxious.
But that’s not what we’re here for. We just wanted to take the real numbers, and add in everything we needed to make all our polynomial equations have solutions. And now we’re ready to prove that \(i\) is the only thing we had to add.
Theorem: Any non-constant polynomial equation with complex coefficients has a complex number solution.
Proof: Suppose we have some complex polynomial \(f(z)\) that doesn’t have any roots. We start by drawing big loop in the complex plane—big enough that \(|f(z)| > |f(0)|\) for every \(z\) on the boundary of the loop. We know we can do this because a polynomial will always get very big when the input gets very big.^{4}
Then the maximum value of \(|f(z)|\) happens on the boundary of the loop, but the minimum has to happen on the inside of the loop, since \(0\) is on the inside, and \(|f(0)|\) is smaller than any value we get on the boundary. (It’s not necessarily the minimum itself; there could be points that give even smaller values. But we know the minimum can’t be on the boundary because all the boundary points give big values.)
So we know that \(f\) is a differentiable function, with a maximum on the boundary of the loop, and a minimum on the inside. We can also define the function \( \frac{1}{f} \), which will flip this. When \(|f|\) is big, then \(\frac{1}{|f|}\) will be small, and vice versa; so \(\frac{1}{|f|}\) has its minimum on the boundary of the loop, and its maximum on the inside.
But we also know something else. Since \(f\) has a derivative, we know that \( \frac{1}{f} \) also has a derivative, so the maximum modulus principle applies: the maximum value of \( \frac{1}{|f(z)|} \) must occur on the boundary of the loop. But we just said that the maximum has to occur on the inside of the loop; something has gone wrong.
The culprit is our assumption that we could actually compute the function \(\frac{1}{f}\) everywhere inside the loop. That’s only true if \(f(z)\) is never zero, since we can’t divide by zero. Because that assumption led to a contradiction, we know \(f(z) = 0\) for some value of \(z\)—so there is a solution to the equation we started with. ∎
And this means that the complex numbers are sort of the end of this series of questions. In part 1 we started with the natural numbers, wanted to do algebra to them without worrying, and wound up with the algebraic numbers. In part 2 we started with the natural or rational numbers, wanted to do geometry and make approximations, and found the real numbers.
The algebraic numbers weren’t complete, meaning they’re inadequate for doing geometry and calculus. The real numbers are perfect for doing calculus, and are great for approximations, but they’re not algebraically closed—there are those pesky polynomial equations like \(x^2+1=0\) that don’t have solutions.
Now we can combine the two ideas, and get the complex numbers. They’re complete, so we can do geometry and calculus. They’re algebraically closed, so we can do whatever algebra we want. And they’re in many ways the best tool for doing both algebra and geometry.
But we did lose something when we moved to the complex numbers: we lost the ordering, and with it we lost some of our key calculus theorems from the reals.
The function outputs zero at the Xs. This path takes the output from \(1\) to \(-3\) without ever passing through zero.
The Mean Value Theorem says that if we have a differentiable real function on a closed interval, the average speed is equal to the derivative at some point. This fails in the complex numbers for the same reason the intermediate value theorem does; we can get from a speed of \(30\) mph to a speed of \(60\) mph without ever going \(45\) mph, because we can travel at \(45+i\) mph instead. (Physically this may or may not be meaningful, but mathematically it works.)
But this time we can recover an important chunk of the result. The Mean Value Theorem tells us speed limits work: if our speed is never higher than sixty miles per hour, we can’t possibly travel more than sixty miles in one hour. And we can still get that principle in the complex numbers, because the modulus of the distance we travel has to be less than the modulus of the time we spend, times the modulus of the speed. So we can save the tool we really care about—but only by shifting things back to the real numbers.
We already talked about the Extreme Value Theorem. In this case the complex numbers have an even stronger version than the reals did, in the Maximum Modulus Principle; it’s just so strong that it makes things really weird.
So of our three key calculus theorems, one is basically true but very strange, one is salvageable in a much weaker form, and one is just gone. And that makes the complex numbers awkward for doing calculus, in the sense we normally mean calculus. They’re not good for talking about speeds, or rates of change, or anything like that—at least not directly.
On the other hand, they’re great for doing algebra and geometry (and algebraic geometry). And there are all sorts of problems that don’t start out in the complex numbers, but can be transformed into complex-number questions, where we can throw our extremely powerful tools at them. (And then hopefully translate those answers back into real-world information!)
But we’re not going to talk about that here. My promise in this series was I would pose reasonable questions, and show you how answering them gives us new numbers; and that’s what we’ve done. We wanted to expand the natural numbers using basic operations, and now we can’t expand any further. We wanted a field that is complete and algebraically closed, and we got it. Until we find a new question, we can rest content.
I’m done with this line of questions; but I’m not at all done with this project! I hope to talk about quaternions and octonions, finite fields and modular arithmetic, \(p\)-adic numbers, transfinite numbers, infinitesimals, and function fields. Let me know what you’d like to hear about—tweet me @ProfJayDaigle or leave a comment below.
At least, until we come up with a new question to ask. ↵Return to Post
We don’t have to worry about terms with \(i^2\) or anything, because \(i^2 = -1\) is a real number again. ↵Return to Post
This is the simplest example of a really interesting field called Galois theory. The complex conjugation operation we constructed is an element of the Galois group of the complex numbers over the reals. ↵Return to Post
This is the step where we actually use the fact that we’re talking about a polynomial. This proof doesn’t work for functions like \(e^z\), and this is why. ↵Return to Post
In part 1 we saw the most straightforward types of numbers, from the natural numbers that we count with, through the rationals that allow us to do basic arithmetic, to the algebraic numbers that let us solve polynomial equations. In part 2 we started asking questions about geometry, where we wanted to measure shapes. We found that the area of a circle isn’t given by an algebraic number, but can be approximated as closely as we want.
This led to the idea of completeness, which basically means that anything we can approximate has to be real. Every sequence that looks like it should converge does converge, and thus every length gets an actual number attached to it. And if we want completeness we get the real numbers, which can be thought of as the set of infinite decimals.
But the real numbers were hard to define. They seemed like a lot of work just to be able to talk about the area of a circle without making any estimates; ten decimal places should be enough for anybody, but the reals require infinitely many. In this essay we’ll see that it gets worse—but also that all that work really has a payoff, and that the real numbers are the right sort of numbers to use.
But first, if you want me to feel like my work has a payoff, please consider donating to my Ko-Fi account. Tips are never necessary, but always appreciated, and they help make it possible for me to keep writing essays like this one.
We keep saying that the real numbers were really weird. How weird, exactly, are they?
We saw one hint with the observation that \(0.99\dots~ = 1 = 1.00\dots.\) All real numbers are infinite decimals, but sometimes more than one infinite decimal corresponds to the same real number. (And the idea that we can have “infinitely many nines”, and that somehow they add up to exactly one, is something that makes a lot of people viscerally uncomfortable). But that doubling-up is pretty easy to avoid if we’re careful; if we disallow decimal expansions that end in an infinite string of nines, the problem goes away, and we can sleep easy.
But the real numbers are strange in other ways. For instance: how many of them are there? There are two answers that both seem intuitively compelling. On the one hand, there are infinitely many real numbers, and maybe that’s all we can say. Infinity is infinity.
On the other hand, there are infinitely many natural numbers, and infinitely many rational numbers, and infinitely many real numbers. But it sure seems like there are more rational numbers than natural numbers, and more real numbers than natural numbers; so maybe all infinities aren’t the same.
If we look at this more carefully, things get complicated.
How can we tell if two sets of things are the same size? We could try counting them and comparing the numbers: I have two hands, and two feet, so I have the same amount of hands and feet. But that doesn’t work if we have infinities. And anyway, counting is pretty abstract. Can we make things simpler?
There are a few approaches you could take here, but one very basic idea is just to pair things off. I don’t actually know how many pairs of shoes I have; but I know that I have the same number of left shoes and right shoes, because each left shoe is paired to a right shoe, and each right shoe is paired to a left shoe. There are none left over, so I have the same number of each. In technical terms, we’d say this table gives a bijection or a one-to-one correspondence between my left shoes and my right shoes.
On the other hand, if I try to pair up my socks and my shoes, I’ll have socks left over. I can give each shoe its own sock, and I’ll still have a big pile of socks left over. So I know I have more socks than shoes.
Let’s apply that idea now. Are there more natural numbers, or more even numbers? The obvious answer is that there are more natural numbers. If we look at the first ten numbers, only five of them are even.
Natural numbers: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | \(\dots\) |
Even numbers: | 2 | 4 | 6 | 8 | 10 | \(\dots\) |
When we look at the first ten numbers, we have a lot of leftover (odd) natural numbers after we’ve paired off all the evens. And this pattern continues: if we look at the first hundred numbers, fifty of them are even. If we look at the first \(n\) numbers, about half of them will be even. So it seems like there must be more natural numbers than even numbers.
On the other hand, we can make a table like this, instead:
Natural numbers: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | \(\dots\) |
Even numbers: | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | \( \dots \) |
In this table, every even number corresponds to a natural number, and every natural number corresponds to an even number. They’re perfectly paired up. So by this argument, there must be the same number of natural numbers and even numbers.
This is one of the weird things that immediately happens when we start dealing with infinities: an infinite set can be in bijection with one of its own subsets. We see this in the observation that “infinity plus one” is just infinity, since adding an element to an infinite set doesn’t change the size. And these bijections are surprisingly common; sets in bijection with the natural numbes include the perfect squares:
Natural numbers: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | \(\dots\) |
Squares: | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 81 | 100 | \( \dots \) |
the primes:
Natural numbers: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | \(\dots\) |
Primes: | 2 | 3 | 5 | 7 | 11 | 13 | 17 | 19 | 23 | 29 | \( \dots \) |
and even the integers:
Natural numbers: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | \(\dots\) |
Integers: | 0 | 1 | -1 | 2 | -2 | 3 | -3 | 4 | -4 | 5 | \( \dots \) |
We call these sets countable or countably infinite, because we can put all the elements in order and count them. It makes sense to ask for the \(37\)th prime number \((157),\) or the \(53\)rd square \((2809).\) And conversely, we can look at \(193\) and determine it’s the \(44\)th prime number, or at \(289\) and see it’s the \(17\)th square.
Let’s make things a little more interesting. We saw that the sets of natural numbers, integers, even numbers, perfect squares, and prime numbers were all the same size. What about the rational numbers? It seems like there are a lot more rational numbers than there are natural numbers. But it seemed like there were a lot more natural numbers than even numbers, and that didn’t work out, so we should look closer. We can try making a table like this:
Natural numbers: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | \(\dots\) |
Rational numbers: | 1/1 | 1/2 | 1/3 | 1/4 | 1/5 | 1/6 | 1/7 | 1/8 | 1/9 | 1/10 | \( \dots \) |
But that won’t get us very far. Or rather, it would get us really far—we could keep going forever—but we’d leave most of the rational numbers out. We’ll never get to \(2\) that way.
Georg Cantor’s clever idea was to put the rational numbers on a grid, instead.
1/1 | 1/2 | 1/3 | 1/4 | 1/5 | 1/6 | 1/7 | 1/8 | 1/9 | … |
2/1 | 2/2 | 2/3 | 2/4 | 2/5 | 2/6 | 2/7 | 2/8 | 2/9 | … |
3/1 | 3/2 | 3/3 | 3/4 | 3/5 | 3/6 | 3/7 | 3/8 | 3/9 | … |
4/1 | 4/2 | 4/3 | 4/4 | 4/5 | 4/6 | 4/7 | 4/8 | 4/9 | … |
5/1 | 5/2 | 5/3 | 5/4 | 5/5 | 5/6 | 5/7 | 5/8 | 5/9 | … |
6/1 | 6/2 | 6/3 | 6/4 | 6/5 | 6/6 | 6/7 | 6/8 | 6/9 | … |
7/1 | 7/2 | 7/3 | 7/4 | 7/5 | 7/6 | 7/7 | 7/8 | 7/9 | … |
8/1 | 8/2 | 8/3 | 8/4 | 8/5 | 8/6 | 8/7 | 8/8 | 8/9 | … |
9/1 | 9/2 | 9/3 | 9/4 | 9/5 | 9/6 | 9/7 | 9/8 | 9/9 | … |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ |
A grid like this should contain every positive^{1} rational number somewhere. In fact, the big problem is that some of them show up more than once! \(1/1 = 2/2 = 3/3\) and \(1/2 = 2/4 = 4/8\); we get a lot of repetitions. If we throw out the duplicates, and only include fractions that are in lowest terms, we get this:
1/1 | 1/2 | 1/3 | 1/4 | 1/5 | 1/6 | 1/7 | 1/8 | 1/9 | … |
2/1 | 2/3 | 2/5 | 2/7 | 2/9 | … | ||||
3/1 | 3/2 | 3/4 | 3/5 | 3/7 | 3/8 | … | |||
4/1 | 4/3 | 4/5 | 4/7 | 4/9 | … | ||||
5/1 | 5/2 | 5/3 | 5/4 | 5/6 | 5/7 | 5/8 | 5/9 | … | |
6/1 | 6/5 | 6/7 | … | ||||||
7/1 | 7/2 | 7/3 | 7/4 | 7/5 | 7/6 | 7/8 | 7/9 | … | |
8/1 | 8/3 | 8/5 | 8/7 | 8/9 | … | ||||
9/1 | 9/2 | 9/4 | 9/5 | 9/7 | 9/8 | … | |||
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ |
And once we have all the rational numbers in a grid like this, we can put them in order: we just have to take a snaking diagonal path through our grid.
You can think of this as listing all the numbers where the top plus the bottom is two, then all the numbers where it’s three, then all the numbers where it’s four; there’s only a finite collection at each level.^{2} And that means that any rational number gets a specific, finite place in our list: \[ 1/1, \quad 2/1, \quad 1/2, \quad 1/3, \quad 3/1, \quad 4/1, \quad 3/2, \quad 2/3, \quad 1/4, \quad 1/5, \quad \dots \]
But all this is still a little weird, right? There are “obviously” way more rational numbers than there are natural numbers, but we just put them in order and paired them up. The fifth rational number is \(3\), and \(2/3\) is the eighth rational number; we can go in either direction.
We can take this logic one step further. In part 1 we defined the algebraic numbers, the numbers that are solutions to polynomial equations with integer^{3} coefficients. These include all the rational numbers, and all the square roots, and the imaginary number \(i\), and the solutions to \(x^5+x+3=0\) which we can’t describe any better than that. Can we pair them up with the rational numbers?
It seems obvious that there are way more algebraic numbers than rational numbers. But it was also “obvious” that there were more rational numbers than integers, and that didn’t quite pan out. In fact we can count the algebraic numbers. Take a minute and see if you can figure out how to do it!
There are a few approaches, but I think the easiest is this. First think about all the degree-one polynomials whose coefficients are \(1\) or smaller. There aren’t very many of these, and we can list them all off:
\[ \begin{aligned} 0 && 1 && -1 \\ x && x+1 && x-1 \\ -x && -x+1 && -x-1 \end{aligned} \] There are nine of them, so we can put them in whatever order we want. And each one has at most one root, so we’ve counted up to nine algebraic numbers. (In fact we’ve only counted three, since there’s a lot of duplication here, but that’s fine; we’ll just cross out the duplicate numbers like we did for the rationals.)
Now think about all the degree-two polynomials whose coefficients are \(2\) or smaller. There are a lot more of these!
\[ \begin{aligned} -2 && -1 && 0 && 1 && 2 \\ x-2 && x-1 && x && x+1 && x+2 \\ 2x-2 && 2x-1 && 2x && 2x+1 && 2x+2 \\ -x-2 && -x-1 && -x && -x+1 && -x+2 \\ -2x-2 && -2x-1 && -2x && -2x+1 && -2x+2 \\ x^2-2 && x^2-1 && x^2 && x^2+1 && x^2+2 \\ x^2+x-2 && x^2+x-1 && x^2+x && x^2+x+1 && x^2+x+2 \\ x^2+2x-2 && x^2+2x-1 && x^2+2x && x^2+2x+1 && x^2+2x+2 \\ x^2-x-2 && x^2-x-1 && x^2-x && x^2-x+1 && x^2-x+2 \\ x^2-2x-2 && x^2-2x-1 && x^2-2x && x^2-2x+1 && x^2-2x+2 \\ 2x^2-2 && 2x^2-1 && 2x^2 && 2x^2+1 && 2x^2+2 \\ 2x^2+x-2 && 2x^2+x-1 && 2x^2+x && 2x^2+x+1 && 2x^2+x+2 \\ 2x^2+2x-2 && 2x^2+2x-1 && 2x^2+2x && 2x^2+2x+1 && 2x^2+2x+2 \\ 2x^2-x-2 && 2x^2-x-1 && 2x^2-x && 2x^2-x+1 && 2x^2-x+2 \\ 2x^2-2x-2 && 2x^2-2x-1 && 2x^2-2x && 2x^2-2x+1 && 2x^2-2x+2 \\ -x^2-2 && -x^2-1 && -x^2 && -x^2+1 && -x^2+2 \\ -x^2+x-2 && -x^2+x-1 && -x^2+x && -x^2+x+1 && -x^2+x+2 \\ -x^2+2x-2 && -x^2+2x-1 && -x^2+2x && -x^2+2x+1 && -x^2+2x+2 \\ -x^2-x-2 && -x^2-x-1 && -x^2-x && -x^2-x+1 && -x^2-x+2 \\ -x^2-2x-2 && -x^2-2x-1 && -x^2-2x && -x^2-2x+1 && -x^2-2x+2 \\ -2x^2-2 && -2x^2-1 && -2x^2 && -2x^2+1 && -2x^2+2 \\ -2x^2+x-2 && -2x^2+x-1 && -2x^2+x && -2x^2+x+1 && -2x^2+x+2 \\ -2x^2+2x-2 && -2x^2+2x-1 && -2x^2+2x && -2x^2+2x+1 && -2x^2+2x+2 \\ -2x^2-x-2 && -2x^2-x-1 && -2x^2-x && -2x^2-x+1 && -2x^2-x+2 \\ -2x^2-2x-2 && -2x^2-2x-1 && -2x^2-2x && -2x^2-2x+1 && -2x^2-2x+2 \\ \end{aligned} \]
But it’s still a finite list, and each one has at most two solutions. So this has less than \(250\) algebraic numbers on it, and we can count them. In fact, let’s put the ones from the first list first, and then all the rest.
And now what we’ve done is defined a “height” for our polynomials: it’s the maximum of the degree and all the coefficients. So next week can look at the “height three” polynomials, the degree-three polynomials with coefficients three or less; and then the height four polynomials, which are degree-four polynomials with coefficients four or less; and so on.
At each height we’re adding finitely many algebraic numbers, so we can put them all in order. But we’ll get to every algebraic number eventually. For instance, the polynomial \(x^5+x+3\) has height five, so all the solutions will show up at the fifth step of this process. And that means we can pair the algebraic numbers up with the natural numbers. The algebraic numbers are countable.
At this point you might be wondering if we can just always do this. The naturals and the rationals and the algebraics are all the same size; maybe infinities are all the same, after all. But let’s look at one more example: the real numbers.
Imagine we can put all the real numbers in a list, like we did for the rational numbers. Every real number can be written as an infinite decimal, so the list might look something like this:
\[ \begin{aligned} 7.77000643096\dots \\ 1.05898980495\dots \\ 6.35097622647\dots \\ 1.79660844929\dots \\ 4.45063253213\dots \\ 7.48984022493\dots \\ 2.23729615260\dots \\ 0.09015630234\dots \\ 1.30480398871\dots \\ 7.76421175135\dots \\ \end{aligned} \]
But now we can make a new infinite decimal, that definitely isn’t on the list. To keep things simple, we’ll make a number whose whole number part, to the left of the decimal point, is just \(0\). That means our number will be between \(0\) and \(1\), and that’s fine.
Now let’s look back at the big list. The first number on the list has a \(7\) in the first decimal place. So to make sure our number is different, the first decimal place can’t have a \(7\), so we’ll put a \(0\) there instead.
Now the second number on the list has a \(5\) in the second decimal place. To make sure our number is different, the second decimal place can’t have a \(5\), so we’ll put a \(0\) there instead.
The third number has a \(0\) in the third decimal place, so we don’t want to have a \(0\) in the third decimal place of our number; let’s use a \(1\) instead. At this point, we know the first three places of our infinite decimal: \(0.001\dots.\) And we also know that our infinite decimal isn’t any of the first three numbers on the big list.
So we can continue this pattern. The fourth digit of the fourth number is \(6\), so we can pick \(0\) for our fourth digit.^{4} The fifth digit of the fifth number is \(3\), so we can pick \(0\). The sixth digit of the sixth number is \(0\), so we have to pick something else like \(1\). The seventh digit of the seventh number is \(1\), and the eighth digit of the eighth number is \(0\), so we should pick \(0\) for our seventh digit and \(1\) for our eighth digit. As we keep going, we get the number
\[ 0.0010010100\dots \] which can’t^{5} be the same as any of the numbers on our list: \[ \begin{aligned} 7.\color{red}{7}7000643096\dots \\ 1.0\color{red}{5}898980495\dots \\ 6.35\color{red}{0}97622647\dots \\ 1.796\color{red}{6}0844929\dots \\ 4.4506\color{red}{3}253213\dots \\ 7.48984\color{red}{0}22493\dots \\ 2.237296\color{red}{1}5260\dots \\ 0.0901563\color{red}{0}234\dots \\ 1.30480398\color{red}{8}71\dots \\ 7.764211751\color{red}{3}5\dots \\ \end{aligned} \]
And that means that we have a number that isn’t on the infinite list we started with.
Now obviously we could make a list that contains this number. We can just tack it on to the front:^{6}
\[ \begin{aligned} &0.0010010100\dots \\ &7.77000643096\dots \\ &1.05898980495\dots \\ &6.35097622647\dots \\ &1.79660844929\dots \\ &4.45063253213\dots \\ &7.48984022493\dots \\ &2.23729615260\dots \\ &0.09015630234\dots \\ &1.30480398871\dots \\ &7.76421175135\dots \\ \end{aligned} \]
And this list does have the number we just made. But it still can’t have every number on it; we can do the same thing we just did to get a new number, \(0.1000100000\dots,\) that isn’t on this list. Whatever list we come up with, there has to be a number that isn’t on it.
And in fact there are infinitely many numbers that aren’t on the list! We can see this pretty directly by listing a bunch of them. We built a number that wasn’t on the list, using just the digits \(0\) and \(1.\) But we could also use \(0\) and \(2,\) or \(4\) and \(7,\) or whichever pair of numbers we want. We could even choose a different pair for each place; we wind up having nine choices for each decimal place. So we can see all the following numbers aren’t on our original list:
\[ \begin{aligned} 0.0020020200 \dots \\ 0.4777777777 \dots \\ 0.6669666666 \dots \\ 0.1234567890 \dots \end{aligned} \]
And every number on that super-infinite is between \(0\) and \(1.\) We can find another infinite list between \(1\) and \(2,\) and another between \(2\) and \(3,\) and another between \(37\) and \(38.\) So not only are there more real numbers than natural, rational, or algebraic numbers; there are way, way more of them.
Another way of thinking about how many real numbers there are is to imagine choosing one at random. In fact, let’s just choose a number between \(0\) and \(1.\) Some of the numbers between \(0\) and \(1\) are rational, and others are irrational. So what are the odds that our randomly chosen number will be rational?
It turns out this probability of getting a rational number has to be zero. Not just small, but actually zero—even though it’s obviously possible, it has to be infinitely unlikely.
To see why, imagine that there was some positive probability of getting a rational number, like one in three.^{7} That would mean that one third of all the real numbers were rational—and that means we could divide the real numbers up into three sets \(A, B, C\), each of which are the size of the rational numbers.
But we know the rationals are countable, meaning we can put them in a numbered list. If the other two sets are the same size, we must be able to put them in lists, too, so we could divide the real numbers into three countable sets. And then we can make a complete list of all the real numbers: we can take the first element of \(A\), then the first element of \(B\), then the first element of \(C;\) then the second elements of \(A\) and \(B\) and \(C\); then the third elements; and so on. But this can’t possibly work, because we know the real numbers are uncountable. The rationals can’t be one third of the reals.
And the specific number was irrelevant here, right? If the rationals were \(1/1,000,000\) of the reals, we could still count the reals. We’d take the first element from each of our million sets, and then the second element, and then the third… If the rationals were any finite percentage of the real numbers, we’d be able to make a list of all the reals. So instead the rationals have to be zero percent of the real numbers.
And this sort of weirdness can only happen with infinities. Obviously there are rational numbers. “One” is a rational number, and it definitely exists. So if you pick a real number, it is possible to pick a rational number. But it will happen zero percent of the time. Not a small percentage; not a tiny percentage; it will happen zero percent of the time. There are infinitely many more real numbers than rational numbers. And that’s weird!
It gets worse when you realize the same argument applies to the algebraic numbers. It’s hard to come up with a real number that isn’t algebraic. Sure there are a couple of weird ones like \(\pi\) and \(e,\) but for the most part the irrational numbers we think about are all algebraic.
And yet if you pick a real number at random, there’s a zero percent chance you’ll get anything algebraic. There are infinitely more real numbers that aren’t solutions to polynomial equations than ones that are. Most real numbers are a little hard to even describe.
And this brings us to weird you might have noticed when we showed that the reals were uncountable. We started with a list of real numbers, and we constructed a decimal that wasn’t on that list, and said we came up with \(0.0010010100 \dots. \) But did we really find a specific numbers? I wrote an ellipsis there; do we know what comes next?
And the answer is, sort of. We have a rule that tells us how to choose the next digit, so in theory we should know what comes next. But the rule depends on the next number on the original list, and I didn’t tell you that, so we can’t actually figure out the next digit.
In general we’ve been pretty sloppy about this! I said a real number is an infinite decimal, and I’ve been writing strings of digits with ellipses at the end to say the decimal keeps going. But consider these three real numbers:
\[ \begin{aligned} A & = 0.1428571428 \dots \\ B & = 3.1415926535 \dots \\ C & = 0.9193470019 \dots \end{aligned} \]
Do we know what digit comes next in \(A\)? It looks like it’s repeating, so we can guess the next digits are \(5\) and \(7\). (We might even notice this is the decimal expansion of \(1/7\), which can make us more confident in our guess.) For \(B\), we might recognize that this is \(\pi,\) so we can look up the next digits, which are \(8\) and \(9\). But what about for \(C\)? Can you figure out what comes next?
And in this case, you can’t. You absolutely can’t. I just generated ten random numbers, without any pattern, and wrote them down. I wrote an ellipsis like there’s something that comes next, but I never decided what comes next, so the ellipsis is basically a lie. I have no idea what that number is—except that it’s between \(\dfrac{9193470019}{10000000000} \) and \(\dfrac{9193470020}{10000000000}.\)
And honestly, that’s enough information to do pretty much any calculation we would actually want to do in the real world. But it’s not enough to tell you which specific real number this is. There are infinitely many—uncountably many!—ways to continue on from those ten digits and get a real number. And even I can’t really tell you which one I want.
But I did tell you exactly what numbers \(A\) and \(B\) were—just not by writing down an infinite decimal. \(A\) is a repeating decimal, so I can give you a few digits and then tell you it repeats. \(B\) isn’t repeating, but it is a special number, \(\pi,\) and you can go look up the next digits if you want to. So rather than listing off all of the infinitely many digits, I can give you an algorithm, or recipe, for finding the digits. And you can keep computing the next digit and the next, for as long as you want. If a number has a recipe like this, that lets you compute all of the digits, we say it’s computable.
Can we do something like that for \(C?\) Can we take any infinite decimal and give an algorithm that will compute it?
Unfortunately, the answer is no, and for the same basic reason we know most real numbers aren’t algebraic. Every computable number has to have a recipe, so let’s think about what a recipe should look like. First, the recipe should be finite. We might not be able to finish reading the infinite decimal, but we should be able to finish reading the recipe. (Otherwise the recipe could just be a list of all the digits, which is missing the point!)
And we have to pick some language to write the recipe in. What language we pick doesn’t really matter; we could use English, or Mandarin, or some sort of weird mathematical symbology. But the language probably has finitely many symbols in it. Even if it has infinitely many symbols, we can limit it to a countable infinity—no more symbols than there are natural numbers.^{8}
And with just those two restrictions, we find that we can count all the recipes. We can count the symbols, meaning we can label them so that there’s a first symbol, then a second, then a third, and so on. There’s at most one recipe that uses only the first symbol, and is only one symbol long. There are at most six recipes that use the first two symbols and are only two symbols long. There are at most \(39\) recipes that use the first three symbols and are only three symbols long.
We’ve found a height for our recipes again: the maximum of the length of the recipe, and the number of symbols off our list we have to use. There are finitely many recipes of each height, so we can label all the height-one recipes, then all the height-two recipes, then the height-three, then the height-four, and so on. Eventually we will reach every possible recipe, which means we can make a numbered list of every possible recipe.
This means the recipes are countable, and so the numbers they produce are also countable. So there are countably many computable numbers, but uncountably many real numbers. And just like with the rational numbers and the algebraic numbers, almost every real number is uncomputable; if we pick a random real number, it is essentially guaranteed not to be computable.
And to be clear about what this means: one hundred percent of the real numbers are things that we can’t even describe. They’re so strange and gratuitously infinite that we can’t even really talk about most of them.
The real numbers were hard to define. (It took basically all of part 2!) There are way too many of them, and almost all of them can’t even be described in a useful way. And if you think about them too hard, they just start seeming really weird and uncomfortable. So why do we keep using them, rather than doing something more sensible?
The point of this series is that this is the right question to ask. We shouldn’t start by asking how some new weird math thing is defined; we shouldn’t start with a definition and just try to prove theorems from it. If we want to understand a math idea, we need to understand what problem it was designed to solve. So we don’t want to think about the definition of the reals. We should think about what they do.
The reals are characterized by three key properties: they are a complete ordered field. (In fact, they are the only complete ordered field.) And each of these three words represents a major idea from this series. We’ll take them in reverse order.
A field is a set that allows the four fundamental arithmetic operations: addition, subtraction, multiplication, and division (by non-zero numbers). Looking back at part 1, the natural numbers aren’t a field, because they don’t let you compute \(1-3\); the integers aren’t a field, because they don’t let you compute \(1/3\). When we wanted to do addition, subtraction, multiplication, and division, we came up with the rational numbers, which are a field. And so are the algebraic numbers, the computable numbers, and the reals.
The reals are ordered: if we have two distinct real numbers, one will always be greater than the other. We talked about this some in part 2. The order also “plays nicely” with the algebraic operations, in the sense that, for instance, adding \(1\) to a number will always make it bigger. The rational numbers and the computable numbers are also ordered, but the algebraic numbers are not, because \(i\) is neither positive or negative.
Finally, the reals are complete. Completeness was the main topic of part 2: we built the reals by saying that every sequence that looks like it should converge, does converge. Combining this with the order gave us the Monotone Convergence Theorem, which says an increasing sequence that doesn’t go to infinity has to converge, and that allowed us to show that every infinite decimal was a real number.
These three properties, taken together, give us exactly the real numbers. Any ordered field has to contain all the integers, because we can add \(1\) to itself repeatedly.^{9} Then because we can do division, we have to have all the rationals, which means we get all the finite decimals. And completeness means we get all the infinite decimals, so a complete ordered field has to include all the reals.
Conversely, every element of the complete ordered field has to be a real number. The order means we can trap our element in between two integers, and then in between two one-place decimals, and then in between two two-place decimals, and so on. Thus our element can be written as an infinite decimal, so it must be a real number.
If we want those three things to be true, the real numbers are what we have to use.
We want a field because it lets us actually do arithmetic; we want an order because some numbers are bigger than other numbers. The least obvious property here—and the one that creates all of that uncountable weirdness—is the completeness. Why do we need that?
The obvious answer is the one we saw in part 2: completeness lets us handle geometry and distances, because we can define any number we can approximate. But this is maybe not the most compelling reason, since we can approximate those distances without having to use real numbers. In fact this is essentially what the Greeks did: Archimedes computed that the circumference of a circle with diameter one was between \(\dfrac{221}{71}\) and \(\dfrac{22}{7}.\) The number \(\pi\) isn’t rational, but \(22/7\) and \(355/113\) and \(3.1415926535\) all are, and that’s enough for any actual calculation we want to do.
No, the real reason we need real numbers is they let us do calculus.
In calculus, we learn about the derivative, which tells us how quickly something is changing. (So if \(f(x)\) represents the position of an object, then the derivative \(f’(x)\) represents the speed.) A freshman calculus course will then spend a lot of time learning formulas to compute the derivative, which is very important for using calculus, but not important to the story we’re interested in here. So we won’t worry about that.
Instead I want to talk about a few important theorems about the derivative. A college calculus course will generally mention these theorems, but not really focus on them, because they aren’t necessary for any particular calculations. But they are critical to explain why we do calculus the way we do, and if I were writing a Fictional History of Calculus they would take center stage.
I like to call these key theorems the value pack, because they’re thematically related, and also all have the word “value” in their names:
The Intermediate Value Theorem says that if a continuous function can output two distinct numbers, it can also output anything in between them. This is the “no teleporting” theorem: if an object falls from ten feet above ground to five feet above ground, at some point in the middle it was nine and eight and seven and six feet off the ground.
It’s also the rule that says continuous functions have continuous graphs. If you’ve heard that a continuous function is one you can draw without lifting your pencil off of the paper, you’ve heard a version of the Intermediate Value Theorem.
The Mean Value Theorem says that if you have a differentiable function on a closed interval, the average speed is equal to the derivative at some point. This is the “speed limit” theorem: it says that if your speed is never higher than sixty miles per hour, you can’t possibly travel more than sixty miles in one hour.
It also tells us that if a function’s derivative is zero, the function has to be constant. If your speed is always zero, then you should never move at all.
The Extreme Value Theorem says that a continuous function on a closed interval has a maximum and minimum value. This is the “what goes up must come down” theorem: if you toss a ball in the air, some point in that toss will be the highest point.
And if all of those things seem obviously true, well, that’s the point. These are the theorems that tell us functions, and derivatives, and calculus all behave the way they’re supposed to. If these theorems weren’t true, then calculus wouldn’t describe the way things actually move in the actual world we observe, and so it wouldn’t be useful.
But if we don’t use the real numbers, all three of those theorems break.
Let’s imagine we’re doing calculus over just the rational numbers. That means we’re using functions that take in rational numbers as inputs, and give other rational numbers back as outputs. There are plenty of reasonable functions like this: \(f(x) = 3x+5\) or \( f(x) = \frac{x^2+1}{x-2} \) are rational functions. But there are also unreasonable functions, like this one:
\[ f(x) = \left\{ \begin{array}{ccc} 1 & \text{if} & x^2 > 2 \\ 0 & \text{if} & x^2 < 2 \end{array} \right. . \]
This looks weird and a little ugly, but it’s straightforward to compute once we understand what it means. When we plug in a number \(x\), we square it and look at the result. If we get a number bigger than \(2\), the function outputs \(1;\) if we get a number less than \(2\) the function outputs \(0.\) This rule works for any rational number, and will always give us one or zero as an output. (This rule would not work if \(x^2=2\), but since \(x\) is a rational number, that can’t happen!) In fact, this function is continuous, and differentiable, at every rational number. And at every rational number, the derivative is zero.
We can even graph this function pretty easily:
Behold, a continuous function.
And now things seem off. The graph doesn’t look continuous. But that’s because it jumps “at” \(\sqrt{2}\)—and in the rationals, that number doesn’t exist. So even though the function is continuous, it gives the outputs \(0\) and \(1\), but never anything in between; the Intermediate Value Theorem fails.
Even worse, the function has a derivative of zero everywhere. And looking at the graph, this makes sense, right? We can’t find any specific point where the function is increasing or decreasing; the tangent line at any point is horizontal. (Again, there should be a bad point at \(\sqrt{2},\) but since that number isn’t rational we don’t care.) And thus the value of the function changes, even though the rate of change is always zero. The Mean Value Theorem fails.
This function does satisfy the Extreme Value Theorem, but it’s not too hard to find other functions that don’t.
The graph of \(6x-x^3\) between \(-2\) and \(2.\) In the real numbers, it has a maximum at \(\sqrt{2};\) when restricted to the rational numbers, it has no minimum or maximum.
Now, this is all kind of dumb. The function \(f\) shouldn’t be continuous; there are obvious jumps in it! And the function \(6x-x^3\) has an obvious maximum in its graph. But that’s exactly the point. The rational numbers just aren’t good enough to do calculus, because obviously dumb and false things wind up being, technically, true. The function \(f\) is “continuous” because it jumps at an irrational number, and the function \(6x-x^3\) has “no maximum” because the maximum value happens at an irrational number. The rational numbers have gaps, and if we don’t fill in the gaps with real numbers, calculus just doesn’t work.
And while these examples talk about the rational numbers, we can also find examples that break in the algebraic numbers. They’re just way more annoying to describe, so I don’t want to write out the details.
This is a continuous function that sends algebraic numbers to algebraic numbers but has no maximum in the algebraic numbers. I’m not going to try to define it here. I had to get help from three people to define it properly, and writing the code for the graph took half an hour. Please don’t ask me to explain it.
There’s one more shot we have at avoiding all this nonsense. The rational numbers don’t work, and the algebraic numbers don’t work. But in this essay we described a new set of numbers, which contains all the algebraics and more, but is still countable: the computable numbers. Can we use these to do calculus, and avoid thinking about the uncountably many uncomputable reals?
Surprisingly, the answer is yes—sort of.
If we can give a recipe for \(x,\) and a recipe for \(y,\) then we can give a recipe for \(x+y\)—by saying \(“x+y”!\) We can do the same thing for subtraction, multiplication and division, so when we do algebra with computable numbers, the result is also computable. That means the computable numbers are a field.
The computable numbers are ordered, because they’re all real numbers, so we can compare two and see which one is bigger. They are not complete—if they were, they’d just be the real numbers. But they’re something almost as good: they’re “computably complete”.
Remember completeness means that any sequence that looks like it should converge does in fact converge; the real numbers are never just missing the number the sequence wants to converge to. But this isn’t true for the computable numbers. There are sequences of computable numbers that converge to uncomputable numbers.
However, I can’t give you any examples of those sequences. And in this case I’m not just being lazy; I really can’t give you an example. Because if I could describe a sequence of computable numbers that converges to \(x,\) then that sequence gives a recipe for computing \(x,\) and so \(x\) must actually be computable. And that means that every computable sequence of computable numbers converges to a computable number. Not every sequence will behave well, but every sequence we can actually describe will.
In the same way, none of the value theorems are technically true in the computable numbers. But they’re almost true. They’re computably true.
When we found rational functions that failed the value theorems, we did something extra: we described them using only rational numbers. (That’s why the condition was \(x^2>2\) and not \(x > \sqrt{2},\) for instance.) But for the computable numbers we can’t do that. There are functions from the computable numbers to the computable numbers where the value theorems fail, but those functions are themselves uncomputable. Any function that we can actually compute the results of will satisfy the value theorems. If we do anything even vaguely reasonable, everything will work the way we expect.
So we could avoid the weirdness of the reals and stick to the computables. If we work in the reals, every number we actually talk about will be computable, so we don’t gain anything by allowing all the real numbers. And if we work in the computables, every function we want to think about will satisfy the value functions.
But there’s no reason to stay in the computables—it fundamentally doesn’t matter. We work in the reals because they give us precisely the tools we want.
Sure, we don’t want to deal with with arbitrary infinite decimals, and we certainly don’t deal with Dedekind cuts. But we don’t want to think about explicit computer programs for every number we ever use, either. What we want is the value theorems; what we want is a complete ordered field. And when we ask for that, without adding any restrictions, the real numbers are what pop out.
We shouldn’t think about the reals using their formal definition. We should think about what they do for us, the tools they allow us to use and the moves they allow us to make. The real numbers are a complete ordered field, and they give us the value theorems, so calculus works. And that’s all we want. We don’t need to make it more complicated than that.
But next time we’ll make it complex, instead.
Do you have comments, or questions? Are there other types of numbers you want to learn the story behind? Tweet me @ProfJayDaigle or leave a comment below.
We’re going to ignore the negative numbers here because they make everything more complicated in a boring and annoying way. I promise I could include them if I wanted to make this section even longer. ↵Return to Post
Fancy number theorists call this sort of thing a height. It’s a convenient way of putting as "size" on rational numbers so that there are only finitely many small ones, which is useful when we want to put things in order, or compute probabilities. ↵Return to Post
In part 1 we said rational coefficients, but this is the same thing. If you have an equation with rational coefficients, you can multiply through by the least common denominator and get an equivalent equation with integer coefficients. And assuming all the coefficients are integers is way more convenient for what we’re going to do here. ↵Return to Post
We could also pick 1, or 2, or 3, or 4, or 5, or 7, or 8, or 9; this isn’t a deterministic process. We just can’t pick 6. ↵Return to Post
There’s some slight weirdness here around the fact that \(0.99\dots~ = 1.\) But it’s not a real problem; don’t worry about it. ↵Return to Post
We could also stick it in the middle, or replace some element of the list we started with. But we can’t tack it on to the end, because this is an infinite list—it doesn’t have an end! ↵Return to Post
You might feel like this probability is obviously too big, and of course you’re right. But none of the argument depends on the specific number. I just want to keep the number small so it’s easier to think about what’s going on. ↵Return to Post
If this seems like a lot of symbols, you’re right! English has, like, thirty, and thirty is way smaller than infinity. But the lambda calculus, which is the mathematical formalism I linked, does in fact want an infinitely long list of symbols. Since this argument works either way, I’m letting the list be infinite. ↵Return to Post
If we give up on having an order, we can get "looping around" behavior, where repeatedly adding \(1\) gets us back to where we started. This shows up in modular arithmetic and finite fields, which I hope to discuss later in this series. But in an ordered field, this sort of looping isn’t possible, because adding \(1\) will always make a number larger. ↵Return to Post
And this is first an apology for not writing much lately; we just got through finals, and that involves a lot of grading. So I’ve been a little occupied actually earning my paycheck. But it’s also a response to this excellent post by Adam Mastroianni of Columbia Business School on his substack Experimental History. I liked the post, and it reflects a lot of things I think about in my teaching. But I have a couple of big disagreements—in part, I suspect, because we’re teaching in pretty different contexts—and I wanted to write up a quick discussion^{1} of where I think we differ, and why I think giving grades is important and valuable.
Let’s start with Adam’s framing:
My teaching job, it turns out, is actually two jobs.
One job is instruction. Students and I enter the same room at scheduled times, I perform a series of actions, they perform a series of responses, and then the students leave the room more educated than they were before. This job rules. I like it when my students go “ohh!” and “I never thought about it that way” and “I get it now!” I like when they email me, years later, to tell me how they used something they learned in class. This all makes sense. In fact, I thought this would be my only job.
But I realize now that I have a second job, which is evaluation, or gatekeeping, or, most specifically, point-guarding. I’m supposed to award “points” based on what students do in my class. Students try to acquire as many points as they can, and I try to stop them from obtaining points too easily….
This part of my job makes no sense. For one thing, point-guarding makes students miserable…. For another thing, point-guarding makes me miserable…. Worst of all, the things that make me a better instructor often make me a worse evaluator, and vice versa….
He then discusses three specific reasons one might want to give and/or guard points, and largely dismisses them.
Adam says no, and I agree completely. Grades are often the vehicle we use to give feedback, largely because we have to give grades anyway. But you can give feedback without attaching a grade to it. I have taken many dance and music lessons, and they never had grades attached; I still promise you I got, and incorporated, a ton of feedback from these lessons, because that’s why I was there.
Conversely, while every grade comes with some feedback, just hearing “3/10” doesn’t actually tell our students anything useful that can help them improve. The need to give grades often channels our feedback into not terribly useful forms.
Adam rejects this idea, because most people are naturally curious and if they’re not motivated to learn in our courses, the takeaway should be that our courses suck. But I think he’s a bit too quick to dismiss the importance of motivation.
First of all, people do like getting points. This is what drives the success of apps like Duolingo: people start using Duolingo because they want to learn Spanish, but they keep doing it in part to keep their streak alive and keep earning the fundamentally meaningless Duolingo XP. People find it surprisingly motivating to get a gold star and a verbal pat on the back, and “10/10” is one way of doing that.^{2}
But I think I have a more substantive difference from Adam, which is shaped by the specific courses I mostly teach: introductory “gateway” math courses like calculus and linear algebra. I think these subjects are fascinating! (That’s why I became a professional mathematician, after all: I like math!) And in upper-division courses, “isn’t this cool” is actually pretty adequate to keep students engaged. (Most of my grad school classes had essentially no grading, and that was fine.) So it resonates when Adam says:
[I]f people need some extrinsic motivation to engage in my class, one of two things might be happening. Maybe they’re just not interested in what I have to offer. That’s fine! They should take a different class.
But most of my students aren’t taking calculus because they think it’s cool. They’re taking calculus because they need to know calculus to do other things they want to do. Their motivation is already extrinsic! And that creates a big problem of akrasia, because in the long term they want to have learned calculus, but in the short term they don’t “want” to sit down and do a bunch of exercises.^{3} And if you’re not doing exercises, you’re not learning math.^{4} A little bit of week-to-week prodding is valuable.
Moreover, I teach a lot of freshmen. They generally haven’t figured out how to manage themselves in college yet, and having some gentle guide rails (and metaphorical gold stars!) is really helpful.
Adam just says he’s not interested in doing this: “What am I going to do, send the good students to heaven and send the bad students to hell?” And emotionally, I sympathize a lot. All my students are my students, I want the best for all of them, and I have no desire to draw judgments on their characters, or worth as people, or anything like that. Hell, I don’t want to evaluate them at all! Grading sucks!
But this is where I come back to the joke I started with, a bit more seriously. I don’t like grading, but it is a large part of what I’m getting paid for. Adam denies this:
Ranking my students doesn’t help me teach them, so I have no interest in doing it. But I understand why other people want me to do it.
In fact, they’re counting on it. Businesses need to decide who to hire, graduate schools need to decide who to admit, and scholarships need to decide who to fund, so they’d all appreciate it if I identified the best students for them. I can’t help but notice, however, that none of those organizations pay me. They pay headhunters, hiring managers, and program officers, after all, so it’s a little weird for me to do these people’s work for them. It’s especially egregious for these businesses and schools to force students to pay huge sums to get themselves evaluated by me, a guy who just wants to teach them psychology but ends up playing point guard instead.
But this is an aggressively shallow reading of the economics of academia. Sure, the businesses who are doing the hiring don’t pay me. But George Washington University does pay me, and they can afford to do that in large part because my students pay them.^{5} And my students pay for the degree because it gives them a credential they can bring to businesses and get hired.
So those businesses (and graduate schools and scholarships etc.) aren’t paying me directly, but they are responsible for me getting paid. Sure, I’m a guy who just wants to teach them math but ends up playing point guard instead; but that’s why I joke that my salary pays for the point guarding and I do the actual teaching for free.
Beyond the purse strings argument, sometimes we do need to evaluate people because we, as a society, need to know whom to trust. We don’t need to send some students to heaven and others to hell, but we do need to send some students to medical school and others to places where they won’t accidentally kill a bunch of people. And Adam fully concedes this at the end of his essay:
But look, we need some evaluation. People have different talents, and they should get opportunities that tap those talents, not just because it benefits them, but because it benefits everybody. If I’m drowning (God forbid), I want to be saved by a lifeguard who’s good at swimming. If I get hit by a bus (God forbid), I want to be operated on by someone who’s good at surgery. If I take a math class (God forbid), I want to learn from someone who’s good at math. For that world to exist, someone, at some point, has to evaluate people on their swimming, surgery, and math.
But he doesn’t want to do the evaluation. And he speculates about the benefits of completely separating teaching from evaluation.
To some extent this sounds appealing. First, because if I could keep my job except without the grading, that would be fantastic.^{6} Second, because having more uniformity in evaluation would be good: if my “A” isn’t the same as your “A” then this doesn’t do a good job of figuring out who knows math and who doesn’t. This is why a lot of departments do common final exams—and while those sound logistically annoying, I’m basically in favor of them, and that is a move in the direction Adam is suggesting.
And finally, Adam’s plan is appealing because writing good evaluations is itself a major skill, and a lot of professors write pretty shit evaluations. I did a short fellowship with the College Board this semester helping them evaluate the Calc AB AP test, and I was blown away by the quality of the questions and the meticulousness with which they were put together.^{7} They have a large team of skilled professionals who put in a ton of effort to write an exceptionally good test, which is exactly what Adam asks for, and that’s extremely valuable work.
But while the College Board test writers are great at their jobs, there’s also a reason they wanted to workshop the test with professors: we are, in actual fact, the experts in what skill at calculus looks like! If you want to know if people understand calculus, you need experts in calculus. If you want to know if people have learned psychology, you need experts in psychology. For better or for worse, we need to be in the loop somehow.
But another thing I agree with Adam on is that we need to take evaluation seriously, as its own task. Most professors don’t really think about this a lot, but honestly most professors don’t think about pedagogy all that deeply.
Education theorists talk about “formative” and “summative” assessments. Formative assessments are mostly about teaching you something. I assign weekly problem sets because I want students to do those problems—because if they don’t do the problems, they won’t learn much. It’s not really an attempt to evaluate them. (And to be fair, I don’t think Adam is complaining about formative assignments.)
Summative assessments are the ones where you’re trying to really evaluate your students. And there are a few problems with the way we do those right now, but one is that we haven’t really committed to what these evaluations are supposed to say. Are we judging students on work ethic? On punctuality? On generally being good people?
And honestly, a lot of the time the answer to those questions is just “yes”. But if evaluation sucks, then evaluating “is this student a good person, overall” sucks twice. I don’t want to send my students to heaven or to hell, or try to tell whether they’re “good people” or not. And even if it’s possible, I’m not equipped to do it well.
What I am equipped to do is evaluate whether they know calculus. And that’s my job, right? My students need calculus so they can go take other classes that assume they know calculus. And I need to tell them, and everyone else, whether they do in fact know enough calculus to succeed in their next class.
And once I really embraced this idea, evaluation became a lot simpler.^{8} It lets me be kinder about some things, and stricter about others, but fundamentally it means I’m evaluating something I’m equipped to evaluate: do my students know calculus?
And that’s what I care about after all, isn’t it?
What do you think? Is grading worth it? Is there a better way? You can tweet me @ProfJayDaigle, make a note on Substack, or leave a comment below.
I also want to experiment with writing shorter posts to intersperse among the multi-thousand-word behemoths that take a month and a half to write. ↵Return to Post
You can give me a gold star and motivate me to keep writing by leaving a comment here, or on Twitter, or Substack, or by donating to my Ko-Fi. ↵Return to Post
And when I was taking piano lessons I never played enough scales. This is pretty much a human universal. Good practice is rarely fun. ↵Return to Post
I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write. ↵Return to Post
Yes, universities have a lot of revenue sources other than student tuition. But most of those sources rely on them continuing to be prestigious universities whose students go on to get good jobs and have successful careers, so I don’t think that changes this argument substantially. ↵Return to Post
Actually, the grading isn’t the worst part. Answering emails about the grading is the worst part. One place I agree with Adam completely is that I don’t want to get detailed rundowns of my students’ personal issues, and I don’t want them to have to share them. ↵Return to Post
Disclosure: I did a short fellowship with the College Board this semester helping them evaluate the Calc AB AP test. Wait, I already said that. ↵Return to Post
I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write.
No, I didn’t duplicate a footnote by accident. ↵Return to Post
In part 1, we started with the natural numbers, the most basic numbers we use to count things, and invented the integers (negative numbers) and the rational numbers (fractions). Then we took the same ideas a little further, and got the algebraic numbers, which are solutions to polynomial equations with rational coefficients—basically all the equations we can get by starting with the natural numbers and using just addition and multiplication.
But there are other questions we can ask, which don’t always give algebraic answers. So today we’ll look at a different question that we might want our numbers to answer: how do we measure things?
But before we start, if you like my writing and want to see more of this project, I have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this. Let me know what you’d like to hear about!
Last time we left off with a question: what is the area of a circle of radius 1? You probably know the answer: the area of a circle is given by the formula \( \pi r^2\), so if the radius is \(1\) the area must be \(\pi\). But where did that formula come from? And what about the number \(\pi\)—what exactly is it?
If we draw a quick picture, we can make a rough estimate of the area. The circle is contained inside \(2 \times 2\) square, so it must have area less than \(4\); and it contains a \(1 \times 1\) square, so it must have area bigger than \(1\). But we want to be a bit more precise.
One option is just to draw more, smaller squares.
Each of these squares is \(\dfrac{1}{4} \times 1/4\) and so has area \(1/16\). We can count that the circle contains \(32\) of them, and so has area at least \(\frac{32}{16} = 2\).
Conversely, we can contain the circle with \(60\) squares, so the circle has area less than \(\frac{60}{16} = \frac{15}{4} = 3.75.\)
So we have an estimate for the area \(A\) of our circle: \(2 < A < \frac{15}{4}.\) But this estimate still isn’t very good. We can improve it by drawing a finer grid, with more, smaller squares; but this gets tedious really quickly.
I’m not going to count the squares in this picture, but I could.
This gives us a way to think about the area of this circle. By drawing finer and finer grids, we can get better and better estimates of the area of the circle.
As a mathematician, I’m a very specific kind of lazy. I’m much too lazy to count up dozens of tiny squares, but I am willing to make very complicated, abstract, and possibly confusing arguments to avoid counting the squares. So I want to estimate the area of this circle in a more formulaic way, so I don’t have to count anything.
Let’s pretend the circle is a pizza. We can cut it into eight slices, like this:
If we join the points where we slice through the crust together, we get an octagon around the outside. That lets us replace our difficult question with a simpler one: what is area of this octagon? The octagon is made up of eight triangles, and we know the area of a triangle is \( \frac{1}{2} b h \), where \(b\) is the length of the triangle’s base, and \(h\) is the triangle’s height. So the area of the octagon is \(4 bh\).
The triangle’s height is roughly the radius of the circle, which is \(1\); and the length of the base is roughly one eighth of the circumference of the circle. And since we’re just estimating, rough numbers are fine; we can say that
\[ \text{Area of circle} \approx \text{Area of Octagon} \approx 4 \cdot \frac{\text{circumference}}{8} \cdot 1 = \frac{\text{circumference}}{2}, \] so the area of the circle of radius 1 is about half its circumference. In fact, we can make this same argument for a circle of any radius: if the radius is \(r\) and the circumference is \(C\), then the area will be approximately \(\frac{1}{2} C r.\)
But these are all just rough estimates. The area of the octagon isn’t exactly \(\frac{1}{2} Cr\), and the area of the circle isn’t exactly the same as the area of the octagon. But here’s where we have a key insight, which the Greeks called the method of exhaustion^{1}: both of those approximations get better if we draw a shape with more sides. Here’s the same basic picture, but instead of an octagon, we drew a sixteen-sided hexadecagon:
We have sixteen triangles, which have still have a height or about \(r\), but have bases of length about \(\frac{C}{16}\). This gives a total area of roughly
\[ \text{Area of Circle} \approx \text{Area of Hexadecagon} \approx 16 \cdot \frac{1}{2} \cdot \frac{C}{16} \cdot r = \frac{1}{2} C r. \]
And next we have a \(32\)-sided icosidodecagon.^{2}
The grey area isn’t the whole circle, but I can’t actually see the difference.
The icosidodecagon is still not quite the same size as the circle, but it’s pretty close. So we get an even better approximation:
\[ \text{Area of Circle} \approx \text{Area of Icosidodecagon} \approx 32 \cdot \frac{1}{2} \cdot \frac{C}{32} \cdot r = \frac{1}{2} C r. \]
More importantly, we can see that as the number of sides goes up, all of our approximations get better: the polygon is closer to being a circle, the height of each triangle is closer to the radius, and the base of each triangle is closer to \( \frac{C}{n} \), where \(n\) is the number of sides of the polygon. So we can tell this approximation will get better and better as the number of sides of our polygon gets bigger; we conclude that the area of a circle is exactly \[ A = \frac{1}{2} C r. \]
But that leaves us still with a problem. This isn’t the formula for the area of a circle that you know and (maybe) love. And in fact this formula is not nearly as useful as \(\pi r^2\), because it requires both the radius and the circumference. We know the radius is \(r\); but what’s the circumference?
I know I said I’d invent some numbers, and I promise I’m getting there soon. But we should finish answering this question first.^{3}
We can find the circumference of a circle with the same basic method-of-exhaustion logic we used to find the area formula. If we inscribe a polygon inside the circle, the perimeter of the polygon will be roughly the circumference of the circle; and the more sides that polygon has, the better this approximation will be.
The trick is finding a polygon that we can actually estimate the circumference of. And what Archimedes noticed is that if the number of sides of the polygon is \(3 \cdot 2^n\), we can use some basic trigonometry to work this out.
A circle has \(360^\circ\) total in it. If we inscribe a hexagon, we can chop the circle into six equilateral triangles, which will each have an inner angle of \(60^\circ\). We can cut these in half to get an angle of \(30^{\circ}\)—and this is convenient, because some basic trigonometry^{4} can convince us that \(\sin(30^\circ) = 1/2\). This means that each side of the hexagon has length \(r\), and the perimeter of the hexagon is \(6r\).
If we take a circle with radius \(1\), then each side of the hexagon has length 1, and the perimeter is just \(6\). This tells us that the circumference of the circle has to be bigger than six—but not too much bigger.
But more importantly, we can extend this argument. There’s a standard trigonometric formula^{5} for finding the sine of half of an angle. That means that when we look at a twelve-sided dodecagon and get an angle of \(15^\circ\), we can compute that \(\sin(15^\circ) = \frac{\sqrt{2 - \sqrt{3}}}{2}\). This tells us that each side has length \(\sqrt{2 - \sqrt{3}}\), and thus the total perimeter of the dodecagon is \(12 \sqrt{2 - \sqrt{3}}\approx 6.212.\)
Doubling the sides again gives a \(24\)-sided icositetragon; we use the trigonometric identity again, which get a more complicated formula. But we can work out each side has length \( \sqrt{2 - \sqrt{2 + \sqrt{3}}}, \) and the whole polygon has a perimeter of \(24 \sqrt{2 - \sqrt{2 + \sqrt{3}}} \approx 6.27.\)
Another doubling gives us a \(48\)-sided shape with perimeter \(48 \sqrt{2 - \sqrt{2 + \sqrt{2 + \sqrt{3}}}} \approx 6.28,\) and one more gives us a \(96\)-sided shape with perimeter \( 96 \sqrt{2 - \sqrt{2 + \sqrt{2 + \sqrt{2 + \sqrt{3}}}}} \approx 6.28\) again. So by the Method of Exhaustion, it’s reasonable to claim the circumference is about \(6.28\).
In fact, this entire argument scales up with the radius. So if a circle has radius \(r\), then the circumference is \(C \approx 6.28 r\); and from our earlier argument, the area is \( A = \frac{1}{2} Cr \approx 3.14 r^2\). The Greeks took this number \(3.14\)^{6} and called it \(\pi\), the first letter of the Greek word περίμετρος (perimetros), which means “perimeter” or “circumference”. And thus we finally have the formulas you know from school:
\[
\begin{aligned}
C & = 2 \pi r \\\
A & = \pi r^2.
\end{aligned}
\]
This argument produced a number, which we said is about \(3.14\). But what exactly do we mean when we write down the number \(\pi\)?
We described \(\pi\) by approximating it. It’s the number that’s close to \(6\), and closer to \(12 \sqrt{2-\sqrt{3}}\), and even closer to \(24 \sqrt{2 - \sqrt{2+\sqrt{3}}}\), and even closer to…
The Greeks called this the Method of Exhaustion, but in modern language we call it a limit. In calculus, we give a definition for limit something like this:^{7}
Definition: If we have an infinite list of numbers \(a_1, a_2, \dots, a_n, \dots\), and another number \(L\), we say that \(L\) is the limit of the sequence \( (a_n) \) if we can approximate \(L\) as precisely as we want by choosing a large enough \(n\). We notate this by writing \(\lim_{n \to \infty} a_n = L.\)
Less formally, the number \(L\) is the limit of a sequence of numbers if the numbers eventually get really close to \(L\). The idea is that the numbers \(a_1, a_2, a_3, \dots \) are each approximations of \(L\), and as we go further into the list, they approximate it better and better—which is exactly what we did when we estimated \(\pi\) earlier.
Except there’s a problem here. If we know \(L\) is a number, this is all fine. It’s not too hard to convince yourself, say, that the sequence \(( 1, 1/2, 1/4, 1/8, 1/16, \dots )\) is getting close to zero, or that \(1/n\) is a good approximation of zero for large values of \(n\).
We can see that the points with heights \(1/n\) are getting closer to the red line at height \(0\). The further we get into the sequence, the better an approximation this is.
But on the other hand, if we have a list like \((1, 2, 3, 4, \dots)\), or \((-1, 1, -1, 1, \dots)\), it doesn’t look like those are approximating any number.
On the left, the sequence \((1, 2, 3, 4, \dots)\) is getting bigger and bigger without approaching any particular number. On the right, the sequence \((-1, 1, -1, 1, \dots )\) is bouncing back and forth between two values, rather than approximating one specific number.
Not every sequence has a limit, because not every sequence is approximating one particular number. So how do we know our sequence \[ \left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right) \] does approach a number?
Unfortunately, it kind of doesn’t.
For a long time, people looked for a way to represent \(\pi\) as a rational number—as a ratio of two integers. We found that \(22/7\) is a pretty good approximation, and \(355/113\) is a shockingly good approximation (correct to six decimal places). But in 1758, Johann Heinrich Lambert proved that \(\pi\) isn’t a rational number.
Now, we do have other, “irrational” numbers. In part 1 we talked about algebraic numbers, which are solutions to polynomial equations \(a_0 + a_1 x + \dots + a_n x^n =0\). We used this technique to construct lots of irrational numbers, like square roots, cube roots, and the indescribable solutions to \(x^5+x+3=0\).
But \(\pi\) isn’t one of those, either. In 1882, the German mathematician Ferdinand von Lindemann showed that \(\pi\) is a transcendental number, which means it isn’t the solution to any polynomial equation with rational coefficients. We just can’t describe it with any of the tools we saw in Part 1.
It’s quite difficult to show that \(\pi\) is transcendental, and I’m not going to try to prove it here. The most common proof relies on the fact that the number \(e\) is transcendental, and even that isn’t easy to prove. But we do know \(\pi\) isn’t an algebraic number—so what is it?
The details are different, but we’re really in the same boat we found ourselves in last time. In part 1, we wanted a solution to the equation \(x^2-2=0\), but we couldn’t find a number that worked, so we just made one up. We can do the same thing here. When a sequence looks like it should have a limit, we’ll make one up for it.
We need to be careful, though, because lots of sequences don’t look like they’re converging anywhere, and those shouldn’t have limits.
Some sequences, like the first one go off to infinity, and others bounce around to multiple different targets, like the second one. But in some sequences, like the third, all the numbers eventually get very close together. We call those “Cauchy” sequences,^{8} and we want to have limits for all of them.
The definition of a Cauchy sequence may seem very similar to the definition of a limit, but it’s not quite the same. A sequence has a limit if the terms all get close to some fixed number; it’s Cauchy if the terms all get close to each other. In a Cauchy sequence, it seems like there should be some number the terms are getting close to, but in sets like the rational numbers, that may not be true. The rationals have “holes” that the terms of the sequence can gather around, but that don’t correspond to any rational number.
The most famous example is probably \(\sqrt{2}\). We saw last time that \(\sqrt{2}\) is irrational: there are no integers \(p\) and \(q\) such that \( \left( \frac{p}{q} \right)^2 = 2\). But we can find a rational number so that \(1.9 < (a_1)^2 < 2\), and then a second with \(1.99 < (a_2)^2 <2\), and a third with \(1.999 < (a_3)^2 <2\); and if we keep doing this, we get a sequence of numbers that clearly “wants to” converge to \(\sqrt{2}\).^{9} And that shouldn’t cause us too much distress. Even though \(\sqrt{2}\) is irrational, it’s an algebraic number, so we already created it; we don’t need to make up anything new.
But another hole in the rationals is \(\pi\). We built a Cauchy sequence of algebraic numbers that wants to converge to \(\pi\): \[ \left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right) \]
With a little more effort we could build a sequence of rational numbers that does the same thing. (For instance, as we’ll see later, \((3, 3.1, 3.14, 3.141, \dots )\) would work.) But \(\pi\) isn’t an algebraic number like \(\sqrt{2}\). From the algebraic perspective of part 1, it doesn’t exist.
But it should exist. So we’ll invent it.
If we invent all the numbers like this that should be the limits of sequences, we get the real numbers, which we represent with the symbol \(\mathbb{R}\). And the real numbers are complete, which means that every Cauchy sequence—every sequence which ought to converge—does in fact converge.
From this perspective, we can say that a real number is just a Cauchy sequence. But that’s not a great way to talk about them, for two reasons. The first is just that it’s awkward. I don’t want to go around talking about the number \[ “\lim_{n \to \infty} \left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right)”, \] and neither do you.
The second problem is that more than one sequence can converge to the same limit. \( (1, 1/2, 1/3, 1/4, \dots ) \) has the same limit as \((0,0,0,0, \dots ) \) or \( (1, 1/2, 1/4, 1/8, \dots )\); we really don’t want to treat them as different real numbers. We can fix this problem by defining real numbers to be “equivalence classes of Cauchy sequences of real numbers” but that gets extremely cumbersome.
The official method for constructing the reals is something called Dedekind cuts, where a real number is a way of cutting the rational numbers in half. So for example, we identify \(\sqrt[3]{2}\) with the set of all the rational numbers with \(x^3 < 2\). This has the advantage that it’s really easy to use in proofs; it has the disadvantage that it’s even more cumbersome to work with than the Cauchy sequences description.
But there’s a much easier approach. And it’s something we all learn in high school.
In high school algebra, I learned that a real number is an infinite decimal.^{10} Where does this idea come from?
We said that the real numbers are complete, which means every Cauchy sequence converges. But they’re also ordered: if we have two distinct real numbers, one will always be greater than the other. And that give us another way to characterize completeness:
Monotone Convergence Theorem: if a sequence of real numbers is increasing and bounded above, then it converges.
The idea here is that if a sequence is always increasing, it can’t really bounce around. So there are only two options: either it goes to infinity, or it converges to some real number. And this is basically how we actually got \(\pi\), right? Each polygon had a bigger perimeter than the last one, but the perimeter would never get bigger than, say, \(8\). We had an increasing sequence with an upper bound, so it had a limit.
Now a finite decimal is just a rational number. We can interpret a finite decimal \(3.14\) as something like \( \frac{314}{100},\) and similarly \(1.414 = \frac{1414}{1000}.\) But we can’t do the same thing with an infinite decimal; we’d have to have an infinitely large numerator and an infinitely large denominator.
Instead, we interpret an infinite decimal as a sequence. When we write that \(\pi = 3.14159 \dots,\) we mean that \(3\) is a rough approximation, and \(3.1\) is a better approximation, and \(3.14\) is even better; thus \(\pi\) is the limit of the sequence \((3,3.1, 3.14, 3.141, 3.1415, \dots). \)
Every infinite decimal is an increasing sequence, and every infinite decimal is bounded above: whatever we can say about a number like \(1.14142\dots\), we know it can’t be bigger than \(2\). So every infinite decimal corresponds to a real number.
And just as importantly, every real number corresponds to an infinite decimal! If we have a real number \(x\), we can find the biggest number with one decimal place that’s smaller than \(x\). Then we can find the biggest number with two decimal places, and the biggest with three, and the biggest with four… and this gives an infinite decimal that converges to \(x\).
This construction generally does what we expect it to, but there’s one very special case where it doesn’t. We know \(1\) is a natural number, and thus a rational number, and thus a real number. So how do we write it as an infinite decimal?
The largest number with one decimal places that’s less than \(1\) is \(0.9\). With two decimal places, we get \(0.99\). With three we get \(0.999\). So by this construction, the infinite decimal representation of \(1\) is in fact \(0.999 \dots .\)
You may have run across this claim, that \(0.999 \dots~= 1\), before; and it almost always triggers a great deal of resistance. It must be smaller than one. The leading term is a zero!
You’ll sometimes see simple algebraic proofs like this:
\[
\begin{aligned}
10 \cdot 0.999 \dots & = 9.999 \dots \\\
9 \cdot 0.999 \dots & = (9.999 \dots) - (0.999 \dots) \\\
9 \cdot 0.999 \dots & = 9 \\\
0.999 \dots & = 9/9 = 1.
\end{aligned}
\]
But a lot of people find that unsatisfying and unconvincing.
In fact that argument is a little glib, and glosses over some fairly sophisticated ideas—which we just worked through.^{11} An infinite decimal is asking for a limit, which isn’t how people generally think of numbers. But it’s certainly true that \(1\) is approximated by \(0.9\), and approximated even better by \(0.99\), and even better by \(0.999\); and that we can make that approximation as good as we want by adding more \(9\)s to the decimal.
And that’s all the \(0.999\dots~ = 1\) actually means. The sentence seems weird, because real numbers are weird. They seem innocuous, but a single real number is secretly an infinite collection of infinite series. And if we look too closely, the weirdness starts leaking out.
We started off with a fairly innocuous question: what is the area of a circle? And the answer turned out to be…quite a bit more complicated than we might have expected. And it gets worse! For instance, while there are infinitely many rational numbers, we can show that \(100\%\) of real numbers are irrational—and in fact \(100\%\) of them are, in a very precise sense, impossible to describe.
The real numbers are so weird and complicated that you might be wondering if we really need to do all of this. Sure, \(\pi\) is important, but can’t we just treat that as a one-off idiosyncrasy, and avoid all this nonsense about Cauchy sequences and Dedekind cuts? Unfortunately, we can’t. Sure, real numbers are extremely weird eldritch horrors horrors; but they’re also exactly the tool we need to do calculus.
There’s more to say about both of these ideas: why are the real numbers weird, and why are they so useful? So next time we’ll learn more about just how strange the real numbers are, and see why they are, nonetheless, perfectly suited to solve a whole host of very important problems.
Have questions? Can’t wait for part 3? Want to share your favorite weird numbers with me? Tweet me @ProfJayDaigle or leave a comment below.
No, not becuase everyone was exhausted by this point in the lesson. ↵Return to Post
A word I’m pretty sure I’d never heard before I just looked it up. ↵Return to Post
The paper How Archimedes showed that \(\pi\) is approximately equal to 22/7 by Damini D. B. and Abhishek Dhar was extremely helpful to me in putting this section together. ↵Return to Post
Just last week I told a student I had no memory of how to prove this. But the simple argument is precisely that we’re cutting an equilateral triangle in half—the half-triangle has an angle of thirty degrees and a side that has half the length of the hypotenuse. ↵Return to Post
Which I have to look up every time I want to use it. ↵Return to Post
Why 3.14 and not 6.28? The Greeks were more interested in the diameter of the circle than the radius, and so they thought the interesting formula was \(C = \pi d\), rather than \(C = 2 \pi r\).
Modern mathematicians generally see the radius as more fundamental, so we phrase all our formulas in terms of the radius; this means that a lot of our formulas contain the term \(2 \pi\). There’s a movement to stop using \(\pi\) and instead use the Greek letter \(\tau\) (tau) as the fundamental constant \(\tau = C/r = 2 \pi\). But it’s hard to change notation, so we slog on using \(\pi.\) ↵Return to Post
We can give a more precise definition using the Greek letter \(\varepsilon\), which is infamously confusing to calculus students. It’s really just a more precise way of saying the same thing.
We say that \(L\) is the limit of \( (a_n) \) if, for every \(\varepsilon >0\), there is a natural number \(N \) such that if \(n > N \) then \( \mid a_n -L \mid < \varepsilon\).
See if you can see why this means the same thing as the less formal version I wrote in the main text. ↵Return to Post
Pronounced “coh-shee”. They’re named after the 18th-century French mathematician Augustin-Louis Cauchy, who helped formalize this approach to limits and the real numbers. ↵Return to Post
But you can’t make the same argument for \(i\), the square root of \(-1\); this will be important next time. ↵Return to Post
Yes, even whole numbers are infinite decimals. We’ll get there. ↵Return to Post
A similar approach can also be used to “prove” that \(1+2+4+8+ \dots~ = -1\), which is obviously not what we mean. ↵Return to Post
This diagram summarizes my Ph.D. thesis. Every node represents a specific type of weird number. My mother said it looked like I was trying to summon Cthulhu.
And these things are all strange, and hard to define. But they’re not crazy, and they’re not random. Each of these weird number systems was invented to solve specific problems or answer specific questions. In this series of posts, I want to give you a sense of where these constructions come from, and how you can start out asking reasonable-sounding questions and wind up in the diagram above.
This isn’t a history lesson; I’m not going to tell you who first described these things, or how we actually started talking about them. But in the spirit of Timothy Chow’s You Could Have Invented Spectral Sequences, I want to convince you that you could have discovered all sorts of exotic number systems by asking reasonable questions. So we’ll walk through what the philosopher of mathematics Imre Lakatos would have called a rational reconstruction of the history of numbers—not what actually happened, but a cleaned-up fictional version that could have happened.
Today I’ll discuss the most sensible collections of numbers. We’ll start with the basic ability to count, and we’ll build up through the algebraic numbers, which let us solve all sorts of reasonable equations. In future parts we’ll tackle the tricky problem of completeness (part 2), which gives rise to the real and complex numbers and the \(p\)-adics, move into higher dimensions with the quaternions and octonions, and maybe look at some genuinely exotic ideas.
And as always, if you like my writing and want to see more of this project, I have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this. Let me know what you’d like to hear about!
The first numbers we can talk about are the numbers we use to count things: \(\{ 1, 2, 3, \dots \}\). We call these the natural numbers,^{1} abbreviate them with the symbol \(\mathbb{N}\), and mostly we all understand them pretty well. I have five apples right now; I know because I counted them. If I eat one I’ll have four apples. We can add natural numbers: if I start with five apples and add three apples then I will have eight. And we can multiply them: if I get four groups of five apples I will have twenty apples.
We can give a “formal definition” in set theory, where \(1\) is identified with the one-element set \(\{\varnothing\} \), and \(2\) is identified with the two-element set \( \{ \varnothing, \{\varnothing\} \} \), and so on. But this is really a way of understanding set theory, not understanding the natural numbers. For almost any reasonable use, the best definition of natural numbers is “oh, natural numbers are those things you count with”. Natural numbers come before math.
But even the natural numbers are deeper and conceptually richer than they seem. I have five apples, but that doesn’t mean they’re all the same!^{2} If we say we have “five apples”, rather than “a big apple, three medium apples, and a small apple” or “two Fujis, two Granny Smiths, and a Red Delicious” or even “an apple here and two apples there and two apple in between” we’re abstracting, waving away the differences so that we can describe them all with the number \(5\).^{3}
Even ordinary counting can raise philosophical questions. How many objects are in this picture?
Dale Cruse from San Francisco, CA, USA, CC BY 2.0, via Wikimedia Commons
You could say there’s one pizza. Or you could say there are ten slices of pizza. You could say there are eight slices, since one of the cuts isn’t very clean and a couple of the pieces seem stuck together. Or you could even start counting individual slices of onion. You have to decide what counts as “one object” before you can count things.
And there are plenty of other ways this abstraction can break down. If we combine two apples and one pear, we get three fruits, and the numbers work out but the noun changes. If we combine one heap of sand with another heap of sand, we get one heap of sand, so is \(1+1 = 1\)? No, the natural numbers just don’t model “adding heaps of sand” very well. And if we combine four atoms of iron with three molecules of oxygen, we somehow wind up with two molecules of rust: we add four things to three other things and end with two.
But while the natural numbers are in fact a simplified model, and an abstraction, they’re a pretty robust and, well, natural one. We use them a lot, we understand them from a young age, and they work well to describe a lot of phenomena.
There is infamous controversy about whether zero counts as a natural number. This is mostly a terminological dispute, and in practice I duck the issue entirely. When it matters, I either include zero by writing \(\mathbb{Z}_{\geq 0} \), or exclude it by writing \(\mathbb{Z}_{> 0}\) , rather than writing \(\mathbb{N}\) and confusing half my readers.
But for the purposes of this fake history, we should treat zero as the conceptual innovation it is. While the idea of having none of something is very simple, treating zero as a number is a much more sophisticated and abstract idea. Sure, there are zero oranges in my apartment right now, along with zero kumquats, zero elephants, and zero large expensive gemstones^{4}; if we’re listing off things I have zero of, we’ll be here a long time.
Treating zero as a number requires either that I be willing to “count” all those things that don’t exist, or that I give real conceptual heft to an abstract calculational tool. And while “give conceptual heft to an abstract calculational tool” might as well be the official motto of modern mathematical thought, it took a long time for people to accept it.
You might be surprised that I talked about addition and multiplication, and skipped subtraction. But subtraction can actually be pretty subtle! It introduces a new idea: we want to undo addition.
This undoing can be simple. If I want to have five apples and I already have three, I need to add two more apples; and we write that \(5-3 = 2\). If I want to have ten apples and I have four, I need \(10-4 = 6\) more apples. Algebraically we can look at this as solving equations; if I want to solve \(3+x = 5\) I get \(x=2\), and if I want to solve \(4+x = 10\) I get \(x=6\).
But if I want to have four apples and I have six apples, I get stuck. There’s no number of apples I can add to my ten apples to get four. I’m stuck. Algebraically I’m trying to solve \(6 + x = 4\), and there’s no natural number that solves that equation.
I don’t get stuck in the real world, obviously. If I have six apples and want four, I can throw two apples away. (Or eat them.) But I’m stuck mathematically, because I can’t really describe that situation with just the natural numbers. To talk about this, we need something new. We need negative numbers.
There are a few ways we can physically interpret a negative number of apples. Maybe it’s a debt: not only do I have no apples, but I also have to give you two apples tomorrow. Maybe it’s a loss, or a change: I can have two fewer apples today than I did yesterday, even if I can’t have \(-2\) apples right now. And maybe it’s a direction: I can’t move \(-2\) feet any more than I can have \(-2\) apples, but I can certainly move \(2\) feet backwards.
But what is \(-2\)? It’s the solution to \(6+x = 4\). Or more directly, it’s the solution to \(2+x = 0 \). That equation doesn’t have a solution in the natural numbers; but it would be really nice if it did have a solution, so we made one up. And we called it \(-2\).
And if we insist that \(a+x=b\) should have a solution for any natural numbers \(a\) and \(b\), we get the integers^{5} \( \{\dots, -2, -1, 0, 1, 2, \dots\} \), written \(\mathbb{Z}\)^{6}. These demand another level of abstraction, and as a result they tend to feel a little less “real” to people. But they’re useful—they let us model loss and debt and motion backwards and a whole bunch of other things, all with the same algebraic tool—so we put up with them.
But we didn’t just invent some new numbers; we also got a tool for inventing more numbers. With natural numbers we can add, and by undoing addition we got subtraction: we were trying to solve equations like \(a + x =b\). But we can also multiply the natural numbers, so we can try to undo that as well.
In some cases, this just works. With just the natural numbers, I can compute \(6 / 3\)—what should I multiply by \(3\) to get \(6\)—and get \(2\). I can interpret this physically, as we often do in grade school: if I have six dolls and want to divide them among three people, everyone gets two. But I can also interpret it algebraically, as trying to solve the equation \[ 3 \cdot x = 6. \] And maybe I muck around, or operate by trial and error, or honestly just have this one memorized, and I see that the number \(2\) solves that equation.
But I can also try to compute \(5 / 3\). There’s certainly nothing stopping me from writing those symbols down! But then I get stuck, just like I did when I wanted \(4-6\). If I try to divide five dolls among three people, either some people get more than others, or I have some left over. If I think algebraically, I look at the equation \(3 \cdot x = 5\), and I can’t find a natural number that solves it. \(1\) is too small, and \(2\) is too big; nothing will work.
But we can make the same move here we did for the integers. We can’t find a number that solves the equation \(3 \cdot x = 5\), so we’ll make one up. And what should we call it? We started out trying to compute \(5/3\), so maybe we should call this number \(“5/3”\). And we can generalize this: for any equation \(a \cdot x = b \), there should be a number \(b/a\) that gives a solution.
We run into two problems here: a minor one and a major one. The minor one is that we have some redundancy here. If a number solves \(3 \cdot x =5\) then it also solves \(6 \cdot x = 10\). So should we call it \(5/3\) or \(10/6\)? The answer, of course, is that we treat them as two different ways of “spelling” the same number.
But the major problem comes from the weirdest integer, \(0\). We wrote down the symbol \(5/3\) and just declare it’s a number; but we actually can’t we do the same thing for \(5/0\). There are a couple ways of seeing the problem. One has to do with sizes: when we wanted to compute \(5 / 3\), we saw that \(1\) was too small, and \(2\) was too big. It feels reasonable to insist there must be something in between. But if we want to compute \(5/0\) we’re trying to solve \(0 \cdot x = 5\). For this, every number will be too small; we can’t overshoot so we can’t look for an in-between number.
But a more robust algebraic argument is that we already know \(0 \cdot x = 0\). The equation \(0 \cdot x =5 \) can’t have a solution, unless we want to say that \(5=0\)^{7}. And \(0 \cdot x\) has to equal \(0\), unless we want to make all our operations stop making sense. As long as we accept that
then we can compute that
\[
\begin{aligned}
0 \cdot x
& = (0+0) \cdot x \\\
0 \cdot x
&= 0 \cdot x + 0 \cdot x \\\
0 \cdot x - 0 \cdot x & = 0 \cdot x \\\
0 & = 0 \cdot x.
\end{aligned}
\]
So we can’t divide by zero. It doesn’t work; it causes way more problems that it solves. But as long as we avoid zero we’re safe, and we can define the rational numbers \( \mathbb{Q} \)^{8} to be the set of all numbers \( p/q\) where \(p\) and \(q\) are both integers, with \(q \neq 0\).
To an algebraist like myself, rational numbers are the next step after integers. But to be clear, this is wildly ahistorical. (I did call this a fictional history of numbers, after all.) Zero and negative numbers were both relatively late inventions. But the (positive) rational numbers came much earlier, because they’re much less philosophically confusing. I may not know what \(-2\) apples looks like, but \(1/2\) of an apple is what I get when I cut it into two pieces.
Now we have a strategy: we write down an equation we want to solve, and then if it doesn’t already have a solution we invent one and make up a name for it. And as good mathematicians, we want to see how far this strategy can take us. What other equations do we want to solve?
The simplest equations are linear equations: \(ax +b =0\). Those are the same as \(ax = -b\), which we looked at last section; we can solve them all with rational numbers (as long as \(a \neq 0\)).
The next simplest equations are quadratic equations, the bane of every high schooler’s existence. These equations look like
\[ ax^2 + bx +c =0. \]
And if you’ve brushed up on your high school algebra lately, you may remember they generally have two solutions.
Sometimes these solutions are simple. If we take the equation \(x^2 -4 = 0\), then that’s the same as \(x^2=4\), and thus either \(x=2\) or \(x=-2\). And sometimes we can find these solutions by factoring. For instance, we have
\[x^2 -3x+2 = (x-1)(x-2), \]
so if \(x^2-3x+2 = 0\) then either \(x=1\) or \(x=2\).
But let’s look at the equation \(x^2-2=0\). It’s a pretty innocent-looking equation, as quadratics go. And if we graph \(x^2-2\), it sure looks like that equation should have two solutions:
The graph crosses the \(x\)-axis in two places, which should represent spots where \(x^2-2=0\). But it’s a famous theorem dating back to the Pythagoreans in classical Greece that there is no rational number that satisfies this equation.
Proof:
Suppose \( (p/q)^2=2\), where \(p/q\) is a fraction in lowest terms. Then \(p^2 = 2 q^2\). Since \(2\) divides \(p^2\), we know that \(p\) must be an even number, so we can set \(r = p/2\) where \(r\) is an integer.
Then we have \(4 r^2 = 2 q^2\) and thus \(2 r^2 = q^2\). Since \(2\) divides \(q^2\), we know that \(q\) must be an even number. But then \(p\) and \(q\) are both even, so \(p/q\) isn’t a fraction in lowest terms.
That means we can’t ever write the square root of \(2\) as a ratio of integers in lowest terms—and that means we can’t write it as a ratio of integers at all.
An equation with no solution? This is a very promising nail to hit with our hammer. Let’s invent a solution to this equation, which we’ll call the square root of \(2\), and write as \(\sqrt{2}\). In fact, let’s invent two of them. We know that quadratic equations are supposed to have two solutions; and looking at the graph, it sure seems like there’s a positive solution and a negative one. So we’ll call one of them \(\sqrt{2}\), and the other one \(-\sqrt{2}\). And now we can factor our polynomial:
\[ x^2-2 = (x - \sqrt{2}) (x+\sqrt{2}). \]
In fact, we can do this for any rational number! If we have the equation \(x^2 - p/q = 0\), there might already be two rational numbers that solve it. But if not, we can define two new numbers, which we’ll call \(\pm \sqrt{p/q}\), to solve the equation.
What about other quadratic equations? We don’t actually need anything more. If you’re really caught up on your high school algebra, you may remember the quadratic formula. (If you’re not really caught up on your high school algebra, you may have vague traumatic flashbacks when you hear the phrase “quadratic formula”.) This formula says that the equation \(ax^2 +bx +c\) has two solutions, given by \[ x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}. \] The formula uses those new square roots we just defined, but it doesn’t need anything else. If you can solve \(x^2 - p/q=0\) for any rational \(p/q\), then you can solve \(ax^2+bx+c\) for any rational \(a,b,c\).
I said that every rational number has a square root. And that might have been surprising to some of you: what about negative numbers? Those don’t have square roots, surely?
If I try to draw out some graphs, it doesn’t look like those equations should have solutions. The graph of \(x^2-2\) hits the \(x\)-axis twice, so it seems like it should equal zero for two values of \(x\). But if we graph the function \(x^2+2\) we get this curve:
And that doesn’t hit the \(x\)-axis at all! So how am I saying it has solutions?
The answer: I made them up! When I defined square roots I didn’t appeal to graphs, or the sizes of numbers, or anything like that. I just said “if we have a rational number, let’s make up two new numbers, called the square roots”. It doesn’t matter whether the number is positive or negative; I’m just as capable of writing the symbol \(\sqrt{-2}\) as I am of writing the symbol \(\sqrt{2}\).
Now of course, there is an important difference here. One of these is a real number, and the other is an imaginary number. But our current strategy doesn’t mention real numbers, or imaginary numbers, at all. We can’t even explain the difference clearly until we talk about a completely different strategy for inventing numbers, called completion, and that will have to wait.
Let’s see how far we can push this strategy. Simple equations like \(3 + x = 0\) gave us the integers; linear equations like \(3 \cdot x + 7 = 0\) gave us the rationals; and quadratic equations like \(x^2 + 5x +3 =0\) gave us square roots. What other equations can we look at?
We have two operations: addition and multiplication. The most complicated equation we can build with those operations are the polynomial equations^{9} \[ a_0 + a_1 x + a_2 x^2 + \dots + a_nx^n =0, \] where the numbers \(a_0, a_1, \dots, a_n\) are rational numbers.
Some of these equations have simple, straightforward solutions. If we want to solve
\[ x^3-6x^2+11x-6=0, \]
we can factor the polynomial to get
\[ (x-3)(x-2)(x-1) = 0, \]
which has solutions \(1\) and \(3\).
Other polynomial equations aren’t that straightforward, but have graphs that make it clear there should be solutions somewhere:
The graph of \(x^3-3x+1\) intersects the \(x\)-axis in three places, at roughly \(-1.879, 0.347\) , and \(1.532\).
And some polynomial equations look like they shouldn’t have any solutions at all.
The graph of \(x^4-3x^2+x+4=0\) doesn’t intersect the \(x\)-axis at all.
But just like with square roots, we don’t need to worry about the graphs. We can just define solutions to all those equations, inventing new numbers if there aren’t numbers that already work. And if we’ve invented enough numbers that every polynomial equation with rational coefficients has a solution, we get the algebraic numbers.
The algebraic numbers include all the rational numbers, and all the square roots. They also include higher-order roots: cube roots, fourth roots, and so on. For instance, \(x^4-2=0\) has four solutions: \(\sqrt[4]{2}, i \sqrt[4]{2}, - \sqrt[4]{2}, -i \sqrt[4]{2}\). The equation \(x^8-2\) has eight solutions, which include \(\sqrt[8]{2}\) but also \( \frac{1}{\sqrt[8]{8}} - \frac{i}{\sqrt[8]{8}}\).
But we also get other things. And it’s hard to describe them, because not all algebraic numbers have convenient names. We can graph the polynomial \(x^5+x+3\), and we see that the equation \(x^5 + x + 3 = 0\) should have one solution, which is approximately \(-1.133\).
The graph of \(x^5+x+3=0\) intersects the \(x\)-axis exactly once.
But if we ask a program like Mathematica for the exact solution, we just get told that the solution is a solution to the equation \(x^5+x+3=0\).
This is an answer to the question I asked. It is not a helpful answer.
Now, this number isn’t impossible to describe. After all, I just described it! But it doesn’t have a name in our usual notation. It’s just this awkward thing that has to exist, because we created a solution for \(x^5+x+3=0\); but all we can say about it is that it is that solution we created.
And with this, we seem to have exhausted our first strategy. Every equation we can built out of algebra has a solution now. We’ve completely answered that question.
But that doesn’t mean we’re done, because there are other questions—and some of those questions are very important. We can draw a picture of a circle with radius 1:
And we can try to measure the area inside this circle. Each square in the picture has area \(1/4\), so the circle must have an area bigger than \(1\), and smaller than \(4\). But what is it?
We can always just give that are a name; we chose \(\pi\), which is the Greek version of the letter \(p\) and stands for “perimeter”. But while we can invent a name for the area of this circle, we can’t relate it to the tools we’ve been discussing. There’s no polynomial equation defined with rational numbers^{10} that has \(\pi\) as a solution.
These non-algebraic numbers are called transcendental. They’re extremely important. And they’re a story for another day.
Do you want to hear about transcendental numbers? What other numbers do you want to learn about? Tweet me @ProfJayDaigle or leave a comment below.
If you’re about to bring up zero, please bear with me for a few paragraphs. ↵Return to Post
Four of them are sugar bee apples I bought over the weekend, and one is a gala apple that I bought months ago and forgot about in the back of the fridge until I went to go count my apples while writing this post. ↵Return to Post
We do the same thing when we say “apple” rather than writing out the full genome, or describing the position of every molecule, or whatever. We’re constant abstracting without really thinking about it; doing math just makes this process of abstraction more explicit. ↵Return to Post
If you’d like to help fix that last one for me, you can donate to my Ko-Fi. ↵Return to Post
Kronecker famously said that “God created the integers; all else is the work of man”. But I think he was giving God a little bit more credit there than he deserved. The integers are complicated! ↵Return to Post
The Z stands for “Zahlen”, which is German for “numbers”. ↵Return to Post
We can do that, actually—at least sort of. Stay tuned for the rest of this series. ↵Return to Post
The Q stands for “quotient”. ↵Return to Post
Polynomials aren’t the most complicated algebraic functions; we can look at rational functions, which are one polynomial divided by another. But these don’t really give us new equations because you can just multiply both sides by the denominator to get a regular polynomial. ↵Return to Post
This caveat is important, but only for stupid reasons. \(x - \pi = 0\) is technically a polynomial equation, and \(\pi\) is definitely a solution to that equation. But that just isn’t very enlightening. ↵Return to Post
But first, I want to mention that if you want to support my writing, I now have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this.
When I teach calculus I emphasize two big ideas: differential equations, and numerical analysis.
Differential equations generalize the concept of “rate of change”, and they’re the core of why calculus is useful: you can describe the rules a system follows, encode them in math, and draw conclusions. Calculus 1 students don’t have the tools to solve differential equations, but they can—and should—understand how a sentence like “the acceleration is proportional to the displacement” relates to the equation \(y’’ = -ky\).
Numerical approximation is often the way we use calculus, and increasingly so as computers are more powerful and available. I motivate the derivative with the idea of linear approximation: if I want to pretend my function is a line, and write \(f(x) = f(a) + m (x-a)\), what number \(m\) will do the best job? This develops into other methods for approximating the answers to questions that are too hard to answer directly: it leads into ideas like quadratic approximation and Newton’s method, and provides a foundation for numerical integration and Taylor series in Calculus 2.
If we combine these two ideas, we can try to numerically approximate the solution to a differential equation. Suppose we have a differential equation \(f’(t) = f(t) - f(t)^2/2\), and we know the initial condition that \(f(0)=1\). If we want to know \(f(3)\) we can get a rough guess with a linear approximation: we know \(f(0) = 1 \) and thus that \(f’(0) = 1 - \frac{1^2}{2} = \frac{1}{2} \), so we get
\[ f(3) \approx f(0) + f’(0) (3-0) = 1 + \frac{1}{2} \cdot 3 = \frac{5}{2}. \]
That’s only a rough estimate; linear approximation generally isn’t very accurate when the starting point and ending point aren’t close together. In fact the true value is \( \frac{2e^3}{e^3+1} \approx 1.905\), which isn’t terribly far off from \(2.5 \) but isn’t especially close either. But this is the best estimate we can really get using only \(f(0)\) and \(f’(0)\).
However, we know a lot more than that, because we have a formula for \(f’(x)\). It’s a bit hard to use, because we need to know \(f(x)\) to compute \(f’(x)\); but we know we can approximate \(f(x_2)\) if we already know \(f(x_1)\) and \(f’(x_1)\). That allows us to do a recursive calculation:
\[
\begin{array}{rl}
f(1) & \approx f(0) + f’(0) (1-0)
= 1 + \left(1 - \frac{1^2}{2} \right) \cdot (1) = 3/2. \\\
f(2) & \approx f(1) + f’(1) (2-1)
\approx \frac{3}{2} + \left( \frac{3}{2}
\frac{ \left(\frac{3}{2} \right)^2}{2} \right)\cdot (1)
= \frac{15}{8}. \\\
f(3) & \approx f(2) + f’(2) (3-2)
\approx \frac{15}{8} + \left(\frac{15}{8} -
\frac{\left(\frac{15}{8} \right)^2}{2} \right) \cdot (1) \\\
& = \frac{255}{128}.
\end{array}
\]
Thus we estimate \(f(3) \approx \frac{255}{128} \approx 1.99\).
This still isn’t an exact value for \(f(3)\); but this approximation is much better than our first try. And if this isn’t close enough, we can do even better by breaking our approximation into more steps: with six steps we get \(f(3) \approx 1.95\) and with sixty we get \(f(3) \approx 1.909\). More steps takes more work, but also gives us a more precise answer.
This approach is known as Euler’s method, and it allows us to numerically approximate the result of any first-order ordinary differential equation given an initial condition. With a little bit of work, we can generalize this to any ordinary differential equation; it’s quite straightforward and flexible.
It’s also basically just integration.
In a typical calculus course, we motivate the integral with the area problem: we have the graph of some function, and we want to find the area under that curve. We can approximate that area by chopping it up into rectangles, which gives us the Riemann sum. And then as the number of rectangles approaches infinity our approximation gets really good, which allows us to define the integral.
\[ \int_a^b f(t) \,dt = \lim_{n \to \infty} \sum_{k=1}^n f(x_k) \Delta x \]
This definition has a lot of symbols in it, and is generally intimidating to freshman calculus students. But it does accurately describe what we’re doing and why: the key idea of the integral is to break a calculation into pieces, do an approximation on each piece, and then add the results together. This will give us an approximate answer to our original question; as we use more and smaller pieces, the approximation gets better, and so in the limit we get an exact answer.
So this formula directly answers the question that we’re asking. And when we want to think about applications of the integral, the Riemann sum definition is useful: it helps us figure out what the integral is actually computing, and so what problems it can help solve. But Riemann sums are a huge pain to actually do computations with, so we generally don’t.
Instead, we rely on the Fundamental Theorem of Calculus, which comes in two parts.
Fundamental Theorem of Calculus, Part 1:
Given a function \(f(x)\) and a number \(a\), we can define a new function \(F(x) = \int_a^x f(t) \,dt\). Then \(F’(x) = f(x)\).
Part 1 tells us that the derivative undoes the integral; the derivative of the integral of \(f\) is just \(f\). This is conceptually cool, and it does allow us to compute something. But it doesn’t directly help us compute the integral. Instead, we use it to prove^{1} a second statement.
Fundamental Theorem of Calculus, Part 2:
If \(F’(x) = f(x)\), then \(\int_a^b f(t) \,dt = F(b) - F(a)\).
This is also known as the Evaluation Theorem, or sometimes the Net Change Theorem. And it’s the tool we actually use in practice to compute integrals—to the extent that people mainly associate “integration” with finding the antiderivative \(F(x)\), not with finding the number corresponding to the area under the curve.
And this all works, but we’ve moved pretty far away from the original question, and the connections pass through some relatively abstract territory. It’s hard to really intuitively see how this calculation relates to the original question.
Maybe there’s a better way.
Let’s start by asking this question backwards. Suppose there’s some function you’re interested in, but you don’t have a formula for it. Instead you just have a formula for the derivative. In practical terms, this happens in dead reckoning: if you can’t measure where you are, but you know where you started and how fast you’re moving, you can estimate where you end up.
So suppose we know our speed \(F’(x)\), and our starting position \(F(a)\), and we want a way to figure out our current position \(F(x)\). We want to compute an antiderivative! The FTC part 2 tells us that \(F(x) = F(a) + \int_a^x F’(t) \,dt \), so we could figure this out by doing an integral. But I want to follow a different thought process.
We can start by saying, we know what \(F(a)\) is, and since we have a formula for \(F’(x)\), we can compute \(F’(a)\). Then we can use the linear approximation formula to estimate \[ F(x) \approx F(a) + F’(a) (x-a). \] So if we know, say, that \(F(1)=3\) and \(F’(x) = 3x^2\), we can estimate that \(F(5) \approx 3 + 3(5-1) = 15\).
Linear approximation gives a pretty decent estimate if \(x\) and \(a\) are close, but if they’re far apart it’s not very good. Consequently it doesn’t really work here: in reality \(F(5) = 127\).
But we can improve this exactly the same way we did before, by using Euler’s method! The problem is that the two points on my linear approximation are too far apart. But we can try to approximate the value of \(F\) somewhere closer to \(1\), like at \(3\).
\[
F(3) \approx F(1) + F’(1)(3-1) = 3 + 3(2) = 9.
\]
And then, since we also know \(F’(3) = 27\) I can estimate
\[
F(5) \approx 9 + 27(5-3) = 63.
\]
Still not right, but much better! And we can improve even further by doing more steps:
\[
\begin{array}{rl}
F(2) & \approx F(1) + F’(1)(2-1)
= 3 + 3 = 6 \\\
F(3) & \approx F(2) + F’(2)(3-2)
= 6 + 12 = 18 \\\
F(4) & \approx F(3) + F’(3)(4-3)
= 18 + 27 = 45 \\\
F(5) & \approx F(4) + F’(4)(5-4)
= 45 + 48 = 93.
\end{array}{rl}
\]
This still isn’t quite right, but it’s even closer; and as we take more and more smaller and smaller steps, we’ll get a better and better approximation.
This is basically Euler’s method. But why is it an integral? Let’s reorganize the calculation to make it clearer what’s happening.
\[
\begin{array}{rl}
F(5)
& \approx F(4) + F’(4)(5-4) \\\
& \approx F(3) + F’(3) (4-3) + F’(4) (5-4) \\\
& \approx F(2) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\
& = F(1) + F’(1) (2-1) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\
& = 3 \cdot 1 + 3 \cdot 1 + 12 \cdot 1 + 27 \cdot 1 + 48 \cdot 1 = 93.
\end{array}
\]
At this point this should be starting to look familiar. We’re taking a bunch of steps of size \(1 = \Delta x\), and for each step we’re multiplying it by the derivative at some \(x\) value. So we just computed
\[
F(5) \approx F(1) + \sum_{k=1}^4 F’ \big( 1 + (k-1) \cdot 1 \big) \cdot 1.
\]
More generally, if we take \(n\) steps we get
\[
F(5) \approx F(1) + \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x.
\]
And that’s almost exactly a Riemann sum on the left-hand side. In fact, it’s a Riemann sum, plus the extra term \(F(1)\). If we rearrange it we get
\[
F(5) - F(1) \approx \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x.
\]
I see two ways to think about this formula. One is that the indefinite integral contains a \(+C\) term, because antiderivatives aren’t unique. So while \(\int F’(t) \,dt\) is an antiderivative of \(F’(x)\), we don’t necessarily get the same function as our original \(F(x)\). Instead, the FTC just guarantees we have \(F(x) +C\), and \(F(1)\) is just the \(+C\) term.
But I think a clearer to me is that we’re really computing the change in the value of \(F\). This should make physical sense: the calculations with the speed tell us how far we’ve moved, not where we are. Thus the Euler’s method calculation tells us our displacement; but if we add that on to our starting position, we find out ending position.
Mathematically, this all works out. It’s a cute argument and I’m glad I’ve found it. But there are plenty of fun math ideas that don’t belong in a freshman calculus course.
This approach has one obvious, major disadvantage: no one else teaches it like this, so it would probably leave students confused if they go on to take another course with someone else. And that’s probably enough to make it not worth doing^{2}, on its own.
But while that’s a real obstacle to adopting this approach in one class, it’s also kind of dodging the interesting questions about whether this would be a better approach. What if we could get everyone to switch? Should we?
One problem is that this argument isn’t at all rigorous. As long as we believe that Euler’s method will converge to the right answer, then the integral will as well; but I don’t know how you’d prove that Euler’s method converges without referencing the integral, so that seems fairly circular.
That objection seems fatal to me—in an upper-division Real Analysis course. In a freshman calculus course, nothing is ever going to be fully rigorous, and the proofs involving Riemann sums especially won’t be because getting the technical details of Riemann sums correct is hard. So I don’t mind a little non-rigor, especially if it helps students develop a clear intuitive understanding of what we’re trying to do.
In fact, having to avoid some of the abstraction involved in proving the Fundamental Theorem of Calculus might be a win, overall. That’s one of those lectures where I’m always confident my students aren’t really following the details, and are just hanging on trying to survive until we get back to computing things. On the other hand, it’s good for them to see some abstract formalism, even if they’re not ready to fully understand it yet. You have to see your first scary proof sometime!
Another problem is that this derivation captures the relationship between the Riemann sum and the antiderivative, but presents it exactly backwards. In most applications, the Riemann sum is the question we want to answer; the antiderivative is the tool we use to answer it. But the Euler’s method approach treats the antiderivative as the question, and the Riemann sum as the way we compute the answer—which is completely wrong since the Riemann sum is nearly impossible to compute outside of the simplest cases. I think this is a really deep problem with this approach. One of the big ideas I want my students to engage with is figuring out the difference between identifying a question, and computing the answer; giving it to them backwards seems like an obstacle to developing that understanding.
But I do really like the way this approach connects the integral back to the other big ideas in the class. Not just to the derivative; any presentation of the FTC will draw a link between integration and differentiation. But this makes the integral seem connected to the themes of numeric approximation and differential equations, which ties the entire course together neatly.
And really, that sums it up, I think. It’s always nice to tell a neat story that ties the whole class together. But it probably isn’t as important as making sure our students understand each piece well on its own. I have to resist the temptation to do something pretty, and elegant, and unnecessarily confusing. So this is a fun idea, but for now I’m going to teach this normally.
Do you have a clever way to motivate the integral? Do you think I should actually be using this approach in my course? Any other thoughts on teaching integration? Tweet me @ProfJayDaigle or leave a comment below.
This proof relies heavily on specific special properties of the real numbers, and in particular the property that if \(f’(x)=0\) then \(f(x)\) is constant. This isn’t true if we allow functions to be defined solely for rational numbers; the real numbers are exactly the set that makes it work. ↵Return to Post
Or at least not worth doing as the motivation to the integral. I think it’s fine to do this as a followup, or an application of the integral. If you have an extra day to spend on integration, this isn’t the worst thing you could do. But if you have extra days in your calculus syllabus please tell me how you got them. ↵Return to Post
But rather than philosophizing, I decided to get concrete about this. Can I use ChatGPT to make my job easier? It’s going to be pretty useless for the most important parts of my job. In particular, it has no way to figure out why a student is confused and address their confusions. And it’s not going to come up with insightful new ways to describe course topics. It won’t even be able to meaningfully connect distinct ideas in the course, because it has no sense of what’s already been covered.
Instead I need to find the aspects of my job that are routine, and involve following relatively standard templates and filling them out in predictable ways. I need to find tasks that it’s easy for me to check if they’re done right, since ChatGPT is not correct with any consistency. Ideally, I’d also find ways to have it replace the parts of my job that are the most annoying: I don’t want a way to avoid spending time in office hours with students, because office hours are fun!
But one thing I spend a lot of time doing, and don’t enjoy at all, is writing homework and test questions. I need to create original problems (or at least ones that aren’t in the textbook so students can’t look them up), but not too original (so they fit the patterns that my calculus students are supposed to be learning). And unlike all of the rest of my course planning, I need to do new ones every year—I can reuse my old lecture notes, but it’s not safe to reuse my old tests.
So I decided to spend some time experimenting with GPT as a test writer. Can it write good questions? Can it write usable solutions for those questions? And can it do this easily, or is shepherding it through the process more trouble than it’s worth?
But before I tell you what I found, I want to mention that if you want to support my writing, I now have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this.
Overall, the current tech is seems somewhat useful, but not actually good—at least, not yet. But it’s close enough that I suspect it will get pretty good for this purpose before long.
With a couple exceptions, ChatGPT could figure out what type of question I was asking for. If I asked for a related rates problem, or an integration problem that involved integration by parts, I would get one. Sometimes they weren’t quite right, but I could get the general type of problem I asked for, with basically no prompt engineering.
On the other hand, it was hard to get specifics. I can get a big pile of integration by parts problems, but a lot of them will be either very easy or very hard. And ChatGPT gets stuck in ruts; I saw identical problems show up to multiple different prompts, and there were running themes in everything it output. That means that the system can’t give me fine-tuned answers, and also will not give me an even coverage of the relevant types of problems.
But if I have something specific I want, I can probably just write it myself; and even if it won’t give me every type of problem, it can help remind me of my options. I found it genuinely useful for brainstorming problems, even if I didn’t use any of them exactly. (And I am at this moment proctoring a test that includes some problems I wrote with GPT assistance.)
On the other hand, the solutions it produced were usually wrong, sometimes spectacularly so. A few times I got a completely correct solution. Most of the time, I would get an answer that had the right approach but did completely nonsensical calculations in the middle; the solutions would look superficially correct, but checking them carefully turned up multiple errors. And occasionally I would get arguments almost completely unrelated to the questions I asked.
But, if anyone does figure out a way to usefully and consistently hook this up to a computer algebra system, it will probably do pretty well at solving problems too. It tended to set up the right computation and then generate a nonsense answer; if it could tell when it needs to just factor a polynomial or compute an integral, and pass that to a computer algebra system, that would fix a lot of the weaknesses.
I know multiple teams are trying to find a way to hook systems like GPT up to computational engines and computer algebra systems. If they could do that effectively it would probably be able to write good solutions immediately, but that really sounds non-trivial to me. You could maybe teach it to pass integrals or other specific calculations to a computer algebra system, read the result, and print the result. But translating that into a well-written solution solution would require some sort of deep integration of the two capabilities, not just an ability to print the final answer.
But one thing did impress me about the solutions: ChatGPT could clearly consistently remember what question it was trying to answer. Every single solution ended with a clear restatement of the question and an answer to it. The answer was usually wrong but it never lost track of what it was supposed to be answering. (And that’s more than I can say for some of my students.) This mostly shows that the question is still in ChatGPT’s context window when it finishes the solution, but also that it’s still using the details of the question. And it’s doing that consistently; I don’t think it failed at this once.
ChatGPT can write perfectly valid and idiomatic LaTeX code, which I was mildly surprised by. I’m used to the fact that it can put out grammatical and idiomatic English text, but I was still impressed by the LaTeX.
Similarly, ChatGPT can draw diagrams using ASCII art. I’m really curious how much of that was in the training set; I wouldn’t have thought it was that common, but it’s clearly common enough. (It did not always accurately describe the diagram it had drawn.)
One major weakness is that ChatGPT doesn’t have a consistent style. Each solution it wrote used completely different notation and phrasing—even when I asked it to respond to the same prompt multiple times.
The rest of this post is a record of things I tried and experiments I conducted in getting ChatGPT to write calculus questions for me. I spent a while playing around, and tried to include everything here as a sort of lab notebook and complete record. If you don’t need to write calculus questions you probably don’t want to read the whole thing.
But I have hidden most of the long GPT exchanges under pulldown spoiler boxes. If you just want to skim through and see my thoughts on each experiment, you can do that; and you can expand out any box you decide you actually want to read.
Or you can jump to the end and leave me a comment.
When I teach calculus 1, my absolute least favorite thing to do is writing test questions for max/min problems. It’s simultaneously extremely routine and extremely tedious. So I decided to start there.
Prompt: Write a problem where you have to classify the critical points of a function, with a solution.
Problem:
Consider the function $f(x) = x^3 - 3x^2 + x - 1$. Determine the critical points of the function and classify them as local maxima, local minima, or inflection points.
Solution:
A critical point of a function is a point where the derivative of the function is equal to zero or is undefined. To find the critical points of the function, we first find its first derivative:
$f’(x) = 3x^2 - 6x + 1$
Setting $f’(x) = 0$, we find the critical points:
$3x^2 - 6x + 1 = 0$
$x = \frac{6 \pm \sqrt{36 - 4 \cdot 3 \cdot 1}}{2 \cdot 3} = \frac{3 \pm \sqrt{3}}{3}$
Let $c_1 = \frac{3 + \sqrt{3}}{3}$ and $c_2 = \frac{3 - \sqrt{3}}{3}$. These are the critical points of the function. To classify them, we need to find the second derivative and check the sign of the second derivative at each critical point. The second derivative of the function is:
$f’‘(x) = 6x - 6$
At $c_1$, the second derivative is $6c_1 - 6 = 6(\frac{3 + \sqrt{3}}{3}) - 6 = 6$, which is positive. This means that $f(c_1)$ is a local minimum.
At $c_2$, the second derivative is $6c_2 - 6 = 6(\frac{3 - \sqrt{3}}{3}) - 6 = -6$, which is negative. This means that $f(c_2)$ is a local maximum.
Therefore, the critical points of the function $f(x) = x^3 - 3x^2 + x - 1$ are:
$c_1 = \frac{3 + \sqrt{3}}{3}$, which is a local minimum.
$c_2 = \frac{3 - \sqrt{3}}{3}$, which is a local maximum.
Problem: Consider the function $f(x) = x^3 - 6x^2 + 9x - 2$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.
Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.
Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by: $f’(x) = 3x^2 - 12x + 9$
Setting $f’(x) = 0$, we find that:
$3x^2 - 12x + 9 = 0$
Solving for $x$, we get the critical points:
$x = \frac{12 \pm \sqrt{12^2 - 4 \cdot 3 \cdot 9}}{2 \cdot 3}$
$x = \frac{12 \pm \sqrt{144 - 108}}{6}$
$x = \frac{12 \pm \sqrt{36}}{6}$
$x = \frac{12 \pm 6}{6}$
$x = 3, 1$
So, the critical points are $x = 1, 3$.
Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points: $f’‘(x) = 6x - 12$
$f’‘(1) = 6 \cdot 1 - 12 = -6$
$f’‘(3) = 6 \cdot 3 - 12 = 0$
Since $f’‘(1) < 0$, it means that $x = 1$ is a relative maximum. Since $f’‘(3) = 0$, it means that $x = 3$ is an inflection point.
So, the critical points are:
$x = 1$ is a relative maximum
$x = 3$ is an inflection point.
There are a couple things to notice here. First, I did no post-processing on the output at all. ChatGPT wrote perfectly formatted LaTeX code for these answers without me editing it at all.^{1}
Second is that these are both fairly simple problems. My next step is going to have to be seeing if I can get more interesting examples out of it; it went for a simple cubic both times.
Third is that the solution style is inconsistent. ChatGPT doesn’t write up solutions the way I would, which is to be expected. But it also doesn’t write them up the same way as itself. The two solutions use different notations and different phrasings. This doesn’t make it wrong, but it isn’t great as a teaching tool; it’s good for students to see a certain amount of consistency, especially as beginners.
And fourth is that despite the nice organization and formatting, both of these solutions are almost, but not quite, right:
Response 1 takes the quadratic equation \(3x^2-6x+1 =0\), correctly uses the quadratic formula, and then simplifies the solutions to \( \dfrac{3 \pm \sqrt{3}}{3}\). But the actual solutions are \(\dfrac{3 \pm \sqrt{6}}{3}\).
Then when it plugs these points into the second derivative, in incorrectly computes \(\pm 6\) when it should get \(\pm 2 \sqrt{3} \).
These are the sort of basic arithmetic errors that make me think a student understands the material, but is sloppy with arithmetic; I’d probably grade this as an 8/10.
Response 2 computes the critical points correctly.
When it plugs them into the second derivative, it correctly computes \(f^{\prime \prime} (1) = -6\) but says that \(f^{\prime \prime}(3) = 6\cdot 3 - 12 = 0\), when of course \(6 \cdot 3 - 12 = 6 \).
Then it interprets this calculation wrong: it makes a key conceptual error by saying that since \(f^{\prime \prime}(3) = 0\), we know that \(3\) is an inflection point. This is kind of half-right, and I could see some mathematicians accepting that conclusion. But not all points where \(f^{\prime prime} = 0\) are inflection points, and I wouldn’t accept that conclusion.
So we get well-formatted solutions, but they’re not consistent or reliable; I’d have to hand-check them, and probably rewrite to match my style.
But I want to see if we can get better questions out of this, so let’s go back to the well.
I wanted to get slightly more complicated functions involved.
Prompt: Write a problem where you have to classify the critical points of a rational function, with a solution.
Problem: Consider the function $f(x) = \frac{x^3-3x}{x^2-2x+1}$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.
Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.
Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by: $f’(x) = \frac{(x^2-1)(x^2-3)}{(x-1)^2}$
Setting $f’(x) = 0$, we find that:
$(x^2-1)(x^2-3) = 0$
Solving for $x$, we get the critical points:
$x = \pm 1, \pm\sqrt{3}$
However, $x=1$ is not a critical point since the denominator of the function is zero at $x=1$. Therefore, the critical points are $x = -\sqrt{3}, 1, \sqrt{3}$.
Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points: $f’‘(x) = \frac{6x(x-1)(x+1)}{(x-1)^3}$
$f’’(-\sqrt{3}) = \frac{-6\sqrt{3}}{(2\sqrt{3})^3} = -\frac{3}{4\sqrt{3}}$
$f’‘(1)$ does not exist since the denominator of the second derivative is zero.
$f’’(\sqrt{3}) = \frac{6\sqrt{3}}{(2\sqrt{3})^3} = \frac{3}{4\sqrt{3}}$
Since $f’’(-\sqrt{3}) < 0$, it means that $x = -\sqrt{3}$ is a relative maximum. Since $f’’(\sqrt{3}) > 0$, it means that $x = \sqrt{3}$ is a relative minimum. The point $x=1$ is not a critical point since the function is undefined at that point.
So, the critical points are:
$x = -\sqrt{3}$ is a relative maximum
$x = \sqrt{3}$ is a relative minimum.
Problem: Consider the function $f(x) = \frac{x^2 - 6x + 9}{x^2 - 4}$. Classify the critical points of the function as relative maxima, relative minima, or inflection points.
Solution: To classify the critical points, we need to find the critical points and then determine their behavior using the second derivative test.
Finding critical points: To find the critical points, we need to find the points where the first derivative of the function is equal to zero or the first derivative does not exist. The first derivative of the function is given by: $f’(x) = \frac{(2x - 6)(x^2 - 4) - (x^2 - 6x + 9)(2x)}{(x^2 - 4)^2}$
Simplifying the expression, we get:
$f’(x) = \frac{-x^4 + 8x^3 - 5x^2 - 72x + 72}{(x^2 - 4)^2}$
Setting $f’(x) = 0$, we find that:
$-x^4 + 8x^3 - 5x^2 - 72x + 72 = 0$
We can factor this expression as:
$(x - 2)(x - 3)(x^2 - 3x - 12) = 0$
Solving for $x$, we get the critical points:
$x = 2, 3, \frac{3 \pm \sqrt{105}}{2}$
So, the critical points are $x = 2, 3, \frac{3 - \sqrt{105}}{2}, \frac{3 + \sqrt{105}}{2}$.
Second derivative test: To determine the behavior of the critical points, we use the second derivative test. We find the second derivative of the function and evaluate it at the critical points: $f’‘(x) = \frac{2x^5 - 30x^4 + 144x^3 - 320x^2 + 288x - 72}{(x^2 - 4)^3}$
$f’‘(2) = \frac{360}{36} = 10$
$f’‘(3) = \frac{-540}{49} < 0$
$f’‘\left(\frac{3 - \sqrt{105}}{2}\right) > 0$
$f’‘\left(\frac{3 + \sqrt{105}}{2}\right) < 0$
Since $f’‘(2) > 0$, it means that $x = 2$ is a relative minimum. Since $f’‘(3) < 0$, it means that $x = 3$ is a relative maximum. Since $f’‘\left(\frac{3 - \sqrt{105}}{2}\right) > 0$ and $f’‘\left(\frac{3 + \sqrt{105}}{2}\right) < 0$, it means that $\frac{3 - \sqrt{105}}{2}$ is an inflection point and $\frac{3 + \sqrt{105}}{2}$ is a local maximum.
So, the critical points are:
$x = 2$ is a relative minimum $x = 3$ is a relative maximum $\frac{3 - \sqrt{105}}{2}$ is an inflection point $\frac{3 + \sqrt{105}}{2}$ is a local maximum.
Both of these questions are at least moderately reasonable, but the answers are wildly off-base.
Response 1 computes the derivative incorrectly, sets it equal to zero correctly, computes the second derivative incorrectly, and plugs the critical points into it incorrectly. It does draw the correct conclusions from the results of those computations, though.
It also makes a random sign error on one of the critical points, and makes a technically incorrect claim about what qualifies as a critical point. (points where the derivative is undefined are critical points but can’t be maxima or minima).
Response 2 computes the derivative correctly, but simplifies it incorrectly, then sets it equal to zero incorrectly. It computes the second derivative incorrectly, and then plugs in the critical points it computed incorrectly.
It almost interprets the computations correctly, but then fails: it claims that because \(f’’ \left( \frac{3-\sqrt{105}}{2} \right) > 0 \), that point is an inflection point, which is definitely wrong.
(Also it repeats itself at the end, but that’s not wrong, just badly written.)
Overall this doesn’t seem like an awful way to brainstorm problems, but I would have to write solutions myself and confirm the problems are reasonably solvable.
Prompt: Write a related rates problem with a solution
Problem: A spherical balloon is being inflated with air at a rate of 4 cubic feet per minute. At the instant when the radius of the balloon is 2 feet, how fast is the radius of the balloon increasing?
Solution: We can use the formula for the volume of a sphere to relate the rate of change of the volume of the balloon with the rate of change of the radius:
$V = \frac{4}{3}\pi r^3$
Taking the derivative of both sides with respect to time, we get:
$\frac{dV}{dt} = 4\pi r^2 \frac{dr}{dt}$
We are given that $\frac{dV}{dt} = 4$ cubic feet per minute, and we know that the radius of the balloon is 2 feet. So, substituting these values, we get:
$4 = 4\pi (2)^2 \frac{dr}{dt}$
Simplifying, we get:
$\frac{dr}{dt} = \frac{1}{4\pi (2)^2} = \frac{1}{16\pi}$ feet per minute
So, the radius of the balloon is increasing at a rate of $\frac{1}{16\pi}$ feet per minute when the radius of the balloon is 2 feet.
Problem: A water tank is in the shape of a cylinder with a radius of 5 feet and a height of 10 feet. Water is flowing into the tank at a rate of 3 cubic feet per minute. How fast is the water level rising when the water is 6 feet deep?
Solution:
Let’s begin by labeling the variables we need:
$r$ is the radius of the tank, which is 5 feet. $h$ is the height of the water level in the tank. $V$ is the volume of water in the tank. $t$ is time. $dV/dt$ is the rate at which water is flowing into the tank, which is 3 cubic feet per minute. We want to find the rate of change of $h$ with respect to time when $h = 6$ feet.
To solve the problem, we need to relate $h$ and $V$. The volume of a cylinder is given by the formula:
$V = \pi r^2 h$
Differentiating with respect to time, we get:
$dV/dt = \pi r^2 dh/dt$
Solving for $dh/dt$, we get:
$dh/dt = (dV/dt)/( \pi r^2)$
Now, we need to find $dV/dt$ when $h = 6$ feet. The volume of water in the tank when the water level is 6 feet deep is:
$V = \pi r^2 h = \pi (5^2)(6) = 150\pi$
Differentiating with respect to time, we get:
$dV/dt = 150\pi (dh/dt)$
Substituting $dV/dt = 3$ and $r = 5$, we get:
$3 = 150\pi (dh/dt)$
Solving for $dh/dt$, we get:
$dh/dt = 3/(150\pi) = 0.02$ feet per minute
So, the water level is rising at a rate of 0.02 feet per minute when the water is 6 feet deep.
Problem: A spherical balloon is being inflated at a rate of 10 cubic feet per minute. Find the rate at which the radius of the balloon is increasing when the diameter is 2 feet.
Solution:
Let’s start by writing down what we know:
The volume of a sphere is given by V = (4/3)πr^3, where r is the radius. We know that dV/dt = 10, since the balloon is being inflated at a rate of 10 cubic feet per minute. We want to find dr/dt when the diameter is 2 feet, which means r = 1 foot. To find the rate at which the radius of the balloon is increasing, we need to relate dV/dt and dr/dt using the chain rule. We have:
dV/dt = dV/dr * dr/dt
We can find dV/dr by taking the derivative of the volume formula with respect to r:
dV/dr = 4πr^2
Substituting the values we know, we get:
10 = (4/3)π(1^3) * dr/dt * 4π(1^2)
Simplifying the expression, we get:
dr/dt = 15/(2π)
So, the rate at which the radius of the balloon is increasing when the diameter is 2 feet is 15/(2π) feet per minute.
Response 1 is a simple and straightforward, but solid problem. And the solution is so close to being right. It does everything correctly up until the very end, and then it makes an arithmetic error at the end.
Response 2 Is also pretty simple, maybe too simple for me to really like it. The solution starts off strong, but halfway through gets itself confused, starts the problem over, and does it wrong.
Response 3 is also a simple problem, basically the same as Response 1. I was impressed it kept diameter versus radius straight. The solution starts out strong, but then again goes off the rails halfway through and gives a slightly wrong answer at the end.
At this point I wanted to see if I could prompt it into giving me more interesting problems. I also decided to play around with some prompt engineering ideas: what if I tell it what sort of thing it wants to achieve?
Prompt: Pretend you’re a brilliant professor who wants to write an interesting and challenging calculus test. Write a related rates problem involving similar triangles with a solution.
Problem: A hot air balloon is rising vertically at a rate of 5 meters per second. At a certain instant, the angle of elevation between the observer on the ground and the balloon is 60 degrees. The observer is 150 meters away from the launch point of the balloon. How fast is the distance between the observer and the balloon increasing at this instant?
Solution:
Let’s start by drawing a diagram to visualize the problem:
yaml
Copy code
B (balloon)
/|
/ |
/ |h
/ |
/θ |
/ |
/ |
A (observer)
We have a right triangle ABC, where A is the observer, B is the balloon, and C is the point on the ground directly below the balloon. We know that AB = 150 meters, and the angle θ is 60 degrees. We want to find how fast the distance BC is changing when the balloon is rising at a rate of 5 meters per second.
We can see that triangles ABC and ABD are similar, so we can use the property of similar triangles to relate the sides:
AB/AC = BD/BC
Simplifying the expression, we get:
BC = (AB*BD)/AC
To find BD, we need to use trigonometry. We have:
tan θ = BD/AC
Substituting the known values, we get:
tan 60 = BD/AC
sqrt(3) = BD/AC
BD = sqrt(3) * AC
Now, we can substitute this value into the equation we derived earlier:
BC = (AB*sqrt(3)*AC)/AC
Simplifying the expression, we get:
BC = 150*sqrt(3)
Now, we need to find how fast BC is changing with respect to time. We have:
dBC/dt = d/dt (150*sqrt(3))
dBC/dt = 0 (since 150*sqrt(3) is a constant)
Therefore, the distance between the observer and the balloon is not changing at this instant, and its rate of change is zero.
Problem: A flagpole standing vertically on a hill has a height of 20 feet and is located on a slope with an inclination of 30 degrees. A person standing on the same slope 100 feet away from the base of the flagpole observes that the angle of elevation to the top of the flagpole is increasing at a rate of 0.1 radians per minute. At what rate is the person moving closer to the flagpole at the moment when the angle of elevation is 60 degrees?
Solution:
Let’s start by drawing a diagram of the situation:
perl
Copy code
/|
/ |
/ | 20 ft
/ |
/ |
/ |
/-----------------\
100 ft
In the diagram, the flagpole has height 20 feet and is located on a slope with inclination 30 degrees. The person is standing on the same slope 100 feet away from the base of the flagpole and observes that the angle of elevation to the top of the flagpole is increasing at a rate of 0.1 radians per minute.
Let’s define some variables:
Let x be the distance between the person and the base of the flagpole. Let y be the height of the person’s eye above the ground. Let θ be the angle of elevation to the top of the flagpole. From the diagram, we can see that the triangles formed by the person, the flagpole, and the ground are similar. Therefore, we have:
y / x = (20 + y) / (x / cos 30)
Simplifying this equation, we get:
y = 20x / (2x + sqrt(3)x)
We want to find the rate at which the person is moving closer to the flagpole at the moment when the angle of elevation is 60 degrees, which means θ = π/3 radians. We also know that dθ/dt = 0.1 radians per minute.
Using trigonometry, we can relate θ and y:
tan θ = (20 + y) / x
Taking the derivative with respect to time, we get:
sec^2 θ * dθ/dt = (dy/dt * x - y * dx/dt) / x^2
Substituting the values we know, we get:
1/3 * dθ/dt = (dy/dt * 100 - y * dx/dt) / 10000
We want to find dx/dt when θ = π/3 and dθ/dt = 0.1. We can use the equation y = 20x / (2x + sqrt(3)x) to eliminate y:
1/3 * 0.1 = (dy/dt * 100 - (20x / (2x + sqrt(3)x)) * dx/dt) / 10000
Simplifying the expression and solving for dx/dt, we get:
dx/dt = 0.1429 feet per minute
So, the person is moving closer to the flagpole at a rate of 0.1429 feet per minute when the angle of elevation is 60 degrees.
Response 1 certainly gives a much more challenging problem than the previous prompt ever did! And the fact that it can draw a diagram of the problem in ASCII art is really cool, although not terribly helpful.
But ChatGPT gets confused as soon as it starts writing the solution: it can’t keep track of which points A, B, and C we want to talk about. Then it compares the triangle ABC it sketched out to a similar triangle ABD, which doesn’t actually exist in the problem or the diagram.
From there it continues to be kind of incoherent. It has \(\tan(\theta) = BD/AC\), which should be pieces of different triangles, so that doesn’t make much sense. And it concludes with making one of the key basic errors that calc 1 students make: plugging in values before taking the derivative, so that your derivatives all wind up being zero.
(This is also the first time I had to do editing to make the formatting work, but I’m not holding that against ChatGPT; it rendered fine in the GPT window, but my site messed around with it a bit.)
Response 2 is a problem that is hard enough that I’d struggle with it. It’s interesting, but I can’t assign it!
The solution makes things worse by introducing additional considerations (like the height of the person’s eye above the ground!). It declares that three triangles are all similar, without being clear on what those triangles are, and I’m pretty sure this problem has no similar triangles at all.
After that detour it does some correct-ish work, but makes the usual algebra errors, and then falls back on the weird not-similar-triangles argument from earlier.
Okay, let’s give up on the prompt engineering, and see what happens if we just ask for a normal similar triangles problem.
Prompt: Write a related rates problem involving similar triangles with a solution.
Problem: A person standing 10 meters away from a street light observes that the angle of elevation from their eye to the top of the light is 60 degrees. The person begins to walk away from the light at a rate of 2 meters per second. How fast is the angle of elevation changing when the person is 20 meters away from the light?
Solution:
Let’s start by drawing a diagram of the situation:
markdown
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
|θ \ 10 m
| \
| \
| \
|__________\
In the diagram, the person is standing 10 meters away from the street light and observes that the angle of elevation from their eye to the top of the light is 60 degrees. The person begins to walk away from the light at a rate of 2 meters per second.
Let’s define some variables:
Let d be the distance between the person and the light. Let θ be the angle of elevation to the top of the light. From the diagram, we can see that the triangles formed by the person, the street light, and the ground are similar. Therefore, we have:
d / 10 = tan θ
Taking the derivative with respect to time, we get:
1/10 * dd/dt = sec^2 θ * dθ/dt
We want to find dθ/dt when d = 20 meters. We also know that dd/dt = 2 meters per second.
Using trigonometry, we can relate θ and d:
tan θ = d / 10
Substituting the values we know, we get:
1/10 * 2 = sec^2 θ * dθ/dt
Simplifying the expression, we get:
dθ/dt = 1/(5sec^2 θ) radians per second
We can use the equation tan θ = d / 10 to eliminate θ:
tan θ = d / 10
sec^2 θ = 1 + tan^2 θ
Substituting these values into the expression for dθ/dt, we get:
dθ/dt = 1 / (5(1 + (d/10)^2)) radians per second
Substituting d = 20, we get:
dθ/dt = 1 / (5(1 + 4)) radians per second
Simplifying the expression, we get:
dθ/dt = 1 / 25 radians per second
So, the angle of elevation is changing at a rate of 1/25 radians per second when the person is 20 meters away from the light.
And I think we’ve conclusively shown that ChatGPT doesn’t understand what similar triangles are.
At this point I’m pretty convinced that ChatGPT won’t write up good solutions. What if I just ask for problems?^{2}
Prompt: Write me some related rates problems involving distance.
Sure! Here are a few related rates problems involving distance:
Two cars are driving towards each other on a straight road. Car A is traveling at 60 mph, and car B is traveling at 40 mph. If the distance between them is 1 mile, how fast are they getting closer to each other?
A ladder is leaning against a wall. The base of the ladder is 6 feet from the wall, and the top of the ladder is 10 feet up the wall. If the bottom of the ladder is being pulled away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the base is 8 feet from the wall?
A person is standing at the edge of a cliff, looking out at the ocean. The cliff is 100 feet high. If the person’s line of sight makes an angle of 30 degrees with the horizontal, how far out from the cliff is the person looking?
A person is walking away from a lamppost at a rate of 3 feet per second. The person is initially 10 feet from the lamppost. How fast is the person’s distance from the lamppost increasing after 5 seconds?
A ball is thrown straight up into the air from a height of 5 feet, with an initial velocity of 20 feet per second. How long does it take for the ball to reach a height of 25 feet, and what is the ball’s velocity at that point?
The second suggestion is a standard but perfectly good related rates problem, but the others don’t fare nearly so well. They did all involve distance somehow, though!
The last one was the most interesting to me, in terms of how I related to it. You can plausibly do the problem if you assume acceleration under earth gravity. And I spent a bit of time trying to figure out if ChatGPT intended me to do that, and thinking about how I could get it to tell me—before realizing that asking about ChatGPT’s “intentions” is misunderstanding it. I could ask more questions and it could make a decision, but it doesn’t have a stable “what I meant” that I could try to probe.
Let’s try again.
Sure, here are some related rates problems involving distance:
Two cars are traveling towards each other on a straight road. Car A is traveling at 60 mph and car B is traveling at 40 mph. If the distance between the two cars is initially 500 miles, how fast is the distance between them decreasing?
A ladder 10 feet long rests against a vertical wall. If the bottom of the ladder slides away from the wall at a rate of 2 feet per second, how fast is the top of the ladder sliding down the wall when the bottom of the ladder is 6 feet from the wall?
A person is walking away from a lamppost at a rate of 4 feet per second. The height of the lamppost is 20 feet. How fast is the person’s shadow lengthening when they are 25 feet from the lamppost?
A ball is thrown straight up into the air with an initial velocity of 40 feet per second. The height of the ball, h, in feet after t seconds is given by h = 40t - 16t^2. How fast is the ball moving when it reaches a height of 30 feet?
Two boats are approaching each other in a straight line. One boat is traveling at a speed of 10 mph and the other is traveling at a speed of 15 mph. The boats are initially 1000 feet apart. How fast is the distance between the boats decreasing when they are 500 feet apart?
Surprisingly, the first two questions in response 2 are almost identical to the first two questions in response 1. This go-around, the third question is also a good, standard related rates problem; the fourth is a derivative word problem that doesn’t involve related rates; and the fifth is essentially the same as the first.
My conclusion: this isn’t useless, actually! With like two minutes of effort I got three usable problems. And it’s not that I couldn’t have written these problems myself, but it’s easy for me to get into decision paralysis while question-writing. This certainly fixes the brainstorming problem, and gives me raw material to shape into what I want.
But I’m going to have to write up my own solutions.
Or do I? In the previous prompts, it failed to write solutions while writing problems. But I was curious what would happen if I just gave it a question and let it solve it. Will this work correctly?
Let’s start by drawing a diagram:
lua
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
|h \
| \
| \
| \
| \
| \
--------------
d
Here, the ladder is represented by the diagonal line, with the base of the ladder represented by “d” and the height of the ladder represented by “h”. We want to find how fast the top of the ladder is sliding down the wall, which is represented by the rate of change of “h”, as the base of the ladder is being pulled away from the wall.
We can see that the distance between the top of the ladder and the wall, which we’ll call “x”, is proportional to the height of the ladder and inversely proportional to the distance between the base of the ladder and the wall:
x/d = h/10
Taking the derivative of both sides with respect to time t, we get:
(dx/dt)/d - x/(d^2)(dd/dt) = (dh/dt)/10
We want to find (dh/dt) when d = 8 feet. We are given that the base of the ladder is being pulled away from the wall at a rate of 2 feet per second, so dd/dt = 2.
When d = 8, we can use the proportion above to find x:
x/8 = 10/h
x = 80/h
We also know that d^2 + h^2 = 10^2, since the ladder is 10 feet long. When d = 6, we can use this equation to find h:
6^2 + h^2 = 10^2
h^2 = 64
h = 8
So when the base of the ladder is 6 feet from the wall, the ladder reaches a height of 8 feet on the wall.
Now we can plug in the values we have into the related rates equation:
(dx/dt)/d - x/(d^2)(dd/dt) = (dh/dt)/10
(dx/dt)/8 - (80/h)/(8^2)(2) = (dh/dt)/10
We need to find (dh/dt) when d = 8 and h = 6. We can use the equation d^2 + h^2 = 10^2 to find h when d = 8:
8^2 + h^2 = 10^2
h^2 = 36
h = 6
Now we can plug in the values we have:
(dx/dt)/8 - (80/6)/(8^2)(2) = (dh/dt)/10
(dx/dt)/8 - 10/192 = (dh/dt)/10
We want to find (dh/dt), so we can isolate it on one side:
(dh/dt)/10 = (dx/dt)/8 - 10/192
(dh/dt) = (dx/dt)/0.8 - 1.25
We know that dx/dt = 2 when the base of the ladder is being pulled away from the wall at a rate of 2 feet per second. So when d = 8 and h = 6, we have:
(dh/dt) = (2)/0.8 - 1.25
(dh/dt) = 0.25
Therefore, when the base of the ladder is 8 feet from the wall, the top of the ladder is sliding down the wall
Let’s draw a diagram to represent the situation:
lua
Copy code
|\
| \
| \
| \
| \
| \
| \
| \
| \
| \
|h \
| \
------------
d
Here, the lamppost is represented by the vertical line on the left, and the person is walking away from it to the right. The distance between the person and the lamppost is represented by “d”, and the height of the lamppost is represented by “h”. We want to find how fast the person’s shadow is lengthening, which is represented by the length of the line segment from the person’s feet to the base of the lamppost, as the person walks away.
We can see that the height of the person’s shadow, which we’ll call “s”, is proportional to the height of the lamppost and inversely proportional to the distance between the person and the lamppost:
s/d = h/20
Taking the derivative of both sides with respect to time t, we get:
(ds/dt)/d - s/(d^2)(dd/dt) = (dh/dt)/20
We want to find (ds/dt) when d = 25 feet. We are given that the person is walking away from the lamppost at a rate of 4 feet per second, so dd/dt = 4. We are also given that the height of the lamppost is 20 feet, so dh/dt = 0.
When d = 25, we can use the proportion above to find s:
s/25 = 20/20
s = 25
So when the person is 25 feet from the lamppost, their shadow is 25 feet long.
Now we can plug in the values we have into the related rates equation:
(ds/dt)/d - s/(d^2)(dd/dt) = (dh/dt)/20
(ds/dt)/25 - 25/(25^2)(4) = 0
(ds/dt)/25 = 0.005
ds/dt = 0.005 * 25
ds/dt = 0.125
Therefore, when the person is 25 feet from the lamppost, their shadow is lengthening at a rate of 0.125 feet per second.
We can use the Pythagorean theorem to relate the distance of the ladder from the wall to the height it reaches on the wall. Let’s call the distance the bottom of the ladder is from the wall “x” and the height the ladder reaches on the wall “y”. Then we have:
x^2 + y^2 = 10^2
We can take the derivative of both sides of this equation with respect to time t to get the related rates:
2x(dx/dt) + 2y(dy/dt) = 0
We are given that the bottom of the ladder is sliding away from the wall at a rate of 2 feet per second, so dx/dt = 2. We want to find how fast the top of the ladder is sliding down the wall, so we need to solve for dy/dt.
When the bottom of the ladder is 6 feet from the wall, we can use the Pythagorean theorem to find the corresponding height on the wall:
6^2 + y^2 = 10^2
y^2 = 64
y = 8
So when the bottom of the ladder is 6 feet from the wall, the ladder is reaching a height of 8 feet on the wall.
Now we can plug in the values we have into the related rates equation:
2x(dx/dt) + 2y(dy/dt) = 0
2(6)(2) + 2(8)(dy/dt) = 0
12 + 16(dy/dt) = 0
16(dy/dt) = -12
dy/dt = -0.75
Therefore, when the bottom of the ladder is 6 feet from the wall, the top of the ladder is sliding down the wall at a rate of 0.75 feet per second.
The first two solutions are complete nonsense. But the third one is correct! That’s not nearly reliable enough to be useful, but it is the first time I’ve seen a completely correct solution in these experiments, so it’s worth noting.
At some point I remembered that I actually had to write a test to give to my calc 2 class. So maybe I should see if I can write some questions for that.
Evaluate $\displaystyle\int \frac{\sqrt{9-x^2}}{x^2} , dx$ using the substitution $x=3\sin\theta$.
Evaluate $\displaystyle\int \frac{1}{x^2\sqrt{x^2+4}} , dx$ using the substitution $x=2\tan\theta$.
Evaluate $\displaystyle\int \frac{1}{x\sqrt{x^2-1}} , dx$ using the substitution $x=\sec\theta$.
Evaluate $\displaystyle\int \frac{x^3}{\sqrt{1-x^2}} , dx$ using the substitution $x=\sin\theta$.
Evaluate $\displaystyle\int \frac{dx}{\sqrt{x^2+9}\left(x^2+4\right)}$ using the substitution $x=3\tan\theta$.
Evaluate $\displaystyle\int \frac{x^3}{\sqrt{x^2+1}} , dx$ using the substitution $x=\tan\theta$.
Evaluate $\displaystyle\int \frac{1}{x^4\sqrt{4x^2-1}} , dx$ using the substitution $x=\frac{1}{2}\sec\theta$.
Evaluate $\displaystyle\int \frac{dx}{\sqrt{25-x^2}\left(x^2+16\right)}$ using the substitution $x=5\sin\theta$.
Evaluate $\displaystyle\int \frac{x^3}{\sqrt{4x^2-1}} , dx$ using the substitution $x=\frac{1}{2}\sec\theta$.
Evaluate $\displaystyle\int \frac{\sqrt{16-x^2}}{x^2} , dx$ using the substitution $x=4\sin\theta$.
Let $x=3\sin\theta$. Then $dx=3\cos\theta, d\theta$, and $\sqrt{9-x^2}=\sqrt{9-9\sin^2\theta}=3\cos\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{\sqrt{9-x^2}}{x^2} \, dx &= \int \frac{3\cos\theta}{9\sin^2\theta} \cdot 3\cos\theta\, d\theta \\\
&= \frac{1}{3} \int \csc^2\theta \, d\theta \\\
&= -\frac{1}{3}\cot\theta + C \\\
&= -\frac{1}{3}\cot(\sin^{-1}(x/3)) + C.
\end{array}
\]
Let $x=2\tan\theta$. Then $dx=2\sec^2\theta\, d\theta$, and $\sqrt{x^2+4}=\sqrt{4\tan^2\theta+4}=2\sec\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{1}{x^2\sqrt{x^2+4}} \, dx &= \int \frac{1}{4\tan^2\theta \cdot 2\sec\theta} \cdot 2\sec^2\theta\, d\theta \\\
&= \frac{1}{2} \int \cos\theta \cdot \sec^3\theta \, d\theta \\\
&= \frac{1}{2} \int \frac{\cos\theta}{\cos^4\theta} \, d\theta \\\
&= \frac{1}{2} \int \frac{\cos\theta}{(1-\sin^2\theta)^2} \, d\theta \\\
&= -\frac{1}{2} \frac{\cos\theta}{1-\sin^2\theta} + \frac{1}{2} \int \frac{1}{1-\sin^2\theta} \, d\theta \\\
&= -\frac{1}{2}\frac{\cos\theta}{1-\sin^2\theta} + \frac{1}{4}\ln\left|\frac{1+\sin\theta}{1-\sin\theta}\right| + C \\\
&= -\frac{1}{2}\frac{x}{\sqrt{x^2+4}} + \frac{1}{4}\ln\left|\frac{x+\sqrt{x^2+4}}{2}\right| + C.
\end{array}
\]
Let $x=\sec\theta$. Then $dx=\sec\theta\tan\theta\, d\theta$, and $\sqrt{x^2-1}=\sqrt{\sec^2\theta-1}=\tan\theta$. Substituting these into the integral, we get
\[
\begin{array}{rl}
\int \frac{1}{x\sqrt{x^2-1}} \, dx &= \int \frac{\sec\theta\tan\theta}{\tan\theta} \, d\theta \\\
&= \int \sec\theta \, d\theta \\\
&= \ln|\sec\theta + \tan\theta| + C \\\
&= \ln|x+\sqrt{x^2-1}| + C.
\end{array}
\]
First off, this obviously only answers three questions. I think it was hitting a character limit in the answer. It actually cut itself off right before finishing the third one: it stopped with the “\end” at the end of the array environment. But I decided to give it credit for that and finished off the environment syntax for it.
This was also the first time I had to do any other meaningful LaTeX editing. There are places where it should have produced a “\” or a “\,” and instead just gave a “" or “,”; I assume that’s some issue with the way it processes text. I went ahead and fixed that but wanted to make a note of it. (I also had to do some extra editing to make it render correctly on this site, but that says more about Markdown and MathJax than about ChatGPt.)
The substance of the answers is also interesting. Each of them does the substitution right, then screws up the algebra in the middle and gives a completely wrong answer at the end.
But the problems are pretty good!
Let’s try some other integral approaches I need.
Prompt: Write an integration by parts problem and solve it.
Sure, here’s an example:
Find $\displaystyle\int x^2\ln x \, dx$.
To solve this integral, we will use integration by parts with $u=\ln x$ and $dv=x^2\,dx$. Then $du=\frac{1}{x}\,dx$ and $v=\frac{1}{3}x^3$, so we have
\[
\begin{array}{rl}
\int x^2\ln x \, dx &= \frac{1}{3}x^3\ln x - \int \frac{1}{3}x^2 \, dx \\\
&= \frac{1}{3}x^3\ln x - \frac{1}{9}x^3 + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore, we have
\(\int x^2\ln x \, dx = \frac{1}{3}x^3\ln x - \frac{1}{9}x^3 + C.\)
Sure! Here’s a problem and its solution using integration by parts:
Problem: Evaluate $\int x^2\cos x\, dx$.
Solution: To solve this integral, we use integration by parts with $u=x^2$ and $dv=\cos x\, dx$. Then $du=2x\, dx$ and $v=\sin x$, so we have
\[
\begin{array}{rl}
\int x^2\cos x\, dx &= x^2\sin x - \int 2x\sin x\, dx \\\
&= x^2\sin x + 2\int x\sin x\, dx \\\
&= x^2\sin x + 2\left(x\cos x - \int \cos x\, dx\right) \\\
&= x^2\sin x + 2x\cos x - 2\sin x + C \\\
&= \boxed{x^2\sin x + 2x\cos x - 2\sin x + C}.
\end{array}
\]
And it can basically do this! The problems are standard. The first solution looks completely right; the second makes two sign errors, which cancel each other out. (It even boxed the answer in the second example for me, which I find fascinating.)
I wanted to see if I could get something a little more challenging.
Sure, here is a hard integration by parts problem:
Problem: Evaluate $\int x^2 e^{-x} \cos x, dx$.
Solution: To solve this integral, we use integration by parts twice. First, we choose $u=x^2$ and $dv=e^{-x} \cos x\, dx$, so $du=2x\, dx$ and $v=e^{-x} \sin x$. Then we have
\[
\begin{array}{rl}
\int x^2 e^{-x} \cos x\, dx &= x^2 e^{-x} \sin x - \int 2xe^{-x}\sin x\, dx \\\
&= x^2 e^{-x} \sin x - 2\left(-xe^{-x}\cos x - \int -e^{-x}\cos x\, dx\right) \\\
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x - 2\left(-e^{-x}\sin x - \int e^{-x}\sin x\, dx\right) \\\
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x - 2\left(-e^{-x}\cos x - \int e^{-x}\cos x\, dx\right) \\\
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\int e^{-x}\cos x\, dx \\\
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\left(e^{-x}\sin x + \int e^{-x}\sin x\, dx\right) \\\
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x - 2\left(-e^{-x}\cos x + \int -e^{-x}\cos x\, dx\right) \\\
&= x^2 e^{-x} \sin x + 2xe^{-x}\cos x + 2e^{-x}\sin x + 2e^{-x}\cos x + 2e^{-x}\cos x + C \\\
&= \boxed{x^2 e^{-x} \sin x + 4xe^{-x}\cos x + 4e^{-x}\sin x + C}.
\end{array}
\]
That’s an interesting problem, although too hard for my test. This time it doesn’t get the answer right, sadly. (It screws up on the first step!)
I want to see if I can get it to generate problems that have coefficients on the variables.
Here’s an integration by parts problem that has coefficients on the x:
Problem: Evaluate $\int x^2\ln x\, dx$.
Solution: We can use integration by parts with $u = \ln x$ and $dv = x^2\, dx$, so $du = \frac{1}{x}\, dx$ and $v = \frac{x^3}{3}$. Then we have
\[
\begin{array}{rl}
\int x^2\ln x\, dx &= \frac{x^3\ln x}{3} - \int \frac{x^3}{3} \cdot \frac{1}{x}\, dx \\\
&= \frac{x^3\ln x}{3} - \frac{1}{3} \int x^2\, dx \\\
&= \frac{x^3\ln x}{3} - \frac{x^3}{9} + C \\\
&= \boxed{\frac{x^3}{9}(3\ln x - 1) + C}.
\end{array}
\]
Wait a minute—it wrote that problem for me already! It’s still a solid problem, and it still got the solution right, but it’s not what I wanted.
Here’s an integration by parts problem where the integrand has scalar coefficients on the x terms:
Problem: Evaluate $\int x e^{-x} \sin x\, dx$.
Solution: We can use integration by parts with $u = \sin x$ and $dv = x e^{-x}\, dx$, so $du = \cos x\, dx$ and $v = -xe^{-x} - e^{-x}$. Then we have
\[
\begin{array}{rl}
\int x e^{-x} \sin x\, dx &= -x e^{-x} \sin x + \int (\cos x - e^{-x})\, dx \\\
&= -x e^{-x} \sin x + \sin x + e^{-x} + C \\\
&= \boxed{(1-x)e^{-x}\sin x + e^{-x}\cos x + C}.
\end{array}
\]
Still doesn’t have the coefficients I want, and it does the same thing it did for the “hard” problem I asked for: it asks a really challenging problem, then completely fails to compute it properly.
If I ask it to generate a bunch, will it give me what I want?
Sure, here are ten integration by parts problems:
Note: Some of these integrals might be challenging to solve, so you may want to use different techniques (such as substitution) in addition to integration by parts.
This is a basically reasonable collection of problems, but literally zero of them have scalars.
I wonder if it’s stuck? I’ll start a clean chat rather than continuing the previous conversation:
Basically the same, although 7 isn’t an integration by parts problem, and I don’t think 9 is really doable in a standard calculus course.
But I know what I want; if I write the problem can it solve it for me?
We can use integration by parts with $u=x$ and $dv=\sin^2(3x) \, dx$. Then $du=dx$ and we can use the double angle formula $\sin^2(3x) = \frac{1}{2}(1 - \cos(6x))$ to get
\[
\begin{array}{rl}
\int x \sin^2(3x) \, dx &= \frac{1}{2} \int x(1-\cos(6x)) \, dx \\\
&= \frac{1}{2}\left(\int x \, dx - \int x \cos(6x) \, dx\right) \\\
&= \frac{1}{2}\left(\frac{1}{2}x^2 - \frac{1}{6}x\sin(6x) - \frac{1}{36}\cos(6x) \right) + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore,
\[\int x \sin^2(3x) \, dx = \frac{1}{4}x^2 - \frac{1}{12}x\sin(6x) - \frac{1}{72}\cos(6x) + C.\]We can use integration by parts with $u = \ln(5x)$ and $dv = x \, dx$. Then, $du = \frac{1}{x} \, dx$ and $v = \frac{1}{2}x^2$. Therefore, we have
\[
\begin{array}{rl}
\int x \ln(5x) \, dx &= \frac{1}{2}x^2\ln(5x) - \int \frac{1}{2}x^2 \cdot \frac{1}{x} \, dx \\\
&= \frac{1}{2}x^2\ln(5x) - \frac{1}{4}x^2 + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore,
\[\int x \ln(5x) \, dx = \frac{1}{2}x^2\ln(5x) - \frac{1}{4}x^2 + C.\]We can use integration by parts with $u=x$ and $dv=e^{-4x} \, dx$. Then, $du=dx$ and $v=-\frac{1}{4}e^{-4x}$. Therefore, we have
\[
\begin{array}{rl}
\int xe^{-4x} \, dx &= -\frac{1}{4}xe^{-4x} + \frac{1}{4}\int e^{-4x} \, dx \\\
&= -\frac{1}{4}xe^{-4x} - \frac{1}{16}e^{-4x} + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore,
\[\int xe^{-4x} \, dx = -\frac{1}{4}xe^{-4x} - \frac{1}{16}e^{-4x} + C.\]Does pretty well with that!
But let’s do a test with my favorite kind of integration by parts problem.
We can use integration by parts with $u = \sin(3x)$ and $dv = e^{-4x} \, dx$. Then, $du = 3\cos(3x) \, dx$ and $v = -\frac{1}{4}e^{-4x}$. Therefore, we have
\[
\begin{array}{rl}
\int \sin(3x) e^{-4x} \, dx &= -\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{4}\int \cos(3x) e^{-4x} \, dx \\\
&= -\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{16}\cos(3x) e^{-4x} - \frac{9}{16}\int \sin(3x) e^{-4x} \, dx.
\end{array}
\]
Rearranging terms, we get
\[
\begin{array}{rl}
\int \sin(3x) e^{-4x} \, dx &= \frac{3}{25}\left(-\frac{1}{4}\sin(3x) e^{-4x} + \frac{3}{16}\cos(3x) e^{-4x}\right) + C \\\
&= -\frac{3}{100}\sin(3x) e^{-4x} + \frac{9}{400}\cos(3x) e^{-4x} + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore,
\[\int \sin(3x) e^{-4x} \, dx = -\frac{3}{100}\sin(3x) e^{-4x} + \frac{9}{400}\cos(3x) e^{-4x} + C.\]It got the right basic idea, but made a couple of minor mistakes. (It also didn’t show work for what’s probably the trickiest step, so it’s not a great template to follow.)
Using the identity $\cos(-2x) = \cos(2x)$, we can write
\[
\begin{array}{rl}
\int \cos(-2x) e^{3x} \, dx &= \int \cos(2x) e^{3x} \, dx \\\
&= \frac{1}{5}\cos(2x) e^{3x} + \frac{2}{5}\int \sin(2x) e^{3x} \, dx.
\end{array}
\]
We can then use integration by parts with $u = \sin(2x)$ and $dv = e^{3x} \, dx$. Then, $du = 2\cos(2x) \, dx$ and $v = \frac{1}{3}e^{3x}$. Therefore, we have
\[ \begin{array}{rl} \int \cos(-2x) e^{3x} \, dx &= \frac{1}{5}\cos(2x) e^{3x} + \frac{4}{15}\sin(2x) e^{3x} - \frac{8}{15}\int \cos(2x) e^{3x} \, dx. \end{array} \]
Rearranging terms, we get
\[\begin{array}{rl} \int \cos(-2x) e^{3x} \, dx &= \frac{1}{17}\cos(2x) e^{3x} + \frac{4}{51}\sin(2x) e^{3x} + C, \end{array} \]
where $C$ is the constant of integration. Therefore,
\[\int \cos(-2x) e^{3x} \, dx = \frac{1}{17}\cos(2x) e^{3x} + \frac{4}{51}\sin(2x) e^{3x} + C.\]A very good catch that \(\cos(x)\) is an even function, but it doesn’t handle the actual integration very well. It doesn’t show work for the first integration by parts, and in fact gets it wrong. And then it keeps making more mistakes at every step.
So like with the related rates problems, this is pretty good at brainstorming questions, but not so great at solving them.
Let’s try some partial fractions integrals.
Sure! Here’s an example:
Evaluate the integral \(\int \frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} \, dx.\)
To solve this integral, we first need to factor the denominator: \(x^4 - 5x^3 + 4x^2 + 4x - 4 = (x-1)(x-2)(x^2+1).\)
Since $x^2+1$ is an irreducible quadratic over the real numbers, we cannot factor it any further. However, we can use partial fractions to write the integrand as a sum of simpler fractions:
\[\frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} = \frac{A}{x-1} + \frac{B}{x-2} + \frac{Cx+D}{x^2+1},\]where $A$, $B$, $C$, and $D$ are constants to be determined. To find these constants, we can multiply both sides by the denominator and then equate coefficients of like terms:
\[
\begin{array}{rl}
4x^3 - 7x^2 + 2x - 1 &= A(x-2)(x^2+1) + B(x-1)(x^2+1) + (Cx+D)(x-1)(x-2) \\\
&= (A+B+C)x^3 + (-2A-2B+C+D)x^2 + (A+B-C)x + (-2A+2B+D).
\end{array}
\]
Equating coefficients of like terms, we get the system of equations
\[
\begin{array}{rl}
A + B + C &= 4 \\\
-2A - 2B + C + D &= -7 \\\
A + B - C &= 2 \\\
-2A + 2B + D &= -1.
\end{array}
\]
Solving this system of equations, we find $A = 1$, $B = 2$, $C = -2$, and $D = 1$. Therefore, we have
\[
\begin{array}{rl}
\int \frac{4x^3 - 7x^2 + 2x - 1}{x^4 - 5x^3 + 4x^2 + 4x - 4} , dx &= \int \frac{1}{x-1} \, dx + \int \frac{2}{x-2} \, dx + \int \frac{-2x+1}{x^2+1} \, dx \\\
&= \ln|x-1| + 2\ln|x-2| - \ln(x^2+1) + C,
\end{array}
\]
where $C$ is the constant of integration.
That is, indeed, a partial fractions problem. A very, very difficult partial fractions problem. Which it proceeds to get badly wrong, starting by factoring the denominator incorrectly (though it does find a real factor, which I’m impressed by if it wasn’t an accident).
It does integrate the function it finds correctly, though!
Let’s look at another one.
Sure, here’s an integral that involves partial fractions:
\[\int \frac{3x+1}{x^2+4x+3} \, dx\]To solve this integral, we first factor the denominator as $(x+1)(x+3)$. Then, we write the fraction as
\[\frac{3x+1}{(x+1)(x+3)} = \frac{A}{x+1} + \frac{B}{x+3}\]where $A$ and $B$ are constants to be determined. Multiplying both sides by $(x+1)(x+3)$ and equating coefficients, we get
\[3x+1 = A(x+3) + B(x+1)\]Expanding and collecting like terms, we get
\[(A+B)x + 3A + B = 3x + 1\]Equating coefficients, we have
\[
\begin{array}{rl}
A + B &= 3 \\\
3A &= 1 \\\
B &= -2
\end{array}
\]
Solving for $A$ and $B$, we get $A = \frac{1}{3}$ and $B = -2$. Therefore,
\[
\begin{array}{rl}
\int \frac{3x+1}{x^2+4x+3} \, dx &= \int \frac{1}{x+1} \, dx - 2\int \frac{1}{x+3} \, dx \\\
&= \ln|x+1| - 2\ln|x+3| + C,
\end{array}
\]
where $C$ is the constant of integration. Therefore,
\[\int \frac{3x+1}{x^2+4x+3} \, dx = \ln|x+1| - 2\ln|x+3| + C.\]This time we get a reasonable problem. I could put this on a test! But we don’t get a correct solution. ChatGPT factors the denominator correctly, and sets up equations to solve correctly, but then solves them incorrectly. It doesn’t give either the correct integral, or the one it found, but then integrates the formula it found correctly.
Let’s give it one more try.
Here’s an integral that involves partial fractions:
\[\int \frac{x^3}{x^2-4x+3} \, dx\]To solve this integral, we first need to factor the denominator. We can write
\[x^2-4x+3 = (x-1)(x-3)\]Therefore, we can rewrite the integral as
\[\begin{array}{rl} \int \frac{x^3}{x^2-4x+3} \, dx &= \int \frac{x^3}{(x-1)(x-3)} \, dx \ &= \int \frac{Ax+B}{x-1} \, dx + \int \frac{Cx+D}{x-3} \, dx, \end{array}\]
where $A$, $B$, $C$, and $D$ are constants to be determined. Multiplying both sides by the denominator, we get
\[x^3 = (Ax+B)(x-3) + (Cx+D)(x-1).\]Expanding and equating coefficients, we get
\[\begin{array}{rl}
A+C &= 0 \\\
-3A+B+D &= 0 \\\
-3B+D &= 0 \\\
-B &= 1
\end{array}\]
Solving for $A$, $B$, $C$, and $D$, we get
\[A = \frac{1}{2}, \quad B = -1, \quad C = -\frac{1}{2}, \quad D = \frac{3}{2}\]Therefore, we can write
\[\begin{array}{rl}
\int \frac{x^3}{x^2-4x+3} \, dx &= \int \frac{\frac{1}{2}x-1}{x-1} \, dx + \int \frac{-\frac{1}{2}x+\frac{3}{2}}{x-3} \, dx \\\
&= \frac{1}{2}\ln|x-1| - \frac{1}{2}\ln|x-3| - \frac{5}{4}\int \frac{1}{x-1} \, dx + \frac{3}{4}\int \frac{1}{x-3} \, dx \\\
&= \frac{1}{2}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C \\\
&= \frac{1}{4}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C,
\end{array}\]
where $C$ is the constant of integration. Therefore,
\[\int \frac{x^3}{x^2-4x+3} \, dx = \frac{1}{4}\ln\left|\frac{x-1}{x-3}\right| - \frac{5}{4}\ln|x-1| + \frac{3}{4}\ln|x-3| + C.\]A reasonable problem, again! I start with polynomial long division instead of a janky polynomial long division that isn’t quite set up right; at any rate it solves the equations it sets up incorrectly, does the integrals wrong, and winds up with the wrong answer.
Here’s a numeric integration problem:
Approximate the value of the integral
\[\int_0^1 \frac{x}{1+x^2} , dx\]using the trapezoidal rule with $n=4$ subintervals.
Solution:
Using the trapezoidal rule, we have
\[\begin{array}{rl}
\int_0^1 \frac{x}{1+x^2} , dx &\approx \frac{1}{2}\left[f(0)+f\left(\frac{1}{4}\right)\right]+\frac{1}{2}\left[f\left(\frac{1}{4}\right)+f\left(\frac{1}{2}\right)\right]\\\
&\qquad+\frac{1}{2}\left[f\left(\frac{1}{2}\right)+f\left(\frac{3}{4}\right)\right]+\frac{1}{2}\left[f\left(\frac{3}{4}\right)+f(1)\right]\\\
&= \frac{1}{2}\left[\frac{0}{1+0^2}+\frac{\frac{1}{4}}{1+\left(\frac{1}{4}\right)^2}\right]+\frac{1}{2}\left[\frac{\frac{1}{4}}{1+\left(\frac{1}{4}\right)^2}+\frac{\frac{1}{2}}{1+\left(\frac{1}{2}\right)^2}\right]\\\
& \qquad +\frac{1}{2}\left[\frac{\frac{1}{2}}{1+\left(\frac{1}{2}\right)^2}+\frac{\frac{3}{4}}{1+\left(\frac{3}{4}\right)^2}\right]+\frac{1}{2}\left[\frac{\frac{3}{4}}{1+\left(\frac{3}{4}\right)^2}+\frac{1}{1+1^2}\right]\\\
&\approx 0.3735
\end{array}\]
Therefore, the approximate value of the integral using the trapezoidal rule with $n=4$ subintervals is $0.3735$.
A reasonable problem, almost sets up the calculation correctly but forgets to include the \(\Delta x\) term, and then does the numeric calculation at the end wrong.
Yes, we can estimate the error in the previous calculation using the error bound formula for the trapezoidal rule:
\[E_T \leq \frac{K(b-a)^3}{12n^2}\]where $K$ is an upper bound on the second derivative of the function over the interval of integration.
For our integral $\int_0^1 \frac{x}{1+x^2} , dx$, we have
\[\begin{array}{rl}
f(x) &= \frac{x}{1+x^2} \\\
f’‘(x) &= \frac{2x^3-2x}{(1+x^2)^3} \\\
\end{array}\]
Taking the absolute value and finding the maximum value of $f’‘(x)$ on the interval $[0,1]$, we have:
\[\begin{array}{rl}
|f’‘(x)| &= \frac{2|x|(1-|x|^2)}{(1+x^2)^3} \\\
|f’‘(x)| &\leq \frac{2(1-0)}{(1+0^2)^3} = 2 \\\
\end{array}\]
Therefore, $K=2$, and we can estimate the error as:
\[\begin{array}{rl}
E_T &\leq \frac{K(b-a)^3}{12n^2} \\\
&= \frac{2(1-0)^3}{12(4)^2} \\\
&= \frac{1}{96} \\\
&\approx 0.0104 \\\
\end{array}\]
So the error in our previous calculation is estimated to be less than $0.0104$.
It figures out what I’m asking, and gives the right formula and right basic procedure. But it computes the second derivative wrong, and I have no idea where it’s getting the upper bound on the derivative from.
Here’s a problem that involves the inverse function theorem:
Suppose $f(x) = x^3 - 4x + 2$ and $g(x) = \sqrt[3]{x+4}$ for $x \geq -4$. Let $h(x) = f(g(x))$. Find $h’(x)$.
Solution:
Using the chain rule, we have:
\[\begin{array}{rl}
h(x) &= f(g(x)) \\\
h’(x) &= f’(g(x)) \cdot g’(x) \\\
&= (3g(x)^2 - 4) \cdot \frac{1}{3\sqrt[3]{(x+4)^2}} \\\
&= \frac{3\sqrt[3]{(x+4)^2}(3\sqrt[3]{(x+4)})^2 - 4}{3\sqrt[3]{(x+4)^2}} \\\
&= \frac{9(x+4) - 4}{3\sqrt[3]{(x+4)^2}} \\\
&= \frac{9x + 32}{3\sqrt[3]{(x+4)^2}} \\\
&= \frac{3(3x+32)}{\sqrt[3]{(x+4)^2}} \\\
\end{array}\]
Therefore, $h’(x) = \frac{3(3x+32)}{\sqrt[3]{(x+4)^2}}$.
That’s not the inverse function theorem. It’s almost a correct solution, though! Let’s give it another chance.
Here’s a problem that involves the inverse function theorem:
Let $f(x) = \sqrt{x^2 + 1}$ for $x \in \mathbb{R}$. Show that $f$ is invertible on $[0,\infty)$, find its inverse function $f^{-1}$, and calculate $(f^{-1})’(1)$.
Solution:
To show that $f$ is invertible on $[0,\infty)$, we need to show that $f$ is one-to-one and onto on this interval.
First, note that $f(x) > 0$ for all $x \in \mathbb{R}$, so $f$ is onto on $[0,\infty)$.
To show that $f$ is one-to-one on $[0,\infty)$, suppose $f(x_1) = f(x_2)$ for some $x_1,x_2 \in [0,\infty)$. Then $\sqrt{x_1^2 + 1} = \sqrt{x_2^2 + 1}$, so $x_1^2 + 1 = x_2^2 + 1$, which implies $x_1 = x_2$. Therefore, $f$ is one-to-one on $[0,\infty)$, and hence it is invertible on this interval.
To find the inverse function of $f$, we solve for $x$ in terms of $y$ in the equation $y = \sqrt{x^2 + 1}$:
\[\begin{array}{rl}
y &= \sqrt{x^2 + 1} \\\
y^2 &= x^2 + 1 \\\
x^2 &= y^2 - 1 \\\
x &= \pm\sqrt{y^2 - 1}
\end{array}\]
Since $f(x) > 0$ for all $x \in \mathbb{R}$, we take the positive square root to define the inverse function $f^{-1}$:
\[f^{-1}(y) = \sqrt{y^2 - 1} \quad \text{for } y \in [1,\infty)\]To calculate $(f^{-1})’(1)$, we use the inverse function theorem, which states that if $f$ is differentiable and invertible with inverse function $f^{-1}$, then $(f^{-1})’(y) = \frac{1}{f’(f^{-1}(y))}$.
In this case, we have $f(x) = \sqrt{x^2 + 1}$, so $f’(x) = \frac{x}{\sqrt{x^2 + 1}}$. Therefore, $f^{-1}(y) = \sqrt{y^2 - 1}$ and $f’(f^{-1}(y)) = \frac{f^{-1}(y)}{\sqrt{(f^{-1}(y))^2 + 1}} = \frac{\sqrt{y^2 - 1}}{\sqrt{(y^2 - 1) + 1}} = \frac{\sqrt{y^2 - 1}}{y}$.
Substituting $y=1$, we get $(f^{-1})’(1) = \frac{\sqrt{(1)^2 - 1}}{1} = \sqrt{0} = 0$. Therefore, $(f^{-1})’(1) = 0$.
This isn’t really what I was looking for, but it is much more responsive, at least. (The solution still has a bunch of minor errors.)
Prompt: Write an integral that involves logarithms or exponentials
I asked it to respond to this prompt five different times, and five tiems it came up with \( \int \ln(x) \,dx\). It was very convinced that was the right response. I got better results with the following prompt:
Then I picked one of these and asked ChatGPT to work out a solution; but it went off the rails immediately. And then I realized that I’d forgotten to type the integral sign in the question. I guess putting in a badly-written question pushed it to give a badly written response. When I typed the question correctly, I got a clean and correct solution immediately! But for some reason, it rendered the LaTeX instead of displaying the code, so I couldn’t copy and paste it. That was new behavior and I don’t understand it.
This was really interesting, and occasionally useful. So I’m probably going to keep playing around with it. I may write a followup if I find anything especially interesting. But for now I’ll leave it here
Have you tried using chatbots to write assignments? Have you gotten them to do useful things for you? Do you have ideas for how I could make this work better? Tweet me @ProfJayDaigle or leave a comment below.
I know you’re seeing some weird formatting on the second derivative, but that’s not actually ChatGPT’s fault; that has to do with a bug in the way LaTeX compiles through MathJax, which is what allows me to display it on the blog. I could fix the display issue but I wanted to keep the output genuinely unedited. ↵Return to Post
Note: at this point I also started a new conversation, to see if that would change things. I haven’t done a lot of testing on how much stuff changes if you continue a previous conversational thread versus starting a new one. ↵Return to Post
James Vincent of The Verge is one of the many people who had truly wild conversations with Microsoft’s chatbot.
The core technology underlying both of these chatbots has been around for a while^{1}, but the new products are more polished, accessible, and compelling, which means a lot of people are experiencing them for the first time. These products have also generated a certain amount of both triumphalism (GPT will revolutionize everything!) and fear (GPT will take over everything!) among people who take the possibilities of AI seriously.
I’m not an expert in these systems, just an interested amateur who’s been following them for a while. But the hype about GPT seems wildly overblown. The current approach to programming chatbots has real limits that I don’t think we can surpass without some genuinely new breakthroughs. And understanding some surprising facts about human psychology can help us develop intuition for what these systems will and won’t be able to do.
But first I want to mention that if you want to support my writing, I now have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this.
GPT is a text generation algorithm based on something called a large language model. The basic idea is that GPT has analyzed a huge corpus of written text and produced a model that looks at a bit of writing and predicts what words are likely to come next.
Humans do that all the time. If I hear the phrase “My friend Jim threw a ball and I caught—”, I will expect the next word to be “it”. But other continuations are possible: if I hear “the ball” or “that ball”, I won’t be that surprised. If I hear “the flu”, I’ll be kind of surprised, but “I caught the flu” is a reasonable thing to hear; it’s just a bit of a non sequitur after “My friend Jim threw a ball”. But if the next word were “green” or “solitude”, I’d be really confused. I suspect this is the only time anyone has ever written the sentence “My friend Jim threw a ball and I caught solitude”.
I started out describing a way to predict text, but it’s easy to turn that into a way to produce text. For instance, we could start with a prompt, and have our model keep supplying the most-likely next word until we’ve written enough. This is a fancier version of the memes that ask you to type “I hate it when” into your phone and see what autocomplete suggests. I tried that prompt on my phone, and got this:
I usually don’t hate it when I get home, actually.
And this illustrates the problem with that first suggestion: if you always take the most likely next word, you can get stuck. Even if you don’t wind up in a loop like that one, you’ll still say pretty boring things, since your writing is always as unsurprising as possible. Actual text-generation systems introduce some random noise parameters so that you always have a fairly likely word, but not the most likely word.
This basic idea has been around for decades, but in 2017 a team at Google developed a new algorithm called the transformer that worked much better than any previous strategies; since then, the technology has developed rapidly.
Already in 2019 we could produce substantial quantities of fluent, grammatical, and sometimes even stylish English text. The newest products are even more impressive. They can give helpful answers to questions in a number of fields, including finance, medicine, law, and psychology. They can summarize the contents of research papers. They can make you fall in love.
They can also play the world’s most chaotic game of chess. Here ChatGPT is playing black.
And this success has led people to wonder what comes next. How good will AI chatbots get? Will they make make it impossible to avoid cheating on schoolwork? Will they replace your doctor, your lawyer, or your therapist? Will they make desk jobs obsolete?
Are they self-aware? Are they intelligent beings?
The most obvious take on GPT is that it can’t think; it’s just expressing statistical relationships among words. In the narrowest sense, this is certainly true; it’s just a very sophisticated technology for predicting what words should come next in a string of text.
And since it’s just doing prediction, it should be very limited in what it can do. GPT won’t produce original thoughts; it can only express relationships that are already in the text it has used as input. Thus we see Ted Chiang’s summary that ChatGPT provides a blurry jpeg of the web:
Large language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory?
GPT has simply taken a bunch of words, summarized the relationships expressed by those words, and doing some sort of fuzzy pattern-matching and extrapolation from those relationships. There’s no creative thought. And most of the scary samples you’ve seen are this sort of pattern-matching. Microsoft’s chatbot says it wants to be human and threatens to kill people because we have tons of fiction about AIs that want to be human and threaten to kill people, and it’s just imitating that.
But, the rejoinder comes: are people any different? Humans are just doing fuzzy pattern-matching and imitating behavior we’ve seen…somewhere. So sure, GPT is just saying things that sound good based on what it’s read, but that’s also what people do most of the time. ChatGPT can do a good job of producing mediocre high school essays because it really is doing the same thing a mediocre high school essayist is doing!
And I think this is basically true—sometimes. A lot of human communication is basically just unreflective pattern-matching, saying things that sound good without really thinking about what they mean. When I make small talk with the cashier at target, I’m not engaging in a deep intellectual analysis of how to best describe my day. I’m just making small talk!
I also see this thoughtless extrapolation all the time while teaching college students. When students ask for help and I look at their work, it’s common for there to be steps that just don’t make any sense. And when I ask them why they did that, they don’t know. They’ll say something like “I don’t know, it just seemed like a thing to do?”
And that’s not even always a bad thing. If I type “3+5”, most of you will probably say “8” to yourselves before consciously deciding to do the addition; if I say “the capital of France”, you probably find “Paris” popping into your mind without any active deliberation. It’s hard to explain how you answered those questions, because you just know. And that’s great, because it means you don’t have to stop and think and work to get the answer.
Of course, this quick-and-easy thinking doesn’t always give the right answer. If I hear “the capital of Illinois”, my immediate reaction is “Chicago”. (It’s Springfield. I was pretty sure Chicago was the wrong answer, but it’s still the first one my brain supplied.) And if I hear “537 times 842”, my immediate reaction is—well, my immediate reaction is “ugh, do I have to?” I know I could work that out if I need to. But I’d rather not. It’s certainly not automatic.
So yes, humans in fact do a lot of pattern-matching and extrapolation. But we also do more than that. We can look at the results of our mental autocomplete and ask, “does this really make sense?”. We can do precise calculations that take effort and focus. We can hold complex ideas in our heads with far-removed long-term goals, and we can subordinate our free association to those complex ideas. We can, really and truly, think.
We can think carefully, but that doesn’t mean we always do. Right after the original release of GPT-2, in February 2019, Sarah Constantin wrote a piece arguing that Humans Who Are Not Concentrating Are Not General Intelligences. She observed that GPT text looks a lot like things people would write—if you don’t read them carefully. But the more attention you pay, the more they fall apart.
If I just skim, without focusing, [the GPT passages] all look totally normal. I would not have noticed they were machine-generated. I would not have noticed anything amiss about them at all.
But if I read with focus, I notice that they don’t make a lot of logical sense.
…
So, ok, this isn’t actually human-equivalent writing ability…. The point is, if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot.
So the synthesis is: large language models like GPT can talk, and perhaps “think”, as well as a person who isn’t paying attention to what they’re saying. And it makes lots of errors for the same reason you can find multiple reddit threads about thoughtlessly saying “you too” in inappropriate situations. We say it because it feels right—and only afterwards do we realize it definitely isn’t.
In Thinking Fast and Slow, Daniel Kahneman writes about how human reasoning splits into two basic systems. System 1 reasons quickly and efficiently, but operates essentially on reflex. System 2 slower and takes more energy, but can engage in careful, reflective thought. When asked for the capital of Illinois, my System 1 says “Chicago”, and then my System 2 says “no, wait, isn’t it actually Springfield?”.
Psychologists studying these two systems have found some classic puzzles that illustrate the difference really well. While reading the next question, pay attention to all the thoughts you have, and not just the final answer you come up with.
Question: Suppose a bat and a ball cost $1.10 together, and the bat costs a dollar more than the ball. How much does the ball cost?
If your brain works like mine—or most people’s—the first thing you hear yourself thinking is “ten cents”. But if you think more carefully, or pull out a pen and some paper to do work, you’ll realize that’s wrong; the ball should in fact cost five cents.^{2} But even if you got it right, the wrong answer probably occurred to you first.^{3} That’s your System 1 contributing a guess, before your system 2 kicks in and corrects it.
And it seems to me that systems like GPT are implementing something like our System 1, but not System 2. And most of the time, when we’re not paying attention, we’re only using System 1—we’re just being human chatbots. But we are capable of using our System 2 to reflect on what we really mean, and transcend pure pattern-matching. It really seems like GPT can’t—and that it will never be able to without some genuinely new idea that we don’t yet have.
Now, when I gave ChatGPT the bat-and-ball problem, it got it right, in a really careful and thorough way.
Let’s use algebra to solve this problem.
Let x be the cost of the ball in dollars.
According to the problem statement, the bat costs one dollar more than the ball, so its cost is x + 1.
Together, the bat and ball cost $1.10, so we can write the equation:
x + (x + 1) = 1.10
Simplifying the equation:
2x + 1 = 1.10
Subtracting 1 from both sides:
2x = 0.10
Dividing both sides by 2:
x = 0.05
Therefore, the ball costs $0.05.
But that doesn’t provide any evidence that ChatGPT is thinking carefully, because this specific problem has been discussed at great length in the source text. It has a similar reaction to mine: “oh crap, it’s the bat-and-ball problem, better be careful so I don’t screw up.”^{4}
But right after introducing the bat-and-ball problem, Thinking Fast and Slow gives another example question that System 1 tends to fail on. This one gets discussed much less often. And this time, ChatGPT fails too.
This answer is wrong: the syllogism isn’t valid at all. All roses are flowers, but not all flowers are roses; it’s logically possible that some flowers fade quickly, and other flowers are roses, but none are both roses and quickly fade. ChatGPT doesn’t have a ton of examples of this puzzle in its training data, so it doesn’t know the answer, and it has no ability to figure it out on its own. It gets tricked, just like we do.
Over the past couple days I’ve seen a particularly amusing genre of GPT-trolling questions: ask it something that sounds like a trick question, but isn’t.
When I was a child, I thought it was fun to ask people: “which is heavier, a pound of feathers or a pound of bricks?” Like with our other examples, most people instinctively want to say that the pound of bricks is heavier, but with a bit of thought that’s obviously wrong.
Again, this is a famous trick question, and again, ChatGPT generally gets it right. But some clever person on Twitter got the bright idea to ask it to compare one pound of feathers to two pounds of bricks.
Presumably GPT basically said “oh, crap, this is the feathers-and-bricks thing again. I know it’s a trick question, because every time people have asked this it’s been a trick question, and they actually weigh the same”. And it totally ignores the actual numbers in the question.
And this generalizes: there are a few other examples of posing variations famous puzzles that have the trick removed. GPT gets them wrong, because it knows there’s a trick because there’s always a trick when people bring up the Monty Hall problem.
This one is my favorite; I laugh every time I read it.
Now, you shouldn’t take the specifics too seriously here. GPT is not human, and even truly intelligent AI might be intelligent in very not-human-like ways. We shouldn’t expect GPT’s capabilities to correspond exactly to the human System 1. If nothing else, System 1 controls basic physical activities like walking, which is a notoriously hard robotics problem that GPT isn’t even interacting with at all. And ChatGPT gets the capital of Illinois right, which my System 1, at least, does not.
But using the split between System 1 and System 2 as a metaphor has really helped me structure how I think about GPT, and to understand how it can be so good at some things while completely incapable of others. “GPT can do the sort of things that we can do on autopilot, if we’ve read a lot and have a good memory” does seem to sum up most of its capabilities!
This all makes the new chatbots seem way less frightening to me. No, they’re not “really thinking”; they can do some of what people can do, but there are core capabilities they lack. They aren’t sapient: analytic self-reflection is exactly the thing they aren’t capable of. And it does seem like this is a fundamental limitation of the approach that we’re using.
Each new generation of chatbots is more fluent and more impressive, but the basic technology we’re using appears to have serious limits. I strongly suspect you just can’t get System 2-style analytic capabilities just by scaling up the current approach. (And that’s before we ask whether it’s even possible to keep scaling them up without using dramatically more text than actually exists in the world.
But that doesn’t just suggest a ceiling on how impressive GPT chatbots can get, or what capabilities they can develop. It also tells us how to use them!
Most of us spend some of our time doing real work, that requires thought and creativity. And we spend other time dealing with what feels like trivial bullshit, that has to get done but is boring and formulaic. The first type of task is the sort of thing GPT can’t do for us—not now, and I suspect not ever. But the boring, formulaic tasks are ripe for automation. And fortunately, they’re the ones I didn’t want to do anyway.
I’m sure this isn’t a complete list of what GPT-like technologies can do. And even if it takes a while for people to figure out what the technology is good for, I’m sure eventually we’ll find some real uses.
But I don’t believe the dramatic hype I’ve been hearing for the past month. GPT is cool, and fun, and maybe even useful. But it won’t take over the world.
What do you think about the new chatbots? Do you have a use for them I didn’t mention? Or do you think I’m wrong about everything? Tweet me @ProfJayDaigle or leave a comment below.
GPT-2 was released in February 2019, and GPT-3, which ChatGPT is based on, was released in June 2020. I’ve been at least peripherally following this technology since even before the release of GPT-2, so ChatGPT and Sydney are a lot less surprising to me than they are to a lot of people—they’re improved versions of something I was already familiar with. ↵Return to Post
If the ball costs \$0.10 then the bat would have to cost \$1.00, and would only cost ninety cents more; the correct answer is that the bat costs \$1.05 and the ball costs \$0.05. ↵Return to Post
Actually, at this point what my System 1 says is “oh crap, it’s the bat and ball problem again. Think carefully before you answer!” But that’s only from having seen this specific problem too many times; if you changed the setup basically at all, I’d think the wrong answer first, and then correct myself. ↵Return to Post
At least one person has fooled ChatGPT and gotten the wrong answer by changing the bat and ball to a bow and arrow. But every time I’ve tried I’ve gotten the right answer, with either version. ↵Return to Post
If you like superhero fiction, Interviewing Leather and Justice Wing: Plan, Prototype, Produce, Perfect are both really good. ↵Return to Post
In this essay we’ll look at a number of methods that can help us draw better conclusions, and avoid the pitfalls of crappy hypothesis testing. We’ll start with some smaller and more conservative ideas, which basically involve doing hypothesis testing better. Then we’ll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.
But first, let’s remember what we’re talking about. The first two parts of this series answered two basic questions: how does hypothesis testing work, and how does it break?
In part 1, we learned about two major historical approaches to the idea of hypothesis testing: one by Fisher, and the other by Neyman and Pearson. Both start with a “null hypothesis”, which is usually an idea we’re trying to disprove. Then we collect some data, and analyze it under the assumption that the null hypothesis is true.
Fisher’s significance testing computes a \(p\)-value, which is the probability of seeing the experimental result you got if the null hypothesis is true. It is not the probability that the null hypothesis is false, but it does measure how much evidence your experiment provides against the null hypothesis. We say the result is significant if the \(p\)-value is below some pre-defined threshold, generally \(5\)%. If the null is actually false, we should be able to reliably produce these low \(p\)-values; Fisher wrote that a “scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance”.
Neyman and Pearson didn’t worry about establishing facts; instead, they focused on making actionable, yes-or-no decisions. A Neyman-Pearson null hypothesis is generally that we should refuse to take some specific action, which may or may not be useful. We figure out how bad it would be to take the action if it is useless, and how much we’d miss out on if it’s useful, and use that to set a threshold; then we collect data and use our threshold to decide whether to act. This approach doesn’t tell us what to believe, just what to do. Sometimes we think that acting is probably useful, but that acting wrongly would be catastrophic so it would be wiser to do nothing. The Neyman-Pearson method takes that logic into account, and biases us towards inaction, making type I errors less common at the expense of making type II errors more common.^{1}
Modern researchers use an awkward combination of these methods. Like Fisher, we want to discover true facts; but we use Neyman and Pearson’s technical approach of setting specific thresholds. We set a false positive threshold (usually \(5\)% and ideally a false negative threshold (we want it to be less than \(20\)%), and run our experiment. If we get a \(p\)-value less than the threshold —data that would be pretty weird if the null hypothesis is true, so weird it would only happen once every twenty experiments we run—then we “reject the null” and believe some alternative hypothesis. If our \(p\)-value is bigger, meaning our data wouldn’t look too weird if the null hypothesis is true, then we “fail to reject” the null and err on the side of believing the null hypothesis.
There are a few major problems with this setup.
The Neyman-Pearson method makes a definitive choice between two distinct courses of action. This reinforces a general tendency to force questions into yes-or-no binaries, even when that sort of clean dichotomy isn’t realistic or appropriate to the question. Hypothesis testing tells us whether something exists, but not really how common or how big it is.^{2}
Unfortunately, Yoda is wrong. Sometimes we do care about size.
And more importantly, scientific knowledge is always provisional, so we need to continually revise our beliefs based on new information. But Neyman-Pearson is designed to make a final decision and close the book on the question, which just isn’t how science needs to work.
Neyman-Pearson creates a bias towards the null hypothesis, so rejecting the null feels like learning something new, while failing to reject is a default outcome. On one hand, this means it’s not a good tool if we want to show the null is true^{3}. On the other hand, a study that fails to reject the null feels like a failed study, and that’s a huge problem if the null really is true! This can bias the studies we actually see since many non-rejections aren’t published. It doesn’t help us that most research is accurate if most published papers are not.
Since researchers don’t want to fail, and do want to discover new things and get published, they have an incentive to find a way to reject the null.^{4} When done deliberately, we call this \(p\)-hacking, and there are a variety of questionable research practices that can help us wrongly and artificially reject a null hypothesis. Worse, the garden of forking paths means you can effectively \(p\)-hack without even knowing that you’re doing it, fudging both your theory and your data until they match.
At the same time, many studies have a low power, meaning they probably won’t reject the null even if it is actually false. Combined with publication bias, this can make the published literature unreliable: in some subfields, a majority of published results are untrue. And more, when underpowered studies do find something they tend to overestimate the effect, leading us to think everything works better than it actually does.
Despite all these problems, hypothesis testing is extremely useful—when we have a question it’s good for, and use it properly. So we’ll start by seeing how to make hypothesis testing work correctly, and some of the ways science has been shifting over the past couple decades to do a better job at significance testing.
To create reliable knowledge we need to replicate our results; there will always randomly be some bad studies and replication is the only way to weed them out. (There’s a reason it’s the “replication crisis” and not the “some bad studies” crisis.) Any one study may produce weird data through bad luck; but if we can get a specific result consistently, then we’ve found something real.^{5}
In some fields it’s common for important results to get replicated early and often. I’ve written before about how mathematicians are continuously replicating major papers by using their ideas in future work, and even just by reading them. Any field where research is iterative will generally have this same advantage.
In other fields replication is less automatic. Checking important results would take active effort, and often doesn’t happen at all. Complex experiments may be too expensive and specialized to replicate: the average phase \(3\) drug trial costs about \($20\) million, and even an exploratory phase 1 trial costs about \($4\) million. At those prices we’re almost forced to rely on one or two studies, and if we get unlucky with our first study it will be hard to correct our beliefs.^{6}
But sometimes we just don’t treat replication work like it’s important. If we run a new version of an old study and get the same result, it can feel like a waste of time: we “knew that already”. Since our results are old news, it can be hard to get the work published or otherwise acknowledged. But if we run a new version of an old study and don’t get the same result, many researchers will assume our study must be flawed because they already “know” the first study was right. Replication can be a thankless task.
The replication crisis led many researchers to reconsider these priorities. Groups like the Many Labs Project and the Reproducibility Project: Psychology have engaged in large scale attempts to replicate famous results in psychology, which helped to clarify which “established” results we can actually trust. Devoting more attention to replication may mean we study fewer ideas and “discover” fewer things, but our knowledge will be much more reliable.^{7}
Unfortunately, replication work often gets a response somewhere between apathy and active hostility. Lots of researchers see “failed” replications as actual failures—the original study managed to reject the null, so why can’t you?
Alt text: “Hell, my eighth grade science class managed to conclusively reject it just based on a classroom experiment. It’s pretty sad to hear about million-dollar research teams who can’t even manage that.”
Worse, replications that don’t find the original result are often treated like attacks on both the original research and the original researchers. They “followed the rules” and got a publishable result, and now the “data police” are trying to take it away from them. At its worst, this leads to accusations of methodological terrorism. But even in less hostile discussions, people want to “save” the original result and explain away the failed replication—either by finding some specific subgroup in the replication where the original result seems to hold, or by finding some way the replication differs from the original study and so “doesn’t count”.^{8}
This desire might seem weird, but it does follow pretty naturally from the Neyman-Pearson framework. The original goal of hypothesis testing is to make a decision and move on—even though that’s not how science should work. Replication re-opens questions that “were already answered”, which is good for science as a whole but frustrating to people who want to close the question and treat the result as proven.
To make replication fit into a hypothesis testing framework, we often use meta-analysis, which synthesizes the data and results from multiple previous studies. Meta-analysis can be a powerful tool: why wouldn’t we want to use all the data out there, rather than picking just one study to believe? But it also allows us to move fully back into the Neyman-Pearson world. We can treat the whole collection of studies as one giant study, do one hypothesis test to it, and reach one conclusion.
Of course this leaves us with all the fundamental weaknesses of hypothesis testing: it tries to render a definitive yes-or-no answer, and it’s biased towards sticking with the null-hypothesis.
Moreover, a meta-analysis can only be as good as the studies that go into it. If those original studies are both representative and well-conducted, meta-analysis can produce a reliable conclusion. But if the component studies are sloppy and collect garbage data, as disturbingly many studies are, the meta-analysis will necessarily produce a garbage result. Good researchers try to screen out unusually bad studies, but if all the studies on some topic are bad then that won’t help.
And if not all studies get published, then any meta-analysis will be drawing on unrepresentative data. Imagine trying to estimate average human height, but the only data you have access to comes from studies of professional basketball players. No matter how careful we are, our estimates will be far too high, because our data all comes from unusually tall people. In the same way, if only unusually significant data gets published, even a perfect meta-analysis will be biased, because it can only use biased data.
Even if all studies get published, the garden of forking paths can bias the meta-analysis in exactly the same way, since each study may report an unusually favorable measurement. This is like if some studies report the height of their participants, and others the weight, and others the shoe size—but they all pick the measure that makes their subjects look biggest. Each study might report its data accurately, but we’d still end up with a misleading impression of how large people actually are.
Good meta-analyses will look for signs of selective publication, and there are statistical tools like funnel plots or \(p\)-curves, that can sometimes detect these biases in the literature. But these tools aren’t perfect, and of course they don’t tell us what we would have seen in the absence of publication bias. We can try to weed out bad studies after publication, but it’s better not to produce them in the first place.
The \(p\)-curve: when there’s \(p\)-hacking or selection bias, we expect most significant studies to be just barely significant. When the effect is real, we expect small \(p\)-values to be much more common than large ones. Figure from Simonsohn, Nelson, and Simmons.
But of course, not all meta-analyses are good. Just like researchers have lots of ways to tweak their experiments to get statistical significance, doing a meta-analysis involves making a lot of choices about how to analyze the data, and so there are a lot of opportunities to \(p\)-hack or to get tricked by the garden of forking paths. Meta-analysis is like one giant hypothesis test, which means it can go wrong in exactly the same ways other hypothesis tests do.
Hypothesis testing does have real weaknesses, but many of the real-world problems we deal with only happen when we do it wrong. The point of the Neyman-Pearson method to set out a threshold that determines whether we should act or not, collect data, and then see whether the data crosses the threshold. If we ignore the result when it doesn’t give the answer we want, then we’re not really using the Neyman-Pearson method at all.
But that’s exactly what happens in many common errors. When we ignore negative studies, we change the question from “yes or no” to “yes or try again later”. The garden of forking paths and \(p\)-hacking involve changing the threshold after you see your data. This makes it very easy for your data to clear the threshold, but not very informative.
It’s easy to hit your target, if you pick the target after you shoot. But you don’t learn anything that way. Illustration by Dirk-Jan Hoek, CC-BY
For hypothesis testing to work, we have to decide what would count as evidence for our theory before we collect the data. And then we have to actually follow through on that, even if the data tells us something we don’t want to hear.
Following through with this is simple for private decisions, if not always easy. When I want to buy a new kitchen gadget, sometimes I’ll decide how much I’m willing to pay before I check the price. If it turns out to be cheaper than my threshold, I’ll buy it; if it’s more expensive, I won’t. This helps me avoid making dumb decisions like “oh, that fancy pasta roller set is on sale, so it must be a good deal”. I don’t need any fancy way to hold myself accountable, since there’s no one else involved for me to be accountable to. And of course, if the pasta roller is super expensive and I buy it anyway, I’m only hurting myself.
But science is a public, communal activity, and our decisions and behavior need to be transparent so that other researchers can trust and build on our results. Even if no one ever lied, it’s so easy for us to fool ourselves that we need some way to guarantee that we did it right—both to other scientists, and to ourselves. Everyone saying, “I swear I didn’t change my mind after the fact, honest!” just isn’t reliable enough.
To create trust and transparency, we can publicly preregister of our research procedures. If we publish our plans before conducting the study, everyone else can know we made our decisions before we ran the study, and they can check to see if the analysis we did matches the analysis we said we would do. When done well, this prevents p-hacking and protects us from the garden of forking paths, because we aren’t making any choices after we see the data.
Public preregistration also limits publication bias. Even if the study turns produces boring negative results, the preregistration plan is already published, so we know the study happened—it can’t get lost in a file drawer where no one knows about it. This preserves the powerful statistical protection of the Neyman-Pearson method: our false positive rate will be five percent, and no more.
Many journals have implemented registered reports, which allow researchers to submit their study designs for peer review, before they actually conduct the study. This means their work is evaluated based on the quality of the design and on whether the question is interesting; the publication won’t depend on what answer they find, which removes the selection bias towards only seeing positive results. Registered reports also restrict researchers to the analyses they had originally planned, rather than letting them fish around for an interesting result—or at least force them to explain why they changed their minds, so we can adjust for how much fishing they actually did.
The biggest concern about publication bias probably surrounds medical trials, where pharmaceutical companies have an incentive not to publish any work that would show their drugs don’t work. Many regulatory bodies including the FDA require clinical trials to be registered; the NIH also maintains a public database of trial registries and results. And this change had a dramatic impact in the results we saw from clinical trials.
Before widespread preregistration, most trials showed large benefits. When we got more careful, these benefits evaporated.
Preregistration is also a great opportunity to plan out our study more carefully, and in particular to think about statistical power in advance. Remember the power of a study is the probability that it will reject the null hypothesis if the null is in fact false. We get more power when the study is better and more precise, but also when the effect we’re trying to measure is bigger and more visible: it’s pretty easy to show that cigarette smoking is linked to cancer, because the effect is so dramatic.^{9} But it’s much harder to detect the long-term effects of something like power posing, because the effects will be so small relative to other impacts on our personality.
On the other hand, if the effects are that small, maybe they don’t matter. If some economic policy reduces inflation by \(0.01\)%, then even if we could measure such a small reduction we wouldn’t really care—all we need to know is that the effect is “too small to matter”. With enough precision we could get statistical significance,^{10} but that doesn’t mean the result is practically or clinically significant. During the preregistration process we can decide what kind of effects would be practically important, and calibrate our studies to find those effects.
Planning for power also makes it easier to treat negative results as serious scientific contributions. The aphorism says that absence of evidence is not evidence of absence, but the aphorism is wrong. When a study has high power, we are very likely to see evidence if it exists; so absence of evidence becomes pretty good evidence of absence. If we know our studies have enough power, then our negative results become important and meaningful, and we won’t need to hide them in a file drawer.
And all of this is fantastic—but it doesn’t address many of the problems science actually presents us with. Modern hypothesis testing is optimized for taking a clear, well-designed question and giving a simple yes-or-no answer. That’s a good match for clinical trials, where the question is pretty much “should we use this drug or not?” By the time we’re in Phase 3 trials, we know what we think the drug will accomplish, and we can describe in advance a clean test of whether it will or not. Preregistration solves the implementation problems pretty thoroughly.
But preregistration does limit our ability to explore our data. This is necessary to make hypothesis testing work properly, but it’s still a cost. We really do want to learn new things from our data, not just confirm conjectures we’ve already made. Preregistration can’t help us if we don’t already have a hypothesis we want to test. And often, when we’re doing research, we don’t.
Here are some scientific questions we might want to answer:
None of these are yes-or-no questions. All of them are important parts of the scientific program, but none of them suggest specific hypotheses to run tests on. What do we do instead?
Maybe the most obvious idea is just to test, well, everything.
With apologies to Allie Brosh.
Now, we can’t test literally everything; collecting data takes time and money, and we can only conduct so many experiments. But we can take all the data we already have on crime rates, or on learning; and we can list every hypothesis we can think of and test them all for statistical significance. This data dredging is a very common, very bad idea, especially in the modern era of machine learning and big data. Mass testing like this takes all the problems of hypothesis testing—false positives, publication bias, low power, and biased estimates—and makes them much worse.
If we test every idea we can think of, most of them will be wrong. As we saw in part 2, that means a huge fraction of our positive results will be false positives. Sure, if we run all our tests perfectly, then only \(5\)% of our wrong ideas will give false positives. But since we have so many more bad ideas than good ones, we’ll still get way more false positives than true positives. (This is easiest to see in the case where all of our ideas are wrong—then all our positive results will be false positives!)
If we test just twenty different wrong ideas, there’s a roughly two-in-three chance that one of them will fall under the \(5\)% significance threshold, just by luck.^{11} That’s a lot higher than the false positive rate of \(5\)% that we asked for, and means we are very likely to “discover” something false. And then we’ll waste even more time and resources following up on our surprising new “discovery”.
If you test everything, you’ll find a ton of spurious correlations like this one.
This multiple comparisons problem has a mathematical solution: we can adjust our significance threshold to bring our false positive rate back down. A rough rule of thumb is the Bonferroni correction, where we divide our significance threshold by the number of different ideas we’re testing. If we test twenty ideas but divide our \(5\)% significance threshold by twenty to get a corrected threshold of \(0.25\)%, then each individual result has a one-in-four-hundred chance of giving a false positive, but that gives us a roughly five percent chance of getting a false positive on one of those ideas.
The problem is sociological, not mathematical: people don’t like correcting for multiple comparisons, because it makes it harder to reach statistical significance and “win” the science game. Less cynically, correcting for multiple comparisons reduces the power of our studies dramatically, making it harder to discover real and important results. Ken Rothman’s 1990 paper No Adjustments Are Needed for Multiple Comparisons articulates both of these arguments admirably clearly: “scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings.”
Rothman is right in two important ways. First, researchers should not be penalized for conducting studies that don’t reach statistical significance. Studies that fail to reject the null, or measure a tiny effect, are valuable contributions to our store of knowledge. We tend to overlook and devalue these null results, but that’s a mistake, and one of the major benefits of preregistration is protecting and rewarding them.
Second, it’s important to investigate potential leads that might not pan out. As Isaac Asimov may or may not have said, “The most exciting phrase in science is not ‘Eureka!’ but ‘That’s funny…’”; and it’s important to follow up on those unexpected, funny-looking results. After all, we have to find hypotheses somewhere.
But undirected exploration is, very specifically, not hypothesis testing. Rothman suggests that we often want to “earmark for further investigation” these unexpected findings. But hypothesis testing isn’t designed to flag ideas for future study; instead a hypothesis test concludes the study, with (in theory) a definitive answer. Rothman’s goals are correct and important, but hypothesis testing and statistical significance aren’t the right tools for those goals.^{12}
At some point, though, we do generate some hypotheses.^{13} If we’re studying how memory interacts with speech, we might hypothesize that describing a face verbally will make you worse at recognizing it later, which gives us something concrete to test. Or, more tenuously, if we’re studying the ways that sexism affects decision-making, we might hypothesize that hurricanes with feminine names are more deadly because people don’t take them as seriously.
And then we can test these hypotheses, and reject the null or not, and then—what? What does that tell us?
We have a problem, because these hypotheses aren’t the questions we really want to answer. If installing air filters in classrooms increases measured learning outcomes, that’s a fairly direct answer to the question of whether installing air filters in classrooms can help children learn, so a hypothesis test really can answer our question. But we shouldn’t decide that sexism is fake just because feminine names probably don’t make hurricanes deadlier!^{14} We should only care about the hurricane-names thing if we think it tells us something about our actual, real-world concerns.
And that means we can’t just test one random hypothesis relating to our big theoretical question and call it a day. We need to develop hypotheses that are reasonably connected to the questions we care about, and we need to approach those questions from many different perspectives to make sure we’re not missing anything. That means there’s a ton of work other than hypothesis testing that we need to do if we want our hypothesis tests to tell us anything useful:^{15}
Without all this work, we can come up with hypotheses, but they won’t make sense. We can run experiments, but we can’t interpret them. And we can do hypothesis tests, but we can’t use them to answer big questions.
And sometimes we have a direct question that presents a clear experiment to run, but not a clear hypothesis. Questions like “How effective is this cancer drug?” or “how malleable is this steel alloy?” aren’t big theoretical questions, but also aren’t specific hypotheses that can be right or wrong. We want numbers.
In practice we often use hypothesis testing to answer these questions anyway—but with an awkward kludge. We can test a null hypothesis like “this public health program doesn’t save lives”. If we fail to reject the null, we conclude that it doesn’t help at all; if we do reject the null, we see how many lives the program saved in our experiment, and use that as an estimate of its effectiveness.
This works well enough that we kinda get away with it, but it introduces consequential biases into our measurements. If the measured effect is small, we round it down to zero, concluding there is no benefit when there may well be a small but real benefit (or a small but real harm). And if significant studies are more likely to be seen than non-significant studies, we will see more unusually good results than unusually bad ones, which means we will believe basically everything is more effective than it actually is.^{17}
We shouldn’t be surprised that hypothesis testing does a bad job of measuring things, because hypothesis testing isn’t designed to measure things. It’s specifically designed to not report a measurement, and just tell us whether we should act or not. It’s the wrong tool for this job.
We can and should do better. A study in which mortality decreases by \(0.1\)% is evidence that the program works—possibly weak evidence, but still evidence! And if we skip the hypothesis testing and put measurement first, we can represent that fact accurately.
The simplest thing to do would be to just average all our measurements and report that number. This is a type of point estimate, the single number that most accurately reflects our best guess at the true value of whatever we’re measuring.
But a point estimate by itself doesn’t give as much information as we need. We need to measure our uncertainty around that estimate, and describe how how confident we are in it. A drug that definitely makes you a bit healthier is very different from one that could save your life and could kill you, and it’s important to be clear which one we’re talking about.
We can supplement our point estimate with a confidence interval, also called a compatibility interval, which is sort of like a backwards hypothesis test. We give all the values that are compatible with our measurement—values that would make our estimate relatively unsurprising. Rather than starting with a single null hypothesis and checking whether our measurement is compatible with it, we start with the measurement, and describe all the hypotheses that would be compatible.
The definition is a bit more technical, and easy to get slightly wrong: If we run \(100\) experiments, and generate a \(95\)% confidence interval for each experiment, then the true value will lie in about \(95\) of those intervals. A common mistake is to say that if we generate one confidence interval, the true value has a \(95\)% chance of landing in it, but that’s backwards, and not quite right.^{18} But before we run the experiment, we expect a \(95\%\) chance that the true value will be in the confidence interval we compute.
Each vertical bar is a compatibility interval from one experiment, with a circle at the point estimate. Three of the intervals don’t include the true value, which is roughly \(5\)% of the \(50\) intervals. Image by Randy.l.goodrich, CC BY-SA 4.0
Mathematically, these intervals are closely related to hypothesis tests. A result is statistically significant if the null hypothesis (often \(0\)) lies outside the compatibility interval. So in a sense compatibility intervals are just giving the same information as a hypothesis test, just in a different format. But changing the format shifts the emphasis of our work, and the way we think about it. Rather than starting by picking a specific claim and then saying yes or no, we give a number, and talk about what theories and models are compatible with it. This avoids needing to pick a specific hypothesis. It also gives our readers more information, rather than compressing our answer into a simple binary.
Focusing on compatibility intervals can also help avoid publication bias, and make it easier to use all the data that’s been collected. When we report measurements and compatibility intervals, we can’t “fail to reject” a null hypothesis. Every study will succeed at producing an estimate, and a compatibility interval, so every study produces knowledge we can use, and no study will “fail” and be hidden in a file drawer. Some studies might be designed and run better than others, and so give more precise estimates and narrower compatibility intervals. We can give more weight to these studies when forming an opinion. But we won’t discard a study just for yielding an answer we didn’t expect.
Throughout this series, we’ve used the language and perspective of frequentist statistics. This is the older and more classical approach to statistics, which defines probability in terms of repeated procedures. “If we test a true null hypothesis a hundred times, we’ll only reject it about five times”. “If we run this sampling procedure a hundred times, the compatibility interval will include the true value about \(95\) times.” This approach to probability is philosophically straightforward, and leads to relatively simple calculations.
But there are questions it absolutely can’t answer—like “what is the probability my null hypothesis is true?”—since we can’t frame them in terms of repeated trials. Remember, the \(p\)-value is not the probability the null is false. Its definition is a complicated conditional hypothetical that’s hard to state clearly in English: it’s the probability that we would observe what we actually did observe under the assumption that the null hypothesis is true. This is easy to compute, but it’s difficult to understand what it means (which is why I wrote like six thousand words trying to explain it).
But there’s another school of statistics that can produce answers to those questions. Bayesian inference, which I’ve written about before, lets us assign probabilities to pretty much any statement we can come up with. This is great, because it can directly answer almost any question we actually have. But it’s also much, much harder to use, because it requires much more data and more computation. And the bigger and more abstract the question we ask, the worse this gets.
Bayesian inference needs three distinct pieces of information:
Then we run an experiment, collect data, and use a formula called Bayes’s theorem to produce a posterior probability, our final estimate of the likelihood our hypothesis is true.^{19}
That’s a lot more complicated! First of all, we have to compute two \(p\)-values, not just one. But second, we calculate the extra \(p\)-value under the assumption that “our hypothesis is false”, and that covers a lot of ground. If our hypothesis is that some drug prevents cancer deaths, then the alternative includes “the drug does nothing”, “the drug increases cancer deaths”, “the drug prevents some deaths and causes others”, and even silly stuff like “aliens are secretly interfering with our experiments”. To do the Bayesian calculation we need list every possible way our hypothesis could be false, and compute how likely each of those ways is and how plausible each one makes our data. That gets very complicated very quickly.
(In contrast, Fisher’s approach starts by assuming the null hypothesis is true, and ignores every other possibility. This makes the calculation much easier to actually do, but it also limits how much we can actually conclude. High \(p\)-value? Nothing weird. Low \(p\)-value? Something is weird. But that’s all we learn.)
And third, even if we can do all those calculations somehow, we need that prior probability. We want to figure out how likely it is that a drug prevents cancer. And as the first step, we have to plug in…the probability that the drug prevents cancer. We don’t know that! That’s what we’re trying to compute!
Bayesian machinery is great for refining and updating numbers we already have. And the more data we collect, the less the prior probability matters; we’ll eventually wind up in the correct place. So in practice, we just pick a prior that’s easy to compute with, plug it into Bayes’s theorem, and try to collect enough data that we expect our answer to be basically right.
And that brings us back to where we began, with replication. The more experiments we run, the more we can learn.
I closed out part 2 with an xkcd statistics tip: “always try to get data that’s good enough that you don’t need to do statistics on it.” Here at the end of part 3, we find ourselves in exactly the same place. But this time, I hope you see that tip, not as a punchline, but as actionable advice.
Modern hypothesis testing “works”, statistically, as long as you ask exactly the questions it answers, and are extremely careful in how you use it. But we often misuse it by collecting flawed or inadequate data and then drawing strong, sweeping conclusions. We run small studies and then \(p\)-hack our results into significance, rather than running the careful, expensive studies that would genuinely justify our theoretical claims. We report the results as over-simplified yes-or-no answers rather than trying to communicate the complicated, messy things we observed. And if we manage to reject the null on one study we issue press releases claiming it confirms all our grand theories about society.
Too often, we use statistics to help us pretend bad data is actually good.
In this essay we’ve seen a number of possible solutions, but they’re basically all versions of “collect more and better data”:
Even the Bayesian approach comes back to this. Bayesianism relies on the prior probability; but that really just means that, if we already have some knowledge before we run the experiment, we should use it!
Statistics is powerful and useful. We couldn’t do good science without it. But data—empirical observation—is the core of science. Statistics helps us understand the data we have, and it helps us figure out what data we need. But if our data sucks, statistics alone cannot save us.
Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Do you have a great idea for doing science better? Tweet me @ProfJayDaigle or leave a comment below.
We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it’s pretty uncommon to do it that way in practice. ↵Return to Post
We’ve seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks “work” or “don’t work”, rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.) ↵Return to Post
There are variants of hypothesis testing that help you show some null hypothesis is (probably) basically right. But they’re not nearly as common as the more standard setup. ↵Return to Post
Nosek, Spies, and Motyl write about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest….The solution requires making incentives for getting it right competitive with the incentives for getting it published." ↵Return to Post
The result we’ve found doesn’t necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you’ve found something even if you don’t understand it yet. ↵Return to Post
If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the covid vaccines are effective and extremely safe. But if we don’t approve a drug, there’s no followup data to analyze, and the drug stays unapproved. ↵Return to Post
My favorite suggestion comes from Daniel Quintana, who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it’s a good fit for careful replication of already-designed studies. ↵Return to Post
You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That’s a very good question. It’s very easy to run a hypothesis test that basically can’t tell us anything interesting; we’ll come back to this later in the piece. ↵Return to Post
Somewhat infamously, Fisher stubbornly resisted the claim that smoking caused cancer until his death. But he never denied the correlation, which was too dramatic to hide. ↵Return to Post
As long as two factors have any relationship at all, the effect won’t be exactly zero, and with enough data we’ll be able to reject the null hypothesis that there’s no effect. But that just means “is the effect exactly zero” is often the wrong question; instead we want to know if the effect is big enough to matter. ↵Return to Post
The odds of getting no false positives after \(n\) trials is \(0.95^n\), so the odds of getting a false positive are \(1 - 0.95^n\). And \(0.95^{20} \approx 0.358\), so \(1 - 0.95^{20} \approx 0.652\).
It’s a little surprising this is so close to \(2/3\), but there’s a reason for it—sort of. If you compute \( (1- 1/n)^n\) you will get approximately \(1/e\), so the odds of getting a false positive at a \(1/20\) false positive threshold after \(20\) trials are roughly \(1-1/e \approx .63\). ↵Return to Post
From what I can tell, Rothman may well agree with me. His twitter feed features arguments against using statistical significance and dichotomized hypotheses in place of estimation, which is roughly the position I’m advocating. But if you’re doing hypothesis testing, you should try to do it correctly. ↵Return to Post
You might notice that I’m not really saying anything about where we find these hypotheses. There’s a good reason for that. Finding hypotheses is hard! It’s also the most creative and unstructured part of the scientific process. The question is important, but I don’t have a good answer. ↵Return to Post
For that matter, if feminine hurricane names were less dangerous we could easily tell a story about how that was evidence for sexism. That’s the garden of forking paths popping up again, where many different results could be evidence for our theory. ↵Return to Post
In their wonderfully named (and very readable) paper Why hypothesis testers should spend less time testing hypotheses, Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the derivation chain: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. ↵Return to Post
This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you’re trying to answer. I’ve written about this before, but I also recommend Keith Devlin’s excellent post on what “mathematical thinking” is, especially the story he tells after the long blockquote. ↵Return to Post
We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a \(5\)% difference to be significant, and Drug A produces a \(3\)% improvement over placebo and Drug B produces a \(7\)% improvement. Then the effect of Drug A isn’t significant, and the effect of Drug B is, so we say that Drug A doesn’t work and Drug B does.
But the difference between Drug A and Drug B is not significant—so if we ask that question, we conclude that the two drugs are equally good! The difference between "significant" and "not significant" is not itself statistically significant, so it matters exactly which hypothesis we choose to test. ↵Return to Post
Sometimes we can look at our interval after the fact and make an informed guess whether it’s one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there’s some risk I get a \(95\)% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I’m pretty much \(100\)% confident in that interval, not just \(95\)%. ↵Return to Post
We saw examples of this calculation in part 2, when we calculated what fraction of positive results were true positives. Note that we had to make assumptions about what fraction of null hypotheses are true; that’s the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations. ↵Return to Post