Jay Daigle

Old Books and the Passage of Time

2024-01-09T00:00:00-08:00

I recently got a couple books that I’ve had on my “to read” list for several years: Frank Abagnale’s The Art of the Steal and Joseph Heath’s Enlightenment 2.0. And I’ve been enjoying both books, but they’re both surprisingly dated. I keep being aware of how old they are, sometimes in ways that really catch me off guard.

The Art of the Steal

The anachronisms in the Abagnale book are more dramatic, in part because the book is older (from 2001), and in part because Abagnale is more of a sensationalist to begin with. (It seems a lot of the claims he made about his most impressive capers are less than accurate—which makes sense coming from a successful con man!)

He takes pains to explain cutting-edge technology, like color scanners and laser printers. Then he warns about people using them to forge store gift certificates, when I’m not sure I’ve seen an actual printed gift certificate (as opposed to a gift card) in years. He talks about scanning and printing near-perfect replicas of US currency, which is no longer possible. He describes the exciting new security features in the redesign of the twenty-dollar bill, which I just barely remember being introduced.

But the most jarring bits are in his first real chapter, about check forgery. Partly, again, the technology has gotten better. He complains that many companies print checks on “that familiar blue or green basketweave check paper” you can buy at any office supply store. But it’s not really familiar to me! Instead I just take for granted that all checks have the fancy new security features he’s advocating.

But moreover, he’s amazed that stores accept checks without checking the signature—whereas I’m amazed that stores accept checks at all! He raises the possibility of paper checks dying out, only to dismiss it:

I’ll be long dead, even if I live to a ripe old age, before checks will ever disappear. The amount of checks we write is growing at a rate of more than a billion checks a year. So they’re not even declining in use. They’re growing. I remember fifteen years ago, when we were writing 40 billion checks a year, people said it would never reach 50 billion, and now we’re at almost 70 billion. People happen to like checks. They’re familiar. Many consumers will say, “I like this check. It has some float to it. I like that much better than when the bank immediately goes into my checking account and takes the money out. I also like the idea that I can get the check back and see who I wrote it to and have a record of it.” And we have a very large generation that is not comfortable with smart cards and electronics. They’re leery of new ways of payments, and they don’t fully grasp them.

Electronic banking is still much more of an unknown frontier. And there’s no forgetting the billions of dollars that banks have invested in electronic readers, sorters, and other check processing equipment. We’re not going to just scrap it and plow money into home banking. There are banks out there pushing electronics, but there are a lot of other banks that would just as soon stay with checks.

And that’s all very convincing, except for one thing:

The Evolution of the Check as a Means of Payment: A Historical Survey
Stephen Quinn and William Roberds

Data collected from https://www.federalreserve.gov/paymentsystems/frps_previous.htm

I can’t explain the discrepancy between Abagnale’s numbers and the Fed’s. Abagnale doesn’t cite a source, and while the Fed is pretty clear that its numbers aren’t totally solid, I know I trust them more. But it sure looks like The Art of the Steal was written at nearly the exact peak of US check-writing. The book is confidently asserting that checks would never fade—to a present-day audience which knows they’re well on their way out.¹

Enlightenment 2.0

Joseph Heath wrote a much more serious book, and a much better researched one. It’s also much more recent, from just 2014. But that makes the anachronisms more disconcerting.

The first thing that really surprised me is his discussion of computer chess. Heath argues (correctly!) that people don’t think in a purely linear, logical-deductive manner, but instead rely on a lot of shortcuts and heuristics. He illustrates the difference by contrasting the human approach to chess-playing with the approach of chess computers like Deep Blue. Computers, he explains, are analyzing millions of branches of the chess decision tree; in contrast, human grandmasters rely on “a heuristic pruning of the decision tree, guided by an intuitive sense of what seem to be the most promising moves or of what sort of position they want on the board.” He goes on to observe that

[N]o one is able to articulate how this initial pruning is done. It is all based on “feel.” … To this day, no one has ever succeeded in reproducing the intuitive style of thinking in a computer, simply because we don’t know how it is done (despite the fact that we ourselves do it)…. The fact that this much computing power can be deployed without yet achieving the “final, generally accepted, victory over the human”22 is a monument to the power and sophistication of nonrational thought processes in the human mind.

Three years later, Google unveiled the AlphaZero engine, which uses modern machine learning techniques to do heuristic pruning very similar to what humans do, and avoids the need to crunch through the entire decision tree. To the best of my knowledge, every top chess program now uses these neural network-based heuristics.

I don’t bring this up to criticize Heath. He was correct when he was writing; and his main point is still correct, since he was mostly trying to explain how human thought works, not how to write a chess program. But it’s definitely a moment where I paused and was thrown out of the argument, because my first reaction was “but that isn’t true!” With a belated followup of “…any more”.

Anachronistic Politics

But there’s another bit that seems far more jarring and anachronistic today, even though it also seems prescient. Heath writes as an unapologetic liberal², and his project is to build a modern, renewed liberal politics. So he sets the stage for his argument by discussing some of the problems he sees in the modern Republican party.

The big tent of the American right has always sheltered its share of crazies… There came a point, however, when the sideshow began to take over center stage. Americans woke up to find that their political system was increasingly divided, not between right and left, but between crazy and non-crazy. And what’s more, the crazies seemed be gaining the upper hand.

He later observes that the American right “always seem to be very angry”, and that

there has also been a significant rise in the amount of bullshit. Lying for political advantage, of course, is as old as the hills. What has changed is that politicians used to worry about getting caught.

He is, of course, describing the 2012 campaign that pitted Mitt Romney against Rick Santorum in the primary and Barack Obama in the general election.

Ten years later, I’m not sure whether to read Heath’s writing as prescient or naive. He forecast the shape of Trumpian politics nearly perfectly, so in that sense he was clearly on to something. But it’s disconcerting to remember a time when we might have viewed Romney and Santorum as shockingly out-of-bounds artists of bullshit.

So those are two different books I’m reading, which both aged surprisingly quickly. I don’t have any grand takeaways from this, or anything. But it’s interesting to see just how unpredictable trends can be. Sometimes they keep going much further than you think they can. And other times, when they seem like they’ll last forever, they stop almost without warning.

What else has aged surprisingly quickly—or surprisingly well? Tweet me @ProfJayDaigle, BlueSky me @profjaydaigle.bsky.social, or leave a comment below.

In Abagnale’s defense, he only claims they won’t disappear, and indeed they haven’t. But the dynamics of check-cashing today are radically different from the dynamics he describes, and his prediction that banks will keep avoiding electronic banking seems particularly off the mark.

In Abagnale’s offense, he has a comment a few chapters later that his children don’t like writing checks and he thinks it’s a generational thing. So he could have seen it coming. ↵Return to Post
In both senses of the term; he opposes the political right, but he also isn’t a leftist. ↵Return to Post

A Fictional History of Numbers, Part 4: Imagination, Complexity, and the Fundamental Theorem of Algebra

2023-07-25T00:00:00-07:00

Welcome back to our survey of the different types of “numbers” that mathematicians work with, and what kind of questions lead us to study those numbers. This week we’re going to put a bow on the first collection of questions we asked and tie them all together.

In the first few essays in this series, we saw two different approaches to finding new types of numbers. But they gave us different—and overlapping, but distinct—sets of numbers. Today we’ll see what happens when we combine both techniques, and develop the complex numbers. This won’t finish our quest to find weird numbers that mathematicians care about; far from it. But it will finish one line of questions, and cover pretty much everything we normally see in high school algebra and calculus.

But before I start, I want to take a moment to thank everyone who has donated to my Ko-Fi account. Tips are never necessary, but always appreciated, and they really do make a difference and help me to keep writing essays like this one.

Building the complex numbers

The two approaches

In part 1, we started with the natural numbers, which are the basic numbers we use to count. Using basic arithmetic operations, we introduced negative numbers to get the integers, then fractions to get the rational numbers. We ended by asking all polynomial equations to have solutions, which gave us the algebraic numbers. These include square roots and cube roots of all the rational numbers, and also some stranger things like the solutions to $x^5+x+3=0$. This gave us a set that was algebraically closed: any polynomial equation defined with algebraic numbers will have a solution that is an algebraic number. So algebraic tools couldn’t push us any farther.

In part 2 we asked a different question, about measurement and approximation. We wanted areas and lengths to all correspond to numbers, and this led to the idea of completeness, where any number we can approximate with rational numbers should actually exist. Completing the rational numbers gave us the real numbers. We might call this the analytic approach to extending the rationals, in contrast to the algebraic approach of part 1.

In part 3 we showed that not every real number is algebraic; in particular $\pi$ is a transcendental number, which isn’t the solution to any polynomial equation. But more generally, we showed that the algebraic numbers are countable, which means we can describe any one of them with a finite amount of information, but the real numbers are uncountable, which means it takes an infinite amount of information to describe most of them. There aren’t just more real numbers than algebraic numbers; there are infinitely more.

But that doesn’t mean the real numbers cover everything! There are algebraic numbers that aren’t real numbers. And there are real polynomials that don’t have real solutions. So what happens if we start with the real numbers and do part 1 again? Can we get a field with the completeness of the reals, but also the nice algebraic closure of the algebraic numbers?

Keeping it unreal

How do we know there are algebraic numbers that aren’t real?

We can start with the quadratic polynomial equation $x^2+1=0$. This is defined entirely with real numbers. But when we graph the function $x^2+1$, we see it doesn’t cross the $x$-axis, which means that $x^2+1=0$ doesn’t have a real solution.

We maybe should have expected this. We know that $\sqrt{2}$ is real, because we can find a rational numbers whose squares are between $1$ and $2$, or $1.9$ and $2$, or between $1.99999$ and $2.$ That gives us a sequences of numbers that approximates $\sqrt{2}$, and thus $\sqrt{2}$ must be real. But we can’t do the same trick for $−1$: no rational number has a square less than zero, so we can’t find anything that’s close to the square root of $-1$.

But we can see this more directly by using the core principles of the real numbers: they’re a complete ordered field. Since they’re ordered, that every (non-zero) number must be either positive or negative. Since they’re an ordered field, the product of two positive numbers must be positive, and the product of two negative numbers must also be positive.

So suppose we have a number $i$ that solves this equation. Then $i^2 = -1$, which means $i$ can’t be positive, and also can’t be negative. It’s clearly not zero. So it can’t be a real number at all. But it’s definitely algebraic: it’s the solution to $x^2+1=0$.

Can we find other non-real algebraic numbers? Sure! There’s $2i$ and $3i$ and $1+i$ and…. We can use $i$ to build lots more non-real numbers.

But that’s it. It turns out that if we take the real numbers, and then add in everything we can build with the number $i$, we have all the algebraic numbers. And in fact we have the solution to any polynomial we can write down with real numbers. This gives us everything we could ever want.¹ But to see why this gets us everything, we’ll need to take a bit of a detour

Imaginary and complex numbers

We want to look at all the numbers we can build by combining the real numbers and $i.$ These numbers will all look like $a + bi$ where $a$ and $b$ are real numbers.² And we call the set of all these things the complex numbers, abbreviated $\mathbb{C}.$ If we have a complex number $z = a + bi$ then we say the real number $a$ is the real part and the real number $b$ is the imaginary part.

Remember our goal was to extend the real numbers to something algebraically nice. So we should start my making sure that we can still do arithmetic operations—that complex numbers are a field. Now, addition and subtraction are fine, since can use the rules \[ \begin{aligned} (a+bi) + (c+di) & = (a+c) + (b+d) i \\ (a+bi) - (c+di) & = (a-c) + (b-d) i . \end{aligned} \] Multiplication is also pretty straightforward. By FOILing we get \[ \begin{aligned} (a+bi)(c+di) & = ac + adi + bci + bdi^2 \\ & = ac + adi + bci + bd(-1) \\ & = (ac - bd) + (ad +bc)i \end{aligned} \] so if we multiply two complex numbers, we get another.

Division is a little trickier; we don’t have a good way to distribute something like $ \frac{a+bi}{c+di}. $ Here we need to be clever, and maybe start by asking a new question that introduces a second big idea.

We defined $i$ to be the square root of $-1$. That is, $i^2=-1$ is the definition of the number $i.$ But what happens if we square the number $-i$? We have \[ (-i)^2 = (-1)^2 (i)^2 = 1 \cdot (-1) = -1. \] So we have two different numbers that both satisfy our equation $x^2 = -1.$ How do we know which is the “positive” $i$, and which is the “negative” $-i$?

And the answer is that there’s no real difference! A positive number like $4$ has two square roots, $2$ and $-2$, and since they’re both real numbers one is positive and the other is negative. A negative number like $-1$ will also have two square roots, but since they aren’t real numbers, neither one of them is actually positive. We just pick one to call $i$, and call the other one $-i$—but it doesn’t matter which one is which. And that means that if we swap $i$ and $-i$, nothing else should change. Thus we can define an operation called complex conjugation by the rule \[ \overline{a + bi} = a - bi. \] This operation swaps $i$ with $-i,$ without changing anything else about our number.³

But the complex conjugate has another useful property. What happens if we multiply a number by its own conjugate? We get \[ \begin{aligned} (a+bi) \overline{(a+bi)} &= (a+bi)(a-bi) \\ &= a^2 +abi - abi - b^2 i^2 \\ &= a^2 - b^2 (-1) \\ &= a^2+b^2. \end{aligned} \] If we multiply any complex number by its conjugate, we get a real number—and in fact, a positive real number, as long as we didn’t start with 0.

And this gives us a way to complex-number divisions, by turning them into real-number division: \[ \begin{aligned} \frac{a+bi}{c+di} & = \frac{a+bi}{c+di} \frac{c-di}{c-di} \\ & = \frac{ (ac +bd) + (bc - ad)i}{c^2 + d^2} \\ & = \frac{ac+bd}{c^2+d^2} + \frac{bc-ad}{c^2+d^2} i. \end{aligned} \] So we can in fact divide by any non-zero complex number. This means we can do basic arithmetic, and thus the complex numbers are a field.

And like the real numbers, they’re complete. The simplest way to think about this: we can think of a complex number $z = a +bi$ as a pair of real numbers $a$ and $b$. So a sequence of complex numbers is basically just two sequences of real numbers, and we know that sequences of real numbers behave well. So any complex number that we can approximate has to actually exist; there aren’t any holes.

So while the reals are the unique complete ordered field, the complex numbers are a complete unordered field, which contains all the reals. And by giving up the order, we hope to get something else: every complex polynomial has a complex number solution. Once we take the real numbers and add in $i$ there’s nothing left to algebraically add.

But it’s not obvious why that’s true. How do we know there’s not some polynomial equation we haven’t thought of, that doesn’t have a solution even in the complex numbers? To answer this, we need to turn to geometry.

Complex Geometry

The complex plane

If we have a pair of real numbers, we can graph it on a plane, using the first number for the horizontal coordinate and the second number for the vertical coordinate. But a complex number $z = a +bi$ is a pair of real numbers. And that means that, just like we can think of the real numbers as forming a line:

we can think of the complex numbers as forming a plane:

There are a lot of geometric ideas we can poke at here; for instance, complex numbers give us a useful way to talk about angles that I’m not going to talk about here, since it doesn’t help answer our current question.

But distances and sizes will be extremely useful. So let’s think about those: if we have a number $z = a+bi$, let’s figure out how far away from the origin at $0$ it is. The $x$- and $y$-coordinates are $a$ and $b$, so we have a triangle with side lengths $a$ and $b$. By the Pythagorean theorem, the length of the hypotenuse, and thus the distance from the origin, is $\sqrt{a^2+b^2}$.

So far, we haven’t used the fact that we have complex numbers running around. But if we remember the calculations we did with the complex conjugate, we might notice that \[ a^2+b^2 = (a+bi)(a-bi) = (a+bi)\overline{(a+bi)}. \] So we can rewrite our distance formula: if we have a complex number $z$, the distance from the origin is $\sqrt{z \cdot \overline{z}} $. We call this the modulus or absolute value of the number $z$, and write it $|z|$. It’s one of the most important operations we can do with complex numbers.

Specifically, it allows us to talk about sizes. Because the complex numbers aren’t ordered, we can’t directly compare numbers like $3-4i$ and $1 + 3i$; neither one is greater than the other. But once we graph them it’s visually clear that $3-4i$ is much further from $0$ than $1+3i$ is, and in that sense it’s definitely “bigger”.

The modulus lets us compute this numerically: \[ \begin{aligned} | 3 - 4i | & = \sqrt{3^2 + 4^2} = \sqrt{25} = 5 \\ | 1+3i | & = \sqrt{1^2 + 3^2} = \sqrt{10} \approx 3.16 \\ \end{aligned} \] and so the first number is in this sense “bigger” than the second.

This size computation allows us to do a few things. First, we need it to do geometry, since it allows us to compute distances: the distance between $z$ and $w$ $ |z-w|$, the modulus of the difference. And then that lets us talk about “completeness” more precisely. Completeness tells us that when all the points in a sequence get close together, they must have some limit; for that to make sense, we need to know what “close” means!

And importantly for us, the modulus lets us talk about maximum values for functions. In the real numbers this is simple to talk about: we’re looking for the greatest possible output. But a function that outputs complex numbers can’t really have a maximum, because the outputs aren’t ordered! But instead we can look for the “biggest” output, where the modulus is greatest. Since the modulus is always a (positive) real number, this is a question that makes sense.

And once we investigate the maxima of complex functions, we get one of the most surprising results in all of complex analysis.

The Maximum Modulus Principle

In the real numbers we had three key theorems in our “value pack”. One was the Extreme Value Theorem, which says that a continuous function on a closed interval has a maximum and minimum value. This doesn’t quite work in the complex numbers, because the lack of order means we lack both maximum outputs, and also “intervals”.

A real interval is one-dimensional and doesn’t make sense in the complex plane.

But it’s basically true, after we tweak it. Instead of a closed interval, we want to have a closed and bounded region, which you can think of as a loop and everything inside of it, very much including all the points on the loop. And we need to look for the greatest modulus, instead of the “greatest complex number”. But after we make those tweaks, we can restate the Extreme Value Theorem: a continuous function on a closed and bounded region has a maximum (and minimum) modulus.

A closed region in the complex plane. The outer blue boundary is included.

In fact, we can get even more than that. A real function on the plane has to have a maximum, but that can happen basically anywhere, without restrictions.

Some real-valued functions have lots of local maxima all over the place.

But a complex function, if it has a derivative, is much more restricted. The maximum modulus principle says that $|f|$ doesn’t just have a maximum somewhere in the region; the maximum has to occur on the boundary of the loop. In fact, unless the function isn’t totally constant, the maximum value can only occur on the boundary. If we have a point on the inside of the loop, we can always get a bigger modulus by moving in some way towards the boundary, so there aren’t even local maxima on the inside of the region.

The height in this graph gives the modulus of the output, and color tells us the angle. If you ignore color this graph looks extremely boring—which is the point.

This has widespread and surprising implications. One of the most famous is that if a complex function is differentiable and bounded—meaning there is some maximum modulus the function can output, no matter the input—then it has to be constant.

And that’s really restrictive! A differentiable real function can easily be bounded without being constant:

The functions $\sin(x)$ and $e^{-x^2}$ are differentiable, bounded, non-constant real functions.

but a differentiable complex function cannot. Either it has only one possible output, or the outputs go to infinity. And this sort of behavior leads to what some mathematicians have jokingly called the only theorem of complex analysis:

Let $f$ be a complex differentiable function with any interesting properties whatsoever. Then $f$ is constant.

In truth, there’s a lot more to the calculus of complex numbers than that; and I could hang out all day talking about cool weird tricks. Like, we can use complex numbers to compute the integrals of purely real-valued functions that are too tricky to solve over just the real numbers, and that’s really cool and also kind of obnoxious.

But that’s not what we’re here for. We just wanted to take the real numbers, and add in everything we needed to make all our polynomial equations have solutions. And now we’re ready to prove that $i$ is the only thing we had to add.

The fundamental theorem of algebra

Theorem: Any non-constant polynomial equation with complex coefficients has a complex number solution.

Proof: Suppose we have some complex polynomial $f(z)$ that doesn’t have any roots. We start by drawing big loop in the complex plane—big enough that $|f(z)| > |f(0)|$ for every $z$ on the boundary of the loop. We know we can do this because a polynomial will always get very big when the input gets very big.⁴

Then the maximum value of $|f(z)|$ happens on the boundary of the loop, but the minimum has to happen on the inside of the loop, since $0$ is on the inside, and $|f(0)|$ is smaller than any value we get on the boundary. (It’s not necessarily the minimum itself; there could be points that give even smaller values. But we know the minimum can’t be on the boundary because all the boundary points give big values.)

So we know that $f$ is a differentiable function, with a maximum on the boundary of the loop, and a minimum on the inside. We can also define the function $ \frac{1}{f} $, which will flip this. When $|f|$ is big, then $\frac{1}{|f|}$ will be small, and vice versa; so $\frac{1}{|f|}$ has its minimum on the boundary of the loop, and its maximum on the inside.

But we also know something else. Since $f$ has a derivative, we know that $ \frac{1}{f} $ also has a derivative, so the maximum modulus principle applies: the maximum value of $ \frac{1}{|f(z)|} $ must occur on the boundary of the loop. But we just said that the maximum has to occur on the inside of the loop; something has gone wrong.

The culprit is our assumption that we could actually compute the function $\frac{1}{f}$ everywhere inside the loop. That’s only true if $f(z)$ is never zero, since we can’t divide by zero. Because that assumption led to a contradiction, we know $f(z) = 0$ for some value of $z$—so there is a solution to the equation we started with. ∎

The end of one road

And this means that the complex numbers are sort of the end of this series of questions. In part 1 we started with the natural numbers, wanted to do algebra to them without worrying, and wound up with the algebraic numbers. In part 2 we started with the natural or rational numbers, wanted to do geometry and make approximations, and found the real numbers.

The algebraic numbers weren’t complete, meaning they’re inadequate for doing geometry and calculus. The real numbers are perfect for doing calculus, and are great for approximations, but they’re not algebraically closed—there are those pesky polynomial equations like $x^2+1=0$ that don’t have solutions.

Now we can combine the two ideas, and get the complex numbers. They’re complete, so we can do geometry and calculus. They’re algebraically closed, so we can do whatever algebra we want. And they’re in many ways the best tool for doing both algebra and geometry.

But we did lose something when we moved to the complex numbers: we lost the ordering, and with it we lost some of our key calculus theorems from the reals.

The Intermediate Value Theorem says that if a continuous real function can output two distinct numbers, it can also output anything in between them. In the complex numbers this isn’t true, because we have two dimensions and so we can go around. In the real numbers, to get from $-1$ to $1$ we have to go through $0$; in the complex numbers we can go through $i$ instead.

The function outputs zero at the Xs. This path takes the output from $1$ to $-3$ without ever passing through zero.

The Mean Value Theorem says that if we have a differentiable real function on a closed interval, the average speed is equal to the derivative at some point. This fails in the complex numbers for the same reason the intermediate value theorem does; we can get from a speed of $30$ mph to a speed of $60$ mph without ever going $45$ mph, because we can travel at $45+i$ mph instead. (Physically this may or may not be meaningful, but mathematically it works.)

But this time we can recover an important chunk of the result. The Mean Value Theorem tells us speed limits work: if our speed is never higher than sixty miles per hour, we can’t possibly travel more than sixty miles in one hour. And we can still get that principle in the complex numbers, because the modulus of the distance we travel has to be less than the modulus of the time we spend, times the modulus of the speed. So we can save the tool we really care about—but only by shifting things back to the real numbers.
We already talked about the Extreme Value Theorem. In this case the complex numbers have an even stronger version than the reals did, in the Maximum Modulus Principle; it’s just so strong that it makes things really weird.

So of our three key calculus theorems, one is basically true but very strange, one is salvageable in a much weaker form, and one is just gone. And that makes the complex numbers awkward for doing calculus, in the sense we normally mean calculus. They’re not good for talking about speeds, or rates of change, or anything like that—at least not directly.

On the other hand, they’re great for doing algebra and geometry (and algebraic geometry). And there are all sorts of problems that don’t start out in the complex numbers, but can be transformed into complex-number questions, where we can throw our extremely powerful tools at them. (And then hopefully translate those answers back into real-world information!)

But we’re not going to talk about that here. My promise in this series was I would pose reasonable questions, and show you how answering them gives us new numbers; and that’s what we’ve done. We wanted to expand the natural numbers using basic operations, and now we can’t expand any further. We wanted a field that is complete and algebraically closed, and we got it. Until we find a new question, we can rest content.

I’m done with this line of questions; but I’m not at all done with this project! I hope to talk about quaternions and octonions, finite fields and modular arithmetic, $p$-adic numbers, transfinite numbers, infinitesimals, and function fields. Let me know what you’d like to hear about—tweet me @ProfJayDaigle or leave a comment below.

At least, until we come up with a new question to ask. ↵Return to Post
We don’t have to worry about terms with $i^2$ or anything, because $i^2 = -1$ is a real number again. ↵Return to Post
This is the simplest example of a really interesting field called Galois theory. The complex conjugation operation we constructed is an element of the Galois group of the complex numbers over the reals. ↵Return to Post
This is the step where we actually use the fact that we’re talking about a polynomial. This proof doesn’t work for functions like $e^z$, and this is why. ↵Return to Post

A Fictional History of Numbers, Part 3: Computability, Reality, and Leaving Well Enough Alone.

2023-06-14T00:00:00-07:00

Today we’re going to finish our discussion of the real numbers. We’ll see that they really are quite strange, in ways that are uncomfortable to think about, and then see why we keep using them anyway. And in passing we’ll define the computable numbers, which are an interesting type of number that doesn’t get nearly enough attention.

In part 1 we saw the most straightforward types of numbers, from the natural numbers that we count with, through the rationals that allow us to do basic arithmetic, to the algebraic numbers that let us solve polynomial equations. In part 2 we started asking questions about geometry, where we wanted to measure shapes. We found that the area of a circle isn’t given by an algebraic number, but can be approximated as closely as we want.

This led to the idea of completeness, which basically means that anything we can approximate has to be real. Every sequence that looks like it should converge does converge, and thus every length gets an actual number attached to it. And if we want completeness we get the real numbers, which can be thought of as the set of infinite decimals.

But the real numbers were hard to define. They seemed like a lot of work just to be able to talk about the area of a circle without making any estimates; ten decimal places should be enough for anybody, but the reals require infinitely many. In this essay we’ll see that it gets worse—but also that all that work really has a payoff, and that the real numbers are the right sort of numbers to use.

But first, if you want me to feel like my work has a payoff, please consider donating to my Ko-Fi account. Tips are never necessary, but always appreciated, and they help make it possible for me to keep writing essays like this one.

Reality is weird

We keep saying that the real numbers were really weird. How weird, exactly, are they?

We saw one hint with the observation that $0.99\dots~ = 1 = 1.00\dots.$ All real numbers are infinite decimals, but sometimes more than one infinite decimal corresponds to the same real number. (And the idea that we can have “infinitely many nines”, and that somehow they add up to exactly one, is something that makes a lot of people viscerally uncomfortable). But that doubling-up is pretty easy to avoid if we’re careful; if we disallow decimal expansions that end in an infinite string of nines, the problem goes away, and we can sleep easy.

But the real numbers are strange in other ways. For instance: how many of them are there? There are two answers that both seem intuitively compelling. On the one hand, there are infinitely many real numbers, and maybe that’s all we can say. Infinity is infinity.

On the other hand, there are infinitely many natural numbers, and infinitely many rational numbers, and infinitely many real numbers. But it sure seems like there are more rational numbers than natural numbers, and more real numbers than natural numbers; so maybe all infinities aren’t the same.

If we look at this more carefully, things get complicated.

Even counting

How can we tell if two sets of things are the same size? We could try counting them and comparing the numbers: I have two hands, and two feet, so I have the same amount of hands and feet. But that doesn’t work if we have infinities. And anyway, counting is pretty abstract. Can we make things simpler?

There are a few approaches you could take here, but one very basic idea is just to pair things off. I don’t actually know how many pairs of shoes I have; but I know that I have the same number of left shoes and right shoes, because each left shoe is paired to a right shoe, and each right shoe is paired to a left shoe. There are none left over, so I have the same number of each. In technical terms, we’d say this table gives a bijection or a one-to-one correspondence between my left shoes and my right shoes.

On the other hand, if I try to pair up my socks and my shoes, I’ll have socks left over. I can give each shoe its own sock, and I’ll still have a big pile of socks left over. So I know I have more socks than shoes.

Let’s apply that idea now. Are there more natural numbers, or more even numbers? The obvious answer is that there are more natural numbers. If we look at the first ten numbers, only five of them are even.

Natural numbers:	1	2	3	4	5	6	7	8	9	10	$\dots$
Even numbers:		2		4		6		8		10	$\dots$

When we look at the first ten numbers, we have a lot of leftover (odd) natural numbers after we’ve paired off all the evens. And this pattern continues: if we look at the first hundred numbers, fifty of them are even. If we look at the first $n$ numbers, about half of them will be even. So it seems like there must be more natural numbers than even numbers.

On the other hand, we can make a table like this, instead:

Natural numbers:	1	2	3	4	5	6	7	8	9	10	$\dots$
Even numbers:	2	4	6	8	10	12	14	16	18	20	$ \dots $

In this table, every even number corresponds to a natural number, and every natural number corresponds to an even number. They’re perfectly paired up. So by this argument, there must be the same number of natural numbers and even numbers.

This is one of the weird things that immediately happens when we start dealing with infinities: an infinite set can be in bijection with one of its own subsets. We see this in the observation that “infinity plus one” is just infinity, since adding an element to an infinite set doesn’t change the size. And these bijections are surprisingly common; sets in bijection with the natural numbes include the perfect squares:

Natural numbers:	1	2	3	4	5	6	7	8	9	10	$\dots$
Squares:	1	4	9	16	25	36	49	64	81	100	$ \dots $

the primes:

Natural numbers:	1	2	3	4	5	6	7	8	9	10	$\dots$
Primes:	2	3	5	7	11	13	17	19	23	29	$ \dots $

and even the integers:

Natural numbers:	1	2	3	4	5	6	7	8	9	10	$\dots$
Integers:	0	1	-1	2	-2	3	-3	4	-4	5	$ \dots $

We call these sets countable or countably infinite, because we can put all the elements in order and count them. It makes sense to ask for the $37$th prime number $(157),$ or the $53$rd square $(2809).$ And conversely, we can look at $193$ and determine it’s the $44$th prime number, or at $289$ and see it’s the $17$th square.

Counting the rationals

Let’s make things a little more interesting. We saw that the sets of natural numbers, integers, even numbers, perfect squares, and prime numbers were all the same size. What about the rational numbers? It seems like there are a lot more rational numbers than there are natural numbers. But it seemed like there were a lot more natural numbers than even numbers, and that didn’t work out, so we should look closer. We can try making a table like this:

Natural numbers:	1	2	3	4	5	6	7	8	9	10	$\dots$
Rational numbers:	1/1	1/2	1/3	1/4	1/5	1/6	1/7	1/8	1/9	1/10	$ \dots $

But that won’t get us very far. Or rather, it would get us really far—we could keep going forever—but we’d leave most of the rational numbers out. We’ll never get to $2$ that way.

Georg Cantor’s clever idea was to put the rational numbers on a grid, instead.

1/1	1/2	1/3	1/4	1/5	1/6	1/7	1/8	1/9	…
2/1	2/2	2/3	2/4	2/5	2/6	2/7	2/8	2/9	…
3/1	3/2	3/3	3/4	3/5	3/6	3/7	3/8	3/9	…
4/1	4/2	4/3	4/4	4/5	4/6	4/7	4/8	4/9	…
5/1	5/2	5/3	5/4	5/5	5/6	5/7	5/8	5/9	…
6/1	6/2	6/3	6/4	6/5	6/6	6/7	6/8	6/9	…
7/1	7/2	7/3	7/4	7/5	7/6	7/7	7/8	7/9	…
8/1	8/2	8/3	8/4	8/5	8/6	8/7	8/8	8/9	…
9/1	9/2	9/3	9/4	9/5	9/6	9/7	9/8	9/9	…
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋱

A grid like this should contain every positive¹ rational number somewhere. In fact, the big problem is that some of them show up more than once! $1/1 = 2/2 = 3/3$ and $1/2 = 2/4 = 4/8$; we get a lot of repetitions. If we throw out the duplicates, and only include fractions that are in lowest terms, we get this:

1/1	1/2	1/3	1/4	1/5	1/6	1/7	1/8	1/9	…
2/1	~~2/2~~	2/3	~~2/4~~	2/5	~~2/6~~	2/7	~~2/8~~	2/9	…
3/1	3/2	~~3/3~~	3/4	3/5	~~3/6~~	3/7	3/8	~~3/9~~	…
4/1	~~4/2~~	4/3	~~4/4~~	4/5	~~4/6~~	4/7	~~4/8~~	4/9	…
5/1	5/2	5/3	5/4	~~5/5~~	5/6	5/7	5/8	5/9	…
6/1	~~6/2~~	~~6/3~~	~~6/4~~	6/5	~~6/6~~	6/7	~~6/8~~	~~6/9~~	…
7/1	7/2	7/3	7/4	7/5	7/6	~~7/7~~	7/8	7/9	…
8/1	~~8/2~~	8/3	~~8/4~~	8/5	~~8/6~~	8/7	~~8/8~~	8/9	…
9/1	9/2	~~9/3~~	9/4	9/5	~~9/6~~	9/7	9/8	~~9/9~~	…
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋱

And once we have all the rational numbers in a grid like this, we can put them in order: we just have to take a snaking diagonal path through our grid.

You can think of this as listing all the numbers where the top plus the bottom is two, then all the numbers where it’s three, then all the numbers where it’s four; there’s only a finite collection at each level.² And that means that any rational number gets a specific, finite place in our list: \[ 1/1, \quad 2/1, \quad 1/2, \quad 1/3, \quad 3/1, \quad 4/1, \quad 3/2, \quad 2/3, \quad 1/4, \quad 1/5, \quad \dots \]

But all this is still a little weird, right? There are “obviously” way more rational numbers than there are natural numbers, but we just put them in order and paired them up. The fifth rational number is $3$, and $2/3$ is the eighth rational number; we can go in either direction.

Counting the algebraic numbers

We can take this logic one step further. In part 1 we defined the algebraic numbers, the numbers that are solutions to polynomial equations with integer³ coefficients. These include all the rational numbers, and all the square roots, and the imaginary number $i$, and the solutions to $x^5+x+3=0$ which we can’t describe any better than that. Can we pair them up with the rational numbers?

It seems obvious that there are way more algebraic numbers than rational numbers. But it was also “obvious” that there were more rational numbers than integers, and that didn’t quite pan out. In fact we can count the algebraic numbers. Take a minute and see if you can figure out how to do it!

There are a few approaches, but I think the easiest is this. First think about all the degree-one polynomials whose coefficients are $1$ or smaller. There aren’t very many of these, and we can list them all off:

\[ \begin{aligned} 0 && 1 && -1 \\ x && x+1 && x-1 \\ -x && -x+1 && -x-1 \end{aligned} \] There are nine of them, so we can put them in whatever order we want. And each one has at most one root, so we’ve counted up to nine algebraic numbers. (In fact we’ve only counted three, since there’s a lot of duplication here, but that’s fine; we’ll just cross out the duplicate numbers like we did for the rationals.)

Now think about all the degree-two polynomials whose coefficients are $2$ or smaller. There are a lot more of these!

Click to see all 125 degree-two polynomials whose coefficients are \$2\$ or smaller.

Evaluating Students is Important, Too

2023-05-22T00:00:00-07:00

There’s a classic dry academic joke that I retell frequently. My entire salary is just to pay me to grade; I do the rest of my job for free.

And this is first an apology for not writing much lately; we just got through finals, and that involves a lot of grading. So I’ve been a little occupied actually earning my paycheck. But it’s also a response to this excellent post by Adam Mastroianni of Columbia Business School on his substack Experimental History. I liked the post, and it reflects a lot of things I think about in my teaching. But I have a couple of big disagreements—in part, I suspect, because we’re teaching in pretty different contexts—and I wanted to write up a quick discussion¹ of where I think we differ, and why I think giving grades is important and valuable.

Let’s start with Adam’s framing:

My teaching job, it turns out, is actually two jobs.

One job is instruction. Students and I enter the same room at scheduled times, I perform a series of actions, they perform a series of responses, and then the students leave the room more educated than they were before. This job rules. I like it when my students go “ohh!” and “I never thought about it that way” and “I get it now!” I like when they email me, years later, to tell me how they used something they learned in class. This all makes sense. In fact, I thought this would be my only job.

But I realize now that I have a second job, which is evaluation, or gatekeeping, or, most specifically, point-guarding. I’m supposed to award “points” based on what students do in my class. Students try to acquire as many points as they can, and I try to stop them from obtaining points too easily….

This part of my job makes no sense. For one thing, point-guarding makes students miserable…. For another thing, point-guarding makes me miserable…. Worst of all, the things that make me a better instructor often make me a worse evaluator, and vice versa….

He then discusses three specific reasons one might want to give and/or guard points, and largely dismisses them.

Do we need grades to give students feedback?

Adam says no, and I agree completely. Grades are often the vehicle we use to give feedback, largely because we have to give grades anyway. But you can give feedback without attaching a grade to it. I have taken many dance and music lessons, and they never had grades attached; I still promise you I got, and incorporated, a ton of feedback from these lessons, because that’s why I was there.

Conversely, while every grade comes with some feedback, just hearing “3/10” doesn’t actually tell our students anything useful that can help them improve. The need to give grades often channels our feedback into not terribly useful forms.

Do we need grades to motivate students?

Adam rejects this idea, because most people are naturally curious and if they’re not motivated to learn in our courses, the takeaway should be that our courses suck. But I think he’s a bit too quick to dismiss the importance of motivation.

First of all, people do like getting points. This is what drives the success of apps like Duolingo: people start using Duolingo because they want to learn Spanish, but they keep doing it in part to keep their streak alive and keep earning the fundamentally meaningless Duolingo XP. People find it surprisingly motivating to get a gold star and a verbal pat on the back, and “10/10” is one way of doing that.²

But I think I have a more substantive difference from Adam, which is shaped by the specific courses I mostly teach: introductory “gateway” math courses like calculus and linear algebra. I think these subjects are fascinating! (That’s why I became a professional mathematician, after all: I like math!) And in upper-division courses, “isn’t this cool” is actually pretty adequate to keep students engaged. (Most of my grad school classes had essentially no grading, and that was fine.) So it resonates when Adam says:

[I]f people need some extrinsic motivation to engage in my class, one of two things might be happening. Maybe they’re just not interested in what I have to offer. That’s fine! They should take a different class.

But most of my students aren’t taking calculus because they think it’s cool. They’re taking calculus because they need to know calculus to do other things they want to do. Their motivation is already extrinsic! And that creates a big problem of akrasia, because in the long term they want to have learned calculus, but in the short term they don’t “want” to sit down and do a bunch of exercises.³ And if you’re not doing exercises, you’re not learning math.⁴ A little bit of week-to-week prodding is valuable.

Moreover, I teach a lot of freshmen. They generally haven’t figured out how to manage themselves in college yet, and having some gentle guide rails (and metaphorical gold stars!) is really helpful.

Do we need grades to separate good students from bad students?

Adam just says he’s not interested in doing this: “What am I going to do, send the good students to heaven and send the bad students to hell?” And emotionally, I sympathize a lot. All my students are my students, I want the best for all of them, and I have no desire to draw judgments on their characters, or worth as people, or anything like that. Hell, I don’t want to evaluate them at all! Grading sucks!

But this is where I come back to the joke I started with, a bit more seriously. I don’t like grading, but it is a large part of what I’m getting paid for. Adam denies this:

Ranking my students doesn’t help me teach them, so I have no interest in doing it. But I understand why other people want me to do it.

In fact, they’re counting on it. Businesses need to decide who to hire, graduate schools need to decide who to admit, and scholarships need to decide who to fund, so they’d all appreciate it if I identified the best students for them. I can’t help but notice, however, that none of those organizations pay me. They pay headhunters, hiring managers, and program officers, after all, so it’s a little weird for me to do these people’s work for them. It’s especially egregious for these businesses and schools to force students to pay huge sums to get themselves evaluated by me, a guy who just wants to teach them psychology but ends up playing point guard instead.

But this is an aggressively shallow reading of the economics of academia. Sure, the businesses who are doing the hiring don’t pay me. But George Washington University does pay me, and they can afford to do that in large part because my students pay them.⁵ And my students pay for the degree because it gives them a credential they can bring to businesses and get hired.

So those businesses (and graduate schools and scholarships etc.) aren’t paying me directly, but they are responsible for me getting paid. Sure, I’m a guy who just wants to teach them math but ends up playing point guard instead; but that’s why I joke that my salary pays for the point guarding and I do the actual teaching for free.

Evaluation is important

Beyond the purse strings argument, sometimes we do need to evaluate people because we, as a society, need to know whom to trust. We don’t need to send some students to heaven and others to hell, but we do need to send some students to medical school and others to places where they won’t accidentally kill a bunch of people. And Adam fully concedes this at the end of his essay:

But look, we need some evaluation. People have different talents, and they should get opportunities that tap those talents, not just because it benefits them, but because it benefits everybody. If I’m drowning (God forbid), I want to be saved by a lifeguard who’s good at swimming. If I get hit by a bus (God forbid), I want to be operated on by someone who’s good at surgery. If I take a math class (God forbid), I want to learn from someone who’s good at math. For that world to exist, someone, at some point, has to evaluate people on their swimming, surgery, and math.

But he doesn’t want to do the evaluation. And he speculates about the benefits of completely separating teaching from evaluation.

To some extent this sounds appealing. First, because if I could keep my job except without the grading, that would be fantastic.⁶ Second, because having more uniformity in evaluation would be good: if my “A” isn’t the same as your “A” then this doesn’t do a good job of figuring out who knows math and who doesn’t. This is why a lot of departments do common final exams—and while those sound logistically annoying, I’m basically in favor of them, and that is a move in the direction Adam is suggesting.

And finally, Adam’s plan is appealing because writing good evaluations is itself a major skill, and a lot of professors write pretty shit evaluations. I did a short fellowship with the College Board this semester helping them evaluate the Calc AB AP test, and I was blown away by the quality of the questions and the meticulousness with which they were put together.⁷ They have a large team of skilled professionals who put in a ton of effort to write an exceptionally good test, which is exactly what Adam asks for, and that’s extremely valuable work.

But while the College Board test writers are great at their jobs, there’s also a reason they wanted to workshop the test with professors: we are, in actual fact, the experts in what skill at calculus looks like! If you want to know if people understand calculus, you need experts in calculus. If you want to know if people have learned psychology, you need experts in psychology. For better or for worse, we need to be in the loop somehow.

Evaluations that don’t suck

But another thing I agree with Adam on is that we need to take evaluation seriously, as its own task. Most professors don’t really think about this a lot, but honestly most professors don’t think about pedagogy all that deeply.

Education theorists talk about “formative” and “summative” assessments. Formative assessments are mostly about teaching you something. I assign weekly problem sets because I want students to do those problems—because if they don’t do the problems, they won’t learn much. It’s not really an attempt to evaluate them. (And to be fair, I don’t think Adam is complaining about formative assignments.)

Summative assessments are the ones where you’re trying to really evaluate your students. And there are a few problems with the way we do those right now, but one is that we haven’t really committed to what these evaluations are supposed to say. Are we judging students on work ethic? On punctuality? On generally being good people?

And honestly, a lot of the time the answer to those questions is just “yes”. But if evaluation sucks, then evaluating “is this student a good person, overall” sucks twice. I don’t want to send my students to heaven or to hell, or try to tell whether they’re “good people” or not. And even if it’s possible, I’m not equipped to do it well.

What I am equipped to do is evaluate whether they know calculus. And that’s my job, right? My students need calculus so they can go take other classes that assume they know calculus. And I need to tell them, and everyone else, whether they do in fact know enough calculus to succeed in their next class.

And once I really embraced this idea, evaluation became a lot simpler.⁸ It lets me be kinder about some things, and stricter about others, but fundamentally it means I’m evaluating something I’m equipped to evaluate: do my students know calculus?

And that’s what I care about after all, isn’t it?

What do you think? Is grading worth it? Is there a better way? You can tweet me @ProfJayDaigle, make a note on Substack, or leave a comment below.

I also want to experiment with writing shorter posts to intersperse among the multi-thousand-word behemoths that take a month and a half to write. ↵Return to Post
You can give me a gold star and motivate me to keep writing by leaving a comment here, or on Twitter, or Substack, or by donating to my Ko-Fi. ↵Return to Post
And when I was taking piano lessons I never played enough scales. This is pretty much a human universal. Good practice is rarely fun. ↵Return to Post
I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write. ↵Return to Post
Yes, universities have a lot of revenue sources other than student tuition. But most of those sources rely on them continuing to be prestigious universities whose students go on to get good jobs and have successful careers, so I don’t think that changes this argument substantially. ↵Return to Post
Actually, the grading isn’t the worst part. Answering emails about the grading is the worst part. One place I agree with Adam completely is that I don’t want to get detailed rundowns of my students’ personal issues, and I don’t want them to have to share them. ↵Return to Post
Disclosure: I did a short fellowship with the College Board this semester helping them evaluate the Calc AB AP test. Wait, I already said that. ↵Return to Post
I have a lot more to say about this, but if I tried to say it here then this post would become a multi-thousand-word behemoths that takes a month and a half to write.

No, I didn’t duplicate a footnote by accident. ↵Return to Post

A Fictional History of Numbers, Part 2: Measurement, Estimation, Completeness, and Reality

2023-04-28T00:00:00-07:00

This week we continue our exploration of what numbers are, and where mathematicians keep finding weird ones.

In part 1, we started with the natural numbers, the most basic numbers we use to count things, and invented the integers (negative numbers) and the rational numbers (fractions). Then we took the same ideas a little further, and got the algebraic numbers, which are solutions to polynomial equations with rational coefficients—basically all the equations we can get by starting with the natural numbers and using just addition and multiplication.

But there are other questions we can ask, which don’t always give algebraic answers. So today we’ll look at a different question that we might want our numbers to answer: how do we measure things?

But before we start, if you like my writing and want to see more of this project, I have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this. Let me know what you’d like to hear about!

Finding Area

Last time we left off with a question: what is the area of a circle of radius 1? You probably know the answer: the area of a circle is given by the formula $ \pi r^2$, so if the radius is $1$ the area must be $\pi$. But where did that formula come from? And what about the number $\pi$—what exactly is it?

If we draw a quick picture, we can make a rough estimate of the area. The circle is contained inside $2 \times 2$ square, so it must have area less than $4$; and it contains a $1 \times 1$ square, so it must have area bigger than $1$. But we want to be a bit more precise.

One option is just to draw more, smaller squares.

Each of these squares is $\dfrac{1}{4} \times 1/4$ and so has area $1/16$. We can count that the circle contains $32$ of them, and so has area at least $\frac{32}{16} = 2$.

Conversely, we can contain the circle with $60$ squares, so the circle has area less than $\frac{60}{16} = \frac{15}{4} = 3.75.$

So we have an estimate for the area $A$ of our circle: $2 < A < \frac{15}{4}.$ But this estimate still isn’t very good. We can improve it by drawing a finer grid, with more, smaller squares; but this gets tedious really quickly.

I’m not going to count the squares in this picture, but I could.

This gives us a way to think about the area of this circle. By drawing finer and finer grids, we can get better and better estimates of the area of the circle.

Formulaic estimation

As a mathematician, I’m a very specific kind of lazy. I’m much too lazy to count up dozens of tiny squares, but I am willing to make very complicated, abstract, and possibly confusing arguments to avoid counting the squares. So I want to estimate the area of this circle in a more formulaic way, so I don’t have to count anything.

Let’s pretend the circle is a pizza. We can cut it into eight slices, like this:

If we join the points where we slice through the crust together, we get an octagon around the outside. That lets us replace our difficult question with a simpler one: what is area of this octagon? The octagon is made up of eight triangles, and we know the area of a triangle is $ \frac{1}{2} b h $, where $b$ is the length of the triangle’s base, and $h$ is the triangle’s height. So the area of the octagon is $4 bh$.

The triangle’s height is roughly the radius of the circle, which is $1$; and the length of the base is roughly one eighth of the circumference of the circle. And since we’re just estimating, rough numbers are fine; we can say that

\[ \text{Area of circle} \approx \text{Area of Octagon} \approx 4 \cdot \frac{\text{circumference}}{8} \cdot 1 = \frac{\text{circumference}}{2}, \] so the area of the circle of radius 1 is about half its circumference. In fact, we can make this same argument for a circle of any radius: if the radius is $r$ and the circumference is $C$, then the area will be approximately $\frac{1}{2} C r.$

But these are all just rough estimates. The area of the octagon isn’t exactly $\frac{1}{2} Cr$, and the area of the circle isn’t exactly the same as the area of the octagon. But here’s where we have a key insight, which the Greeks called the method of exhaustion¹: both of those approximations get better if we draw a shape with more sides. Here’s the same basic picture, but instead of an octagon, we drew a sixteen-sided hexadecagon:

We have sixteen triangles, which have still have a height or about $r$, but have bases of length about $\frac{C}{16}$. This gives a total area of roughly

\[ \text{Area of Circle} \approx \text{Area of Hexadecagon} \approx 16 \cdot \frac{1}{2} \cdot \frac{C}{16} \cdot r = \frac{1}{2} C r. \]

And next we have a $32$-sided icosidodecagon.²

The grey area isn’t the whole circle, but I can’t actually see the difference.

The icosidodecagon is still not quite the same size as the circle, but it’s pretty close. So we get an even better approximation:

\[ \text{Area of Circle} \approx \text{Area of Icosidodecagon} \approx 32 \cdot \frac{1}{2} \cdot \frac{C}{32} \cdot r = \frac{1}{2} C r. \]

More importantly, we can see that as the number of sides goes up, all of our approximations get better: the polygon is closer to being a circle, the height of each triangle is closer to the radius, and the base of each triangle is closer to $ \frac{C}{n} $, where $n$ is the number of sides of the polygon. So we can tell this approximation will get better and better as the number of sides of our polygon gets bigger; we conclude that the area of a circle is exactly \[ A = \frac{1}{2} C r. \]

But that leaves us still with a problem. This isn’t the formula for the area of a circle that you know and (maybe) love. And in fact this formula is not nearly as useful as $\pi r^2$, because it requires both the radius and the circumference. We know the radius is $r$; but what’s the circumference?

Ring Around the Circle

I know I said I’d invent some numbers, and I promise I’m getting there soon. But we should finish answering this question first.³

We can find the circumference of a circle with the same basic method-of-exhaustion logic we used to find the area formula. If we inscribe a polygon inside the circle, the perimeter of the polygon will be roughly the circumference of the circle; and the more sides that polygon has, the better this approximation will be.

The trick is finding a polygon that we can actually estimate the circumference of. And what Archimedes noticed is that if the number of sides of the polygon is $3 \cdot 2^n$, we can use some basic trigonometry to work this out.

A circle has $360^\circ$ total in it. If we inscribe a hexagon, we can chop the circle into six equilateral triangles, which will each have an inner angle of $60^\circ$. We can cut these in half to get an angle of $30^{\circ}$—and this is convenient, because some basic trigonometry⁴ can convince us that $\sin(30^\circ) = 1/2$. This means that each side of the hexagon has length $r$, and the perimeter of the hexagon is $6r$.

If we take a circle with radius $1$, then each side of the hexagon has length 1, and the perimeter is just $6$. This tells us that the circumference of the circle has to be bigger than six—but not too much bigger.

But more importantly, we can extend this argument. There’s a standard trigonometric formula⁵ for finding the sine of half of an angle. That means that when we look at a twelve-sided dodecagon and get an angle of $15^\circ$, we can compute that $\sin(15^\circ) = \frac{\sqrt{2 - \sqrt{3}}}{2}$. This tells us that each side has length $\sqrt{2 - \sqrt{3}}$, and thus the total perimeter of the dodecagon is $12 \sqrt{2 - \sqrt{3}}\approx 6.212.$

Doubling the sides again gives a $24$-sided icositetragon; we use the trigonometric identity again, which get a more complicated formula. But we can work out each side has length $ \sqrt{2 - \sqrt{2 + \sqrt{3}}}, $ and the whole polygon has a perimeter of $24 \sqrt{2 - \sqrt{2 + \sqrt{3}}} \approx 6.27.$

Another doubling gives us a $48$-sided shape with perimeter $48 \sqrt{2 - \sqrt{2 + \sqrt{2 + \sqrt{3}}}} \approx 6.28,$ and one more gives us a $96$-sided shape with perimeter $ 96 \sqrt{2 - \sqrt{2 + \sqrt{2 + \sqrt{2 + \sqrt{3}}}}} \approx 6.28$ again. So by the Method of Exhaustion, it’s reasonable to claim the circumference is about $6.28$.

In fact, this entire argument scales up with the radius. So if a circle has radius $r$, then the circumference is $C \approx 6.28 r$; and from our earlier argument, the area is $ A = \frac{1}{2} Cr \approx 3.14 r^2$. The Greeks took this number $3.14$⁶ and called it $\pi$, the first letter of the Greek word περίμετρος (perimetros), which means “perimeter” or “circumference”. And thus we finally have the formulas you know from school:

\[ \begin{aligned} C & = 2 \pi r \\\
A & = \pi r^2. \end{aligned} \]

Getting real

This argument produced a number, which we said is about $3.14$. But what exactly do we mean when we write down the number $\pi$?

Limitless power

We described $\pi$ by approximating it. It’s the number that’s close to $6$, and closer to $12 \sqrt{2-\sqrt{3}}$, and even closer to $24 \sqrt{2 - \sqrt{2+\sqrt{3}}}$, and even closer to…

The Greeks called this the Method of Exhaustion, but in modern language we call it a limit. In calculus, we give a definition for limit something like this:⁷

Definition: If we have an infinite list of numbers $a_1, a_2, \dots, a_n, \dots$, and another number $L$, we say that $L$ is the limit of the sequence $ (a_n) $ if we can approximate $L$ as precisely as we want by choosing a large enough $n$. We notate this by writing $\lim_{n \to \infty} a_n = L.$

Less formally, the number $L$ is the limit of a sequence of numbers if the numbers eventually get really close to $L$. The idea is that the numbers $a_1, a_2, a_3, \dots $ are each approximations of $L$, and as we go further into the list, they approximate it better and better—which is exactly what we did when we estimated $\pi$ earlier.

Except there’s a problem here. If we know $L$ is a number, this is all fine. It’s not too hard to convince yourself, say, that the sequence $( 1, 1/2, 1/4, 1/8, 1/16, \dots )$ is getting close to zero, or that $1/n$ is a good approximation of zero for large values of $n$.

We can see that the points with heights $1/n$ are getting closer to the red line at height $0$. The further we get into the sequence, the better an approximation this is.

But on the other hand, if we have a list like $(1, 2, 3, 4, \dots)$, or $(-1, 1, -1, 1, \dots)$, it doesn’t look like those are approximating any number.

On the left, the sequence $(1, 2, 3, 4, \dots)$ is getting bigger and bigger without approaching any particular number. On the right, the sequence $(-1, 1, -1, 1, \dots )$ is bouncing back and forth between two values, rather than approximating one specific number.

Not every sequence has a limit, because not every sequence is approximating one particular number. So how do we know our sequence \[ \left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right) \] does approach a number?

Unfortunately, it kind of doesn’t.

Failing at algebra

For a long time, people looked for a way to represent $\pi$ as a rational number—as a ratio of two integers. We found that $22/7$ is a pretty good approximation, and $355/113$ is a shockingly good approximation (correct to six decimal places). But in 1758, Johann Heinrich Lambert proved that $\pi$ isn’t a rational number.

Now, we do have other, “irrational” numbers. In part 1 we talked about algebraic numbers, which are solutions to polynomial equations $a_0 + a_1 x + \dots + a_n x^n =0$. We used this technique to construct lots of irrational numbers, like square roots, cube roots, and the indescribable solutions to $x^5+x+3=0$.

But $\pi$ isn’t one of those, either. In 1882, the German mathematician Ferdinand von Lindemann showed that $\pi$ is a transcendental number, which means it isn’t the solution to any polynomial equation with rational coefficients. We just can’t describe it with any of the tools we saw in Part 1.

It’s quite difficult to show that $\pi$ is transcendental, and I’m not going to try to prove it here. The most common proof relies on the fact that the number $e$ is transcendental, and even that isn’t easy to prove. But we do know $\pi$ isn’t an algebraic number—so what is it?

Mind the gaps

The details are different, but we’re really in the same boat we found ourselves in last time. In part 1, we wanted a solution to the equation $x^2-2=0$, but we couldn’t find a number that worked, so we just made one up. We can do the same thing here. When a sequence looks like it should have a limit, we’ll make one up for it.

We need to be careful, though, because lots of sequences don’t look like they’re converging anywhere, and those shouldn’t have limits.

Some sequences, like the first one go off to infinity, and others bounce around to multiple different targets, like the second one. But in some sequences, like the third, all the numbers eventually get very close together. We call those “Cauchy” sequences,⁸ and we want to have limits for all of them.

The definition of a Cauchy sequence may seem very similar to the definition of a limit, but it’s not quite the same. A sequence has a limit if the terms all get close to some fixed number; it’s Cauchy if the terms all get close to each other. In a Cauchy sequence, it seems like there should be some number the terms are getting close to, but in sets like the rational numbers, that may not be true. The rationals have “holes” that the terms of the sequence can gather around, but that don’t correspond to any rational number.

The most famous example is probably $\sqrt{2}$. We saw last time that $\sqrt{2}$ is irrational: there are no integers $p$ and $q$ such that $ \left( \frac{p}{q} \right)^2 = 2$. But we can find a rational number so that $1.9 < (a_1)^2 < 2$, and then a second with $1.99 < (a_2)^2 <2$, and a third with $1.999 < (a_3)^2 <2$; and if we keep doing this, we get a sequence of numbers that clearly “wants to” converge to $\sqrt{2}$.⁹ And that shouldn’t cause us too much distress. Even though $\sqrt{2}$ is irrational, it’s an algebraic number, so we already created it; we don’t need to make up anything new.

But another hole in the rationals is $\pi$. We built a Cauchy sequence of algebraic numbers that wants to converge to $\pi$: \[ \left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right) \]

With a little more effort we could build a sequence of rational numbers that does the same thing. (For instance, as we’ll see later, $(3, 3.1, 3.14, 3.141, \dots )$ would work.) But $\pi$ isn’t an algebraic number like $\sqrt{2}$. From the algebraic perspective of part 1, it doesn’t exist.

But it should exist. So we’ll invent it.

You complete me

If we invent all the numbers like this that should be the limits of sequences, we get the real numbers, which we represent with the symbol $\mathbb{R}$. And the real numbers are complete, which means that every Cauchy sequence—every sequence which ought to converge—does in fact converge.

From this perspective, we can say that a real number is just a Cauchy sequence. But that’s not a great way to talk about them, for two reasons. The first is just that it’s awkward. I don’t want to go around talking about the number \[ “\lim_{n \to \infty} \left( 3, \quad 6 \sqrt{2-\sqrt{3}}, \quad 12 \sqrt{2-\sqrt{2+\sqrt{3}}}, \quad 24 \sqrt{2-\sqrt{2+\sqrt{2+\sqrt{3}}}}, \quad \dots \right)”, \] and neither do you.

The second problem is that more than one sequence can converge to the same limit. $ (1, 1/2, 1/3, 1/4, \dots ) $ has the same limit as $(0,0,0,0, \dots ) $ or $ (1, 1/2, 1/4, 1/8, \dots )$; we really don’t want to treat them as different real numbers. We can fix this problem by defining real numbers to be “equivalence classes of Cauchy sequences of real numbers” but that gets extremely cumbersome.

The official method for constructing the reals is something called Dedekind cuts, where a real number is a way of cutting the rational numbers in half. So for example, we identify $\sqrt[3]{2}$ with the set of all the rational numbers with $x^3 < 2$. This has the advantage that it’s really easy to use in proofs; it has the disadvantage that it’s even more cumbersome to work with than the Cauchy sequences description.

But there’s a much easier approach. And it’s something we all learn in high school.

Decimalization

In high school algebra, I learned that a real number is an infinite decimal.¹⁰ Where does this idea come from?

We said that the real numbers are complete, which means every Cauchy sequence converges. But they’re also ordered: if we have two distinct real numbers, one will always be greater than the other. And that give us another way to characterize completeness:

Monotone Convergence Theorem: if a sequence of real numbers is increasing and bounded above, then it converges.

The idea here is that if a sequence is always increasing, it can’t really bounce around. So there are only two options: either it goes to infinity, or it converges to some real number. And this is basically how we actually got $\pi$, right? Each polygon had a bigger perimeter than the last one, but the perimeter would never get bigger than, say, $8$. We had an increasing sequence with an upper bound, so it had a limit.

Now a finite decimal is just a rational number. We can interpret a finite decimal $3.14$ as something like $ \frac{314}{100},$ and similarly $1.414 = \frac{1414}{1000}.$ But we can’t do the same thing with an infinite decimal; we’d have to have an infinitely large numerator and an infinitely large denominator.

Instead, we interpret an infinite decimal as a sequence. When we write that $\pi = 3.14159 \dots,$ we mean that $3$ is a rough approximation, and $3.1$ is a better approximation, and $3.14$ is even better; thus $\pi$ is the limit of the sequence $(3,3.1, 3.14, 3.141, 3.1415, \dots). $

Every infinite decimal is an increasing sequence, and every infinite decimal is bounded above: whatever we can say about a number like $1.14142\dots$, we know it can’t be bigger than $2$. So every infinite decimal corresponds to a real number.

And just as importantly, every real number corresponds to an infinite decimal! If we have a real number $x$, we can find the biggest number with one decimal place that’s smaller than $x$. Then we can find the biggest number with two decimal places, and the biggest with three, and the biggest with four… and this gives an infinite decimal that converges to $x$.

Math’s greatest flame war

This construction generally does what we expect it to, but there’s one very special case where it doesn’t. We know $1$ is a natural number, and thus a rational number, and thus a real number. So how do we write it as an infinite decimal?

The largest number with one decimal places that’s less than $1$ is $0.9$. With two decimal places, we get $0.99$. With three we get $0.999$. So by this construction, the infinite decimal representation of $1$ is in fact $0.999 \dots .$

You may have run across this claim, that $0.999 \dots~= 1$, before; and it almost always triggers a great deal of resistance. It must be smaller than one. The leading term is a zero!

You’ll sometimes see simple algebraic proofs like this: \[ \begin{aligned} 10 \cdot 0.999 \dots & = 9.999 \dots \\\
9 \cdot 0.999 \dots & = (9.999 \dots) - (0.999 \dots) \\\
9 \cdot 0.999 \dots & = 9 \\\
0.999 \dots & = 9/9 = 1. \end{aligned} \] But a lot of people find that unsatisfying and unconvincing.

In fact that argument is a little glib, and glosses over some fairly sophisticated ideas—which we just worked through.¹¹ An infinite decimal is asking for a limit, which isn’t how people generally think of numbers. But it’s certainly true that $1$ is approximated by $0.9$, and approximated even better by $0.99$, and even better by $0.999$; and that we can make that approximation as good as we want by adding more $9$s to the decimal.

And that’s all the $0.999\dots~ = 1$ actually means. The sentence seems weird, because real numbers are weird. They seem innocuous, but a single real number is secretly an infinite collection of infinite series. And if we look too closely, the weirdness starts leaking out.

Was this really necessary?

We started off with a fairly innocuous question: what is the area of a circle? And the answer turned out to be…quite a bit more complicated than we might have expected. And it gets worse! For instance, while there are infinitely many rational numbers, we can show that $100\%$ of real numbers are irrational—and in fact $100\%$ of them are, in a very precise sense, impossible to describe.

The real numbers are so weird and complicated that you might be wondering if we really need to do all of this. Sure, $\pi$ is important, but can’t we just treat that as a one-off idiosyncrasy, and avoid all this nonsense about Cauchy sequences and Dedekind cuts? Unfortunately, we can’t. Sure, real numbers are extremely weird eldritch horrors horrors; but they’re also exactly the tool we need to do calculus.

There’s more to say about both of these ideas: why are the real numbers weird, and why are they so useful? So next time we’ll learn more about just how strange the real numbers are, and see why they are, nonetheless, perfectly suited to solve a whole host of very important problems.

Have questions? Can’t wait for part 3? Want to share your favorite weird numbers with me? Tweet me @ProfJayDaigle or leave a comment below.

No, not becuase everyone was exhausted by this point in the lesson. ↵Return to Post
A word I’m pretty sure I’d never heard before I just looked it up. ↵Return to Post
The paper How Archimedes showed that $\pi$ is approximately equal to 22/7 by Damini D. B. and Abhishek Dhar was extremely helpful to me in putting this section together. ↵Return to Post
Just last week I told a student I had no memory of how to prove this. But the simple argument is precisely that we’re cutting an equilateral triangle in half—the half-triangle has an angle of thirty degrees and a side that has half the length of the hypotenuse. ↵Return to Post
Which I have to look up every time I want to use it. ↵Return to Post
Why 3.14 and not 6.28? The Greeks were more interested in the diameter of the circle than the radius, and so they thought the interesting formula was $C = \pi d$, rather than $C = 2 \pi r$.

Modern mathematicians generally see the radius as more fundamental, so we phrase all our formulas in terms of the radius; this means that a lot of our formulas contain the term $2 \pi$. There’s a movement to stop using $\pi$ and instead use the Greek letter $\tau$ (tau) as the fundamental constant $\tau = C/r = 2 \pi$. But it’s hard to change notation, so we slog on using $\pi.$ ↵Return to Post
We can give a more precise definition using the Greek letter $\varepsilon$, which is infamously confusing to calculus students. It’s really just a more precise way of saying the same thing.

We say that $L$ is the limit of $ (a_n) $ if, for every $\varepsilon >0$, there is a natural number $N $ such that if $n > N $ then $ \mid a_n -L \mid < \varepsilon$.

See if you can see why this means the same thing as the less formal version I wrote in the main text. ↵Return to Post
Pronounced “coh-shee”. They’re named after the 18th-century French mathematician Augustin-Louis Cauchy, who helped formalize this approach to limits and the real numbers. ↵Return to Post
But you can’t make the same argument for $i$, the square root of $-1$; this will be important next time. ↵Return to Post
Yes, even whole numbers are infinite decimals. We’ll get there. ↵Return to Post
A similar approach can also be used to “prove” that $1+2+4+8+ \dots~ = -1$, which is obviously not what we mean. ↵Return to Post

A Fictional History of Numbers, Part 1: Counting, Fractions, and Algebra

2023-04-03T00:00:00-07:00

In graduate school I studied number theory. The joke goes that number theory is the kind of math where you never use anything recognizable as an actual number. And it’s true that advanced mathematics uses a wild variety of strange number-like things—complex numbers, quaternions, octonions, $p$-adic numbers, Witt vectors, surreal numbers, and worse.

This diagram summarizes my Ph.D. thesis. Every node represents a specific type of weird number. My mother said it looked like I was trying to summon Cthulhu.

And these things are all strange, and hard to define. But they’re not crazy, and they’re not random. Each of these weird number systems was invented to solve specific problems or answer specific questions. In this series of posts, I want to give you a sense of where these constructions come from, and how you can start out asking reasonable-sounding questions and wind up in the diagram above.

This isn’t a history lesson; I’m not going to tell you who first described these things, or how we actually started talking about them. But in the spirit of Timothy Chow’s You Could Have Invented Spectral Sequences, I want to convince you that you could have discovered all sorts of exotic number systems by asking reasonable questions. So we’ll walk through what the philosopher of mathematics Imre Lakatos would have called a rational reconstruction of the history of numbers—not what actually happened, but a cleaned-up fictional version that could have happened.

Today I’ll discuss the most sensible collections of numbers. We’ll start with the basic ability to count, and we’ll build up through the algebraic numbers, which let us solve all sorts of reasonable equations. In future parts we’ll tackle the tricky problem of completeness (part 2), which gives rise to the real and complex numbers and the $p$-adics, move into higher dimensions with the quaternions and octonions, and maybe look at some genuinely exotic ideas.

And as always, if you like my writing and want to see more of this project, I have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this. Let me know what you’d like to hear about!

Natural numbers: they really count

The first numbers we can talk about are the numbers we use to count things: $\{ 1, 2, 3, \dots \}$. We call these the natural numbers,¹ abbreviate them with the symbol $\mathbb{N}$, and mostly we all understand them pretty well. I have five apples right now; I know because I counted them. If I eat one I’ll have four apples. We can add natural numbers: if I start with five apples and add three apples then I will have eight. And we can multiply them: if I get four groups of five apples I will have twenty apples.

We can give a “formal definition” in set theory, where $1$ is identified with the one-element set $\{\varnothing\} $, and $2$ is identified with the two-element set $ \{ \varnothing, \{\varnothing\} \} $, and so on. But this is really a way of understanding set theory, not understanding the natural numbers. For almost any reasonable use, the best definition of natural numbers is “oh, natural numbers are those things you count with”. Natural numbers come before math.

But even the natural numbers are deeper and conceptually richer than they seem. I have five apples, but that doesn’t mean they’re all the same!² If we say we have “five apples”, rather than “a big apple, three medium apples, and a small apple” or “two Fujis, two Granny Smiths, and a Red Delicious” or even “an apple here and two apples there and two apple in between” we’re abstracting, waving away the differences so that we can describe them all with the number $5$.³

Even ordinary counting can raise philosophical questions. How many objects are in this picture?

Dale Cruse from San Francisco, CA, USA, CC BY 2.0, via Wikimedia Commons

You could say there’s one pizza. Or you could say there are ten slices of pizza. You could say there are eight slices, since one of the cuts isn’t very clean and a couple of the pieces seem stuck together. Or you could even start counting individual slices of onion. You have to decide what counts as “one object” before you can count things.

And there are plenty of other ways this abstraction can break down. If we combine two apples and one pear, we get three fruits, and the numbers work out but the noun changes. If we combine one heap of sand with another heap of sand, we get one heap of sand, so is $1+1 = 1$? No, the natural numbers just don’t model “adding heaps of sand” very well. And if we combine four atoms of iron with three molecules of oxygen, we somehow wind up with two molecules of rust: we add four things to three other things and end with two.

But while the natural numbers are in fact a simplified model, and an abstraction, they’re a pretty robust and, well, natural one. We use them a lot, we understand them from a young age, and they work well to describe a lot of phenomena.

Zero: much ado about nothing

There is infamous controversy about whether zero counts as a natural number. This is mostly a terminological dispute, and in practice I duck the issue entirely. When it matters, I either include zero by writing $\mathbb{Z}_{\geq 0} $, or exclude it by writing $\mathbb{Z}_{> 0}$ , rather than writing $\mathbb{N}$ and confusing half my readers.

But for the purposes of this fake history, we should treat zero as the conceptual innovation it is. While the idea of having none of something is very simple, treating zero as a number is a much more sophisticated and abstract idea. Sure, there are zero oranges in my apartment right now, along with zero kumquats, zero elephants, and zero large expensive gemstones⁴; if we’re listing off things I have zero of, we’ll be here a long time.

Treating zero as a number requires either that I be willing to “count” all those things that don’t exist, or that I give real conceptual heft to an abstract calculational tool. And while “give conceptual heft to an abstract calculational tool” might as well be the official motto of modern mathematical thought, it took a long time for people to accept it.

Integers: what’s the takeaway?

You might be surprised that I talked about addition and multiplication, and skipped subtraction. But subtraction can actually be pretty subtle! It introduces a new idea: we want to undo addition.

This undoing can be simple. If I want to have five apples and I already have three, I need to add two more apples; and we write that $5-3 = 2$. If I want to have ten apples and I have four, I need $10-4 = 6$ more apples. Algebraically we can look at this as solving equations; if I want to solve $3+x = 5$ I get $x=2$, and if I want to solve $4+x = 10$ I get $x=6$.

But if I want to have four apples and I have six apples, I get stuck. There’s no number of apples I can add to my ten apples to get four. I’m stuck. Algebraically I’m trying to solve $6 + x = 4$, and there’s no natural number that solves that equation.

I don’t get stuck in the real world, obviously. If I have six apples and want four, I can throw two apples away. (Or eat them.) But I’m stuck mathematically, because I can’t really describe that situation with just the natural numbers. To talk about this, we need something new. We need negative numbers.

There are a few ways we can physically interpret a negative number of apples. Maybe it’s a debt: not only do I have no apples, but I also have to give you two apples tomorrow. Maybe it’s a loss, or a change: I can have two fewer apples today than I did yesterday, even if I can’t have $-2$ apples right now. And maybe it’s a direction: I can’t move $-2$ feet any more than I can have $-2$ apples, but I can certainly move $2$ feet backwards.

But what is $-2$? It’s the solution to $6+x = 4$. Or more directly, it’s the solution to $2+x = 0 $. That equation doesn’t have a solution in the natural numbers; but it would be really nice if it did have a solution, so we made one up. And we called it $-2$.

And if we insist that $a+x=b$ should have a solution for any natural numbers $a$ and $b$, we get the integers⁵ $ \{\dots, -2, -1, 0, 1, 2, \dots\} $, written $\mathbb{Z}$⁶. These demand another level of abstraction, and as a result they tend to feel a little less “real” to people. But they’re useful—they let us model loss and debt and motion backwards and a whole bunch of other things, all with the same algebraic tool—so we put up with them.

Rational numbers: let’s think about this

But we didn’t just invent some new numbers; we also got a tool for inventing more numbers. With natural numbers we can add, and by undoing addition we got subtraction: we were trying to solve equations like $a + x =b$. But we can also multiply the natural numbers, so we can try to undo that as well.

In some cases, this just works. With just the natural numbers, I can compute $6 / 3$—what should I multiply by $3$ to get $6$—and get $2$. I can interpret this physically, as we often do in grade school: if I have six dolls and want to divide them among three people, everyone gets two. But I can also interpret it algebraically, as trying to solve the equation \[ 3 \cdot x = 6. \] And maybe I muck around, or operate by trial and error, or honestly just have this one memorized, and I see that the number $2$ solves that equation.

But I can also try to compute $5 / 3$. There’s certainly nothing stopping me from writing those symbols down! But then I get stuck, just like I did when I wanted $4-6$. If I try to divide five dolls among three people, either some people get more than others, or I have some left over. If I think algebraically, I look at the equation $3 \cdot x = 5$, and I can’t find a natural number that solves it. $1$ is too small, and $2$ is too big; nothing will work.

But we can make the same move here we did for the integers. We can’t find a number that solves the equation $3 \cdot x = 5$, so we’ll make one up. And what should we call it? We started out trying to compute $5/3$, so maybe we should call this number $“5/3”$. And we can generalize this: for any equation $a \cdot x = b $, there should be a number $b/a$ that gives a solution.

We run into two problems here: a minor one and a major one. The minor one is that we have some redundancy here. If a number solves $3 \cdot x =5$ then it also solves $6 \cdot x = 10$. So should we call it $5/3$ or $10/6$? The answer, of course, is that we treat them as two different ways of “spelling” the same number.

But the major problem comes from the weirdest integer, $0$. We wrote down the symbol $5/3$ and just declare it’s a number; but we actually can’t we do the same thing for $5/0$. There are a couple ways of seeing the problem. One has to do with sizes: when we wanted to compute $5 / 3$, we saw that $1$ was too small, and $2$ was too big. It feels reasonable to insist there must be something in between. But if we want to compute $5/0$ we’re trying to solve $0 \cdot x = 5$. For this, every number will be too small; we can’t overshoot so we can’t look for an in-between number.

But a more robust algebraic argument is that we already know $0 \cdot x = 0$. The equation $0 \cdot x =5 $ can’t have a solution, unless we want to say that $5=0$⁷. And $0 \cdot x$ has to equal $0$, unless we want to make all our operations stop making sense. As long as we accept that

$0 + 0 = 0$; and
Multiplication distributes across addition, following the rule that $ (a+b) c = ac + bc$;

then we can compute that \[ \begin{aligned} 0 \cdot x & = (0+0) \cdot x \\\
0 \cdot x &= 0 \cdot x + 0 \cdot x \\\
0 \cdot x - 0 \cdot x & = 0 \cdot x \\\
0 & = 0 \cdot x. \end{aligned} \]

So we can’t divide by zero. It doesn’t work; it causes way more problems that it solves. But as long as we avoid zero we’re safe, and we can define the rational numbers $ \mathbb{Q} $⁸ to be the set of all numbers $ p/q$ where $p$ and $q$ are both integers, with $q \neq 0$.

To an algebraist like myself, rational numbers are the next step after integers. But to be clear, this is wildly ahistorical. (I did call this a fictional history of numbers, after all.) Zero and negative numbers were both relatively late inventions. But the (positive) rational numbers came much earlier, because they’re much less philosophically confusing. I may not know what $-2$ apples looks like, but $1/2$ of an apple is what I get when I cut it into two pieces.

Square roots: totally radical

Now we have a strategy: we write down an equation we want to solve, and then if it doesn’t already have a solution we invent one and make up a name for it. And as good mathematicians, we want to see how far this strategy can take us. What other equations do we want to solve?

The simplest equations are linear equations: $ax +b =0$. Those are the same as $ax = -b$, which we looked at last section; we can solve them all with rational numbers (as long as $a \neq 0$).

The next simplest equations are quadratic equations, the bane of every high schooler’s existence. These equations look like

\[ ax^2 + bx +c =0. \]

And if you’ve brushed up on your high school algebra lately, you may remember they generally have two solutions.

Sometimes these solutions are simple. If we take the equation $x^2 -4 = 0$, then that’s the same as $x^2=4$, and thus either $x=2$ or $x=-2$. And sometimes we can find these solutions by factoring. For instance, we have

\[x^2 -3x+2 = (x-1)(x-2), \]

so if $x^2-3x+2 = 0$ then either $x=1$ or $x=2$.

But let’s look at the equation $x^2-2=0$. It’s a pretty innocent-looking equation, as quadratics go. And if we graph $x^2-2$, it sure looks like that equation should have two solutions:

The graph crosses the $x$-axis in two places, which should represent spots where $x^2-2=0$. But it’s a famous theorem dating back to the Pythagoreans in classical Greece that there is no rational number that satisfies this equation.

Theorem: There is no rational number whose square is 2

Motivating the Integral with Euler’s Method

2023-03-15T00:00:00-07:00

I have a fun idea for how to teach and think about the integral in the context of freshman calculus. I’ve never actually used this in a class, and I suspect it’s not actually a great idea. But it’s a fun idea and worth at least playing with, even if it’s a bit too weird to help calculus novices understand what’s going on.

But first, I want to mention that if you want to support my writing, I now have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this.

The Big Ideas of Calculus 1

When I teach calculus I emphasize two big ideas: differential equations, and numerical analysis.

Differential equations generalize the concept of “rate of change”, and they’re the core of why calculus is useful: you can describe the rules a system follows, encode them in math, and draw conclusions. Calculus 1 students don’t have the tools to solve differential equations, but they can—and should—understand how a sentence like “the acceleration is proportional to the displacement” relates to the equation $y’’ = -ky$.

Numerical approximation is often the way we use calculus, and increasingly so as computers are more powerful and available. I motivate the derivative with the idea of linear approximation: if I want to pretend my function is a line, and write $f(x) = f(a) + m (x-a)$, what number $m$ will do the best job? This develops into other methods for approximating the answers to questions that are too hard to answer directly: it leads into ideas like quadratic approximation and Newton’s method, and provides a foundation for numerical integration and Taylor series in Calculus 2.

Euler’s Method

If we combine these two ideas, we can try to numerically approximate the solution to a differential equation. Suppose we have a differential equation $f’(t) = f(t) - f(t)^2/2$, and we know the initial condition that $f(0)=1$. If we want to know $f(3)$ we can get a rough guess with a linear approximation: we know $f(0) = 1 $ and thus that $f’(0) = 1 - \frac{1^2}{2} = \frac{1}{2} $, so we get

\[ f(3) \approx f(0) + f’(0) (3-0) = 1 + \frac{1}{2} \cdot 3 = \frac{5}{2}. \]

That’s only a rough estimate; linear approximation generally isn’t very accurate when the starting point and ending point aren’t close together. In fact the true value is $ \frac{2e^3}{e^3+1} \approx 1.905$, which isn’t terribly far off from $2.5 $ but isn’t especially close either. But this is the best estimate we can really get using only $f(0)$ and $f’(0)$.

However, we know a lot more than that, because we have a formula for $f’(x)$. It’s a bit hard to use, because we need to know $f(x)$ to compute $f’(x)$; but we know we can approximate $f(x_2)$ if we already know $f(x_1)$ and $f’(x_1)$. That allows us to do a recursive calculation:

\[ \begin{array}{rl} f(1) & \approx f(0) + f’(0) (1-0) = 1 + \left(1 - \frac{1^2}{2} \right) \cdot (1) = 3/2. \\\
f(2) & \approx f(1) + f’(1) (2-1) \approx \frac{3}{2} + \left( \frac{3}{2}
\frac{ \left(\frac{3}{2} \right)^2}{2} \right)\cdot (1) = \frac{15}{8}. \\\
f(3) & \approx f(2) + f’(2) (3-2) \approx \frac{15}{8} + \left(\frac{15}{8} - \frac{\left(\frac{15}{8} \right)^2}{2} \right) \cdot (1) \\\
& = \frac{255}{128}. \end{array} \]

Thus we estimate $f(3) \approx \frac{255}{128} \approx 1.99$.

This still isn’t an exact value for $f(3)$; but this approximation is much better than our first try. And if this isn’t close enough, we can do even better by breaking our approximation into more steps: with six steps we get $f(3) \approx 1.95$ and with sixty we get $f(3) \approx 1.909$. More steps takes more work, but also gives us a more precise answer.

This approach is known as Euler’s method, and it allows us to numerically approximate the result of any first-order ordinary differential equation given an initial condition. With a little bit of work, we can generalize this to any ordinary differential equation; it’s quite straightforward and flexible.

It’s also basically just integration.

What is an integral?

In a typical calculus course, we motivate the integral with the area problem: we have the graph of some function, and we want to find the area under that curve. We can approximate that area by chopping it up into rectangles, which gives us the Riemann sum. And then as the number of rectangles approaches infinity our approximation gets really good, which allows us to define the integral.

\[ \int_a^b f(t) \,dt = \lim_{n \to \infty} \sum_{k=1}^n f(x_k) \Delta x \]

This definition has a lot of symbols in it, and is generally intimidating to freshman calculus students. But it does accurately describe what we’re doing and why: the key idea of the integral is to break a calculation into pieces, do an approximation on each piece, and then add the results together. This will give us an approximate answer to our original question; as we use more and smaller pieces, the approximation gets better, and so in the limit we get an exact answer.

So this formula directly answers the question that we’re asking. And when we want to think about applications of the integral, the Riemann sum definition is useful: it helps us figure out what the integral is actually computing, and so what problems it can help solve. But Riemann sums are a huge pain to actually do computations with, so we generally don’t.

Instead, we rely on the Fundamental Theorem of Calculus, which comes in two parts.

Fundamental Theorem of Calculus, Part 1:
Given a function $f(x)$ and a number $a$, we can define a new function $F(x) = \int_a^x f(t) \,dt$. Then $F’(x) = f(x)$.

Part 1 tells us that the derivative undoes the integral; the derivative of the integral of $f$ is just $f$. This is conceptually cool, and it does allow us to compute something. But it doesn’t directly help us compute the integral. Instead, we use it to prove¹ a second statement.

Fundamental Theorem of Calculus, Part 2:
If $F’(x) = f(x)$, then $\int_a^b f(t) \,dt = F(b) - F(a)$.

This is also known as the Evaluation Theorem, or sometimes the Net Change Theorem. And it’s the tool we actually use in practice to compute integrals—to the extent that people mainly associate “integration” with finding the antiderivative $F(x)$, not with finding the number corresponding to the area under the curve.

And this all works, but we’ve moved pretty far away from the original question, and the connections pass through some relatively abstract territory. It’s hard to really intuitively see how this calculation relates to the original question.

Maybe there’s a better way.

The antiderivative as a differential equation

Let’s start by asking this question backwards. Suppose there’s some function you’re interested in, but you don’t have a formula for it. Instead you just have a formula for the derivative. In practical terms, this happens in dead reckoning: if you can’t measure where you are, but you know where you started and how fast you’re moving, you can estimate where you end up.

So suppose we know our speed $F’(x)$, and our starting position $F(a)$, and we want a way to figure out our current position $F(x)$. We want to compute an antiderivative! The FTC part 2 tells us that $F(x) = F(a) + \int_a^x F’(t) \,dt $, so we could figure this out by doing an integral. But I want to follow a different thought process.

We can start by saying, we know what $F(a)$ is, and since we have a formula for $F’(x)$, we can compute $F’(a)$. Then we can use the linear approximation formula to estimate \[ F(x) \approx F(a) + F’(a) (x-a). \] So if we know, say, that $F(1)=3$ and $F’(x) = 3x^2$, we can estimate that $F(5) \approx 3 + 3(5-1) = 15$.

Linear approximation gives a pretty decent estimate if $x$ and $a$ are close, but if they’re far apart it’s not very good. Consequently it doesn’t really work here: in reality $F(5) = 127$.

But we can improve this exactly the same way we did before, by using Euler’s method! The problem is that the two points on my linear approximation are too far apart. But we can try to approximate the value of $F$ somewhere closer to $1$, like at $3$.

\[ F(3) \approx F(1) + F’(1)(3-1) = 3 + 3(2) = 9. \] And then, since we also know $F’(3) = 27$ I can estimate \[ F(5) \approx 9 + 27(5-3) = 63. \] Still not right, but much better! And we can improve even further by doing more steps: \[ \begin{array}{rl} F(2) & \approx F(1) + F’(1)(2-1) = 3 + 3 = 6 \\\
F(3) & \approx F(2) + F’(2)(3-2) = 6 + 12 = 18 \\\
F(4) & \approx F(3) + F’(3)(4-3) = 18 + 27 = 45 \\\
F(5) & \approx F(4) + F’(4)(5-4) = 45 + 48 = 93. \end{array}{rl} \] This still isn’t quite right, but it’s even closer; and as we take more and more smaller and smaller steps, we’ll get a better and better approximation.

Riemann Sums as Euler’s Method

This is basically Euler’s method. But why is it an integral? Let’s reorganize the calculation to make it clearer what’s happening. \[ \begin{array}{rl} F(5) & \approx F(4) + F’(4)(5-4) \\\
& \approx F(3) + F’(3) (4-3) + F’(4) (5-4) \\\
& \approx F(2) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\
& = F(1) + F’(1) (2-1) + F’(2) (3-2) +F’(3) (4-3) + F’(4) (5-4) \\\
& = 3 \cdot 1 + 3 \cdot 1 + 12 \cdot 1 + 27 \cdot 1 + 48 \cdot 1 = 93. \end{array} \] At this point this should be starting to look familiar. We’re taking a bunch of steps of size $1 = \Delta x$, and for each step we’re multiplying it by the derivative at some $x$ value. So we just computed \[ F(5) \approx F(1) + \sum_{k=1}^4 F’ \big( 1 + (k-1) \cdot 1 \big) \cdot 1. \] More generally, if we take $n$ steps we get \[ F(5) \approx F(1) + \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x. \] And that’s almost exactly a Riemann sum on the left-hand side. In fact, it’s a Riemann sum, plus the extra term $F(1)$. If we rearrange it we get \[ F(5) - F(1) \approx \sum_{k=1}^n F’\big( 1 + (k-1) \Delta x \big) \Delta x. \]

I see two ways to think about this formula. One is that the indefinite integral contains a $+C$ term, because antiderivatives aren’t unique. So while $\int F’(t) \,dt$ is an antiderivative of $F’(x)$, we don’t necessarily get the same function as our original $F(x)$. Instead, the FTC just guarantees we have $F(x) +C$, and $F(1)$ is just the $+C$ term.

But I think a clearer to me is that we’re really computing the change in the value of $F$. This should make physical sense: the calculations with the speed tell us how far we’ve moved, not where we are. Thus the Euler’s method calculation tells us our displacement; but if we add that on to our starting position, we find out ending position.

Is this a good idea?

Mathematically, this all works out. It’s a cute argument and I’m glad I’ve found it. But there are plenty of fun math ideas that don’t belong in a freshman calculus course.

This approach has one obvious, major disadvantage: no one else teaches it like this, so it would probably leave students confused if they go on to take another course with someone else. And that’s probably enough to make it not worth doing², on its own.

But while that’s a real obstacle to adopting this approach in one class, it’s also kind of dodging the interesting questions about whether this would be a better approach. What if we could get everyone to switch? Should we?

One problem is that this argument isn’t at all rigorous. As long as we believe that Euler’s method will converge to the right answer, then the integral will as well; but I don’t know how you’d prove that Euler’s method converges without referencing the integral, so that seems fairly circular.

That objection seems fatal to me—in an upper-division Real Analysis course. In a freshman calculus course, nothing is ever going to be fully rigorous, and the proofs involving Riemann sums especially won’t be because getting the technical details of Riemann sums correct is hard. So I don’t mind a little non-rigor, especially if it helps students develop a clear intuitive understanding of what we’re trying to do.

In fact, having to avoid some of the abstraction involved in proving the Fundamental Theorem of Calculus might be a win, overall. That’s one of those lectures where I’m always confident my students aren’t really following the details, and are just hanging on trying to survive until we get back to computing things. On the other hand, it’s good for them to see some abstract formalism, even if they’re not ready to fully understand it yet. You have to see your first scary proof sometime!

Another problem is that this derivation captures the relationship between the Riemann sum and the antiderivative, but presents it exactly backwards. In most applications, the Riemann sum is the question we want to answer; the antiderivative is the tool we use to answer it. But the Euler’s method approach treats the antiderivative as the question, and the Riemann sum as the way we compute the answer—which is completely wrong since the Riemann sum is nearly impossible to compute outside of the simplest cases. I think this is a really deep problem with this approach. One of the big ideas I want my students to engage with is figuring out the difference between identifying a question, and computing the answer; giving it to them backwards seems like an obstacle to developing that understanding.

But I do really like the way this approach connects the integral back to the other big ideas in the class. Not just to the derivative; any presentation of the FTC will draw a link between integration and differentiation. But this makes the integral seem connected to the themes of numeric approximation and differential equations, which ties the entire course together neatly.

And really, that sums it up, I think. It’s always nice to tell a neat story that ties the whole class together. But it probably isn’t as important as making sure our students understand each piece well on its own. I have to resist the temptation to do something pretty, and elegant, and unnecessarily confusing. So this is a fun idea, but for now I’m going to teach this normally.

Do you have a clever way to motivate the integral? Do you think I should actually be using this approach in my course? Any other thoughts on teaching integration? Tweet me @ProfJayDaigle or leave a comment below.

This proof relies heavily on specific special properties of the real numbers, and in particular the property that if $f’(x)=0$ then $f(x)$ is constant. This isn’t true if we allow functions to be defined solely for rational numbers; the real numbers are exactly the set that makes it work. ↵Return to Post
Or at least not worth doing as the motivation to the integral. I think it’s fine to do this as a followup, or an application of the integral. If you have an extra day to spend on integration, this isn’t the worst thing you could do. But if you have extra days in your calculus syllabus please tell me how you got them. ↵Return to Post

Writing Calculus Tests with ChatGPT

2023-03-08T00:00:00-08:00

Last week I talked about the new chatbots, like ChatGPT and Bing’s chat interface. I argued that they while they produce language they can’t really analyze it or check it for errors; and that this is a meaningful restriction that we can’t get past without a serious change in the approach we take to AI systems. So the chatbots won’t be able to fully replace intellectual labor any time soon. But they still might help, especially if we can identify formulaic tasks that don’t require really critically thinking about how ideas connect.

But rather than philosophizing, I decided to get concrete about this. Can I use ChatGPT to make my job easier? It’s going to be pretty useless for the most important parts of my job. In particular, it has no way to figure out why a student is confused and address their confusions. And it’s not going to come up with insightful new ways to describe course topics. It won’t even be able to meaningfully connect distinct ideas in the course, because it has no sense of what’s already been covered.

Instead I need to find the aspects of my job that are routine, and involve following relatively standard templates and filling them out in predictable ways. I need to find tasks that it’s easy for me to check if they’re done right, since ChatGPT is not correct with any consistency. Ideally, I’d also find ways to have it replace the parts of my job that are the most annoying: I don’t want a way to avoid spending time in office hours with students, because office hours are fun!

But one thing I spend a lot of time doing, and don’t enjoy at all, is writing homework and test questions. I need to create original problems (or at least ones that aren’t in the textbook so students can’t look them up), but not too original (so they fit the patterns that my calculus students are supposed to be learning). And unlike all of the rest of my course planning, I need to do new ones every year—I can reuse my old lecture notes, but it’s not safe to reuse my old tests.

So I decided to spend some time experimenting with GPT as a test writer. Can it write good questions? Can it write usable solutions for those questions? And can it do this easily, or is shepherding it through the process more trouble than it’s worth?

But before I tell you what I found, I want to mention that if you want to support my writing, I now have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this.

The Verdict

Overall, the current tech is seems somewhat useful, but not actually good—at least, not yet. But it’s close enough that I suspect it will get pretty good for this purpose before long.

Writing Problems

With a couple exceptions, ChatGPT could figure out what type of question I was asking for. If I asked for a related rates problem, or an integration problem that involved integration by parts, I would get one. Sometimes they weren’t quite right, but I could get the general type of problem I asked for, with basically no prompt engineering.

On the other hand, it was hard to get specifics. I can get a big pile of integration by parts problems, but a lot of them will be either very easy or very hard. And ChatGPT gets stuck in ruts; I saw identical problems show up to multiple different prompts, and there were running themes in everything it output. That means that the system can’t give me fine-tuned answers, and also will not give me an even coverage of the relevant types of problems.

But if I have something specific I want, I can probably just write it myself; and even if it won’t give me every type of problem, it can help remind me of my options. I found it genuinely useful for brainstorming problems, even if I didn’t use any of them exactly. (And I am at this moment proctoring a test that includes some problems I wrote with GPT assistance.)

Solving Problems

On the other hand, the solutions it produced were usually wrong, sometimes spectacularly so. A few times I got a completely correct solution. Most of the time, I would get an answer that had the right approach but did completely nonsensical calculations in the middle; the solutions would look superficially correct, but checking them carefully turned up multiple errors. And occasionally I would get arguments almost completely unrelated to the questions I asked.

But, if anyone does figure out a way to usefully and consistently hook this up to a computer algebra system, it will probably do pretty well at solving problems too. It tended to set up the right computation and then generate a nonsense answer; if it could tell when it needs to just factor a polynomial or compute an integral, and pass that to a computer algebra system, that would fix a lot of the weaknesses.

I know multiple teams are trying to find a way to hook systems like GPT up to computational engines and computer algebra systems. If they could do that effectively it would probably be able to write good solutions immediately, but that really sounds non-trivial to me. You could maybe teach it to pass integrals or other specific calculations to a computer algebra system, read the result, and print the result. But translating that into a well-written solution solution would require some sort of deep integration of the two capabilities, not just an ability to print the final answer.

But one thing did impress me about the solutions: ChatGPT could clearly consistently remember what question it was trying to answer. Every single solution ended with a clear restatement of the question and an answer to it. The answer was usually wrong but it never lost track of what it was supposed to be answering. (And that’s more than I can say for some of my students.) This mostly shows that the question is still in ChatGPT’s context window when it finishes the solution, but also that it’s still using the details of the question. And it’s doing that consistently; I don’t think it failed at this once.

Other Observations

ChatGPT can write perfectly valid and idiomatic LaTeX code, which I was mildly surprised by. I’m used to the fact that it can put out grammatical and idiomatic English text, but I was still impressed by the LaTeX.

Similarly, ChatGPT can draw diagrams using ASCII art. I’m really curious how much of that was in the training set; I wouldn’t have thought it was that common, but it’s clearly common enough. (It did not always accurately describe the diagram it had drawn.)

One major weakness is that ChatGPT doesn’t have a consistent style. Each solution it wrote used completely different notation and phrasing—even when I asked it to respond to the same prompt multiple times.

The Data

The rest of this post is a record of things I tried and experiments I conducted in getting ChatGPT to write calculus questions for me. I spent a while playing around, and tried to include everything here as a sort of lab notebook and complete record. If you don’t need to write calculus questions you probably don’t want to read the whole thing.

But I have hidden most of the long GPT exchanges under pulldown spoiler boxes. If you just want to skim through and see my thoughts on each experiment, you can do that; and you can expand out any box you decide you actually want to read.

Or you can jump to the end and leave me a comment.

Max/Min problems

When I teach calculus 1, my absolute least favorite thing to do is writing test questions for max/min problems. It’s simultaneously extremely routine and extremely tedious. So I decided to start there.

Max/Min take one

Prompt: Write a problem where you have to classify the critical points of a function, with a solution.

Response 1

Why I’m Not Scared of the New Chatbots

2023-02-27T00:00:00-08:00

If you haven’t already heard about AI chatbots, you probably haven’t been on the internet in the past couple of months. In November, OpenAI released ChatGPT, which can engage in text conversations with coherent text that looks like it was written by a real person. Then a couple weeks ago Bing rolled out its own chatbot, which was more engaging but also much less reliable, producing a spate of lurid stories of “Sydney” expressing a desire to be human, threatening users, and claiming to have murdered one of its developers.

James Vincent of The Verge is one of the many people who had truly wild conversations with Microsoft’s chatbot.

The core technology underlying both of these chatbots has been around for a while¹, but the new products are more polished, accessible, and compelling, which means a lot of people are experiencing them for the first time. These products have also generated a certain amount of both triumphalism (GPT will revolutionize everything!) and fear (GPT will take over everything!) among people who take the possibilities of AI seriously.

I’m not an expert in these systems, just an interested amateur who’s been following them for a while. But the hype about GPT seems wildly overblown. The current approach to programming chatbots has real limits that I don’t think we can surpass without some genuinely new breakthroughs. And understanding some surprising facts about human psychology can help us develop intuition for what these systems will and won’t be able to do.

But first I want to mention that if you want to support my writing, I now have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this.

How does GPT work?

GPT is a text generation algorithm based on something called a large language model. The basic idea is that GPT has analyzed a huge corpus of written text and produced a model that looks at a bit of writing and predicts what words are likely to come next.

Humans do that all the time. If I hear the phrase “My friend Jim threw a ball and I caught—”, I will expect the next word to be “it”. But other continuations are possible: if I hear “the ball” or “that ball”, I won’t be that surprised. If I hear “the flu”, I’ll be kind of surprised, but “I caught the flu” is a reasonable thing to hear; it’s just a bit of a non sequitur after “My friend Jim threw a ball”. But if the next word were “green” or “solitude”, I’d be really confused. I suspect this is the only time anyone has ever written the sentence “My friend Jim threw a ball and I caught solitude”.

I started out describing a way to predict text, but it’s easy to turn that into a way to produce text. For instance, we could start with a prompt, and have our model keep supplying the most-likely next word until we’ve written enough. This is a fancier version of the memes that ask you to type “I hate it when” into your phone and see what autocomplete suggests. I tried that prompt on my phone, and got this:

I usually don’t hate it when I get home, actually.

And this illustrates the problem with that first suggestion: if you always take the most likely next word, you can get stuck. Even if you don’t wind up in a loop like that one, you’ll still say pretty boring things, since your writing is always as unsurprising as possible. Actual text-generation systems introduce some random noise parameters so that you always have a fairly likely word, but not the most likely word.

GPT works surprisingly well

This basic idea has been around for decades, but in 2017 a team at Google developed a new algorithm called the transformer that worked much better than any previous strategies; since then, the technology has developed rapidly.

Already in 2019 we could produce substantial quantities of fluent, grammatical, and sometimes even stylish English text. The newest products are even more impressive. They can give helpful answers to questions in a number of fields, including finance, medicine, law, and psychology. They can summarize the contents of research papers. They can make you fall in love.

They can also play the world’s most chaotic game of chess. Here ChatGPT is playing black.

And this success has led people to wonder what comes next. How good will AI chatbots get? Will they make make it impossible to avoid cheating on schoolwork? Will they replace your doctor, your lawyer, or your therapist? Will they make desk jobs obsolete?

Are they self-aware? Are they intelligent beings?

Does GPT really think?

The most obvious take on GPT is that it can’t think; it’s just expressing statistical relationships among words. In the narrowest sense, this is certainly true; it’s just a very sophisticated technology for predicting what words should come next in a string of text.

And since it’s just doing prediction, it should be very limited in what it can do. GPT won’t produce original thoughts; it can only express relationships that are already in the text it has used as input. Thus we see Ted Chiang’s summary that ChatGPT provides a blurry jpeg of the web:

Large language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory?

GPT has simply taken a bunch of words, summarized the relationships expressed by those words, and doing some sort of fuzzy pattern-matching and extrapolation from those relationships. There’s no creative thought. And most of the scary samples you’ve seen are this sort of pattern-matching. Microsoft’s chatbot says it wants to be human and threatens to kill people because we have tons of fiction about AIs that want to be human and threaten to kill people, and it’s just imitating that.

Do humans really think?

But, the rejoinder comes: are people any different? Humans are just doing fuzzy pattern-matching and imitating behavior we’ve seen…somewhere. So sure, GPT is just saying things that sound good based on what it’s read, but that’s also what people do most of the time. ChatGPT can do a good job of producing mediocre high school essays because it really is doing the same thing a mediocre high school essayist is doing!

And I think this is basically true—sometimes. A lot of human communication is basically just unreflective pattern-matching, saying things that sound good without really thinking about what they mean. When I make small talk with the cashier at target, I’m not engaging in a deep intellectual analysis of how to best describe my day. I’m just making small talk!

I also see this thoughtless extrapolation all the time while teaching college students. When students ask for help and I look at their work, it’s common for there to be steps that just don’t make any sense. And when I ask them why they did that, they don’t know. They’ll say something like “I don’t know, it just seemed like a thing to do?”

And that’s not even always a bad thing. If I type “3+5”, most of you will probably say “8” to yourselves before consciously deciding to do the addition; if I say “the capital of France”, you probably find “Paris” popping into your mind without any active deliberation. It’s hard to explain how you answered those questions, because you just know. And that’s great, because it means you don’t have to stop and think and work to get the answer.

Of course, this quick-and-easy thinking doesn’t always give the right answer. If I hear “the capital of Illinois”, my immediate reaction is “Chicago”. (It’s Springfield. I was pretty sure Chicago was the wrong answer, but it’s still the first one my brain supplied.) And if I hear “537 times 842”, my immediate reaction is—well, my immediate reaction is “ugh, do I have to?” I know I could work that out if I need to. But I’d rather not. It’s certainly not automatic.

So yes, humans in fact do a lot of pattern-matching and extrapolation. But we also do more than that. We can look at the results of our mental autocomplete and ask, “does this really make sense?”. We can do precise calculations that take effort and focus. We can hold complex ideas in our heads with far-removed long-term goals, and we can subordinate our free association to those complex ideas. We can, really and truly, think.

Thinking is hard.

We can think carefully, but that doesn’t mean we always do. Right after the original release of GPT-2, in February 2019, Sarah Constantin wrote a piece arguing that Humans Who Are Not Concentrating Are Not General Intelligences. She observed that GPT text looks a lot like things people would write—if you don’t read them carefully. But the more attention you pay, the more they fall apart.

If I just skim, without focusing, [the GPT passages] all look totally normal. I would not have noticed they were machine-generated. I would not have noticed anything amiss about them at all.

But if I read with focus, I notice that they don’t make a lot of logical sense.

…

So, ok, this isn’t actually human-equivalent writing ability…. The point is, if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot.

So the synthesis is: large language models like GPT can talk, and perhaps “think”, as well as a person who isn’t paying attention to what they’re saying. And it makes lots of errors for the same reason you can find multiple reddit threads about thoughtlessly saying “you too” in inappropriate situations. We say it because it feels right—and only afterwards do we realize it definitely isn’t.

System 1 and System 2

In Thinking Fast and Slow, Daniel Kahneman writes about how human reasoning splits into two basic systems. System 1 reasons quickly and efficiently, but operates essentially on reflex. System 2 slower and takes more energy, but can engage in careful, reflective thought. When asked for the capital of Illinois, my System 1 says “Chicago”, and then my System 2 says “no, wait, isn’t it actually Springfield?”.

Psychologists studying these two systems have found some classic puzzles that illustrate the difference really well. While reading the next question, pay attention to all the thoughts you have, and not just the final answer you come up with.

Question: Suppose a bat and a ball cost $1.10 together, and the bat costs a dollar more than the ball. How much does the ball cost?

If your brain works like mine—or most people’s—the first thing you hear yourself thinking is “ten cents”. But if you think more carefully, or pull out a pen and some paper to do work, you’ll realize that’s wrong; the ball should in fact cost five cents.² But even if you got it right, the wrong answer probably occurred to you first.³ That’s your System 1 contributing a guess, before your system 2 kicks in and corrects it.

And it seems to me that systems like GPT are implementing something like our System 1, but not System 2. And most of the time, when we’re not paying attention, we’re only using System 1—we’re just being human chatbots. But we are capable of using our System 2 to reflect on what we really mean, and transcend pure pattern-matching. It really seems like GPT can’t—and that it will never be able to without some genuinely new idea that we don’t yet have.

Probing questions

Now, when I gave ChatGPT the bat-and-ball problem, it got it right, in a really careful and thorough way.

Click here to see ChatGPT's answer to the bat-and-ball problem.

Hypothesis Testing and its Discontents, Part 3: What Can We Do?

2022-07-25T00:00:00-07:00

Hypothesis testing is central to the way we do science, but it has major flaws that have encouraged widespread shoddy research. In part 1 of this series, we looked at the historical origins of hypothesis testing, and described two different approaches: Fisher’s significance testing, and Neyman-Pearson hypothesis testing. In part 2 we saw how modern researchers use hypothesis testing in practice. We looked at theoretical reasons the tools we use aren’t suited for many questions we want to ask, and also at the ways these tools encourage researchers to misuse them and draw dubious conclusions from questionable methods.

In this essay we’ll look at a number of methods that can help us draw better conclusions, and avoid the pitfalls of crappy hypothesis testing. We’ll start with some smaller and more conservative ideas, which basically involve doing hypothesis testing better. Then we’ll look at more radical changes, taking the focus away from hypothesis tests and seeing the other ways we can organize and contribute to scientific knowledge.

1. What was hypothesis testing, again?

But first, let’s remember what we’re talking about. The first two parts of this series answered two basic questions: how does hypothesis testing work, and how does it break?

In part 1, we learned about two major historical approaches to the idea of hypothesis testing: one by Fisher, and the other by Neyman and Pearson. Both start with a “null hypothesis”, which is usually an idea we’re trying to disprove. Then we collect some data, and analyze it under the assumption that the null hypothesis is true.

Fisher’s significance testing computes a $p$-value, which is the probability of seeing the experimental result you got if the null hypothesis is true. It is not the probability that the null hypothesis is false, but it does measure how much evidence your experiment provides against the null hypothesis. We say the result is significant if the $p$-value is below some pre-defined threshold, generally $5$%. If the null is actually false, we should be able to reliably produce these low $p$-values; Fisher wrote that a “scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance”.

Neyman and Pearson didn’t worry about establishing facts; instead, they focused on making actionable, yes-or-no decisions. A Neyman-Pearson null hypothesis is generally that we should refuse to take some specific action, which may or may not be useful. We figure out how bad it would be to take the action if it is useless, and how much we’d miss out on if it’s useful, and use that to set a threshold; then we collect data and use our threshold to decide whether to act. This approach doesn’t tell us what to believe, just what to do. Sometimes we think that acting is probably useful, but that acting wrongly would be catastrophic so it would be wiser to do nothing. The Neyman-Pearson method takes that logic into account, and biases us towards inaction, making type I errors less common at the expense of making type II errors more common.¹

Modern researchers use an awkward combination of these methods. Like Fisher, we want to discover true facts; but we use Neyman and Pearson’s technical approach of setting specific thresholds. We set a false positive threshold (usually $5$% and ideally a false negative threshold (we want it to be less than $20$%), and run our experiment. If we get a $p$-value less than the threshold —data that would be pretty weird if the null hypothesis is true, so weird it would only happen once every twenty experiments we run—then we “reject the null” and believe some alternative hypothesis. If our $p$-value is bigger, meaning our data wouldn’t look too weird if the null hypothesis is true, then we “fail to reject” the null and err on the side of believing the null hypothesis.

There are a few major problems with this setup.

Artificial decisiveness

The Neyman-Pearson method makes a definitive choice between two distinct courses of action. This reinforces a general tendency to force questions into yes-or-no binaries, even when that sort of clean dichotomy isn’t realistic or appropriate to the question. Hypothesis testing tells us whether something exists, but not really how common or how big it is.²

Unfortunately, Yoda is wrong. Sometimes we do care about size.

And more importantly, scientific knowledge is always provisional, so we need to continually revise our beliefs based on new information. But Neyman-Pearson is designed to make a final decision and close the book on the question, which just isn’t how science needs to work.
Bias towards the null

Neyman-Pearson creates a bias towards the null hypothesis, so rejecting the null feels like learning something new, while failing to reject is a default outcome. On one hand, this means it’s not a good tool if we want to show the null is true³. On the other hand, a study that fails to reject the null feels like a failed study, and that’s a huge problem if the null really is true! This can bias the studies we actually see since many non-rejections aren’t published. It doesn’t help us that most research is accurate if most published papers are not.
Motivated reasoning and $p$-hacking

Since researchers don’t want to fail, and do want to discover new things and get published, they have an incentive to find a way to reject the null.⁴ When done deliberately, we call this $p$-hacking, and there are a variety of questionable research practices that can help us wrongly and artificially reject a null hypothesis. Worse, the garden of forking paths means you can effectively $p$-hack without even knowing that you’re doing it, fudging both your theory and your data until they match.
Low power creates misleading results

At the same time, many studies have a low power, meaning they probably won’t reject the null even if it is actually false. Combined with publication bias, this can make the published literature unreliable: in some subfields, a majority of published results are untrue. And more, when underpowered studies do find something they tend to overestimate the effect, leading us to think everything works better than it actually does.

Despite all these problems, hypothesis testing is extremely useful—when we have a question it’s good for, and use it properly. So we’ll start by seeing how to make hypothesis testing work correctly, and some of the ways science has been shifting over the past couple decades to do a better job at significance testing.

2. Replication: Fisher’s principle

To create reliable knowledge we need to replicate our results; there will always randomly be some bad studies and replication is the only way to weed them out. (There’s a reason it’s the “replication crisis” and not the “some bad studies” crisis.) Any one study may produce weird data through bad luck; but if we can get a specific result consistently, then we’ve found something real.⁵

In some fields it’s common for important results to get replicated early and often. I’ve written before about how mathematicians are continuously replicating major papers by using their ideas in future work, and even just by reading them. Any field where research is iterative will generally have this same advantage.

In other fields replication is less automatic. Checking important results would take active effort, and often doesn’t happen at all. Complex experiments may be too expensive and specialized to replicate: the average phase $3$ drug trial costs about $$20$ million, and even an exploratory phase 1 trial costs about $$4$ million. At those prices we’re almost forced to rely on one or two studies, and if we get unlucky with our first study it will be hard to correct our beliefs.⁶

But sometimes we just don’t treat replication work like it’s important. If we run a new version of an old study and get the same result, it can feel like a waste of time: we “knew that already”. Since our results are old news, it can be hard to get the work published or otherwise acknowledged. But if we run a new version of an old study and don’t get the same result, many researchers will assume our study must be flawed because they already “know” the first study was right. Replication can be a thankless task.

The replication crisis led many researchers to reconsider these priorities. Groups like the Many Labs Project and the Reproducibility Project: Psychology have engaged in large scale attempts to replicate famous results in psychology, which helped to clarify which “established” results we can actually trust. Devoting more attention to replication may mean we study fewer ideas and “discover” fewer things, but our knowledge will be much more reliable.⁷

Resistance to Replication

Unfortunately, replication work often gets a response somewhere between apathy and active hostility. Lots of researchers see “failed” replications as actual failures—the original study managed to reject the null, so why can’t you?

Alt text: “Hell, my eighth grade science class managed to conclusively reject it just based on a classroom experiment. It’s pretty sad to hear about million-dollar research teams who can’t even manage that.”

Worse, replications that don’t find the original result are often treated like attacks on both the original research and the original researchers. They “followed the rules” and got a publishable result, and now the “data police” are trying to take it away from them. At its worst, this leads to accusations of methodological terrorism. But even in less hostile discussions, people want to “save” the original result and explain away the failed replication—either by finding some specific subgroup in the replication where the original result seems to hold, or by finding some way the replication differs from the original study and so “doesn’t count”.⁸

This desire might seem weird, but it does follow pretty naturally from the Neyman-Pearson framework. The original goal of hypothesis testing is to make a decision and move on—even though that’s not how science should work. Replication re-opens questions that “were already answered”, which is good for science as a whole but frustrating to people who want to close the question and treat the result as proven.

Meta-analysis: use all the data

To make replication fit into a hypothesis testing framework, we often use meta-analysis, which synthesizes the data and results from multiple previous studies. Meta-analysis can be a powerful tool: why wouldn’t we want to use all the data out there, rather than picking just one study to believe? But it also allows us to move fully back into the Neyman-Pearson world. We can treat the whole collection of studies as one giant study, do one hypothesis test to it, and reach one conclusion.

Of course this leaves us with all the fundamental weaknesses of hypothesis testing: it tries to render a definitive yes-or-no answer, and it’s biased towards sticking with the null-hypothesis.

Moreover, a meta-analysis can only be as good as the studies that go into it. If those original studies are both representative and well-conducted, meta-analysis can produce a reliable conclusion. But if the component studies are sloppy and collect garbage data, as disturbingly many studies are, the meta-analysis will necessarily produce a garbage result. Good researchers try to screen out unusually bad studies, but if all the studies on some topic are bad then that won’t help.

And if not all studies get published, then any meta-analysis will be drawing on unrepresentative data. Imagine trying to estimate average human height, but the only data you have access to comes from studies of professional basketball players. No matter how careful we are, our estimates will be far too high, because our data all comes from unusually tall people. In the same way, if only unusually significant data gets published, even a perfect meta-analysis will be biased, because it can only use biased data.

Even if all studies get published, the garden of forking paths can bias the meta-analysis in exactly the same way, since each study may report an unusually favorable measurement. This is like if some studies report the height of their participants, and others the weight, and others the shoe size—but they all pick the measure that makes their subjects look biggest. Each study might report its data accurately, but we’d still end up with a misleading impression of how large people actually are.

Good meta-analyses will look for signs of selective publication, and there are statistical tools like funnel plots or $p$-curves, that can sometimes detect these biases in the literature. But these tools aren’t perfect, and of course they don’t tell us what we would have seen in the absence of publication bias. We can try to weed out bad studies after publication, but it’s better not to produce them in the first place.

The $p$-curve: when there’s $p$-hacking or selection bias, we expect most significant studies to be just barely significant. When the effect is real, we expect small $p$-values to be much more common than large ones. Figure from Simonsohn, Nelson, and Simmons.

But of course, not all meta-analyses are good. Just like researchers have lots of ways to tweak their experiments to get statistical significance, doing a meta-analysis involves making a lot of choices about how to analyze the data, and so there are a lot of opportunities to $p$-hack or to get tricked by the garden of forking paths. Meta-analysis is like one giant hypothesis test, which means it can go wrong in exactly the same ways other hypothesis tests do.

3. Preregistration: do it right the first time

Hypothesis testing does have real weaknesses, but many of the real-world problems we deal with only happen when we do it wrong. The point of the Neyman-Pearson method to set out a threshold that determines whether we should act or not, collect data, and then see whether the data crosses the threshold. If we ignore the result when it doesn’t give the answer we want, then we’re not really using the Neyman-Pearson method at all.

But that’s exactly what happens in many common errors. When we ignore negative studies, we change the question from “yes or no” to “yes or try again later”. The garden of forking paths and $p$-hacking involve changing the threshold after you see your data. This makes it very easy for your data to clear the threshold, but not very informative.

It’s easy to hit your target, if you pick the target after you shoot. But you don’t learn anything that way. Illustration by Dirk-Jan Hoek, CC-BY

For hypothesis testing to work, we have to decide what would count as evidence for our theory before we collect the data. And then we have to actually follow through on that, even if the data tells us something we don’t want to hear.

Public registries

Following through with this is simple for private decisions, if not always easy. When I want to buy a new kitchen gadget, sometimes I’ll decide how much I’m willing to pay before I check the price. If it turns out to be cheaper than my threshold, I’ll buy it; if it’s more expensive, I won’t. This helps me avoid making dumb decisions like “oh, that fancy pasta roller set is on sale, so it must be a good deal”. I don’t need any fancy way to hold myself accountable, since there’s no one else involved for me to be accountable to. And of course, if the pasta roller is super expensive and I buy it anyway, I’m only hurting myself.

But science is a public, communal activity, and our decisions and behavior need to be transparent so that other researchers can trust and build on our results. Even if no one ever lied, it’s so easy for us to fool ourselves that we need some way to guarantee that we did it right—both to other scientists, and to ourselves. Everyone saying, “I swear I didn’t change my mind after the fact, honest!” just isn’t reliable enough.

To create trust and transparency, we can publicly preregister of our research procedures. If we publish our plans before conducting the study, everyone else can know we made our decisions before we ran the study, and they can check to see if the analysis we did matches the analysis we said we would do. When done well, this prevents p-hacking and protects us from the garden of forking paths, because we aren’t making any choices after we see the data.

Public preregistration also limits publication bias. Even if the study turns produces boring negative results, the preregistration plan is already published, so we know the study happened—it can’t get lost in a file drawer where no one knows about it. This preserves the powerful statistical protection of the Neyman-Pearson method: our false positive rate will be five percent, and no more.

Many journals have implemented registered reports, which allow researchers to submit their study designs for peer review, before they actually conduct the study. This means their work is evaluated based on the quality of the design and on whether the question is interesting; the publication won’t depend on what answer they find, which removes the selection bias towards only seeing positive results. Registered reports also restrict researchers to the analyses they had originally planned, rather than letting them fish around for an interesting result—or at least force them to explain why they changed their minds, so we can adjust for how much fishing they actually did.

The biggest concern about publication bias probably surrounds medical trials, where pharmaceutical companies have an incentive not to publish any work that would show their drugs don’t work. Many regulatory bodies including the FDA require clinical trials to be registered; the NIH also maintains a public database of trial registries and results. And this change had a dramatic impact in the results we saw from clinical trials.

Before widespread preregistration, most trials showed large benefits. When we got more careful, these benefits evaporated.

Planning for power

Preregistration is also a great opportunity to plan out our study more carefully, and in particular to think about statistical power in advance. Remember the power of a study is the probability that it will reject the null hypothesis if the null is in fact false. We get more power when the study is better and more precise, but also when the effect we’re trying to measure is bigger and more visible: it’s pretty easy to show that cigarette smoking is linked to cancer, because the effect is so dramatic.⁹ But it’s much harder to detect the long-term effects of something like power posing, because the effects will be so small relative to other impacts on our personality.

On the other hand, if the effects are that small, maybe they don’t matter. If some economic policy reduces inflation by $0.01$%, then even if we could measure such a small reduction we wouldn’t really care—all we need to know is that the effect is “too small to matter”. With enough precision we could get statistical significance,¹⁰ but that doesn’t mean the result is practically or clinically significant. During the preregistration process we can decide what kind of effects would be practically important, and calibrate our studies to find those effects.

Planning for power also makes it easier to treat negative results as serious scientific contributions. The aphorism says that absence of evidence is not evidence of absence, but the aphorism is wrong. When a study has high power, we are very likely to see evidence if it exists; so absence of evidence becomes pretty good evidence of absence. If we know our studies have enough power, then our negative results become important and meaningful, and we won’t need to hide them in a file drawer.

A limited tool

And all of this is fantastic—but it doesn’t address many of the problems science actually presents us with. Modern hypothesis testing is optimized for taking a clear, well-designed question and giving a simple yes-or-no answer. That’s a good match for clinical trials, where the question is pretty much “should we use this drug or not?” By the time we’re in Phase 3 trials, we know what we think the drug will accomplish, and we can describe in advance a clean test of whether it will or not. Preregistration solves the implementation problems pretty thoroughly.

But preregistration does limit our ability to explore our data. This is necessary to make hypothesis testing work properly, but it’s still a cost. We really do want to learn new things from our data, not just confirm conjectures we’ve already made. Preregistration can’t help us if we don’t already have a hypothesis we want to test. And often, when we’re doing research, we don’t.

4. Bigger, Better Questions

Here are some scientific questions we might want to answer:

What sorts of fundamental particles exist?
What social factors contribute to crime rates?
How does sleep deprivation affect learning?
How effective is this cancer drug?
How cost-effective is this public health program?
How malleable are all the different steel alloys you can make?

None of these are yes-or-no questions. All of them are important parts of the scientific program, but none of them suggest specific hypotheses to run tests on. What do we do instead?

Spaghetti on the wall

Maybe the most obvious idea is just to test, well, everything.

With apologies to Allie Brosh.

Now, we can’t test literally everything; collecting data takes time and money, and we can only conduct so many experiments. But we can take all the data we already have on crime rates, or on learning; and we can list every hypothesis we can think of and test them all for statistical significance. This data dredging is a very common, very bad idea, especially in the modern era of machine learning and big data. Mass testing like this takes all the problems of hypothesis testing—false positives, publication bias, low power, and biased estimates—and makes them much worse.

If we test every idea we can think of, most of them will be wrong. As we saw in part 2, that means a huge fraction of our positive results will be false positives. Sure, if we run all our tests perfectly, then only $5$% of our wrong ideas will give false positives. But since we have so many more bad ideas than good ones, we’ll still get way more false positives than true positives. (This is easiest to see in the case where all of our ideas are wrong—then all our positive results will be false positives!)

If we test just twenty different wrong ideas, there’s a roughly two-in-three chance that one of them will fall under the $5$% significance threshold, just by luck.¹¹ That’s a lot higher than the false positive rate of $5$% that we asked for, and means we are very likely to “discover” something false. And then we’ll waste even more time and resources following up on our surprising new “discovery”.

If you test everything, you’ll find a ton of spurious correlations like this one.

Multiple Comparisons

This multiple comparisons problem has a mathematical solution: we can adjust our significance threshold to bring our false positive rate back down. A rough rule of thumb is the Bonferroni correction, where we divide our significance threshold by the number of different ideas we’re testing. If we test twenty ideas but divide our $5$% significance threshold by twenty to get a corrected threshold of $0.25$%, then each individual result has a one-in-four-hundred chance of giving a false positive, but that gives us a roughly five percent chance of getting a false positive on one of those ideas.

The problem is sociological, not mathematical: people don’t like correcting for multiple comparisons, because it makes it harder to reach statistical significance and “win” the science game. Less cynically, correcting for multiple comparisons reduces the power of our studies dramatically, making it harder to discover real and important results. Ken Rothman’s 1990 paper No Adjustments Are Needed for Multiple Comparisons articulates both of these arguments admirably clearly: “scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings.”

Rothman is right in two important ways. First, researchers should not be penalized for conducting studies that don’t reach statistical significance. Studies that fail to reject the null, or measure a tiny effect, are valuable contributions to our store of knowledge. We tend to overlook and devalue these null results, but that’s a mistake, and one of the major benefits of preregistration is protecting and rewarding them.

Second, it’s important to investigate potential leads that might not pan out. As Isaac Asimov may or may not have said, “The most exciting phrase in science is not ‘Eureka!’ but ‘That’s funny…’”; and it’s important to follow up on those unexpected, funny-looking results. After all, we have to find hypotheses somewhere.

But undirected exploration is, very specifically, not hypothesis testing. Rothman suggests that we often want to “earmark for further investigation” these unexpected findings. But hypothesis testing isn’t designed to flag ideas for future study; instead a hypothesis test concludes the study, with (in theory) a definitive answer. Rothman’s goals are correct and important, but hypothesis testing and statistical significance aren’t the right tools for those goals.¹²

Jumping to conclusions

At some point, though, we do generate some hypotheses.¹³ If we’re studying how memory interacts with speech, we might hypothesize that describing a face verbally will make you worse at recognizing it later, which gives us something concrete to test. Or, more tenuously, if we’re studying the ways that sexism affects decision-making, we might hypothesize that hurricanes with feminine names are more deadly because people don’t take them as seriously.

And then we can test these hypotheses, and reject the null or not, and then—what? What does that tell us?

We have a problem, because these hypotheses aren’t the questions we really want to answer. If installing air filters in classrooms increases measured learning outcomes, that’s a fairly direct answer to the question of whether installing air filters in classrooms can help children learn, so a hypothesis test really can answer our question. But we shouldn’t decide that sexism is fake just because feminine names probably don’t make hurricanes deadlier!¹⁴ We should only care about the hurricane-names thing if we think it tells us something about our actual, real-world concerns.

And that means we can’t just test one random hypothesis relating to our big theoretical question and call it a day. We need to develop hypotheses that are reasonably connected to the questions we care about, and we need to approach those questions from many different perspectives to make sure we’re not missing anything. That means there’s a ton of work other than hypothesis testing that we need to do if we want our hypothesis tests to tell us anything useful:¹⁵

Defining terms: First we need to decide what question we’re actually trying to answer! There are a lot of different things people mean by “sexism” or “memory” or “crime”, and our research will be confused unless we make sure we’re consistently talking about the same thing.¹⁶
Causal modeling: What sort of relationships do we expect to see? If our theory on the Big Question is true, what experimental results does that imply? What other factors could confound or interfere with these effects? We need to know what relationships we’re looking for before we can design tests for them.
Developing measurements: How will we measure the inputs and outputs to our theory? What numbers will we use to measure crime levels, or educational improvement, or ability to remember faces? Are the things we’re measuring closely connected to the definitions we chose earlier? It’s easy to measure something but hard to make sure the measurement tells us what we want to know.
Determining scope: When do we expect our theory to work, and for what sort of extreme results do we expect it to break down? What experiments should we not bother running? It’s worth studying whether mild air pollution makes learning harder, without worrying about the major health effects that we know severe pollution causes.
Auxiliary assumptions: What extra assumptions are we making in all the previous steps, and how can we verify them? Does installing classroom air filters actually reduce pollution? Do people who verbally described a face try equally hard at the later recall task? How can we tell? We can’t avoid making assumptions, but we can try to be explicit about them, and check the ones that could cause problems.

Without all this work, we can come up with hypotheses, but they won’t make sense. We can run experiments, but we can’t interpret them. And we can do hypothesis tests, but we can’t use them to answer big questions.

5. Failing to measure up

And sometimes we have a direct question that presents a clear experiment to run, but not a clear hypothesis. Questions like “How effective is this cancer drug?” or “how malleable is this steel alloy?” aren’t big theoretical questions, but also aren’t specific hypotheses that can be right or wrong. We want numbers.

In practice we often use hypothesis testing to answer these questions anyway—but with an awkward kludge. We can test a null hypothesis like “this public health program doesn’t save lives”. If we fail to reject the null, we conclude that it doesn’t help at all; if we do reject the null, we see how many lives the program saved in our experiment, and use that as an estimate of its effectiveness.

This works well enough that we kinda get away with it, but it introduces consequential biases into our measurements. If the measured effect is small, we round it down to zero, concluding there is no benefit when there may well be a small but real benefit (or a small but real harm). And if significant studies are more likely to be seen than non-significant studies, we will see more unusually good results than unusually bad ones, which means we will believe basically everything is more effective than it actually is.¹⁷

We shouldn’t be surprised that hypothesis testing does a bad job of measuring things, because hypothesis testing isn’t designed to measure things. It’s specifically designed to not report a measurement, and just tell us whether we should act or not. It’s the wrong tool for this job.

We can and should do better. A study in which mortality decreases by $0.1$% is evidence that the program works—possibly weak evidence, but still evidence! And if we skip the hypothesis testing and put measurement first, we can represent that fact accurately.

Compatibility checking

The simplest thing to do would be to just average all our measurements and report that number. This is a type of point estimate, the single number that most accurately reflects our best guess at the true value of whatever we’re measuring.

But a point estimate by itself doesn’t give as much information as we need. We need to measure our uncertainty around that estimate, and describe how how confident we are in it. A drug that definitely makes you a bit healthier is very different from one that could save your life and could kill you, and it’s important to be clear which one we’re talking about.

We can supplement our point estimate with a confidence interval, also called a compatibility interval, which is sort of like a backwards hypothesis test. We give all the values that are compatible with our measurement—values that would make our estimate relatively unsurprising. Rather than starting with a single null hypothesis and checking whether our measurement is compatible with it, we start with the measurement, and describe all the hypotheses that would be compatible.

The definition is a bit more technical, and easy to get slightly wrong: If we run $100$ experiments, and generate a $95$% confidence interval for each experiment, then the true value will lie in about $95$ of those intervals. A common mistake is to say that if we generate one confidence interval, the true value has a $95$% chance of landing in it, but that’s backwards, and not quite right.¹⁸ But before we run the experiment, we expect a $95\%$ chance that the true value will be in the confidence interval we compute.

Each vertical bar is a compatibility interval from one experiment, with a circle at the point estimate. Three of the intervals don’t include the true value, which is roughly $5$% of the $50$ intervals. Image by Randy.l.goodrich, CC BY-SA 4.0

Mathematically, these intervals are closely related to hypothesis tests. A result is statistically significant if the null hypothesis (often $0$) lies outside the compatibility interval. So in a sense compatibility intervals are just giving the same information as a hypothesis test, just in a different format. But changing the format shifts the emphasis of our work, and the way we think about it. Rather than starting by picking a specific claim and then saying yes or no, we give a number, and talk about what theories and models are compatible with it. This avoids needing to pick a specific hypothesis. It also gives our readers more information, rather than compressing our answer into a simple binary.

Focusing on compatibility intervals can also help avoid publication bias, and make it easier to use all the data that’s been collected. When we report measurements and compatibility intervals, we can’t “fail to reject” a null hypothesis. Every study will succeed at producing an estimate, and a compatibility interval, so every study produces knowledge we can use, and no study will “fail” and be hidden in a file drawer. Some studies might be designed and run better than others, and so give more precise estimates and narrower compatibility intervals. We can give more weight to these studies when forming an opinion. But we won’t discard a study just for yielding an answer we didn’t expect.

6. Bayesian statistics: the other path

Throughout this series, we’ve used the language and perspective of frequentist statistics. This is the older and more classical approach to statistics, which defines probability in terms of repeated procedures. “If we test a true null hypothesis a hundred times, we’ll only reject it about five times”. “If we run this sampling procedure a hundred times, the compatibility interval will include the true value about $95$ times.” This approach to probability is philosophically straightforward, and leads to relatively simple calculations.

But there are questions it absolutely can’t answer—like “what is the probability my null hypothesis is true?”—since we can’t frame them in terms of repeated trials. Remember, the $p$-value is not the probability the null is false. Its definition is a complicated conditional hypothetical that’s hard to state clearly in English: it’s the probability that we would observe what we actually did observe under the assumption that the null hypothesis is true. This is easy to compute, but it’s difficult to understand what it means (which is why I wrote like six thousand words trying to explain it).

But there’s another school of statistics that can produce answers to those questions. Bayesian inference, which I’ve written about before, lets us assign probabilities to pretty much any statement we can come up with. This is great, because it can directly answer almost any question we actually have. But it’s also much, much harder to use, because it requires much more data and more computation. And the bigger and more abstract the question we ask, the worse this gets.

Bayesian inference needs three distinct pieces of information:

The probability of seeing our data, assuming the hypothesis is true, which is just the $p$-value we’ve been discussing;
The probability of seeing our data, assuming the hypothesis is false, which is another $p$-value; and
The prior probability that our hypothesis is true, based on the evidence we had before we run the experiment.

Then we run an experiment, collect data, and use a formula called Bayes’s theorem to produce a posterior probability, our final estimate of the likelihood our hypothesis is true.¹⁹

That’s a lot more complicated! First of all, we have to compute two $p$-values, not just one. But second, we calculate the extra $p$-value under the assumption that “our hypothesis is false”, and that covers a lot of ground. If our hypothesis is that some drug prevents cancer deaths, then the alternative includes “the drug does nothing”, “the drug increases cancer deaths”, “the drug prevents some deaths and causes others”, and even silly stuff like “aliens are secretly interfering with our experiments”. To do the Bayesian calculation we need list every possible way our hypothesis could be false, and compute how likely each of those ways is and how plausible each one makes our data. That gets very complicated very quickly.

(In contrast, Fisher’s approach starts by assuming the null hypothesis is true, and ignores every other possibility. This makes the calculation much easier to actually do, but it also limits how much we can actually conclude. High $p$-value? Nothing weird. Low $p$-value? Something is weird. But that’s all we learn.)

And third, even if we can do all those calculations somehow, we need that prior probability. We want to figure out how likely it is that a drug prevents cancer. And as the first step, we have to plug in…the probability that the drug prevents cancer. We don’t know that! That’s what we’re trying to compute!

Bayesian machinery is great for refining and updating numbers we already have. And the more data we collect, the less the prior probability matters; we’ll eventually wind up in the correct place. So in practice, we just pick a prior that’s easy to compute with, plug it into Bayes’s theorem, and try to collect enough data that we expect our answer to be basically right.

And that brings us back to where we began, with replication. The more experiments we run, the more we can learn.

7. Conclusion: (Good) data is king

I closed out part 2 with an xkcd statistics tip: “always try to get data that’s good enough that you don’t need to do statistics on it.” Here at the end of part 3, we find ourselves in exactly the same place. But this time, I hope you see that tip, not as a punchline, but as actionable advice.

Modern hypothesis testing “works”, statistically, as long as you ask exactly the questions it answers, and are extremely careful in how you use it. But we often misuse it by collecting flawed or inadequate data and then drawing strong, sweeping conclusions. We run small studies and then $p$-hack our results into significance, rather than running the careful, expensive studies that would genuinely justify our theoretical claims. We report the results as over-simplified yes-or-no answers rather than trying to communicate the complicated, messy things we observed. And if we manage to reject the null on one study we issue press releases claiming it confirms all our grand theories about society.

Too often, we use statistics to help us pretend bad data is actually good.

In this essay we’ve seen a number of possible solutions, but they’re basically all versions of “collect more and better data”:

Do enough foundational work that you can formulate good hypotheses, and figure out what data you need to draw usable conclusions.
If you have numerical data, use the numbers, rather than throwing away information and just giving a single yes or no.
Preregister your studies, to make sure your data is useful and you’re not altering it to fit your conclusions.
Replicate your studies, so you collect more data that can either confirm or correct your beliefs.

Even the Bayesian approach comes back to this. Bayesianism relies on the prior probability; but that really just means that, if we already have some knowledge before we run the experiment, we should use it!

Statistics is powerful and useful. We couldn’t do good science without it. But data—empirical observation—is the core of science. Statistics helps us understand the data we have, and it helps us figure out what data we need. But if our data sucks, statistics alone cannot save us.

Have questions about hypothesis testing? Is there something I didn’t cover, or even got completely wrong? Do you have a great idea for doing science better? Tweet me @ProfJayDaigle or leave a comment below.

We could reverse this, and err on the side of acting, if we think wrongly doing nothing has worse downsides than wrongly acting. But it’s pretty uncommon to do it that way in practice. ↵Return to Post
We’ve seen the effects of this unnecessary dichotomization over and over again during the pandemic. We argued about whether masks “work” or “don’t work”, rather than discussing how well different types of masks work and how we could make them better. I know people who are still extremely careful to wear masks everywhere, but who wear cloth masks rather than medical—a combination that makes very little sense outside of this false binary.) ↵Return to Post
There are variants of hypothesis testing that help you show some null hypothesis is (probably) basically right. But they’re not nearly as common as the more standard setup. ↵Return to Post
Nosek, Spies, and Motyl write about the experience of carefully replicating some interesting work before publication, and seeing the effect vanish: "Incentives for surprising, innovative results are strong in science. Science thrives by challenging prevailing assumptions and generating novel ideas and evidence that push the field in new directions. We cannot expect to eliminate the disappointment that we felt by “losing” an exciting result. That is not the problem, or at least not one for which the fix would improve scientific progress. The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest….The solution requires making incentives for getting it right competitive with the incentives for getting it published." ↵Return to Post
The result we’ve found doesn’t necessarily mean what we think it means, and that is its own tricky problem. But if you get a consistent effect then you’ve found something even if you don’t understand it yet. ↵Return to Post
If a drug is wrongly approved, we continue learning about it through observation of the patients taking it. This is, for instance, how we can be quite certain that the covid vaccines are effective and extremely safe. But if we don’t approve a drug, there’s no followup data to analyze, and the drug stays unapproved. ↵Return to Post
My favorite suggestion comes from Daniel Quintana, who wants undergraduate psychology majors to contribute to replication efforts for their senior thesis research. Undergraduate research is often more about developing methodological skill than about producing genuinely innovative work, so it’s a good fit for careful replication of already-designed studies. ↵Return to Post
You might wonder if a result that depends heavily on minor differences in study technique can actually be telling us anything important. That’s a very good question. It’s very easy to run a hypothesis test that basically can’t tell us anything interesting; we’ll come back to this later in the piece. ↵Return to Post
Somewhat infamously, Fisher stubbornly resisted the claim that smoking caused cancer until his death. But he never denied the correlation, which was too dramatic to hide. ↵Return to Post
As long as two factors have any relationship at all, the effect won’t be exactly zero, and with enough data we’ll be able to reject the null hypothesis that there’s no effect. But that just means “is the effect exactly zero” is often the wrong question; instead we want to know if the effect is big enough to matter. ↵Return to Post
The odds of getting no false positives after $n$ trials is $0.95^n$, so the odds of getting a false positive are $1 - 0.95^n$. And $0.95^{20} \approx 0.358$, so $1 - 0.95^{20} \approx 0.652$.

It’s a little surprising this is so close to $2/3$, but there’s a reason for it—sort of. If you compute $ (1- 1/n)^n$ you will get approximately $1/e$, so the odds of getting a false positive at a $1/20$ false positive threshold after $20$ trials are roughly $1-1/e \approx .63$. ↵Return to Post
From what I can tell, Rothman may well agree with me. His twitter feed features arguments against using statistical significance and dichotomized hypotheses in place of estimation, which is roughly the position I’m advocating. But if you’re doing hypothesis testing, you should try to do it correctly. ↵Return to Post
You might notice that I’m not really saying anything about where we find these hypotheses. There’s a good reason for that. Finding hypotheses is hard! It’s also the most creative and unstructured part of the scientific process. The question is important, but I don’t have a good answer. ↵Return to Post
For that matter, if feminine hurricane names were less dangerous we could easily tell a story about how that was evidence for sexism. That’s the garden of forking paths popping up again, where many different results could be evidence for our theory. ↵Return to Post
In their wonderfully named (and very readable) paper Why hypothesis testers should spend less time testing hypotheses, Anne Scheel, Leonid Tiokhin, Peder Isager, and Daniël Lakens call this the derivation chain: the empirical and conceptual linkages that allow you to derive broad theoretical claims from the specific hypotheses you test. ↵Return to Post
This is one of the major skills you develop in math courses, because a lot of the work of math is figuring out what question you’re trying to answer. I’ve written about this before, but I also recommend Keith Devlin’s excellent post on what “mathematical thinking” is, especially the story he tells after the long blockquote. ↵Return to Post
We also sometimes find that our conclusions depend on exactly which questions we ask. Imagine a study where we need a $5$% difference to be significant, and Drug A produces a $3$% improvement over placebo and Drug B produces a $7$% improvement. Then the effect of Drug A isn’t significant, and the effect of Drug B is, so we say that Drug A doesn’t work and Drug B does.

But the difference between Drug A and Drug B is not significant—so if we ask that question, we conclude that the two drugs are equally good! The difference between "significant" and "not significant" is not itself statistically significant, so it matters exactly which hypothesis we choose to test. ↵Return to Post
Sometimes we can look at our interval after the fact and make an informed guess whether it’s one of the good intervals or the bad intervals. If I run a small study to measure average adult heights, there’s some risk I get a $95$% confidence interval that contains, say, everything between five feet and six feet. Based on outside knowledge, I’m pretty much $100$% confident in that interval, not just $95$%. ↵Return to Post
We saw examples of this calculation in part 2, when we calculated what fraction of positive results were true positives. Note that we had to make assumptions about what fraction of null hypotheses are true; that’s the Bayesian prior probability. Tables like the ones we used there show up a lot in simple Bayesian calculations. ↵Return to Post

Natural numbers:	1	2	3	4	5	6	7	8	9	10	\(\dots\)
Even numbers:		2		4		6		8		10	\(\dots\)