Jekyll2020-04-09T15:49:09-07:00https://jaydaigle.net/Jay DaigleJay Daigle is a professor of mathematics at Occidental College in Los Angeles. In addition to his research in number theory, he brings a mathematical style to thinking about philosophy, politics, social dynamics, and everyday life.Jay DaigleThe SIR Model of Epidemics2020-03-27T00:00:00-07:002020-03-27T00:00:00-07:00https://jaydaigle.net/blog/the-sir-model-of-epidemics<script src="https://sagecell.sagemath.org/static/embedded_sagecell.js"></script>
<script>sagecell.makeSagecell({"inputLocation": ".sage"});</script>
<p>For <em>some</em> reason, a lot of people have gotten really interested in epidemiology lately. Myself included.</p>
<p><img src="/assets/blog/sir/coronavirus.jpg" alt="Picture of a coronavirus, by Alissa Eckert, MS and Dan Higgins, MAMS, courtesy of the CDC" class="center" style="width:350px" /></p>
<p style="text-align: center"><em>I have no idea why.</em></p>
<p>Now, I’m not an epidemiologist. I don’t study infectious diseases. But I do know a little about how mathematical models work, so I wanted to explain how one of the common, simple epidemiological models works. This model isn’t anywhere near good enough to make concrete predictions about what’s going to happen. But it <em>can</em> give some basic intuition about how epidemics progress, and provide some context for what the experts are saying.</p>
<hr />
<p><strong>Disclaimer:</strong> I don’t study epidemics, and I don’t even study differential equation models like this one. I’m basically an interested amateur. I’m going to try my best not to make any predictions, or say anything specific about COVID-19. I don’t know what’s going to happen, and you shouldn’t listen to my guesses, or the guesses of anyone else who isn’t an actual epidemiologist.</p>
<hr />
<h2 id="the-sir-model">The SIR Model</h2>
<h3 id="parameters">Parameters</h3>
<p>The SIR model divides the population into three groups, which give the model its name:</p>
<ul>
<li>$S$ is the number of <strong>S</strong>usceptible people in the population. These are people who aren’t sick yet, but could get sick in the future.</li>
<li>$I$ is the number of <strong>I</strong>nfected people. These are the people who are sick<strong title="Or people who are asymptomatic carriers. This model doesn't worry about who actually gets a fever and starts coughing, just who carries the virus and can maybe infect others."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> right now.</li>
<li>$R$ is the number of people who have <strong>R</strong>ecovered from the virus. They are immune and can’t get sick again.</li>
<li>We also will use $N$ for the total number of people. So $N = S+ I + R$.</li>
</ul>
<p><img src="/assets/blog/sir/knight.jpg" alt="Picture of a Knight, by Paul Mercuri (1860)" class="center" style="width:400px" /></p>
<p style="text-align:center"><em>Not that kind of “sir”.</em></p>
<p>For the purposes of this model, we assume that the total number of people, $N$, doesn’t change. But the number of people in each $S,I,R$ group is changing all the time: susceptible people get infected, and infected people recover. So we write $S(t)$ for the number of susceptible people “at time $t$”—which is just a fancy way of saying that $S(3)$ means the number of susceptible people on the third day.</p>
<h3 id="change-over-time">Change Over Time</h3>
<p>In order to model how these groups evolve over time, we need to know how often those two changes happen. How quickly do sick people recover? And how quickly do susceptible people get sick?</p>
<p>The first question, in this model, is simple. Each infected person has a chance of recovering each day, which we call $\gamma$. So if the average person is sick for two weeks, we have $\gamma = \frac{1}{14}$. And on each day, $\gamma I$ sick people recover from the virus.</p>
<p>The second question is a little trickier. There are basically three things that determine how likely a susceptible person is to get sick: how many people they encounter in a day, what fraction of those people are sick, and how likely a sick person is to transmit the disease. The middle factor, the fraction of people who are sick, is $\frac{I}{N}$. We could think about the other two separately, but for mathematical convenience we group them together and call them $\beta$.</p>
<p>So the chance that a given susceptible person gets sick on each day is $\beta \frac{I}{N}$.<strong title="If we're being fancy, we say that the chance of getting sick is proportional to I/N and that β is the constant of proportionality. But if you're not used to differential equations already I'm not sure that tells you very much."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> And thus the total number of people who get sick each day is $\beta \frac{I}{N} S$.</p>
<p>If these letters look scary, it might help to realize that you’ve probably spent a lot of time lately thinking about $\beta$—although you probably didn’t call it that. The parameter $\beta$ measures how likely you are to get sick. You can decrease it by reducing the number of people you encounter in a day, through “social distancing” (or <a href="https://www.washingtonpost.com/lifestyle/wellness/social-distancing-coronavirus-physical-distancing/2020/03/25/a4d4b8bc-6ecf-11ea-aa80-c2470c6b2034_story.html">physical distancing</a>). And you can decrease it by improved hygiene—better handwashing, not touching your face, and sterilizing common surfaces.</p>
<p>There’s one more number we can extract from this model, which you might have heard of. In a population with no resistance to the disease (so $S$ and $I$ are both small, and we can pretend that $S=N$), a sick person will infect $\beta$ people each day, and will be sick for $\frac{1}{\gamma}$ days, and so will infect a total of $\frac{\beta}{\gamma}$ people. We call this ratio is $R_0$; you may have seen in the news that the $R_0$ for COVID-19 is probably about $2.5$.</p>
<p><img src="/assets/blog/sir/file-20200128-120039-bogv2t.png" alt="A graph demonstrating exponential growth when R0 = 2" class="center" style="width:377px;" /></p>
<p style="text-align: center;"><em>When $\beta$ is twice as big as $\gamma$, things can get bad very quickly. From <a href="https://theconversation.com/r0-how-scientists-quantify-the-intensity-of-an-outbreak-like-coronavirus-and-predict-the-pandemics-spread-130777">The Conversation</a>, licensed under <a href="http://creativecommons.org/licenses/by-nd/4.0/">CC BY-ND</a></em></p>
<h3 id="assumptions-and-limitations">Assumptions and Limitations</h3>
<p>Like all models, this is a dramatic oversimplification of the real world. Simplifcation is good, because it means we can actually understand what the model says, and use that to improve our intuitions. But we do need to stay aware of some of the things we’re leaving out, and think about whether they matter.</p>
<p><strong>First</strong>: the model assumes a static population: no one is born and no one dies. This is obviously <em>wrong</em> but it shouldn’t matter too much over the months-long timescale that we’re thinking about here. On the other hand, if you want to model years of disease progression, then you might need to include terms for new susceptible people being born, and for people from all three groups dying.</p>
<p><strong>Second</strong>: the model assumes that recovery gives permanent immunity. Everyone who’s infected will eventually transition to recovered, and recovered people never lose their immunity and become susceptible again. I don’t think we know yet how many people develop immunity after getting COVID-19, or how long that immunity lasts.</p>
<p>But it seems basically reasonable to assume that most people will get immunity for at least several months; in this model we’re simplifying that to assume “all” of them do. And since we’re only trying to model the next several months, it doesn’t matter for our purposes whether immunity will last for one year or ten.</p>
<p><strong>Third</strong>: we assumed that $\beta$ and $\gamma$ are constants, and not changing over time. But a lot of the response to the coronavirus has been designed to decrease $\beta$—and the extent of those changes may vary over time. People will be more or less careful as they get more or less worried, as the disease gets worse or better. And people might just get restless from staying home all the time and start being sloppier. An improved testing regime might also decrease $\beta$, and better treatments could improve $\gamma$.</p>
<p>But the model leaves $\beta$ and $\gamma$ the same at all times. So we can imagine it as describing what would happen if we didn’t change our lifestyle or do anything in response to the virus.</p>
<p><strong>Finally</strong>: the first two factors, combined, mean that the susceptible population can only decrease, and the recovered population can only increase. Since we also hold $\beta$ and $\gamma$ constant, this model of the pandemic will only have one peak. It will never predict periodic or seasonal resurgences of infection, like we see with the flu.</p>
<p><img src="/assets/blog/sir/CDC-influenza-pneumonia-deaths-2015-01-10.gif" alt="graph of flu deaths, 2010 - 2014" class="center" /></p>
<p style="text-align: center;"><em>A graph of flu deaths per week, peaking each winter, from the CDC. The vanilla SIR model will never produce this sort of periodic seasonal pattern.</em></p>
<p><img src="https://miro.medium.com/max/2000/1*ok3NLISRGvK-4SQyDA5KTg.png" alt="stylized graph of possible COVID-19 trajectories" class="center" style="width:500px;" /></p>
<p style="text-align: center;"><em>This green curve imagines a “dance” where we suppress coronavirus infections through an aggressive quarantine, and then spend months alternately relaxing the quarantine until infections get too high, and then tightening it again until infections fall back down. The SIR model doesn’t allow this sort of dynamic variation of $\beta$ and can never produce the green curve.</em></p>
<h3 id="the-whole-system">The Whole System</h3>
<p>If we put all this together we get a <em>system of ordinary nonlinear differential equations</em>. A differential equation is an equation that talks about how quickly something changes; in these equations, we have the rates at which the number of susceptible, infected, and recovered people change. “Ordinary” means that there’s only one input variable; all the parameters change with time, but we’re not taking location as an input or anything. “Nonlinear” means that our equations aren’t in a specific “linear” form that’s really easy to work with.</p>
<p><img src="/assets/blog/sir/13974391215433.jpg" alt="Photo of a Kitten" class="center" style="width:479px" /></p>
<p style="text-align: center"><em>Calling these equations a “nonlinear system” is a lot like calling this kitten a “nondog animal”. It’s not wrong, but it’s kind of weirdly specific if you’re not at a dog show.</em></p>
<p>If you took calculus, you might remember that we often write $\frac{dS}{dt}$ to mean the rate at which $S$ is changing over time. Roughly speaking, it’s the change in the total number of susceptible people over the course of a day. We know that $S$ is decreasing, since susceptible people get sick but we’re assuming that people don’t <em>become</em> susceptible, so $\frac{dS}{dt}$ is negative. And specifically, we worked out that $\frac{dS}{dt}$ is $-\beta \frac{IS}{N}$, since that’s the number of people who get sick each day.</p>
<p>Similarly, we saw that $\frac{dR}{dt}$ is $\gamma I$, the number of people who recover each day. And $\frac{dI}{dt}$ is the number of people who get sick minus the number who recover. All together this gives us:</p>
<p>\begin{align}
\frac{dS}{dt} & = - \beta \frac{IS}{N} \\\<br />
\frac{dI}{dt} &= \beta \frac{IS}{N} - \gamma I \\\<br />
\frac{dR}{dt} & = \gamma I
\end{align}</p>
<hr />
<h2 id="what-did-we-learn">What Did We Learn?</h2>
<p>Now that we have this model, what’s the point? We can actually do a few different things with a model like this. If we want, we can write down an <a href="https://arxiv.org/abs/1403.2160">exact formula</a> that tells us how many people will be sick on each day. Unfortunately, the exact formula isn’t actually all that helpful. The paper I linked includes lovely equations like</p>
<script type="math/tex; mode=display">z(\psi )= e^{-\mu\int_1^{\psi } \frac{ e^{\Psi (\xi )}}{\xi } \, d\xi } \left[\int_1^{\psi } e^{\Psi (\chi )+\mu\int_1^{\chi } \frac{ e^{\Psi (\xi )}}{\xi } \, d\xi } \, d\chi
-\int_1^{\gamma N_2} e^{\Psi (\chi )+\mu\int_1^{\chi } \frac{ e^{\Psi (\xi )}}{\xi } \, d\xi } \, d\chi +N_3 e^{\mu\int_1^{\gamma N_2} \frac{
e^{\Psi (\xi )}}{\xi } \, d\xi }\right].</script>
<p>And I don’t want to touch a formula that looks like that any more than you do.</p>
<p>Even if the formula were nicer, it wouldn’t be all that useful. Getting an exact solution to the equations doesn’t mean we know exactly how many people are going to get sick. Like all models, this one is a gross oversimplification of the real world. It’s not useful for making exact predictions; and if you want predictions that are <em>kinda</em> accurate, you should talk to the epidemiological experts, who have much more complicated models and much better data.</p>
<h3 id="qualitative-judgments">Qualitative Judgments</h3>
<p>But this model does give us a qualitative sense of how epidemics progress. For instance, in the very early stages of the epidemic, almost everyone will be susceptible. So we can make a further simplifying assumption that $S = N, I = R =0$, and get the equation
<script type="math/tex">\frac{dI}{dt} = \beta I.</script>
This is <a href="https://jaydaigle.net/blog/a-neat-argument-for-the-uniqueness-of-e-x/">famously</a> the equation for <a href="https://en.wikipedia.org/wiki/Exponential_growth">exponential growth</a>. And indeed, graphs of new coronavirus infections seem to start nearly perfectly exponential.</p>
<p><img src="https://cdn.i24news.tv/uploads/49/ba/a9/51/db/2f/9b/b6/08/0e/96/64/95/71/70/7f/49baa951db2f9bb6080e96649571707f.png" alt="Comparison of reported Chinese cases with exponential curve" class="center" style="width:320px;" /></p>
<p style="text-align: center;"><em>This graph <a href="https://www.i24news.tv/en/news/international/asia-pacific/1580327226-analysis-at-current-rate-china-virus-could-infect-over-25-000-by-february">from I24 news</a> of reported infections in China almost perfectly matches the exponential curve.</em></p>
<p><img src="https://static01.nyt.com/images/2020/03/20/science/virus-log-chart-1584728689795/virus-log-chart-1584728689795-facebookJumbo.jpg" alt="Linear and logarithmic scale plots of US and Italian coronavirus cases" style="width:600px;" class="center" /></p>
<p style="text-align: center;"><em>This <a href="https://www.nytimes.com/2020/03/20/health/coronavirus-data-logarithm-chart.html">New York Times graph</a> shows the exponential curves in both the US and Italy on the left. The right-hand logarithmic plots look nearly like straight lines, which which also reflects the exponential growth pattern.</em></p>
<p>As the epidemic progresses, the numbers of infected and recovered people climb. Each sick person will infect fewer additional people, since more of the people they meet are immune. We can see this in the model: the number of people who get infected each day is $\beta \frac{S}{N} I$. After many people have gotten sick, $\frac{S}{N}$ goes down and so fewer people get infected for a given value of $I$.</p>
<p>The epidemic will peak when people are recovering at least as fast as they get sick. This happens when $\beta \frac{IS}{N} \leq \gamma I$, and thus when $S = \frac{\gamma}{\beta} N$. Remember that $\frac{\beta}{\gamma}$ was our magic number $R_0$, so by the peak of the epidemic, only one person out of every $R_0$ people will have avoided getting sick.</p>
<p>If the estimates of $R_0 \approx 2.5$ are correct, this would mean that the epidemic would peak when something like 60% of the population had gotten sick. And remember, that’s not the end of the epidemic; that’s just the worst part. It would slowly get weaker from that time on, until it eventually fizzles.</p>
<p>(These are <em>not predictions</em>, for many reasons. I’m not an epidemiologist. Any real epidemiologist would be using a much more sophisticated model than this one to try to make real predictions. Don’t pay attention to the specific numbers I use here. But you can get a qualitative sense of what changing these numbers would do—and have more context for understanding what the real experts tell you.)</p>
<p><img src="/assets/blog/sir/imperial_projections_chart.png" alt="Chart" class="center" style="width:448px;" /></p>
<p style="text-align: center"><em>Predictions from actual experts use a ton of data and consider a huge range of possibilities, and generally look like <a href="https://spiral.imperial.ac.uk:8443/handle/10044/1/77482">this table</a> from a team at Imperial College London.</em></p>
<h3 id="numeric-simulations">Numeric Simulations</h3>
<p>There’s one more thing that toy models like this can do. We can use them to run numeric simulations (using <a href="https://en.wikipedia.org/wiki/Euler_method">Euler’s method</a> or something similar). We can see what would happen under our assumptions, and how the results change if we vary those assumptions.</p>
<p>Below is some code for the SIR model written in SageMath. (I borrowed the code from <a href="https://sage.math.clemson.edu:34567/home/pub/161/">this page</a> at Clemson; I believe the code was written by <a href="http://people.oregonstate.edu/~medlockj/">Jan Medlock</a>.) I’ve primed it with $\gamma = .07$, which means that people are sick for two weeks on average, and $\beta = .2$, which gives us an $R_0$ of about $2.8$.</p>
<p>If you just click “Evaluate”, you’ll see what happens if we run this model using those values of $\beta$ and $\gamma$ over the next 400 days. It’s pretty grim; the epidemic peaks two months out with a sixth of the country sick at once (the red curve), and in six months well over 80% of the country has fallen ill at some point (the blue curve).<strong title=" Reminder: I don't believe that this will happen, for many reasons. And you shouldn't listen to me if I did. Numbers are for illustrative purposes only and should not be construed as epidemiological advice."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>But with this widget you can play with those assumptions. What happens if we find a way to cure people faster, so $\gamma$ goes down? What if we lower $\beta$, by physical distancing or improved hygiene? The graph improves dramatically. And you can change up all the numbers if you want to. Play around, and see what you learn.</p>
<p>And stay safe out there.</p>
<div class="sage">
<script type="text/x-sage">
# Transmission rate
beta = 0.20
# Recovery rate
gamma = 0.07
# Population size
N = 300000000
# Initial infections
IInit = 100000
SInit = N - IInit
RInit = 0
R0 = beta / gamma
show(r'R_0 = %g' % R0)
# End time
tMax = 400
# Standard SIR model
def ODE_RHS(t, Y):
(S, I, R) = Y
dS = - beta * S * I / N
dI = beta * S * I / N - gamma * I
dR = gamma * I
return (dS, dI, dR)
# Set up numerical solution of ODE
solver = ode_solver(function = ODE_RHS,
y_0 = (SInit, IInit, RInit),
t_span = (0, tMax),
algorithm = 'rk8pd')
# Numerically solve
solver.ode_solve(num_points = 1000)
# Plot solution
show(
plot(solver.interpolate_solution(i = 0), 0, tMax, legend_label = 'S(t)', color = 'green')
+ plot(solver.interpolate_solution(i = 1), 0, tMax, legend_label = 'I(t)', color = 'red')
+ plot(solver.interpolate_solution(i = 2), 0, tMax, legend_label = 'R(t)', color = 'blue')
)
# code from https://sage.math.clemson.edu:34567/home/pub/161/
# Thanks to Jan Medlock
</script>
</div>
<p><em>Have a question about the SIR model? Have other good resources on this to point people at? Or did you catch a mistake? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<p><em>And take care of yourself.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Or people who are asymptomatic carriers. This model doesn’t worry about who actually gets a fever and starts coughing, just who carries the virus and can maybe infect others. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>If we’re being fancy, we say that the chance of getting sick is proportional to $\frac{I}{N}$ and that $\beta$ is the constant of proportionality. But if you’re not used to differential equations already I’m not sure that tells you very much. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Reminder: I don’t believe that this will happen, for many reasons. And you shouldn’t listen to me if I did. Numbers are for illustrative purposes only and should not be construed as epidemiological advice. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleFor some reason, a lot of people have gotten really interested in epidemiology lately. Myself included. Now, I'm not an epidemiologist. I don't study infectious diseases. But I do know a little about how mathematical models work, so I wanted to explain how one of the common, simple epidemiological models works. This model isn't anywhere near good enough to make concrete predictions about what's going to happen. But it _can_ give some basic intuition about how epidemics progress, and provide some context for what the experts are saying.Online Teaching in the Time of Coronavirus2020-03-14T00:00:00-07:002020-03-14T00:00:00-07:00https://jaydaigle.net/blog/online-teaching-in-the-time-of-coronavirus<p>I’ve been spending a lot of the past week looking at different options for transitioning my teaching online for the rest of the term. There are certainly people far more expert at online instruction than I am, but I wanted to share some of my thoughts and what I’ve found.</p>
<h2 id="handling-assignments">Handling Assignments</h2>
<h3 id="online-assignment-options">Online Assignment Options</h3>
<p>There are a lot of options for doing homework online. Many of these products (like WebAssign) have temporarily made everything freely available. I’m sure some of them are good, but I don’t know much about them.</p>
<p>This term I’ve been experimenting with using <a href="https://webwork.maa.org/">the MAA’s WeBWork system</a>, which has been going quite well. If you can administer your own server it’s completely free; if you can’t, the MAA will give you one trial class and then charge $200 per course you want to host. I don’t know how willing they are to start these up mid-semester, though. WeBWork is hardly a solution to everything, but it works very well for questions with numerical or algebraic answers.</p>
<p>(With WeBWork you can even give assignments that have to be completed inside a narrow window–say, an assignment that is only answerable between 2 and 3:30 on Thursday. So we could maybe use this to somewhat replace tests. Though again, not perfectly.)</p>
<h3 id="written-homework">Written Homework</h3>
<p>Of course, some assignments really need to include a written component. Written homework probably can just be photographed (or scanned) with a mobile phone; I expect most of our students have access to some sort of digital camera. I don’t know anything about the scanning apps but I know they exist. I have in fact graded photographed homework before, and my student graders have expressed a willingness to do this for the rest of the term.</p>
<p>We can also consider encouraging our students, especially in upper-division classes, to start using LaTeX for more assignments. That’s an unreasonable imposition on Calc 1 students but most of the people in the upper-level classes have probably been exposed to it, and it would make a lot of this much simpler. No scanning, no photographing, just emailing in PDFs.</p>
<h2 id="lectures-and-office-hours">Lectures and Office Hours</h2>
<p>I purchased a writing tablet for my computer. This is a peripheral that plugs into your computer and allows you to write/draw with a pen. I specifically ordered a Huion 1060 Plus, which gives a 10x6 writing area and <a href="https://amazon.com/gp/product/B01FTE9HS2/">goes for $70 on Amazon</a>. I haven’t gotten to test it yet, so don’t consider that quite a recommendation. The other thing that gets highly recommended is the <a href="https://amazon.com/Wacom-Drawing-Software-Included-CTL4100/dp/B079HL9YSF">Wacom Intuos</a>, which is supposed to be somewhat nicer but also gives a much smaller writing surface (something like 6x4), so if you write big this might not be comfortable.</p>
<p>I’ve been looking into options to stream lectures and other content. There are really two things I want to do here: the first is to have video conferences where I can stream lectures and share my screen to show written notes, LaTeX’d notes, Mathematica notebooks, etc. The second is to create a persistent space for student interactions. I’d like to create a space where even when I’m not “holding a lecture” or “having office hours”, my students can still ask questions—of each other and of me.</p>
<h3 id="discord">Discord</h3>
<p>I’ve been doing the second thing with Discord for my research group for the past year or so. It works pretty well. You create a room with a bunch of channels and all messages in a channel stay permanently (unless deleted by a moderator). You can scroll up to see what people have talked about in the past. Makes it great for students to have conversations with you and each other, and other students can see what happened in them. (There’s also a private messaging feature, of course.)</p>
<p>Discord is also good for voice calls, and has a screen sharing feature. Both of them worked very smoothly when I tried them, except the screensharing has some limitations that I believe are Linux-specific (in particular, in my multi-monitor setup I can share one window, or my entire desktop, but I can’t share exactly one monitor, which is something I would like to do). I’ve been in touch with <a href="http://www-personal.umich.edu/~speyer/">David Speyer</a>, who’s written up a bunch of thoughts about Discord <a href="https://academia.stackexchange.com/questions/145389/using-discord-to-support-online-teaching/145390#145390">here, with a basic tutorial for setting it up</a>.</p>
<p>One thing about discord that is both good and bad is that many of our students use it already. (It was designed for online videogame playing, and is now a widely used chat and voice program.) This is good because our students are already familiar with the program and how to use it. It may be bad because that means our students often already have screen names and identities on Discord that they may want to keep separated from their academic/professional personas. If we use some software they have not used before, they can create fresh accounts and keep their online personas appropriately segmented.</p>
<h3 id="oxys-suggestions-bluejeans-and-moodle">Oxy’s Suggestions: BlueJeans and Moodle</h3>
<p>My institution made some software recommendations. BlueJeans is the recommended videoconferencing software. I’ve played around with it a bit and it seems serviceable but not great. (Again, it has some specific issues with Linux that are more or less dealbreakers for me, as well.) One thing I miss from it is that it’s designed for video calls/conferences, but it doesn’t have the capacity to create a persistent chat room. So if I want that persistent interaction space, I’d need to use a second tool; I’d prefer to run everything on one platform if I can.</p>
<p>Moodle has a tool for creating chat rooms, but it’s <em>awful</em>. Do not want. It’s still a good place to post assignments and such if you don’t already have a place to post them and your institution uses Moodle. (If your institution uses some other learning management software, I can’t say much; Moodle is the only one I’ve ever used.)</p>
<h3 id="zoom-videoconferencing">Zoom Videoconferencing</h3>
<p>I’ve been leaning towards a videoconferencing solution called Zoom. The screensharing works great, and the recording feature works great. There’s an ability to create a shared whiteboard space, that I and students can both write on, which seems helpful for virtual office hours.</p>
<p>Zoom has the ability to create a persistent chatroom, and it worked very smoothly in some testing I did today with a couple of my undergraduates. (One of them reported that it “felt really slick”, which is a good sign; most of the experience was pretty seamless.) The videoconferencing can work without anyone making an account, I think, but the persistent chat room would require all our students to make (free) accounts. Anyone with a Gmail account can just log in with that, so that might not be a large barrier.</p>
<p>One major downside is that videoconferences are limited to 40 minutes. They’ve been relaxing this for schools and in affected areas, so I don’t know how much this would be in practice. But I also think we could just start again at the end of the 40 minute period if we needed to. (Or maybe just keep formal lectures below forty minutes; it’s hard to ask students to pay attention that long anyway. If you’re posting recorded video suggestions seem to be to keep them under ten minutes.)</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>There are a bunch of other resources floating around to help you; I’ve looked at several but unfortunately haven’t been keeping a list. But if you poke around on Twitter or elsewhere there are many people more informed than I am who will offer help!</p>
<p>I know the MAA has a <a href="https://twitter.com/mathcirque/status/1238119797747068929?s=09">recorded online chat on online teaching</a>, though I haven’t looked at it yet.</p>
<p>But the most important thing is not to get hung up on perfection. I didn’t plan to teach my courses remotely this term, and I’m sure they will suffer for lack of direct instructional contact. But that’s okay! And I’m going to be honest with my students about this.</p>
<p>This is a really unfortunate way to finish out the semester. It sucks. But I’m going to do what I can to make it only suck a medium amount. And I hope my students will bear with me and help to make this only medium suck.</p>
<p>We’ll get through this.</p>
<hr />
<p><em>I’d love to hear any ideas or feedback you have about moving to online instruction. And I’m happy to answer any questions I can—we’re in this together.</em> <em>Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a>, or leave a comment below.</em></p>Jay DaigleI’ve been spending a lot of the past week looking at different options for transitioning my teaching online for the rest of the term. There are certainly people far more expert at online instruction than I am, but I wanted to share some of my thoughts and what I’ve found.2019 Spring Class Reflections: Calculus2019-07-10T00:00:00-07:002019-07-10T00:00:00-07:00https://jaydaigle.net/blog/spring-2019-class-reflections-calculus<p>Now that the term is over, I want to reflect a bit on the courses I taught, what worked well, and what I might want to do differently next time. (Honestly, it probably would have been more useful to write this sooner after finishing the courses, when they were fresher in my mind. But I don’t have a time machine, so I can’t do much about that now.) In this post I’ll talk about my calculus class; I’ll try to write about the others soon.</p>
<h3 id="my-previous-course-design-had-limited-success">My previous course design had limited success</h3>
<p>Math 114 at Occidental is intended for students, usually freshmen, who have seen calculus before but haven’t mastered the material sufficiently to be ready for calculus 2. This has the advantage that everyone in the course is familiar with the basic ideas, and that I can sometimes reference ideas we haven’t talked about yet to help justify what we do in the early parts of the course. It also has the disadvantage that my students arrive with a lot of preconceptions and confusions about the subject.<strong title="And a lot of anxiety. After all, the typical student in this course took calculus in high school and then failed the AP exam; they've all had at least one not-great experience with the material."><sup id="fnref:anxiety"><a href="#fn:anxiety" class="footnote">1</a></sup></strong></p>
<p>It also means that we have extra time available to learn about extra topics that are interesting or useful or just help explain the ideas of calculus better, even if those topics aren’t really necessary to prepare for calculus 2.</p>
<p>In past years I had used this extra time to do the epsilon-delta definition of limits. I’m still proud of having successfully taught many freshmen to write clean epsilon-delta proofs. But over time I came to the conclusion that this wasn’t the best use of class time.</p>
<p>I had wanted the epsilon-delta proofs section to accomplish two things: help my students learn to write and reason more clearly, and give them a taste of what higher math was like. Neither of these goals were complete failures, but neither was really a success either.</p>
<ul>
<li>
<p>My students got better at writing proofs, but I don’t think they learned this in a way that transferred skills to their other writing and communication. Beginner proofs tend to be written in a very restrictive, formal organization, effectively following a template. This template looks like it does for a reason, and is useful as a baseline for people to grow from. But in practice my students were just repeating the template to me instead of growing beyond it, so I don’t think they were gaining much.</p>
</li>
<li>
<p>And my students got a taste of higher math, but I’m pretty sure it was an unfortunately bitter taste. Epsilon-delta proofs are actually pretty complicated things and especially hard for novice proof-writers to execute successfully, so they don’t make a great first experience in proofs.</p>
</li>
<li>
<p>Making things worse, it tends to be really unclear why we need to prove any of these things. Most of the limit facts that come up in a first calculus course are “obviously true,” and so the effort we’re putting in often doesn’t feel like it’s actually accomplishing anything.<strong title="This same problem arises even in upper-division analysis courses. My undergraduate analysis professor Sandy Grabiner used to say that the point of a first analysis course is to prove that most of the time, what happens is exactly what you would expect to happen, and the second analysis course starts talking about the exceptions. But we tend to hope that our upperclassmen math majors at least are willing to bear with us through the proofs by that point."><sup id="fnref:analysis"><a href="#fn:analysis" class="footnote">2</a></sup></strong> Proofs often come across as a particularly obnoxious hoop that I’m making my students jump through to satisfy some perverse math–professor urge. <a href="https://mathwithbaddrawings.com/2019/01/09/a-brief-case-against-limits/">Ben Orlin</a> makes this case pretty clearly: calculus 1 students haven’t run into any of the problems that epsilon-delta proofs were invented to solve, and so they seem like an unnecessary runaround.</p>
</li>
</ul>
<ul>
<li>Most of all, it actually took quite a lot of time to do this well! Getting freshmen with no proof experience to the point where they could mostly write epsilon-delta proofs took a good three weeks out of a thirteen-week course. That’s a huge chunk of the course, and needs to be accomplishing a lot to justify itself. An epsilon-delta approach to limits just wasn’t worth the time and effort we were putting into it.</li>
</ul>
<h3 id="an-approximate-approach">An approximate approach</h3>
<p>Over time I realized that my course had gotten less focused on using the formal limits ideas anyway. I had drifted more and more to talking about two big ideas once we got out of the limits section: models and approximation.</p>
<p><em>Models</em> are the big idea I’ve been thinking about lately.<strong title="You can read a lot more of my thoughts about this in my post on word problems at https://jaydaigle.net/blog/why-word-problems/, for instance."><sup id="fnref:word-problems"><a href="#fn:word-problems" class="footnote">3</a></sup></strong> On its own terms, math is a purely abstract enterprise; to use math to understand the world we need to have some model of how the world can be described mathematically. This modeling is a really important skill for any field where you’re expect to apply math to solve problems—and the same skills can help reason about situations with no explicit mathematical model.</p>
<p><em>Approximation</em> is the big idea of calculus. This is true on a surface level, where we can think of limits as taking an “infinitely good” approximation of the value of a function at a point, and derivatives are an approximation of the rate of change. But it’s also the case that many of the applications of calculus and especially of derivatives have to do with notions of approximation.</p>
<p>After some wrestling with both ideas, I decided to take the latter approach in this term’s course. It meshed well with the way I tend to think about the ideas in calculus 1, and the way I had been explaining them to students. So I reorganized my course into five sections.</p>
<ol>
<li><strong>Zero-order approximations:</strong> Continuity and limits. We can think of a continuous function as one where $f(a)$ is a good approximation of $f(x)$ when $x$ is close to $a$. A lot of the facts about limits we need to learn are answers to questions that arise naturally when we want to approximate various functions. And “discontinuities” make sense as “points where approximation is hard for some reason”.</li>
<li><strong>First-order approximations:</strong> Derivatives. We started with the linear approximation formula $f(x) = f(a) + m(x-a)$ and asked what value of $m$ would make this the best possible approximation. A little rearrangement gives the definition of derivative, but now that definition is the answer to a question, not a definition just dropped on our heads from the sky. We want to be able to compute derivatives <em>so that</em> we can approximate functions easily, and as a bonus we can reinterpret all of this geometrically, in terms of the tangent line.</li>
<li><strong>Modeling:</strong> Word problems and differential equations. We reinterpret the derivative a third time as an answer to the problem of average versus “instantaneous” speed, and then as the answer to all sorts of concrete “rate of change” problems. We can talk about the idea of differential equations, and practice turning descriptions of situations into toy mathematical models with derivatives. We can’t solve these equations explicitly without integrals, but we can <em>approximate</em> solutions using Euler’s method, and get a good definition of the function $e^x$ in the bargain. Implicit derivatives and related rates also show up here, using derivatives in a different type of model.</li>
<li><strong>Inverse Problems:</strong> Inverse functions and antiderivatives. We take all the questions we’ve asked and turn them around. We define inverse functions, especially the logarithm and inverse trig functions, and use the inverse function theorem to find their derivatives. We can use the intermediate value theorem and Newton’s method to approximate the solutions to equations. We finish by defining the antiderivative as the (not-quite) inverse of the derivative.</li>
<li><strong>Second-Order approximations:</strong> The second derivative allows us to find the best <em>quadratic</em> approximation to a given function. This is a natural setting for thinking about extreme value problems, so we cover all the optimization topics, along with Rolle’s theorem and the mean value theorem, and then put all this information together to sketch graphs of functions. We finished up with brief explanations of Taylor series and of imaginary numbers.</li>
</ol>
<h3 id="most-of-it-worked-pretty-well">Most of it worked pretty well.</h3>
<p>This course was basically successful, but there are lots of ways to improve it. I think my students both had a more comfortable experience and gained a much better understanding of some of the core ideas of calculus, especially the basic idea of linear approximation.</p>
<p>The first section, on limits, was okay. It’s still a little awkward, and I’m tempted to Ben’s approach of starting with derivatives entirely. But I really liked the way it started, with making the point that $\sqrt{5}$ is “about 2”. This simplest-possible-approximation made a good anchor for the course, and helps reinforce the sort of basic numeracy that helps us understand basically any numerical information we learn. I still need to do a bit more work on the logical flow and transitions, and the idea of limits at infinity is important but doesn’t sit in here entirely comfortably.</p>
<p>The section on derivatives and first-order approximations worked wonderfully. This is the section that contains many of the ideas driving this course approach, and I’ve used many of them before, so it makes sense that this worked well.</p>
<p>The section on inverse functions again worked pretty well. It’s pretty easy to justify “solving equations” to students in a math class, and “this equation is too hard so let’s find a way to avoid solving it” is pretty compelling.</p>
<p>And finally the section on the second-order stuff felt pretty strong as well, but could still be improved. While in my head I have a clear picture relating “approximation with a parabola” to the maxima and minima of a function, I don’t know that it came across clearly in the class. And I was feeling a little time pressure by this point; I really wish I had had an extra couple of days of class time.</p>
<h3 id="modeling-is-hard">Modeling is hard</h3>
<p>But the section on modeling needs a lot of work. A lot of the ideas that I wanted to include in here aren’t things I’ve ever taught before, so the material is still a little rough. I also got really sick right when this section was starting, so my preparation probably wasn’t as good as it could have been.</p>
<p>In particular, I wasn’t very satisfied with the section on describing real-world situations in terms of models, and coming up with differential equations. I showed a bunch of examples but don’t know that we really got a clear grasp on the underlying principles as a class. And my homework questions on this modeling process probably contained a bit too much “right answer and wrong answer” for a topic that’s as inherently fuzzy as modeling.</p>
<p>I’m toying with the idea of assigning some problems where I ask students to <em>argue</em> for some modeling choices they make—handle it less like there’s one correct model, and more like there are a bunch of defensible choices. But I don’t know how well I can get that to fit in to the calculus class and the framework of first- or second-order ODEs.<strong title="It probably doesn't help that I never actually studied ODEs in any way, so I don't have many of my own examples to draw on."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong> (Maybe I should do some modeling that doesn’t involve derivatives since understanding modeling is a goal on its own.)</p>
<p>I also wish I could fit the mean value theorem into the discussion of speed, but proving it really requires a lot of ideas I wanted to hold off on until later. Maybe I should state and explain it here, but then prove it later when the proof comes up for other reasons.</p>
<p>One thing I <em>did</em> really like in this section is the way I introduced the exponential $e^x$ as the solution to the initial value problem $y’ = y, y(0)=1$. This makes $e$ seem less like a number we made up to torture math students, and more like the answer to a question people would reasonably ask again.</p>
<h3 id="final-thoughts">Final thoughts</h3>
<p>Overall, I feel pretty good about this redesign. I’m definitely not going back to the epsilon-delta definitions for this course any time soon, and I think this course will be really strong with a bit of work.</p>
<p>But there are a lot of ideas in the modeling topic that are important but that I don’t quite feel like I’m doing justice to yet. I need to go over that section carefully and figure out how to improve it.</p>
<p>I’m also thinking about moving <em>some</em> of my homework to an online portal. If we take all the “compute these eight derivatives” questions and have them automatically graded, I can use scarce human-grading time to give thorough comments on some more interesting conceptual questions.</p>
<p>To anyone who’s read this entire post, I’d love your feedback—on the course design as a whole, and on how to fix some of the problems I ran into. And if anyone is curious how I handled things, I’d be happy to share my course materials. You can find most of them <a href="https://jaydaigle.net/teaching/courses/2019-spring-114/">on the course page</a> but I’m happy to talk or share more if you’re interested!</p>
<hr />
<p><em>Have ideas about this course plan? Have questions about why I did things?</em> <em>Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below, and let me know!</em></p>
<div class="footnotes">
<ol>
<li id="fn:anxiety">
<p>And a lot of anxiety. After all, the typical student in this course took calculus in high school and then failed the AP exam; they’ve all had at least one not-great experience with the material. <a href="#fnref:anxiety" class="reversefootnote">↩</a></p>
</li>
<li id="fn:analysis">
<p>This same problem arises even in upper-division analysis courses. My undergraduate analysis professor Sandy Grabiner used to say that the point of a first analysis course is to prove that most of the time, what happens is exactly what you would expect to happen, and the second analysis course starts talking about the exceptions. But we tend to hope that our upper-classmen math majors at least are willing to bear with us through the proofs by that point. <a href="#fnref:analysis" class="reversefootnote">↩</a></p>
</li>
<li id="fn:word-problems">
<p>You can read a lot more of my thoughts about this in my <a href="https://jaydaigle.net/blog/why-word-problems/">post on word problems</a>, for instance. <a href="#fnref:word-problems" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>It probably doesn’t help that I never actually studied ODEs in any way, so I don’t have many of my own examples to draw on. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleNow that the term is over, I want to reflect a bit on the courses I taught, what worked well, and what I might want to do differently next time. (Honestly, it probably would have been more useful to write this sooner after finishing the courses, when they were fresher in my mind. But I don’t have a time machine, so I can’t do much about that now.) In this post I’ll talk about my calculus class; I’ll try to write about the others soon.An Overview of Bayesian Inference2019-02-20T00:00:00-08:002019-02-20T00:00:00-08:00https://jaydaigle.net/blog/overview-of-bayesian-inference<p>A few weeks ago I <a href="https://jaydaigle.net/blog/paradigms-and-priors/">wrote about Kuhn’s theory of paradigm shifts</a> and how it relates to Bayesian inference. In this post I want to back up a little bit and explain what Bayesian inference is, and eventually rediscover the idea of a paradigm shift just from understanding how Bayesian inference works.</p>
<p>Bayesian inference is important in its own right for many reasons beyond just improving our understanding of philosophy of science. Bayesianism is at its heart an extremely powerful mathematical method of using evidence to make predictions. Almost any time you see anyone making predictions that involve probabilities—whether that’s a projection of election results like the ones from <a href="https://fivethirtyeight.com/">FiveThirtyEight</a>, a prediction for the results of a big sports game, or just a weather forecast telling you the chances of rain tomorrow—you’re seeing the results of a Bayesian inference.</p>
<p>Bayesian inference is also the foundation of many machine learning and artificial intelligence tools. Amazon wants to predict how likely you are to buy things. Netflix wants to predict how likely you are to like a show. Image recognition programs want to predict whether that picture contains a bird. And self-driving cars want to predict whether they’re going to crash into that wall.</p>
<p>You’re using tools based on Bayesian inference every day, and probably at this very moment.<strong title="I'm old enough to remember the late nineties, when spam was such a big problem that email became almost unusable. These days when I complain about email spam it's usually my employer sending too many messages out through internal mailing lists; but there was a period in the nineties when for every legitimate email you'd get four or five filled with links to pr0n sites or trying to sell you v1@gr@ and c1@lis CHEAP!!! It was a major problem. Entire conferences were held on developing methods to defeat the spam problem. These days I see about one true spam message like that per _year_. And one major reason for that is the invention of effective spam filters using Bayesian inference to predict whether a given email is spam or legitimate. So you're using Bayesian tools right now purely by _not_ receiving dozens of unwanted pornographic pictures in your email inbox every day."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong> So it’s worth understanding how they work.</p>
<hr />
<p>The basic idea of Bayesian inference is that we start with some <em>prior probability</em> that describes what we originally believe the world is like in terms of probability, by specifying the probabilities of various things happening. Then we make observations of the world, and update our beliefs, giving our conclusion as a <em>posterior probability</em>.</p>
<p>As a really simple example: suppose I tell you I’ve flipped a coin, but I don’t tell you how it landed. Your prior is probably a 50% chance that it shows heads, and a 50% chance that it shows tails. After you get to look at the coin, you update your prior beliefs to reflect your new knowledge. Your posterior probability says there is a 100% chance that it shows heads and a 0% chance that it shows tails.<strong title="This particular example is far too simple to really be worth setting up the Bayesian framework, but it gives a pretty direct and explicit demonstration of what all the pieces mean."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<p>The rule we use to update our beliefs is called <a href="https://en.wikipedia.org/wiki/Bayes_theorem">Bayes’s Theorem</a> (hence the name “Bayesian inference”). Specifically, we use the mathematical formula
\[
P(H |E) = \frac{ P(E|H) P(H)}{P(E)},
\]
where</p>
<ul>
<li>$H$ is some hypothesis we had—some thing we thought might maybe happen—and $P(H)$ is how likely we originally thought that hypothesis was.</li>
<li>$E$ is the <em>evidence</em> we just observed, and $P(E)$ is how likely we originally thought we were to see that evidence.</li>
<li>$P(E|H)$ is the most complicated bit to explain. It tells us, if we assume that our hypothesis $H$ is true, how likely we originally thought seeing the evidence $E$ would be. So it tells us what we would have thought <em>before</em> seeing the new evidence, if we had assumed the hypothesis $H$ was true.</li>
<li>$P(H|E)$ is the new, updated, posterior probability we give to the hypothesis $H$, <em>after</em> seeing the evidence $E$.</li>
</ul>
<p>Let’s work through a quick example. Suppose I have a coin, and you think that there’s a 50% chance it’s a fair coin, and a 50% chance that it actually has two heads. So we have $P(H_{fair}) = .5$ and $P(H_{unfair}) = .5$.</p>
<p>Now you flip the coin ten times, and it comes up heads all ten times. If the coin is fair, this is pretty unlikely! The probability of that happening is $\frac{1}{2}^{10} = \frac{1}{1024}$, so we have $P(E|H_{fair}) = \frac{1}{1024}$. But if the coin is two-headed, this will definitely happen; the probability of getting ten heads is 100%, or $1$. So when you see this, you probably conclude that the coin is unfair.</p>
<p>Now let’s work through that same chain of reasoning algebraically. If the coin is fair, the probability of seeing ten heads in a row is $\frac{1}{2^{10}} = \frac{1}{1024}$. And if the coin is unfair, the probability is 1. So if we think there’s a 50% chance the coin is fair, and a 50% chance it’s unfair, then the overall probability of seeing ten heads in a row is
\begin{align}
P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\\ = .5 \cdot \frac{1}{1024} + .5 \cdot 1 = \frac{1025}{2048} \approx .5005.
\end{align}</p>
<p>By Bayes’s Theorem, we have
\begin{align}
P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\<br />
& = \frac{ \frac{1}{1024} \cdot .5}{\frac{1025}{2048}} = \frac{1}{1025} \\<br />
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\<br />
&= \frac{1 \cdot \frac{1}{2}}{\frac{1025}{2048}} = \frac{1024}{1025}.
\end{align}
Thus we conclude that the probability the coin is fair is $\frac{1}{1025} \approx .001$, and the probability it is two-headed is $\frac{1024}{1025} \approx .999$. This matches what our intuition tells us: if it comes up ten heads in a row, it probably isn’t fair.</p>
<hr />
<p>But let’s tweak things a bit. Suppose I have a table with a thousand coins, and I tell you that all of them are fair except one two-headed one. You pick one at random, flip it ten times, and see ten heads. Now what do you think?</p>
<p>You have exactly the same <em>evidence</em>, but now your prior is different. Your prior tells you that $P(H_{fair}) = \frac{999}{1000}$ and $P(H_{unfair}) = \frac{1}{1000}$. We can do the same calculations as before. We have
\begin{align}
P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\<br />
= \frac{999}{1000} \cdot \frac{1}{1024} + \frac{1}{1000} \cdot 1
\approx .00198
\end{align}</p>
<p>\begin{align}
P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\<br />
& = \frac{ \frac{1}{1024} \cdot \frac{999}{1000}}{.00198} \approx .494 \\<br />
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\<br />
&= \frac{1 \cdot \frac{1}{1000}}{.00198} \approx .506.
\end{align}
So now you should think it’s about equally likely that your coin is fair or unfair. <strong title="The exact probabilities are 999/2023 and 1024/2023. As a bonus, try to see why having some of those exact numbers makes sense, and reassures us that we did this right."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
<p>Why does this happen? If you have a fair coin, then seeing ten heads in a row is pretty unlikely. But having an unfair coin is <em>also</em> unlikely, because of the thousand coins you could have picked, only one was unfair. In this example those two unlikelinesses cancel out almost exactly, leaving us uncertain whether you got a (normal) fair coin and then a surprisingly unlikely result, or if you got a surprisingly unfair coin and then the normal, expected result.</p>
<p>In other words, you should definitely be somewhat surprised to see ten heads in a row. Remember, we worked out that your prior probability of seeing <em>that</em> is just $P(E) \approx .00198$—less than two tenths of a percent! But there are two different ways to get that unusual result, and you don’t know which of those unusual things happened.</p>
<hr />
<p>Bayesian inference also does a good job of handling evidence that disproves one of your hypotheses. Suppose you have the same prior we were just discussing: $999$ fair coins, and one two-headed coin. What happens if you flip the coin once and it comes up <em>tails</em>?</p>
<p>Informally, we immediately realize that we can’t be flipping a two-headed coin. It came up tails, after all. So how does this work out in the math?</p>
<p>If the coin is fair, we have a $50\%$ chance of getting tails, and a $50\%$ chance of getting heads. If the coin is unfair, we have a $0\%$ chance of tails and a $100\%$ chance of heads. So we compute:
\begin{align}
P(H_{fair}) \cdot P(E | H_{fair}) + P(H_{unfair}) \cdot P(E | H_{unfair}) \\<br />
= \frac{999}{1000} \cdot \frac{1}{2} + \frac{1}{1000} \cdot 0
= \frac{999}{2000}
\end{align}</p>
<p>\begin{align}
P(H_{fair} | E) &= \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)} \\<br />
& = \frac{ \frac{1}{2} \cdot \frac{999}{1000}}{\frac{999}{2000}} = 1 \\<br />
P(H_{unfair} | E) & = \frac{ P(E | H_{unfair}) P(H_{unfair})}{P(E)} \\<br />
&= \frac{0 \cdot \frac{1}{1000}}{\frac{999}{2000}} = 0.
\end{align}</p>
<p>Thus the math agrees with us: once we see a tails, the probability that we’re flipping a two-headed coin is zero.</p>
<hr />
<p>As long as everything behaves well, we can use these techniques to update our beliefs. In fact, this method is pretty powerful. We can prove that it is the best possible decision rule according to a few different sets of criteria<strong title="There are two really important results that occur to me. Cox's Theorem gives a collection of reasonable-sounding conditions, and proves that Bayesian inference is the only possible rule that satisfies them all. Dutch Book Arguments show that this inference rule protects you from making a collection of bets which are guaranteed to lose you money."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></strong>; and there are pretty good guarantees about eventually converging to the right answer after collecting enough evidence.</p>
<p>But there are still a few ways Bayesian inference can go wrong.</p>
<p>What if you get tails and keep flipping the coin—and get ten tails in a row? We’ll still draw the same conclusion: the coin can’t be double-headed, so it’s definitely fair. (You can work through the equations on this if you like; they’ll look just like the last computation I did, but longer). And if we keep flipping and get a thousand tails in a row, or a million, our computation will still tell us yes, the coin is definitely fair.</p>
<p>But before we get to a million flips, we might start suspecting, pretty strongly, that the coin is <em>not</em> fair. When it comes up tails a thousand times in a row, we probably suspect that in fact the coin has two tails. <strong title="No, you can't just check this by looking at the coin. Because I said so. More seriously, it's pretty common to have experiments where you can see the results, but can't inspect the mechanism by which those results are reached. In a particle collider you can see the tracks of exiting particles, but you can't actually observe the collision. In an educational study, you can look at students' test results, but you can't look inside their brains and observe exactly when the learning happens. So it's useful for this thought experiment to assume we can see how the coin lands, but can never look at both sides at the same time."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> So why doesn’t the math reflect this at all?</p>
<p>In this case, we made a mistake at the very beginning. Our prior told us that there was a $99.9\%$ chance we had a fair coin, and a $.1\%$ chance that we had a coin with two heads. And that means that our prior left no room for the possibility that our coin did anything else. We said our prior was
\[
P(H_{fair}) = \frac{999}{1000} \qquad P(H_{unfair}) = \frac{1}{1000};
\]
but we really should have said
\[
P(H_{fair}) = \frac{999}{1000} \qquad P(H_{two\ heads}) = \frac{1}{1000} \qquad P(H_{two\ tails}) = 0.
\]
And since we started with the belief that a two-tailed coin was <em>impossible</em>, no amount of evidence will cause us to change our beliefs. Thus Bayesian inference follows the old rule of Sherlock Holmes: “when you have excluded the impossible, whatever remains, however improbable, must be the truth.”</p>
<hr />
<p>This example demonstrates both the power and the problems of doing Bayesian inference. The power is that it reflects what we already know. If something is known to be quite rare, then we probably didn’t just encounter it. (It’s more likely that I saw a random bear than a sasquatch—and that’s true even if sasquatch exist, since bear sightings are clearly more common). And if something is outright impossible, we don’t need to spend a lot of time thinking about the implications of it happening.</p>
<p>The problem is that in pure Bayesian inference, you’re trapped by your prior. If your prior thinks the “true” hypothesis is possible, then eventually, with enough evidence, you will conclude that the true hypothesis is extremely likely. But if your prior gives no probability to the true hypothesis, then no amount of evidence can ever change your mind. If we start out with $P(H) = 0$, then it is mathematically impossible to update your prior to believe that $H$ is possible.</p>
<p>But Douglas Adams neatly explained the flaw in the Sherlock Holmes principle in the voice of his character Dirk Gently:</p>
<blockquote>
<p>The impossible often has a kind of integrity to it which the merely improbable lacks. How often have you been presented with an apparently rational explanation of something that works in all respects other than one, which is that it is hopelessly improbable?…The first idea merely supposes that there is something we don’t know about, and God knows there are enough of those. The second, however, runs contrary to something fundamental and human which we do know about. We should therefore be very suspicious of it and all its specious rationality.</p>
</blockquote>
<p>In real life, when we see something we had thought was extremely improbable, we often reconsider our beliefs about what is possible. Maybe there’s some possibility we had originally dismissed, or not even considered, that makes our evidence look reasonable or even likely; and if we change our prior to include that possibility, suddenly our evidence makes sense. This is the “paradigm shift” I talked about in my <a href="https://jaydaigle.net/blog/paradigms-and-priors/">recent post on Thomas Kuhn</a>, and extremely unlikely evidence, like our extended series of tails, is a Kuhnian anomaly.</p>
<p>But rethinking your prior isn’t really allowed by the mathematics and machinery of Bayesian inference—it’s something <em>else</em>, something <em>outside</em> of the procedure, that we do to cover for the shortcomings of unaugmented Bayesianism.</p>
<hr />
<p>Let’s return to the coin-flipping thought experiment; there’s one other way it can go wrong that I want to tell you about. Suppose you fix your prior to acknowledge the possibility that is two-headed <em>or</em> two-tailed. (We could even set up our prior to include the possibility that the coin is two-sided but biased— so that the coin comes up head 70% of the time, say. I’m going to ignore this case completely because it makes the calculations a lot more complicated and doesn’t actually clarify anything. But it’s important that we <em>can</em> do that if we want to).<strong title="Gelman and Nolan have argued that it's not physically possible to bias a coin flip in this way. This is arguably another reason to ignore the possibility that a coin is biased. And if you believe Gelman and Nolan's argument, then you _should_ have a low or zero prior probability that the coin is biased. But the actual reason I'm ignoring it is to avoid computing integrals in public."><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></strong></p>
<p>You assign the prior probabilities
\[
P(H_{fair}) = \frac{98}{100} \qquad P(H_{two\ heads}) = \frac{1}{100} \qquad P(H_{two\ tails}) = \frac{1}{100},
\]
giving a 1% chance of each possible double-sided coin. (This is a higher chance than you gave it before, but clearly when I give you these coins I’ve been messing with you, so you should probably be less certain of everything). You flip the coin.</p>
<p><a href="https://youtu.be/M0I-xm7iCBU?t=15">And it lands on its edge.</a></p>
<p>What does our rule of inference tell us now? We can try to do the same calculations we did before. The first thing we need to calculate is $P(E)$, which is easy. We started out by assuming this couldn’t happen, so the prior probability of seeing the coin landing on its side is zero!</p>
<p>(Algebraically, a fair coin has a 50% chance of heads and a 50% chance of tails. So if the coin is fair, then $P(E|H_{fair}) = 0$. But if the coin has a 100% chance of heads, then $P(E| H_{two\ heads}) = 0$. And if the coin has a 100% chance of tails, then $P(E| H_{two\ tails}) = 0$. Thus
\begin{align}
P(E) &= P(E|H_{fair}) \cdot P(H_{fair}) + P(E|H_{two\ heads}) \cdot P(H_{two\ heads}) + P(E|H_{two\ heads}) \cdot P(H_{two\ heads}) \\<br />
& = 0 \cdot \frac{98}{100} + 0 \cdot \frac{1}{100} + 0 \cdot \frac{1}{100} = 0.
\end{align}
So we conclude that $P(E) = 0$).</p>
<p>Now we can actually calculate our new, updated, posterior probabilities—or can we? We have the formula that
\[
P(H_{fair} | E) = \frac{ P(E | H_{fair}) P(H_{fair})}{P(E)}.
\]
But with the probabilities we just calculated, this works out to
\[
P(H_{fair} | E) = \frac{ 0 \cdot \frac{98}{100}}{0} = \frac{0}{0}.
\]
And our calculation has broken down completely; $\frac{0}{0}$ isn’t a <em>number</em>, let alone a useful probability.</p>
<p>Even more so than the last example, this is a serious Kuhnian anomaly. If we ever try to update and get $\frac{0}{0}$ as a response, something has gone wrong. We had said that something was totally impossible, and then it happened. All we can do is back up and choose a new prior.</p>
<p>And Bayesian inference can’t tell us how to do that.</p>
<p>There are a few different ways people try to get around this problem. But that’s another post.</p>
<hr />
<p><em>Questions about this post? Was something confusing or unclear? Or are there other things you want to know about Bayesian reasoning?</em> <em>Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below, and let me know!</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I’m old enough to remember the late nineties, when spam was such a big problem that email became almost unusable. These days when I complain about email spam it’s usually my employer sending too many messages out through internal mailing lists; but there was a period in the nineties when for every legitimate email you’d get four or five filled with links to pr0n sites or trying to sell you v1@gr@ and c1@lis CHEAP!!! It was a major problem. Entire conferences were held on developing methods to defeat the spam problem.</p>
<p>These days I see about one true spam message like that per <em>year</em>. And one major reason for that is the invention of effective spam filters using Bayesian inference to predict whether a given email is spam or legitimate. So you’re using Bayesian tools right now purely by <em>not</em> receiving dozens of unwanted pornographic pictures in your email inbox every day. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This particular example is far too simple to really be worth setting up the Bayesian framework, but it gives a pretty direct and explicit demonstration of what all the pieces mean. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>The exact probabilities are 999/2023 and 1024/2023. As a bonus, try to see why having some of those exact numbers makes sense, and reassures us that we did this right. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>I’m primarily thinking of two really important results here. <a href="https://en.wikipedia.org/wiki/Cox's_theorem">Cox’s Theorem</a> gives a collection of reasonable-sounding conditions, and proves that Bayesian inference is the only possible rule that satisfies them all. <a href="https://plato.stanford.edu/entries/dutch-book/">Dutch Book Arguments</a> show that this inference rule protects you from making a collection of bets which are guaranteed to lose you money. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>No, you can’t just check this by looking at the coin. Because I said so.</p>
<p>More seriously, it’s pretty common to have experiments where you can see the results, but can’t inspect the mechanism by which those results are reached. In a particle collider you can see the tracks of exiting particles, but you can’t actually observe the collision. In an educational study, you can look at students’ test results, but you can’t look inside their brains and observe exactly when the learning happens. So it’s useful for this thought experiment to assume we can see how the coin lands, but can never look at both sides at the same time. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Gelman and Nolan have argued that it’s <a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">not physically possible to bias a coin flip in this way</a>. This is arguably another reason to ignore the possibility that a coin is biased. And if you believe Gelman and Nolan’s argument, then you <em>should</em> have a low or zero prior probability that the coin is biased. But the actual reason I’m ignoring it is to avoid computing integrals in public. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleA few weeks ago I wrote about Kuhn’s theory of paradigm shifts and how it relates to Bayesian inference. In this post I want to back up a little bit and explain what Bayesian inference is, and eventually rediscover the idea of a paradigm shift just from understanding how Bayesian inference works.Paradigms and Priors2019-01-15T00:00:00-08:002019-01-15T00:00:00-08:00https://jaydaigle.net/blog/paradigms-and-priors<p>Scott Alexander at <a href="https://slatestarcodex.com/">Slate Star Codex</a> has been blogging lately about Thomas Kuhn and the idea of paradigm shifts in science. This is a topic near and dear to my heart, so I wanted to take the opportunity to share some of my thoughts and answer some questions that Scott asked in his posts.</p>
<h3 id="the-big-idea">The Big Idea</h3>
<p>I’m going to start with my own rough summary of what I take from Kuhn’s work. But since this is all in response to Scott’s <a href="https://slatestarcodex.com/2019/01/08/book-review-the-structure-of-scientific-revolutions/">book review of <em>The Structure of Scientific Revolutions</em></a>, you may want to read his post first.</p>
<p>The main idea I draw from Kuhn’s work is that science and knowledge aren’t only, or even primarily, of a collection of facts. Observing the world and incorporating evidence is <em>important</em> to learning about the world, but evidence can’t really be interpreted or used without a prior framework or model through which to interpret it. For example, check out <a href="https://twitter.com/OrbenAmy/status/1084856550383149057">this Twitter thread</a>: researchers were able to draw thousands of different and often mutually contradictory conclusions from a single data set by varying the theoretical assumptions they used to analyze it.</p>
<p>Kuhn also provided a response to <a href="https://en.wikipedia.org/wiki/Falsifiability#Falsificationism">Popperian falsificationism</a>. No theory can ever truly be falsified by observation, because you can force almost any observation to match most theories with enough special cases and extra rules added in. And it’s often quite difficult to tell whether a given extra rule is an important development in scientific knowledge, or merely motivated reasoning to protect a familiar theory. After all, if you claim that objects with different weights fall at the same speed, you then have to explain why that doesn’t apply to bowling balls and feathers.</p>
<p>This is often described as the <em>theory-ladenness of observation</em>. Even when we think directly perceiving things, those perceptions are always mediated by our theories of how the world works and can’t be fully separated from them. This is most obvious when engaging in a complicated indirect experiment: there’s a lot of work going on between “I’m hearing a <a href="https://en.wikipedia.org/wiki/Geiger_counter">clicking sound</a> from this thing I’m holding in my hand” and “a bunch of atoms just ejected alpha particles from their nuclei”.</p>
<p>But even in more straightforward scenarios, any inference comes with a lot of theory behind it. I drop two things that weigh different amounts, and see that the heavier one falls faster—proof that Galileo was wrong!</p>
<p>Or even more mundanely: I look through my window when I wake up, see a puddle, and conclude that it rained overnight. Of course I’m relying on the assumption that when I look through my window I actually see what’s on the other side of it, and not, say, a clever science-fiction style holoscreen. But more importantly, my conclusion that it rained depends on a lot of assumptions I normally wouldn’t explicitly mention—that rain would leave a puddle, and that my patio would be dry if it hadn’t rained.</p>
<p>(In fact, I discovered several months after moving in that my air conditioner condensation tray overflows on hot days. So the presence of puddles doesn’t actually tell me that it rained overnight).</p>
<p>Even direct perception, what we can see right in front of us, is mediated by internal modeling our brains do to put our observations into some comprehensible context. This is why optical illusions work so well; they hijack the modeling assumptions of your perceptual system to make you “see” things that aren’t there.</p>
<p style="text-align:center"><img src="/assets/blog/scintillating-grid-illusion.png" alt="An example of the Scintillating Grid illusion." /></p>
<p style="text-align:center"><em>There are <a href="https://en.wikipedia.org/wiki/Grid_illusion#Scintillating_grid_illusion">no black dots in this picture</a>.</em> <br />
<em>Who are you going to believe: me, or your own eyes?</em></p>
<hr />
<h3 id="what-does-this-tell-us-about-science">What does this tell us about science?</h3>
<p>Kuhn divides scientific practice into three categories. The first he calls pre-science, where there is no generally accepted model to interpret observations. Most of life falls into this category—which makes sense, because most of life isn’t “science”. Subjects like history and psychology with multiple competing “schools” of thought are pre-scientific, because while there are a number of useful and informative models that we can use to understand parts of the subject, no single model provides a coherent shared context for all of our evidence. There is no unifying consensus perspective that basically explains everything we know.</p>
<p>A model that does achieve such a coherent consensus is called a <em>paradigm</em>. A paradigm is a theory that explains all the known evidence in a reasonable and satisfactory way. When there is a consensus paradigm, Kuhn says that we have “normal science”. And in normal science, the idea that scientists are just collecting more facts actually makes sense. Everyone is using the same underlying theory, so no one needs to spend time arguing about it; the work of science is just to collect more data to interpret within that theory.</p>
<p>But sometimes during the course of normal science you find <em>anomalies</em>, evidence that your paradigm can’t readily explain. If you have one or two anomalies, the best response is to assume that they really are anomalies—there’s something weird going on there, but it isn’t a problem for the paradigm.</p>
<p>A great example of an unimportant anomaly is the <a href="https://en.wikipedia.org/wiki/OPERA_experiment">OPERA experiment</a> from a few years ago that measured neutrinos traveling faster than the speed of light. This meant one of two things: either special relativity, a key component of the modern physics paradigm, was wrong; or there was an error somewhere in a delicate measurement process. Pretty much everyone assumed that the measurement was flawed, and pretty much everyone was right.</p>
<p>In contrast, sometimes the anomalies aren’t so easy to resolve. Scientists find more and more anomalies, more results that the dominant paradigm can’t explain. It becomes clear the paradigm is flawed, and can’t provide a satisfying explanation for the evidence. At this point people start experimenting with other models, and with luck, eventually find something new and different that explains all the evidence, old and new, normal and anomalous. A new paradigm takes over, and normal science returns.</p>
<p>(Notice that the old paradigm was never <em>falsified</em>, since you can always add epicycles to make the new data fit. In fact, the proverbial “epicycles” were added to the Ptolemaic model of the solar system to make it fit astronomical observations. In the early days of the Copernican model, it actually fit the evidence worse than the Ptolemaic model did—but it didn’t require the convoluted epicycles that made the Ptolemaic model work. Sabine Hossenfelder describes this process as, not falsification, but “implausification”: “a continuously adapted theory becomes increasingly difficult and arcane—not to say ugly—and eventually practitioners lose interest.”)</p>
<p>Importantly, Kuhn argued that two different paradigms would be <em>incommensurable</em>, so different from each other that communication between them is effectively impossible. I think this is sometimes overblown, but also often underestimated. Imagine trying to explain a modern medical diagnosis to someone who believes in four humors theory. Or remember how difficult it is to have conversations with someone whose politics are very different from your own; the background assumptions about how the world works are sometimes so different that it’s hard to agree even on basic facts.<strong title="If you're interested in the political angle on this more than the scientific, check out the talk I gave at TedxOccidentalCollege last year at https://www.youtube.com/watch?v=aTSrHfv9C94."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<h3 id="scotts-example-questions">Scott’s example questions</h3>
<p>Now I can turn to the very good questions Scott asks in section II of his book review.</p>
<blockquote>
<p>For example, consider three scientific papers I’ve looked at on this blog recently….What paradigm is each of these working from?</p>
</blockquote>
<p>As a preliminary note, if we’re maintaining the Kuhnian distinction between a paradigm on the one hand and a model or a school of thought on the other, it is plausible that none of these are working in true paradigms. One major difficulty in many fields, especially the social sciences is that there isn’t a paradigm that unifies all our disparate strands of knowledge. But asking what possibly-incommensurable <em>model</em> or <em>theory</em> these papers are working from is still a useful and informative exercise.</p>
<p>I’m going to discuss the first study Scott mentions in a fair amount of depth, because it turned out I had a lot to say about it. I’ll follow that up by making briefer comments on his other two examples.</p>
<h4 id="cipriani-ioannidis-et-al">Cipriani, Ioannidis, et al.</h4>
<blockquote>
<p>– Cipriani, Ioannidis, et al perform a meta-analysis of antidepressant effect sizes and find that although almost all of them seem to work, amitriptyline works best.</p>
</blockquote>
<p>This is actually a great example of some of the ways paradigms and models shape science. The study is a meta-analysis of various antidepressants to assess their effectiveness. So what’s the underlying model here?</p>
<p>Probably the best answer is: “depression is a real thing that can be caused or alleviated by chemicals”. Think about how completely incoherent this entire study would seem to a Szasian who thinks that mental illnesses are just choices made by people with weird preferences, to a medieval farmer who thinks mental illnesses are caused by demonic possession, or to a natural-health advocate who thinks that “chemicals” are bad for you. The medical model of mental illness is powerful and influential enough that we often don’t even notice we’re relying on it, or that there are alternatives. But it’s not the only model that we could use.<strong title="In fact, this was my third or fourth answer in the first draft of this section. Then I looked at it again and realized it was by far the _best_ answer. That's how paradigms work: as long as everything is functioning normally, you don't even have to think about the fact that they're there."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong></p>
<hr />
<p>While this is the best answer Scott’s question, it’s not the only one. When Scott <a href="https://slatestarcodex.com/2018/02/26/ssc-journal-club-cipriani-on-antidepressants/">originally wrote about this study</a> he compared it to one he had done himself, which got very different results. Since they’re (mostly) studying the same drugs, in the same world, they “should” get similar results. But they don’t. Why not?</p>
<p>I’m not in any position to actually answer that question, since I don’t know much about psychiatric medications. But I <em>can</em> point out one very plausible reason: the studies made different modeling assumptions. And Scott highlights some of these assumptions himself in his analysis. For instance, he looks at the way Cipriani et al. control for possible bias in studies:</p>
<blockquote>
<p>I’m actually a little concerned about the exact way he did this. If a pharma company sponsored a trial, he called the pharma company’s drug’s results biased, and the comparison drugs unbiased….</p>
</blockquote>
<blockquote>
<p>But surely if Lundbeck wants to make Celexa look good [relative to clomipramine], they can either finagle the Celexa numbers upward, finagle the clomipramine numbers downward, or both. If you flag Celexa as high risk of being finagled upwards, but don’t flag clomipramine as at risk of being finagled downwards, I worry you’re likely to understate clomipramine’s case.</p>
</blockquote>
<blockquote>
<p>I make a big deal of this because about a dozen of the twenty clomipramine studies included in the analysis were very obviously pharma companies using clomipramine as the comparison for their own drug that they wanted to make look good; I suspect some of the non-obvious ones were too. If all of these are marked as “no risk of bias against clomipramine”, we’re going to have clomipramine come out looking pretty bad.</p>
</blockquote>
<p>Cipriani et al. had a model for which studies were producing reliable data, and fed it into their meta-analysis. Notice they aren’t denying or ignoring the numbers that were reported, but they <em>are</em> interpreting them differently based on background assumptions they have about the way studies work. And Scott is disagreeing with those assumptions and suggesting a different set of assumptions instead.</p>
<p>(For bonus points, look at <em>why</em> Scott flags this specific case. Cipriani et al. rated clomipramine badly, but Scott’s experience is that clomipramine is quite good. This is one of Kuhn’s paradigm-violating anomalies: the model says you should expect one result, but you observe another. Sometimes this causes you to question the observation; sometimes a drug that “everyone knows” is great actually doesn’t do very much. But sometimes it causes you to question the model instead.)</p>
<p>Scott’s model here isn’t really incommensurable with Cipriani et al.’s in a deep sense. But the difference in models does make <em>numbers</em> incommensurable. An odds ratio of 1.5 means something very different if your model expects it to be biased downwards than it does if you expect it to be neutral—or biased upwards. You can’t escape this sort of assumption just by “looking at the numbers”.</p>
<p>And this is true even though Scott and Cipriani et al. are largely working with the same sorts of models. They both believe in the medical model of mental illness. Their paradigm does include the idea that randomized controlled trials work, as Scott suggests in his piece. A bit more subtly, their shared paradigm also includes whatever instruments they use to measure antidepressant effectiveness. Since Cipriani et al. is actually a meta-analysis, they don’t address this directly. But each study they include is probably using some sort of questionnaire to assess how depressed people are. The numbers they get are only coherent or meaningful at all if you think that questionnaire is measuring something you care about.</p>
<hr />
<p>There’s one more paradigm choice here that I want to draw attention to, because it’s important, and because I know Scott is interested in it, and because we may be in the middle of a moderate paradigm shift right now.</p>
<p>Studies this one tend to assume that a given drug will work about the same for everyone. And then people find that no antidepressant works consistently for everyone, and they all have small effect sizes, and conclude that maybe antidepressants aren’t very useful. But that’s hard to square with the fact that people regularly report massive benefits from going on antidepressants. We found an anomaly!</p>
<p>A number of researchers, including Scott himself, have suggested that any given person will respond well to some antidepressants and poorly to others. So when a study says that bupropion (or whatever) has a small effect on average, maybe that doesn’t mean bupropion isn’t helping anyone. Maybe instead it’s helping some people quite a lot, and it’s completely useless for other people, and so on average its effect is small but positive.</p>
<p>But this is a completely different way of thinking clinically and scientifically about these drugs. And it potentially undermines the entire idea behind meta-analyses like Cipriani et al. If our data is useless because we’re doing too much averaging, then averaging all our averages together isn’t really going to help. Maybe we should be doing something entirely different. We just need to figure out what.</p>
<h4 id="ceballos-ehrlich-et-al">Ceballos, Ehrlich et al.</h4>
<blockquote>
<p>– Ceballos, Ehrlich, et al calculate whether more species have become extinct recently than would be expected based on historical background rates; after finding almost 500 extinctions since 1900, they conclude they definitely have.</p>
</blockquote>
<p>I actually think Scott mostly answers his own questions here.</p>
<blockquote>
<p>As for the extinction paper, surely it can be attributed to some chain of thought starting with Cuvier’s catastrophism, passing through Lyell, and continuing on to the current day, based on the idea that the world has changed dramatically over its history and new species can arise and old ones disappear. But is that “the” paradigm of biology, or ecology, or whatever field Ceballos and Lyell are working in? Doesn’t it also depend on the idea of species, a different paradigm starting with Linnaeus and developed by zoologists over the ensuing centuries? It look like it dips into a bunch of different paradigms, but is not wholly within any.</p>
</blockquote>
<p>The paper is using a model where</p>
<ul>
<li>Species is a real and important distinction;</li>
<li>Species extinction is a thing that happens and matters;</li>
<li>Their calculated background rate for extinction is the relevant comparison.</li>
</ul>
<p>(You can in fact see a lot of their model/paradigm come through pretty clearly in the “Discussion” section of the paper— which is good writing practice.)</p>
<p>Scott seems concerned that it might dip a whole bunch of paradigms, but I don’t think that’s really a problem. Any true unifying paradigm will include more than one big idea; on the other hand, if there isn’t a true paradigm, you’d expect research to sometimes dip into multiple models or schools of thought. My impression is that biology is closer to having a real paradigm than not, but I can’t say for sure.</p>
<h4 id="terrell-et-al">Terrell et al.</h4>
<blockquote>
<p>– Terrell et al examine contributions to open source projects and find that men are more likely to be accepted than women when adjusted for some measure of competence they believe is appropriate, suggesting a gender bias.</p>
</blockquote>
<p>Social science tends to be less paradigm-y than the physical sciences, and this sort of politically-charged sociological question is probably the least paradigm-y of all, in that there’s no well-developed overarching framework that can be used to explain and understand data. If you can look at a study and know that people will immediately start arguing about what it “really means”, there’s probably no paradigm.</p>
<p>There is, however, a model underlying any study like this, as there is for any sort of research. Here I’d summarize it something like:</p>
<ul>
<li>Gender is an interesting and important construct;</li>
<li>Acceptance rates for pull requests are a measure of (perceived) code quality;</li>
<li>Their program that evaluated “obvious gender cues” does a good job of evaluating gender as perceived by other GitHub users;</li>
<li>The “insider versus outsider” measure they report is important;</li>
<li>The confounders they check are important, and the confounders they don’t check aren’t.</li>
</ul>
<p>Basically, any time you get to do some comparisons and not others, or report some numbers and not others, you have to fall back on a model or paradigm to tell you which comparisons are actually important. Without some guiding model, you’d just have to report every number you measured in a giant table.</p>
<p>Now, sometimes people actually do this. They measure a whole bunch of data, and then they try to correlate everything with everything else, and see what pops up. This is <a href="https://statmodeling.stat.columbia.edu/2017/01/30/no-guru-no-method-no-teacher-just-nature-garden-forking-paths/">not usually good research practice</a>.</p>
<p>If you had exactly this same paper except, instead of “men and women” it covered “blondes and brunettes”, you’d probably be <em>able</em> to communicate the content of the paper to other people; but they’d probably look at you kind of funny, because why would that possibly matter?</p>
<h3 id="anomalies-and-bayes">Anomalies and Bayes</h3>
<p>Possibly the most interesting thing Scott has posted is his <a href="https://slatestarcodex.com/2019/01/10/paradigms-all-the-way-down/">Grand Unified Chart</a> relating Kuhnian theories to related ideas in other disciplines. The chart takes the Kuhnian ideas of “paradigm”, “data”, and “anomaly” and identifies equivalents from other fields. (I’ve flipped the order of the second and third columns here). In political discourse Scott relates them to “ideology”, “facts”, and “cognitive dissonance”; in psychology he relates them to “prediction”, “sense data”, and “surprisal”.</p>
<p>In the original version of the chart, several entries in the “anomalies” column were left blank. He has since filled some of them in, and removed a couple of other rows. I think his answer for the “Bayesian probability” row is wrong; but I think it’s interestingly wrong, in a way that effectively illuminates some of the philosophical and practical issues with Bayesian reasoning.</p>
<p>A quick informal refresher: in Bayesian inference, we start with some <em>prior probability</em> that describes what we originally believe the world is like, by specifying the probabilities of various things happening. Then we make observations of the world, and update our beliefs, giving our conclusion as a <em>posterior probability</em>.</p>
<p>The rule we use to update our beliefs is called <a href="https://en.wikipedia.org/wiki/Bayes_theorem">Bayes’s Theorem</a> (hence the name “Bayesian inference”). Specifically, we use the mathematical formula
\[
P(H |E) = \frac{ P(E|H) P(H)}{P(E)},
\]
where $P$ is the probability function, $H$ is some hypothesis, and $E$ is our new evidence.</p>
<p>I have often drawn the same comparison Scott draws between a Kuhnian paradigm and a Bayesian prior. (They’re not exactly the same, and I’ll come back to this in a bit). And certainly Kuhnian “data” and Bayesian “evidence” correspond pretty well. But the Bayesian equivalent of the Kuhnian anomaly isn’t really the KL-divergence that Scott suggests.</p>
<p>KL-divergence is mathematical way to measure how far apart two probability distributions are. So it’s an appropriate way to look at two priors and tell how different they are. But you never directly observe a probability distribution—just a collection of data points—so KL-divergence doesn’t tell you how surprising your data is. (Your prior does that on its own).</p>
<p>But “surprising evidence” isn’t the same thing as an anomaly. If you make a new observation that was likely under your prior, you get an updated posterior probability and everything is fine. And if you make a new observation that was unlikely under your prior, you get an updated posterior probability and everything is fine. As long as the true<strong title=""True" isn't really the most accurate word to use here, but it works well enough and I want to avoid another thousand-word digression on the subject of metaphysics."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong> hypothesis is in your prior at all, you’ll converge to it with enough evidence; that’s one of the great strengths of Bayesian inference. So even a very surprising observation doesn’t force you to rethink your model.</p>
<p>In contrast, if you make a new observation that was <em>impossible</em> under your prior, you hit a literal divide-by-zero error. If your prior says that $E$ can’t happen, then you can’t actually carry out the Bayesian update calculation, because Bayes’s rule tells you to divide by $P(E)$—which is zero. And this is the Bayesian equivalent of a Kuhnian anomaly.</p>
<p>We can imagine a <a href="https://en.wikipedia.org/wiki/Liar!_(short_story)">robot in an Asimov short story</a> encountering this situation, trying to divide by zero, and crashing fatally. But people aren’t quite so easy to crash, and an intelligently designed AI wouldn’t be either. We can do something that a simple Bayesian inference algorithm doesn’t allow: we can invent a new prior and start over from the beginning. We can shift paradigms.</p>
<hr />
<p>A theoretically perfect Bayesian inference algorithm would start with a <em>universal prior</em>—a prior that gives positive probability to every conceivable hypothesis and every describable piece of evidence. No observation would ever be impossible under the universal prior, so no update would require division by zero.</p>
<p>But it’s easier to talk about such a prior than it is to actually come up with one. The usual example I hear is the <a href="https://en.wikipedia.org/wiki/Solomonoff_induction">Solomonoff prior</a>, but it is known to be uncomputable. I would guess that any useful universal prior would be similarly uncomputable. But even if I’m wrong and a theoretically computable universal prior exists, there’s definitely no way we could actually carry out the infinitely many computations it would require.</p>
<p>Any practical use of Bayesian inference, or really any sort of analysis, has to restrict itself to considering only a few classes of hypotheses. And that means that sometimes, the “true” hypothesis <em>won’t be in your prior</em>. Your prior gives it a zero probability. And that means that as you run more experiments and collect more evidence, your results will look weirder and weirder. Eventually you might get one of those zero-probability results, those anomalies. And then you have to start over.</p>
<p>A lot of the work of science—the “normal” work—is accumulating more evidence and feeding it to the (metaphorical) Bayesian machine. But the most difficult and creative part is coming up with <em>better hypotheses</em> to include in the prior. Once the “true” hypothesis is in your prior, collecting more evidence will drive its probability up. But you need to add the hypothesis to your prior first. And that’s what a paradigm shift looks like.</p>
<hr />
<p>It’s important to remember that this is an analogy; a paradigm isn’t exactly the same thing as a prior. Just as “surprising evidence” isn’t an anomaly, two priors with slightly different probabilities put on some hypotheses aren’t operating in different paradigms.</p>
<p>Instead, a paradigm comes <em>before</em> your prior. Your paradigm tells you what counts as a hypothesis, what you should include in your prior and what you should leave out. You can have two different priors in the same paradigm; you can’t have the same prior in two different paradigms. Which is kind of what it means to say that different paradigms are incommensurable.</p>
<p>This is probably the biggest weakness of Bayesian inference, in practice. Bayes gives you a systematic way of evaluating the hypotheses you have based on the evidence you see. But it doesn’t help you figure out what sort of hypotheses you should be considering in the first place; you need some theoretical foundation to do that.</p>
<p>You need a paradigm.</p>
<hr />
<p><em>Have questions about philosophy of science? Questions about Bayesian inference? Want to tell me I got Kuhn completely wrong? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below, and let me know!</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>If you’re interested in the political angle on this more than the scientific, check out the <a href="https://www.youtube.com/watch?v=aTSrHfv9C94">talk I gave at TedxOccidentalCollege last year</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In fact, this was my third or fourth answer in the first draft of this section. Then I looked at it again and realized it was by far the <em>best</em> answer. That’s how paradigms work: as long as everything is working normally, you don’t even have to think about the fact that they’re there. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>"True" isn’t really the most accurate word to use here, but it works well enough and I want to avoid another thousand-word digression on the subject of metaphysics. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleScott Alexander at Slate Star Codex has been blogging lately about Thomas Kuhn and the idea of paradigm shifts in science. This is a topic near and dear to my heart, so I wanted to take the opportunity to share some of my thoughts and answer some questions that Scott asked in his posts.Numerical Semigroups and Delta Sets2019-01-05T00:00:00-08:002019-01-05T00:00:00-08:00https://jaydaigle.net/blog/numerical-semigroups<p>In this post I want to outline my main research project, which involves non-unique factorization in numerical semigroups. I’m going to define semigroups and numerical semigroups; explain what non-unique factorization means; define the invariant I study, called the delta set; and talk about some of the specific questions I’m interested in.</p>
<h3 id="semigroups">Semigroups</h3>
<p>A <em>semigroup</em> is a set $S$ with one associative operation. This really just means we have a set of things, and some way of combining any two of them to get another. Semigroups generalize the more common idea of a <em>group</em>, which has an identity and inverses in addition to the associative operation. Every group is also a semigroup, but not every semigroup is a group.<strong title="There is also something called a "monoid", which has an identity element but no inverses; thus every group is a monoid and every monoid is a semigroup. The presence of an identity element doesn't actually matter for any of the questions we're asking, so researchers use the terms "semigroup" and "monoid" more or less interchangeably."><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></strong></p>
<p>The simplest example of a semigroup is the natural numbers $\mathbb{N}$, with the operation of addition: we can add any two natural numbers together, but without negative numbers we don’t have any way to subtract, which would be an inverse. This is the free semigroup on one generator, which means we can get every element by starting with $1$ and adding it to itself some number of times.</p>
<p>Other examples of semigroups are:</p>
<ul>
<li>$\mathbb{N}^n, +$: ordered $n$-tuplets of natural numbers.</li>
<li>$\mathbb{N}, \times$: the natural numbers using multiplication as the operation. This has infinitely many generators, since we need to start with every prime number to get every possible natural number.</li>
<li>String Concatenation: we can take our set to be the set of all strings of English letters, and we combine two strings by just sticking the second one after the first.</li>
<li><a href="http://mathworld.wolfram.com/BlockMonoid.html">Block Monoids</a> are semigroups whose elements are lists of group elements that mulitiply out to zero under the operation of concatenation.</li>
</ul>
<p>Numerical semigroups, which are the main object I study, are formally defined as sub-semigroups of the natural numbers but that phrase doesn’t actually explain a lot if you’re not already familiar with the field. However, I can explain what they actually are them much less technically and more simply.</p>
<h3 id="numerical-semigroups">Numerical Semigroups</h3>
<p>We can define the numerical semigroup generated by $a_1, \dots, a_k$ to be the set of integers
\[
\langle a_1, \dots, a_k \rangle = {n_1 a_1 + \dots + n_k a_k : n_i \in \mathbb{Z}_{\geq 0} }.
\]
In other words, our semigroup is the set of all the numbers you can get by adding up the generators some number of times, but without allowing subtraction.</p>
<p>I like to think about the <a href="https://arxiv.org/abs/1709.01606">Chicken McNugget semigroup</a> to explain this. When I was a kid, at McDonald’s you could get a 4-piece, 6-piece, or 9-piece order of Chicken McNuggets.<strong title="For some reason, they switched over to 4-, 6-, and 10-piece orders when I was a teenager. That semigroup is much less interesting, so I'm going to pretend that never happened."><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></strong> And then we can ask: which numbers of nuggets is it possible to order?</p>
<p>You certainly can’t order one, two, or three nuggets. You can order four, but not five. You can order six, but not seven. You can get eight by ordering two 4-pieces, nine by ordering one 9-piece, and ten by ordering a 4-piece and a 6-piece. There’s no way to order exactly eleven nuggets, and it turns out we can get any number of nuggets past that exactly. (This makes eleven the <a href="https://en.wikipedia.org/wiki/Coin_problem">Frobenius number</a> for this semigroup). We can summarize all this in the table below:</p>
<p>\[
\begin{array}{cc}
1 & \text{not possible} \\\<br />
2 & \text{not possible} \\\<br />
3 & \text{not possible} \\\<br />
4 & = 1 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 \\\<br />
5 & \text{not possible} \\\<br />
6 & = 0 \cdot 4 + 1 \cdot 6 + 0 \cdot 9 \\\<br />
7 & \text{not possible} \\\<br />
8 & = 2 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 \\\<br />
9 & = 0 \cdot 4 + 0 \cdot 6 + 1 \cdot 9 \\\<br />
10 & = 1 \cdot 4 + 1 \cdot 6 + 0 \cdot 9 \\\<br />
11 & \text{not possible} \\\<br />
12 & = 3 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 \\\<br />
& = 0 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 \\\<br />
13 & = 1 \cdot 4 + 0 \cdot 6 + 1 \cdot 9
\end{array}
\]</p>
<p>Looking at this table you might notice something else: there are two rows for the number 12, because we can order 12 nuggets in two different ways: we can order three 4-piece orders, or two 6-piece orders. We call each of these ways of ordering twelve nuggets a <em>factorization</em> of 12 with respect to the generators $4,6,9$. And not only do we have two different factorizations of 12; they actually have different numbers of factors!</p>
<p>If we look at larger numbers, the variety in factorizations becomes far greater. Consider this table of ways to factor 36:
\[
\begin{array}{cc}
\text{factorization} & \text{length} \\\<br />
9 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 & 9 \\\<br />
6 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 & 8 \\\<br />
3 \cdot 4 + 4 \cdot 6 + 0 \cdot 9 & 7 \\\<br />
3 \cdot 4 + 1 \cdot 6 + 2 \cdot 9 & 6 \\\<br />
0 \cdot 4 + 6 \cdot 6 + 0 \cdot 9 & 6 \\\<br />
0 \cdot 4 + 3 \cdot 6 + 2 \cdot 9 & 5 \\\<br />
0 \cdot 4 + 0 \cdot 6 + 4 \cdot 9 & 4
\end{array}
\]
We have seven distinct ways we can factor 12. The shortest has four factors and the longest has nine; every length in between is represented.</p>
<p>From here we can ask a number of questions. How many ways can we order a given number of chicken nuggets? How many different lengths can these factorizations have? What patterns can we find?</p>
<p>All this is very different from what we’re used to. When we factor integers into prime numbers, the <a href="https://en.wikipedia.org/wiki/Fundamental_theorem_of_arithmetic">Fundamental Theorem of Arithmetic</a> tells us that there is a unique way to do this. We generally learn this in grade school, and so from a very young age we’re used to having only one way to factor things. But this unique factorization property isn’t universal, and it doesn’t apply here.</p>
<p>Numerical semigroups essentially never have unique factorization. But we want to find ways to measure how not-unique their factorization is.</p>
<h3 id="the-delta-set">The Delta Set</h3>
<p>In my research I study something called the <em>delta set</em> of a semigroup. The delta set is a way of measuring how complicated the relationships among different factorizations can get.</p>
<p>For an element $x$ in a semigroup, we can look at all the factorizations of $x$, and then we can look at all the possible lengths of these factorizations. (In our example above, we had $\mathbf{L}(36) = \{4,5,6,7,8,9\}$; we don’t repeat the $6$ because we only care about which lengths are possible, and not how many times they occur). Then we can ask a bunch of questions about these sets of lengths.</p>
<p>A simple thing to compute is the <em>elasticity</em> of an element, which is just the ratio of the longest factorization to the shortest, and tells you how much the lengths can vary. (The elasticity of $36$ is $9/4$). A good exercise is to convince yourself that the largest elasticity of any element in a semigroup is the ratio of the largest generator to the smallest generator. (And thus that $36$ has the maximum possible elasticity for $\langle 4, 6, 9 \rangle$).</p>
<p>The delta set is a bit more complicated. The delta set of $x$ is the set of successive differences in lengths. So instead of looking at the shortest and longest factorizations, we look at all of them, and see what sort of gaps show up. (For our example, the delta set is just $\Delta(36) = \{1\}$, since there’s a factorization of each length between $4$ and $9$. If the set of lengths were $\{3,5,8,15\}$ then the delta set would be $\{2,3,7\}$).</p>
<p>We want to understand the whole semigroup, not just individual elements. So we often want to talk about the delta set of an entire semigroup, which is just the union of the delta sets of all the elements. So $\Delta(S)$ tells us what kind of gaps can appear in <em>any</em> set of lengths for any element of the semigroup. It turns out that for the Chicken McNugget semigroup $S = \langle 4,6,9 \rangle$, the delta set is just $\Delta(S) = \{1\}$. This means that the delta set of any element is just $\{1\}$, and thus that every set of lengths is a set of consecutive integers $\{n,n+1, \dots, n+k \}$.</p>
<h3 id="what-do-we-know">What Do We Know?</h3>
<p>Delta sets can be a little tricky to compute. It’s fairly easy to show a number <em>is</em> in the delta set of a semigroup: find an element, calculate all the factorization lengths, and see that you have a gap of the desired size. But to show that a number is not in the delta set of the semigroup, you have to show that it isn’t in the delta set of any element, which is much trickier.</p>
<p>However, there are a few things we do know.</p>
<ul>
<li>
<p>The smallest element of the delta set is the greatest common divisor of the other elements of the delta set. This means that $\{2,3\}$ can’t be the delta set of any semigroup, since $2$ isn’t the GCD of $2$ and $3$.</p>
</li>
<li>
<p>If $S = \langle a, b \rangle$ is generated by exactly two elements, then $\Delta(S) = \{b - a\}$. More generally, if $S = \langle a, a+d, a+2d, \dots, a+kd \rangle$ then $\Delta(S) = \{d\}$. (We call such semigroups “arithmetic semigroups” since their generating set is an <a href="https://en.wikipedia.org/wiki/Arithmetic_progression">arithmetic progression</a>).</p>
</li>
<li>
<p>For any numerical semigroup $S$, there is a finite collection of (computable) elements called the <em>Betti elements</em>, and the maximum element of the delta set of $S$ is in the delta set of at least one of the Betti elements.</p>
</li>
<li>
<p>Finally and most importantly, the delta set is eventually periodic. This means that if you check the delta sets for a (possibly large but known) number of elements of the semigroup, you will see everything you can possibly see. This makes it possible to compute the delta set of any given semigroup and know you haven’t left anything out. <strong title="This result was originally proven by Scott Chapman, Rolf Hoyer, and Nathan Kaplan in 2008, during an undergraduate REU research program I was also participating in. But the original result had an unfortunately large bound, so using this to compute delta sets wasn't really practically feasible. In 2014, a paper by J. I. García-García, M. A. Moreno-Frías, and A. Vigneron-Tenorio improved the bound dramatically and made computation of delta sets feasible on personal computers."><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></strong></p>
</li>
</ul>
<p>But this is nearly everything that we really know about delta sets. There are a lot of open questions left, which primarily fall into two categories:</p>
<ol>
<li>
<p>For some nice category of semigroup, compute the delta set. We’ve already seen this question answered for semigroups generated by arithmetic sequences; we also have complete or partial answers for semigroups generated by <a href="https://pdfs.semanticscholar.org/a100/c2ba10554d593c6f7c59245f330117d5d2c6.pdf">generalized arithmetic sequences</a>, geometric sequences, and <a href="https://arxiv.org/abs/1503.05993">compound sequences</a>.</p>
</li>
<li>
<p>The <em>realization problem</em>: given a set of natural numbers, is it the delta set of some numerical semigroup? We don’t actually know a lot about this. About the only thing that we know <em>can’t</em> happen is a minimum element that isn’t the GCD of the set. But to show that something <em>can</em> happen, about all we can do is find a specific semigroup that has that delta set. There’s a lot of room to explore here.</p>
</li>
</ol>
<h3 id="non-minimal-generating-sets">Non-Minimal Generating Sets</h3>
<p>In my research I introduce one more complication. Earlier we talked about the Chicken McNugget semigroup, of all the ways we can build orders out of 4, 6, or 9 chicken nuggets. But McDonald’s also offers a 20 piece order of chicken nuggets. <strong title="My parents would never let me order this when I was a child, and I'm still bitter."><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> </strong></p>
<p>From a purely algebraic perspective, this doesn’t change anything. Anything we can get with 20 piece orders, we can get with a combination of 4 and 6 pieces, so we have the same set and the same operation, and thus the same semigroup. (We say that 20 isn’t “irreducible” because we can factor it into other simpler elements). So in this sense, nothing should change.</p>
<p>But the set of factorizations does change. If we replicate our earlier table of factorizations of 36 but now allow $20$ as a factor, we get
\[
\begin{array}{cc}
\text{factorization} & \text{length} \\\<br />
9 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 9 \\\<br />
6 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 8 \\\<br />
3 \cdot 4 + 4 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 7 \\\<br />
3 \cdot 4 + 1 \cdot 6 + 2 \cdot 9 + 0 \cdot 20 & 6 \\\<br />
0 \cdot 4 + 6 \cdot 6 + 0 \cdot 9 + 0 \cdot 20 & 6 \\\<br />
0 \cdot 4 + 3 \cdot 6 + 2 \cdot 9 + 0 \cdot 20
& 5 \\\<br />
\color{blue}{4 \cdot 4 + 0 \cdot 6 + 0 \cdot 9 + 1 \cdot 20}
& \color{blue}{5} \\\<br />
0 \cdot 4 + 0 \cdot 6 + 4 \cdot 9 + 0 \cdot 20
& 4 \\\<br />
\color{blue}{1 \cdot 4 + 2 \cdot 6 + 0 \cdot 9 + 1 \cdot 20 }
& \color{blue}{4}
\end{array}
\]
The extra generator gives us the two additional factorizations in blue.</p>
<p>Now every question we asked about factorizations in numerical semigroups, we can ask again for factorizations with respect to our non-minimal generating set. For instance, we can ask for the delta set with respect to our generating set. For 36 above, we see that the delta set is still 1, just as it was before; nothing has changed.</p>
<p>But let’s look instead at the element 20. With our old generating set of $4,6,9$, we can only get 20 nuggets in two ways. But with our non-minimal generating set, we have three different ways to order 20 nuggets: $20 = 5 \cdot 4 = 2 \cdot 4 + 2 \cdot 6 = 1 \cdot 20$. These three “factorizations” have lengths 5, 4, and 1, and a little experimentation will convince you that they’re the only possible factorizations. Therefore our set of lengths is $\mathbf{L}(20) = \{1,4,5\}$ and the delta set is $\Delta(20) = \{1,3\}$.</p>
<p>This is a big change! With the original, minimal generating set, the delta set of the <em>entire semigroup</em> was ${1}$. There was no element with a length gap larger than 1. But by adding a new generator in, we can get an element whose delta set is ${1,3}$. And a little experimentation shows us that
\[
26 = 5 \cdot 4 + 1 \cdot 6 = 2 \cdot 4 + 3 \cdot 6
= 2 \cdot 4 + 2 \cdot 9 = 1 \cdot 6 + 1 \cdot 20
\]
and thus $\mathbf{L}(26) = \{2,4,5,6\}$ and $\Delta(26) = \{1,2\}$. So the delta set for the entire semigroup is $\{1,2,3\}$.<strong title="I haven't actually shown that you can't get a gap bigger than 3. But it's true."><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></strong> We’ve gotten a different delta set for the exact same semigroup, but using a different set of generators.</p>
<p>This raises a number of questions for us to study. We can start with our previous two questions: given a semigroup (and a non-minimal set of generators), what is the delta set? And given a set, is it the delta set of some semigroup and non-minimal generating set? But we also have a new question: what happens to the delta set of a semigroup as we continually add things to the generating set? Can we make the delta set bigger? Can we make it smaller? What ways of adding generators produce interesting patterns?</p>
<p>There’s a lot of fertile ground here. A few questions have been answered already, in a <a href="https://www.tandfonline.com/doi/abs/10.1080/00927870903045165">paper I cowrote with Scott Chapman, Rolf Hoyer, and Nathan Kaplan in 2010</a>. For instance, it is always possible to force the delta set to be $\{1\}$ by adding more elements to the generating set. A couple other groups have done some work since then, but as far as I know, nothing else has been published.</p>
<p>But hopefully I’ve convinced you that there are quite a few interesting and unanswered questions in this field. Many of the answers should be accessible with a bit of work, and I hope to be able to provide some of them soon.</p>
<hr />
<p><em>Have a question about numerical semigroups? Factorization theory? My research? Tweet me <a href="https://twitter.com/profjaydaigle">@ProfJayDaigle</a> or leave a comment below.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There is also something called a “monoid”, which has an identity element but no inverses; thus every group is a monoid and every monoid is a semigroup. The presence of an identity element doesn’t actually matter for any of the questions we’re asking, so researchers use the terms “semigroup” and “monoid” more or less interchangeably. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>For some reason, they switched over to 4-, 6-, and 10-piece orders when I was a teenager. That semigroup is much less interesting, so I’m going to pretend that never happened. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>This result was originally <a href="https://link.springer.com/article/10.1007%2Fs00010-008-2948-4">proven by Scott Chapman, Rolf Hoyer, and Nathan Kaplan in 2008</a>, during an undergraduate REU research program I was also participating in. But the original result had an unfortunately large bound, so using this to compute delta sets wasn’t really practically feasible. In 2014, a <a href="https://arxiv.org/abs/1406.0280">paper by J. I. García-García, M. A. Moreno-Frías, and A. Vigneron-Tenorio</a> improved the bound dramatically and made computation of delta sets feasible on personal computers. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>My parents would never let me order this when I was a child, and I’m still bitter. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>I haven’t actually shown that you can’t get a gap bigger than $3$. But it’s true. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleIn this post I want to outline my main research project, which involves non-unique factorization in numerical semigroups. I’m going to define semigroups and numerical semigroups; explain what non-unique factorization means; define the invariant I study, called the delta set; and talk about some of the specific questions I’m interested in.The difference between science and engineering2018-08-29T00:00:00-07:002018-08-29T00:00:00-07:00https://jaydaigle.net/blog/science-vs-engineering<p><em>I wrote this essay a few years back elsewhere on the internet. It still seems relevant, so I’m posting this updated and lightly edited version.</em></p>
<p>I’ve noticed that people regularly get confused, on a number of
subjects, by the difference between science and engineering.<br />
In summary: science is sensitive and finds facts; engineering is robust and
gives praxis. Many problems happen when we confuse science for
engineering and completely modify our praxis based on the results of a
couple of studies in an unsettled area.</p>
<p><img src="http://cowbirdsinlove.com/comics/46/engineer.png" alt="Sad truth: Most "mad scientists" are actually just mad engineers" /></p>
<p>(Thanks to <a href="http://cowbirdsinlove.com/46">Cowbirds in Love</a> for the <em>perfect</em> comic strip)</p>
<h3 id="the-difference-between-science-and-engineering">The difference between science and engineering</h3>
<p>As a rough definition, science is a system of techniques for finding out facts about the
world. Engineering, in contrast, is the technique of using science to
produce tools we can consistently use in the world. Engineering produces
things that have useful effects. (And I’ll also point to a third
category, of “folk traditions,” which
are tools we use in the world that are not particularly founded in science.)</p>
<p>These things are importantly different. Science depends on a large
number of people putting together a
lot of little pieces, and building up an edifice of facts that
together give us a good picture of how things work. It’s fine if any
one experiment or study is flawed, because in the limit of infinite
experiments we figure out what’s going on. (See
for example Scott Alexander’s essay <a href="http://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">Beware the Man of One
Study</a> for excellent commentary on this problem).</p>
<p>Similarly, it’s fine if any one experiment holds in only very restricted cases, or detects a subtle effect that can only be seen with delicate machinery. The
point is to build up a large number of data points and use them to
generate a model of the world.</p>
<p>Engineering, in contrast, has to be <em>robust</em>. If I want to detect the
Higgs Boson once, to find out if it exists, I can do that in a giant
machine that costs billions of dollars and requires hundreds of hours
of analysis. If I want to build a Higgs Boson detector into a cell
phone, that doesn’t work.</p>
<p>This means two things. First is that we need to understand things
much better for engineering than for science. In science it’s fine to
say “The true effect is between +3 and -7 with 95% probability”. If
that’s what we know, then that’s what we know. And an experiment that
shrinks the bell curve by half a unit is useful. For
engineering, we generally need to have a much better idea of what the
true effect is. (Imagine trying to build an airplane based on the
knowledge that acceleration due to gravity is
<a href="https://xkcd.com/683/">probably</a> between 9 and 13 m/s^2).</p>
<p>Second is that science in general cares about much smaller effects
than engineering does. It was a very long time before engineering
needed relativistic corrections due to gravity, say. A fact can be
true but not (yet) useful or relevant, and then it’s in the domain of
science but not engineering.</p>
<h3 id="why-does-this-matter">Why does this matter?</h3>
<p>The distinction is, I think fairly clear when we talk about physics.
In particular, we understand the science of physics quite well, at
least on every-day scales. And our practice of the engineering of
physics is also quite well-developed, enough so that people rarely use
folk traditions in place of engineering any more. (“I don’t know why
this bridge stays up, but this is how daddy built them.”)</p>
<p>But people get much more confused when we move over to, say,
psychology, or sociology, or nutrition. Researchers are doing a lot
of science on these subjects, and doing good work. So there’s a ton
of papers out there saying that eggs are good, or eggs are bad, or
<a href="http://www.theonion.com/article/eggs-good-for-you-this-week-4144">eggs are good for you but only until next
Monday</a>, or whatever.</p>
<p>And people often have one of two reactions to this situation. The
first is to read one study and say “See, here’s the scientific study.
It says eggs are bad for you. Why are you still eating eggs? Are you
denying the science?” And
the second reaction is to say that obviously the scientists can’t
agree, and so we don’t know anything and maybe the whole scientific
approach is flawed.</p>
<p>But the real situation is that we’re struggling to develop a science
of nutrition. And that’s <em>hard</em>. We’ve put in a lot of work, and we know
some things. But we don’t really have enough information to do
<em>engineering</em>—to say “Okay, to optimize cardiovascular health you
need to cut your simple carbs by 7%, eat an extra 10g of
monounsaturated fats every day, and eat 200g of protein every
Wednesday”, or whatever. We just don’t know enough.</p>
<p>And this is where folk traditions come in. Folk traditions are
attempts to answer questions that we need decent answers to, that have
been developed over time, and that are presumably non-horrible because
they haven’t failed obviously and spectacularly yet. A person who eats
“like grandma did” is probably on average at least as healthy as
a person who tried to follow every trendy bit of scientistic nutrition
advice from the past thirty years.</p>
<h3 id="trendy-teaching-as-confusing-science-for-engineering">Trendy teaching as confusing science for engineering</h3>
<p>So where do I see this coming up other than nutrition? Well, the
subject that really got me thinking about it was “scientific” teaching
practices. I’ve attended a few workshops on “modern” teaching techniques like the use of clickers, and when I tell people about them I often get comments disparaging <a href="http://calteches.library.caltech.edu/51/2/CargoCult.htm">cargo cult</a> teaching methods.</p>
<p>In general there’s a big split among university professors between
people who want to teach in a more “traditional” way and people who
want to teach in a more “scientific” way. With bad blood on both
sides.</p>
<p>And my biggest problem with the “scientific” side is that some of their studies are <em>so bad</em>. I’d like good studies on teaching methods. I’d like a good engineering of teaching. But we don’t have one yet, and acting like “we have three studies, now we know the best thing to do” is just silly.</p>
<p>(Which shouldn’t be read as full-throated support for the
“traditionalists”! The science is good enough to tell us some things
about some things, and I do try to engage in judicious supplementation
of folk teaching traditions with information from recent research.
But the research is not in a good enough state to be dispositive, or
produce an engineering discipline, or completely displace the folk
tradition).</p>
<h3 id="other-examples">Other examples</h3>
<p>A few of my friends have complained about the sad state of excercise science; but I think they’re
really complaining about the lack of exercise engineering. We are
doing basic research that tells us about how the body responds to
exercise. We don’t know enough to give advice that improves much on
“do the things people have been doing for a while that seem to work”.</p>
<p>A lot of “lifehacks” boil down to “We read a study, and based on this
study, here are three simple things you can do to accomplish X.” But a study is
science, not engineering. Sometimes helpful, but easy to
overinterpret. Don’t take any one study too seriously, and if what
you’re doing works, don’t totally overhaul it because you read a study.</p>
<p>Similarly, any comment about how you can be more effective socially
by doing this one trick is usually science, not engineering.</p>
<p>Lots of economics and public policy debates sound like this.
“This study shows that raising the minimum wage
(increases/decreases/has no effect on) unemployment.” All three of
those statements can be true! There are a lot of studies with a lot of different results. We’re <em>starting</em> to develop an
engineering practice of economics policy, but it’s in its infancy at
best.</p>
<p>Or see <a href="http://ottawacitizen.com/news/national/the-hard-truth-about-hard-power">this essay’s
account</a>
of scientifically studying the most effective way for police to
respond to domestic violence charges, for a good example of confusing
science and engineering. Bonus points for the following quote:</p>
<blockquote>
<p>Reflection upon these results led Sherman to formulate his “defiance” theory of the criminal sanction, beginning with the inauspicious generalization that, according to the evidence, “legal punishment either reduces, increases, or has no effect on future crimes, depending on the type of offenders, offenses, social settings and levels of analysis.” This is a fancy way of saying “we don’t know what works.”</p>
</blockquote>
<h3 id="marketing-engineering-versus-folk-traditions">Marketing: engineering versus folk traditions</h3>
<p>The field of marketing presents a good contrast between engineering
and folk traditions. We have a mental image of a sleazy salesman, who
has a whole host of interpersonal tactics that have been honed through
centuries and millenia of sleazy sales tactics. And this works.</p>
<p>And there’s an entirely different field of marketing research and
focus groups. And this shows what’s necessary to turn science into
engineering. There’s a whole bunch of basic research about psychology
that goes into designing marketing campaigns. But people also do
focus groups, to gather a ton of data on how people respont to minute
differences.</p>
<p>And, more importantly, they do <a href="https://en.wikipedia.org/wiki/A/B_testing">A/B testing</a>, which gives pretty good
data on how actual people respond to actual differences. And by
iterating a ton of A/B testing, you have a pretty good idea that
people will buy 5% more if you use the green packaging, or whatever.</p>Jay DaigleI've noticed that people regularly get confused, on a number of subjects, by the difference between science and engineering. In summary: science is sensitive and finds facts; engineering is robust and gives praxis. Many problems happen when we confuse science for engineering and completely modify our praxis based on the results of a couple of studies in an unsettled area.An easier approach to partial fractions decomposition2018-08-16T00:00:00-07:002018-08-16T00:00:00-07:00https://jaydaigle.net/blog/easier-partial-fractions<p>I always found partial fraction decomposition incredibly annoying and tedious. But it turns out there’s a much easier way to compute it. (I learned this a couple years ago from Chris Towse).</p>
<p>Suppose we want to find a partial fraction decomposition for $\frac{7x+2}{(x+2)^2 (x-1)}$. The normal method is to take your fraction and write it as a sum of real numbers over your polynomial denominators:</p>
<script type="math/tex; mode=display">\frac{7x+2}{(x+2)^2 (x-1)} = \frac{A}{x+2} + \frac{B}{(x+2)^2} + \frac{C}{(x-1)}.</script>
<p>(For this reason, my high school calculus teacher called this the “ABC method”). Then we clear denonminators:
<script type="math/tex">% <![CDATA[
\begin{align}
7x+2
&= A(x+2)(x-1) + B(x-1) + C(x+2)^2 \\\\
&= A(x^2+x - 2) + B(x-1) + C(x^2 + 4x +4) \\\\
&=(A+C) x^2 + (A + B + 4C) x + (-2A-B +4C)
\end{align} %]]></script></p>
<p>and we get a system of linear equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
A+C & = 0 \\\\
A+B+4C & = 7 \\\\
4C - 2A -B & = 2.
\end{align} %]]></script>
<p>This is a system of linear equations, so we can solve it by any of the usual methods, and we get $A = -1, B = 4, C = 1$, so</p>
<script type="math/tex; mode=display">\frac{7x+2}{(x+2)^2(x-1)} = \frac{-1}{x+2} + \frac{4}{(x+2)^2} + \frac{1}{(x-1)}.</script>
<p>And now we can integrate or do whatever else we needed to do with our fraction.</p>
<hr />
<p>This process can get super tedious. In particular, solving the linear system at the end isn’t <em>difficult</em> but it is really <em>annoying</em> and easy to screw up if you do it by hand. (I used NumPy instead. Computer algebra systems are your friend).</p>
<p>It turns out there’s a much easier way to do this. It’s motivated by complex analysis residue integrals, but you can do it without actually knowing any complex analysis.</p>
<p>Let’s go back to our equation from earlier:</p>
<script type="math/tex; mode=display">\frac{7x+2}{(x+2)^2 (x-1)} = \frac{A}{x+2} + \frac{B}{(x+2)^2} + \frac{C}{(x-1)}.</script>
<p>Instead of clearing all the denominators, let’s just clear one. If we multiply by $(x-1)$ on both sides, we get</p>
<script type="math/tex; mode=display">\frac{7x+2}{(x+2)^2} = \frac{A(x-1)}{x+2} + \frac{B(x-1)}{(x+2)^2} + C.</script>
<p>This doesn’t look much nicer at first, but look at what happens if we evaluate at $x=1$. The left hand side becomes $1$. On the right-hand side, the $A$ and $B$ terms go away completely, and we’re just left with $C$. So we immediately see that $C = 1$.</p>
<hr />
<p>We can find $A$ and $B$ the same way, with a bit more care. Multiplying our equation by $x+2$ doesn’t help, because we’ll still have a factor of $x+2$ in the denominator. But if we multiply by $(x+2)^2$ we get</p>
<script type="math/tex; mode=display">\frac{7x+2}{x-1} = \frac{A (x+2)}{x-1} + B + \frac{C (x+2)^2}{x-1}</script>
<p>and evaluating at $x = -2$ gives $4 = B$. To get $A$, we need to do a little bit of work and subtract off the $B$ term:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{7x+2}{(x+2)^2 (x-1)} - \frac{4}{(x+2)^2}
& = \frac{7x+2 - 4x + 4}{(x+2)^2 (x-1)} \\
&= \frac{3x+6}{(x+2)^2 (x-1)} \\
&= \frac{3}{(x+2)(x-1)}
\end{align} %]]></script>
<p>so</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{3}{(x+2)(x-1)} & = \frac{A}{x+2} + \frac{C}{x-1} \\
\frac{3}{x-1} & = A + \frac{C(x+2)}{x-1} \\
-1 & = A.
\end{align} %]]></script>
<hr />
<p>That might feel like it took longer, but that’s mostly because I actually worked through all the algebra with the new version. No NumPy here! I actually suspect the first way is more efficient if you’re doing a really <em>big</em> decomposition, because it paralellizes a bunch of stuff, and linear equation solvers are pretty efficient.</p>
<p>But for reasonable-sized problems I’d much rather do the second method, no question. And this makes me almost want to actually teach partial fraction decomposition next time I teach calc 2.</p>Jay DaigleI always found partial fraction decomposition incredibly annoying and tedious. But it turns out there's a much easier way to compute it. (I learned this a couple years ago from Chris Towse).A Neat Argument For the Uniqueness of $e^x$2018-08-09T00:00:00-07:002018-08-09T00:00:00-07:00https://jaydaigle.net/blog/a-neat-argument-for-the-uniqueness-of-e%5Ex<p>In my advanced Calculus 1 class I teach a quick unit on differential equations. We don’t have the tools to solve them since we haven’t done integrals, but I talk about what differential equations are and how you can check whether you have a solution.</p>
<p>And then I spend a day in lab discussing exponential growth, and how the differential equation $y’ = ry$ implies that $y = Ce^{rt}$ for some constants $C$ and $r$. I’ve been telling my students that while it’s easy to check that this is a solution, we don’t have the tools to prove it’s the only family of solutions.</p>
<p>But today thanks to reddit, I discovered that that isn’t quite true. You can prove that $Ce^x$ is the only solution to this differential equation with a simple argument.</p>
<p>Suppose $f(x)$ is a function that satisfies $y’ = r y$, that is, suppose $f’(x) = r f(x)$. Then consider the derivative of $f(x) e^{-rx}$. By the product rule, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{d}{dx} f(x) e^{-rx} &= f'(x) e^{-rx} + f(x) (-r)e^{-rx} \\\\
&= r f(x) e^{-rx} - rf(x) e^{-rx} = 0.
\end{align} %]]></script>
<p>Thus we see that $f(x)/e^{rx}$ must be a constant; and thus $f(x) = C e^{rx}$. So this family of solutions is unique.</p>Jay DaigleIn my advanced Calculus 1 class I teach a quick unit on differential equations. We don’t have the tools to solve them since we haven’t done integrals, but I talk about what differential equations are and how you can check whether you have a solution.Working Backwards2018-08-06T00:00:00-07:002018-08-06T00:00:00-07:00https://jaydaigle.net/blog/working-backwards<p>I teach a lot of students who are still learning the basics of proof-writing. My calculus students are seeing their first college math, and often my number theory class is the first really proof-heavy class that a lot of the students take. So I spend a lot of time helping students figure out how to write good proofs. The single best piece advice I’ve come up with is to get comfortable working backwards.</p>
<hr />
<p>Math students are generally comfortable working forwards. At the beginning of our education this is the only way of working we have: we’re given a problem, like “add these two numbers together”, and we use some algorithm to work out the answer. I like the way Jordan Ellenberg describes this:</p>
<blockquote>
<p>[U]ntil algebra shows up, you’re doing numerical computations in a straightforwardly algorithmic way. You dump some numbers into the addition box, or the multiplication box, or even, in traditionally minded schools, the long-division box, you turn the crank, and you report what comes out the other side. Algebra is different. It’s computation backward.</p>
</blockquote>
<p>But even once we get to high-school algebra, we don’t really tend to <em>feel</em> like we’re working backwards. In practice we develop an algorithm for solving equations and turn the crank; instead of the addition box or the long-division box we now have the quadratic-equation box. This effect is strong enough that when I give my calculus students problems where I explicitly tell them to guess and check<strong title="This mostly comes up in the context of inverse functions. One of my favorite questions in Stewart is: Let f(x) = x^5+x^3+1. What is f^{-1}(3)? You can't find an explicit formula for f^{-1}, but you don't need to: just plugging in numbers makes it clear the answer is 1."><sup id="fnref:guess-and-check"><a href="#fn:guess-and-check" class="footnote">1</a></sup></strong>, they feel very confused and want me to tell them what steps to follow to finish the problem.</p>
<p>But this wanting to know “the steps” is a trap—it leads to treating problems like a black-box request for an algorithm, rather than thinking about what’s actually going on. And as soon as problems require any sort of creative engagement, the search for steps fails.</p>
<p>This first hits students hard when they learn to do integrals. When I taught Calculus 2, I spent a couple lab periods having my students do integral worksheets while I answered questions and gave advice. Fairly frequently, they would spend five or six minutes staring at an integral, before giving up and asking me how to start; I sometimes pointed out that we only had four or five things to try, and if they’d just tried all of them they’d have found one that worked already. But they were uncomfortable with the idea that there wasn’t one correct thing to try.</p>
<hr />
<p>But all of this becomes especially important when you start doing proofs, because in proofs there is no straightforward algorithm—and often you can’t even really work forward.</p>
<p>Working forward in proofs can be quite useful, to be fair. Given a set of hypotheses, you can start listing off things that the hypotheses obviously imply. Especially in the context of a class, you can often see “okay, if I know <em>these</em> three things, it looks like I should try applying <em>that</em> theorem and see what I get.” But this has really serious limits. You can’t plausibly write down <em>all</em> the implications of your hypotheses.</p>
<p>Instead, you need some idea of where to go. So you should start by looking at the <em>goal</em>, the thing you want to prove. Figure out what you could know that would be enough to finish the proof.</p>
<p>If you can see how to do that, then great, you win. If not, <em>now</em> go look at the hypotheses, and figure out what you can reasonably and easily conclude from them. Can you see how to get from there to the things you wanted?</p>
<p>You develop a sort of push-pull dynamic: work backwards from the goal, then forwards from the premises, then backwards from the goal again. And hopefully, eventually things will meet in the middle.</p>
<p>This seems pretty mundane. But a lot of students are comfortable working forward, and deeply uncomfortable working backwards. They can draw conclusions, but are bad at staying focused on the ultimate purpose of whatever they’re doing. So just prompting them to think about their goals in the middle of proofs can be really helpful.</p>
<p>I do this a lot while lecturing. In the middle of a proof, stop and ask the class to remind me what I’m actually trying to do. They find this surprisingly hard. Sometimes they even struggle to remember what the actual theorem we’re trying to prove is, despite the fact that it’s still written on the board; it’s easy to get lost in the weeds of whatever you’re doing <em>right now</em> and forget the broader context. But without that context, everything you’re doing is kind of pointless, and it’s really difficult to decide what to do next.</p>
<hr />
<p>And that idea of the importance of context, of focusing on your actual goals, is just as true outside of math class as it is inside. I see this a lot when I give feedback on people’s writing and speaking: they will keep <em>saying things</em>, and the things will be correct, but they won’t do a good job of saying things that are relevant to their goals and message, and of telling us why the things are relevant.</p>
<p>But you can also see this in a lot of bad planning and management. If you lose track of why you’re doing what you’re doing, you’re much less likely to actually achieve your goals. You’ve forgotten what they are, so if you meet them it’s essentially by luck!</p>
<p>This is one of the dangers, for instance, of getting too reliant on management-by-metric. Originally you create a metric to measure how well you’re achieving some goal. But over time, people forget about the goal and remember the metric—and do things that improve the metric, but in ways that don’t advance the original goal.</p>
<hr />
<p>Staying focused on your actual goal, and working backwards, is really helpful for learning to write proofs. So if you’re teaching people how to write proofs, this is worth explaining explicitly, and then actively training. Keep asking people why they’re doing what they’re doing, and how it gets them closer to the conclusion they want to prove!</p>
<p>But it’s also a really transferable skill, one that can help you in almost all aspects of life. Another example of general thinking skills that studying math helps develop, but are helpful everywhere.</p>
<hr />
<p><em>Do you have any suggestions for how to help students get comfortable working backwards? Any other tips for teaching students to write proofs more fluently? Please share them in the comments!</em></p>
<div class="footnotes">
<ol>
<li id="fn:guess-and-check">
<p>This mostly comes up in the context of inverse functions. One of my favorite questions in Stewart is: Let $f(x) = x^5+x^3+1$. What is $f^{-1}(3)$? You can’t find an explicit formula for $f^{-1}$, but you don’t need to: just plugging in numbers makes it clear the answer is 1. <a href="#fnref:guess-and-check" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jay DaigleI teach a lot of students who are still learning the basics of proof-writing. My calculus students are seeing their first college math, and often my number theory class is the first really proof-heavy class that a lot of the students take. So I spend a lot of time helping students figure out how to write good proofs. The single best piece advice I’ve come up with is to get comfortable working backwards.