Why I'm Not Scared of the New Chatbots

If you haven’t already heard about AI chatbots, you probably haven’t been on the internet in the past couple of months. In November, OpenAI released ChatGPT, which can engage in text conversations with coherent text that looks like it was written by a real person. Then a couple weeks ago Bing rolled out its own chatbot, which was more engaging but also much less reliable, producing a spate of lurid stories of “Sydney” expressing a desire to be human, threatening users, and claiming to have murdered one of its developers.

James Vincent of The Verge is one of the many people who had truly wild conversations with Microsoft’s chatbot.

The core technology underlying both of these chatbots has been around for a while¹, but the new products are more polished, accessible, and compelling, which means a lot of people are experiencing them for the first time. These products have also generated a certain amount of both triumphalism (GPT will revolutionize everything!) and fear (GPT will take over everything!) among people who take the possibilities of AI seriously.

I’m not an expert in these systems, just an interested amateur who’s been following them for a while. But the hype about GPT seems wildly overblown. The current approach to programming chatbots has real limits that I don’t think we can surpass without some genuinely new breakthroughs. And understanding some surprising facts about human psychology can help us develop intuition for what these systems will and won’t be able to do.

But first I want to mention that if you want to support my writing, I now have a Ko-Fi account. Any tips would be appreciated and would help me write more essays like this.

How does GPT work?

GPT is a text generation algorithm based on something called a large language model. The basic idea is that GPT has analyzed a huge corpus of written text and produced a model that looks at a bit of writing and predicts what words are likely to come next.

Humans do that all the time. If I hear the phrase “My friend Jim threw a ball and I caught—”, I will expect the next word to be “it”. But other continuations are possible: if I hear “the ball” or “that ball”, I won’t be that surprised. If I hear “the flu”, I’ll be kind of surprised, but “I caught the flu” is a reasonable thing to hear; it’s just a bit of a non sequitur after “My friend Jim threw a ball”. But if the next word were “green” or “solitude”, I’d be really confused. I suspect this is the only time anyone has ever written the sentence “My friend Jim threw a ball and I caught solitude”.

I started out describing a way to predict text, but it’s easy to turn that into a way to produce text. For instance, we could start with a prompt, and have our model keep supplying the most-likely next word until we’ve written enough. This is a fancier version of the memes that ask you to type “I hate it when” into your phone and see what autocomplete suggests. I tried that prompt on my phone, and got this:

Phone screenshot: I hate it when I get home I will be there in about half hour and a half hour and half an hour and a half hour and half of the day off I usually don’t hate it when I get home, actually.

And this illustrates the problem with that first suggestion: if you always take the most likely next word, you can get stuck. Even if you don’t wind up in a loop like that one, you’ll still say pretty boring things, since your writing is always as unsurprising as possible. Actual text-generation systems introduce some random noise parameters so that you always have a fairly likely word, but not the most likely word.

GPT works surprisingly well

This basic idea has been around for decades, but in 2017 a team at Google developed a new algorithm called the transformer that worked much better than any previous strategies; since then, the technology has developed rapidly.

Already in 2019 we could produce substantial quantities of fluent, grammatical, and sometimes even stylish English text. The newest products are even more impressive. They can give helpful answers to questions in a number of fields, including finance, medicine, law, and psychology. They can summarize the contents of research papers. They can make you fall in love.

They can also play the world’s most chaotic game of chess. Here ChatGPT is playing black.

And this success has led people to wonder what comes next. How good will AI chatbots get? Will they make make it impossible to avoid cheating on schoolwork? Will they replace your doctor, your lawyer, or your therapist? Will they make desk jobs obsolete?

Are they self-aware? Are they intelligent beings?

Does GPT really think?

The most obvious take on GPT is that it can’t think; it’s just expressing statistical relationships among words. In the narrowest sense, this is certainly true; it’s just a very sophisticated technology for predicting what words should come next in a string of text.

And since it’s just doing prediction, it should be very limited in what it can do. GPT won’t produce original thoughts; it can only express relationships that are already in the text it has used as input. Thus we see Ted Chiang’s summary that ChatGPT provides a blurry jpeg of the web:

Large language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory?

GPT has simply taken a bunch of words, summarized the relationships expressed by those words, and doing some sort of fuzzy pattern-matching and extrapolation from those relationships. There’s no creative thought. And most of the scary samples you’ve seen are this sort of pattern-matching. Microsoft’s chatbot says it wants to be human and threatens to kill people because we have tons of fiction about AIs that want to be human and threaten to kill people, and it’s just imitating that.

Do humans really think?

But, the rejoinder comes: are people any different? Humans are just doing fuzzy pattern-matching and imitating behavior we’ve seen…somewhere. So sure, GPT is just saying things that sound good based on what it’s read, but that’s also what people do most of the time. ChatGPT can do a good job of producing mediocre high school essays because it really is doing the same thing a mediocre high school essayist is doing!

And I think this is basically true—sometimes. A lot of human communication is basically just unreflective pattern-matching, saying things that sound good without really thinking about what they mean. When I make small talk with the cashier at target, I’m not engaging in a deep intellectual analysis of how to best describe my day. I’m just making small talk!

I also see this thoughtless extrapolation all the time while teaching college students. When students ask for help and I look at their work, it’s common for there to be steps that just don’t make any sense. And when I ask them why they did that, they don’t know. They’ll say something like “I don’t know, it just seemed like a thing to do?”

And that’s not even always a bad thing. If I type “3+5”, most of you will probably say “8” to yourselves before consciously deciding to do the addition; if I say “the capital of France”, you probably find “Paris” popping into your mind without any active deliberation. It’s hard to explain how you answered those questions, because you just know. And that’s great, because it means you don’t have to stop and think and work to get the answer.

Of course, this quick-and-easy thinking doesn’t always give the right answer. If I hear “the capital of Illinois”, my immediate reaction is “Chicago”. (It’s Springfield. I was pretty sure Chicago was the wrong answer, but it’s still the first one my brain supplied.) And if I hear “537 times 842”, my immediate reaction is—well, my immediate reaction is “ugh, do I have to?” I know I could work that out if I need to. But I’d rather not. It’s certainly not automatic.

So yes, humans in fact do a lot of pattern-matching and extrapolation. But we also do more than that. We can look at the results of our mental autocomplete and ask, “does this really make sense?”. We can do precise calculations that take effort and focus. We can hold complex ideas in our heads with far-removed long-term goals, and we can subordinate our free association to those complex ideas. We can, really and truly, think.

Thinking is hard.

We can think carefully, but that doesn’t mean we always do. Right after the original release of GPT-2, in February 2019, Sarah Constantin wrote a piece arguing that Humans Who Are Not Concentrating Are Not General Intelligences. She observed that GPT text looks a lot like things people would write—if you don’t read them carefully. But the more attention you pay, the more they fall apart.

If I just skim, without focusing, [the GPT passages] all look totally normal. I would not have noticed they were machine-generated. I would not have noticed anything amiss about them at all.

But if I read with focus, I notice that they don’t make a lot of logical sense.

…

So, ok, this isn’t actually human-equivalent writing ability…. The point is, if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot.

So the synthesis is: large language models like GPT can talk, and perhaps “think”, as well as a person who isn’t paying attention to what they’re saying. And it makes lots of errors for the same reason you can find multiple reddit threads about thoughtlessly saying “you too” in inappropriate situations. We say it because it feels right—and only afterwards do we realize it definitely isn’t.

System 1 and System 2

In Thinking Fast and Slow, Daniel Kahneman writes about how human reasoning splits into two basic systems. System 1 reasons quickly and efficiently, but operates essentially on reflex. System 2 slower and takes more energy, but can engage in careful, reflective thought. When asked for the capital of Illinois, my System 1 says “Chicago”, and then my System 2 says “no, wait, isn’t it actually Springfield?”.

Psychologists studying these two systems have found some classic puzzles that illustrate the difference really well. While reading the next question, pay attention to all the thoughts you have, and not just the final answer you come up with.

Question: Suppose a bat and a ball cost $1.10 together, and the bat costs a dollar more than the ball. How much does the ball cost?

If your brain works like mine—or most people’s—the first thing you hear yourself thinking is “ten cents”. But if you think more carefully, or pull out a pen and some paper to do work, you’ll realize that’s wrong; the ball should in fact cost five cents.² But even if you got it right, the wrong answer probably occurred to you first.³ That’s your System 1 contributing a guess, before your system 2 kicks in and corrects it.

And it seems to me that systems like GPT are implementing something like our System 1, but not System 2. And most of the time, when we’re not paying attention, we’re only using System 1—we’re just being human chatbots. But we are capable of using our System 2 to reflect on what we really mean, and transcend pure pattern-matching. It really seems like GPT can’t—and that it will never be able to without some genuinely new idea that we don’t yet have.

Probing questions

Now, when I gave ChatGPT the bat-and-ball problem, it got it right, in a really careful and thorough way.

Click here to see ChatGPT's answer to the bat-and-ball problem.

Let’s use algebra to solve this problem.

Let x be the cost of the ball in dollars.

According to the problem statement, the bat costs one dollar more than the ball, so its cost is x + 1.

Together, the bat and ball cost $1.10, so we can write the equation:

x + (x + 1) = 1.10

Simplifying the equation:

2x + 1 = 1.10

Subtracting 1 from both sides:

2x = 0.10

Dividing both sides by 2:

x = 0.05

Therefore, the ball costs $0.05.

But that doesn’t provide any evidence that ChatGPT is thinking carefully, because this specific problem has been discussed at great length in the source text. It has a similar reaction to mine: “oh crap, it’s the bat-and-ball problem, better be careful so I don’t screw up.”⁴

But right after introducing the bat-and-ball problem, Thinking Fast and Slow gives another example question that System 1 tends to fail on. This one gets discussed much less often. And this time, ChatGPT fails too.

This answer is wrong: the syllogism isn’t valid at all. All roses are flowers, but not all flowers are roses; it’s logically possible that some flowers fade quickly, and other flowers are roses, but none are both roses and quickly fade. ChatGPT doesn’t have a ton of examples of this puzzle in its training data, so it doesn’t know the answer, and it has no ability to figure it out on its own. It gets tricked, just like we do.

Un-trick questions

Over the past couple days I’ve seen a particularly amusing genre of GPT-trolling questions: ask it something that sounds like a trick question, but isn’t.

When I was a child, I thought it was fun to ask people: “which is heavier, a pound of feathers or a pound of bricks?” Like with our other examples, most people instinctively want to say that the pound of bricks is heavier, but with a bit of thought that’s obviously wrong.

Again, this is a famous trick question, and again, ChatGPT generally gets it right. But some clever person on Twitter got the bright idea to ask it to compare one pound of feathers to two pounds of bricks.

Presumably GPT basically said “oh, crap, this is the feathers-and-bricks thing again. I know it’s a trick question, because every time people have asked this it’s been a trick question, and they actually weigh the same”. And it totally ignores the actual numbers in the question.

And this generalizes: there are a few other examples of posing variations famous puzzles that have the trick removed. GPT gets them wrong, because it knows there’s a trick because there’s always a trick when people bring up the Monty Hall problem.

This one is my favorite; I laugh every time I read it.

Still not human

Now, you shouldn’t take the specifics too seriously here. GPT is not human, and even truly intelligent AI might be intelligent in very not-human-like ways. We shouldn’t expect GPT’s capabilities to correspond exactly to the human System 1. If nothing else, System 1 controls basic physical activities like walking, which is a notoriously hard robotics problem that GPT isn’t even interacting with at all. And ChatGPT gets the capital of Illinois right, which my System 1, at least, does not.

But using the split between System 1 and System 2 as a metaphor has really helped me structure how I think about GPT, and to understand how it can be so good at some things while completely incapable of others. “GPT can do the sort of things that we can do on autopilot, if we’ve read a lot and have a good memory” does seem to sum up most of its capabilities!

If they’re not smart, can they still be useful?

This all makes the new chatbots seem way less frightening to me. No, they’re not “really thinking”; they can do some of what people can do, but there are core capabilities they lack. They aren’t sapient: analytic self-reflection is exactly the thing they aren’t capable of. And it does seem like this is a fundamental limitation of the approach that we’re using.

Each new generation of chatbots is more fluent and more impressive, but the basic technology we’re using appears to have serious limits. I strongly suspect you just can’t get System 2-style analytic capabilities just by scaling up the current approach. (And that’s before we ask whether it’s even possible to keep scaling them up without using dramatically more text than actually exists in the world.

But that doesn’t just suggest a ceiling on how impressive GPT chatbots can get, or what capabilities they can develop. It also tells us how to use them!

Most of us spend some of our time doing real work, that requires thought and creativity. And we spend other time dealing with what feels like trivial bullshit, that has to get done but is boring and formulaic. The first type of task is the sort of thing GPT can’t do for us—not now, and I suspect not ever. But the boring, formulaic tasks are ripe for automation. And fortunately, they’re the ones I didn’t want to do anyway.

I’ve been experimenting with using ChatGPT to write homework problems. I wouldn’t want to use it for lecture notes, because for those I’m adding a lot of specific touches I think are important, and the details matter. But homework and test problems are largely rote—which is part of why I find writing them so tedious. I’m working on a separate writeup of how that’s going.
On the other hand, a friend who does online trainings is using it to draft lesson plans. She says she needs to tweak a lot of things but it does a really good job with the basic structure of a training.
A number of programmers I know are impressed by GitHub Copilot, which uses GPT to generate routine code from natural language descriptions, or refactor code in routine ways.
An author whose fiction I like⁵ is experimenting with it to replicate a game of telephone. How will people who weren’t at a major event describe it twenty or fifty years later? “Rewrite this short story as a passage from a history textbook” will not get all the details right but if you’re trying to create fallible in-universe secondary materials that’s a feature.

I’m sure this isn’t a complete list of what GPT-like technologies can do. And even if it takes a while for people to figure out what the technology is good for, I’m sure eventually we’ll find some real uses.

But I don’t believe the dramatic hype I’ve been hearing for the past month. GPT is cool, and fun, and maybe even useful. But it won’t take over the world.

What do you think about the new chatbots? Do you have a use for them I didn’t mention? Or do you think I’m wrong about everything? Tweet me @ProfJayDaigle or leave a comment below.

GPT-2 was released in February 2019, and GPT-3, which ChatGPT is based on, was released in June 2020. I’ve been at least peripherally following this technology since even before the release of GPT-2, so ChatGPT and Sydney are a lot less surprising to me than they are to a lot of people—they’re improved versions of something I was already familiar with. ↵Return to Post
If the ball costs \$0.10 then the bat would have to cost \$1.00, and would only cost ninety cents more; the correct answer is that the bat costs \$1.05 and the ball costs \$0.05. ↵Return to Post
Actually, at this point what my System 1 says is “oh crap, it’s the bat and ball problem again. Think carefully before you answer!” But that’s only from having seen this specific problem too many times; if you changed the setup basically at all, I’d think the wrong answer first, and then correct myself. ↵Return to Post
At least one person has fooled ChatGPT and gotten the wrong answer by changing the bat and ball to a bow and arrow. But every time I’ve tried I’ve gotten the right answer, with either version. ↵Return to Post
If you like superhero fiction, Interviewing Leather and Justice Wing: Plan, Prototype, Produce, Perfect are both really good. ↵Return to Post

Tags: math machine learning artificial intelligence gpt large language models

Maybe-Mathematical Musings

Math, Teaching, Literature, and Life

February 27, 2023

Recent Posts: