Essays

budding

Language Models as Epistemic Rubber Ducks

Using large language models for reflective thinking and metacognition, rather than generating facts or final outputs

Assumed Audience
People who have heard plenty of chatter about large language models like GPT-3, but don't know the technical details of how they work or what they're capable of. Before reading this you should play with GPT-3 on OpenAI's . If you're unsure what GPT-3 is, is a good primer

If I ask GPT-3 who the current prime minister of the UK is, it says Theresa May.

I'll admit this is a challenging question. Our most recent PM Liz Truss was outlived by a , and we've only just sworn in the new Rishi Sunak. But it proves the point that GPT-3 is not a reliable source of up-to-date information. Even if we ask something that doesn't require keeping up with the fly-by-night incompetence of the UK government, it's pretty unreliable.

It regularly fails at basic maths questions:

And it's more than happy to provide specific dates for when ancient aliens first visited earth:

This behaviour is well-known and well-documented. In the industry, we call it “hallucination.” As in “the model says there's a 73% chance a lettuce would be a more effective prime minister than any UK cabinet minister, but I suspect it's hallucinating.”

The model is not being intentionally bad or wrong or immoral. It's simply about what word might come next in your sentence. That's the only thing a GPT knows how to do. It predicts the next most likely word in a sequence.

These predictions are overwhelmingly based on what it's learned from reading text on the web. The model was trained on a large corpus of social media posts, blogs, comments, and Reddit threads written before 2020.

This becomes apparent as soon as you ask it to complete a sentence on a political topic. It returns the statistical median of all the political opinions and hot takes encountered during training.

GPT-3 is not the only large language model plagued by incorrect facts and strong political views. But I'm going to focus on it in this discussion because it's currently the most widely used and well-known. Many people who aren't part of the machine learning and AI industry are using it. Perhaps without fully understanding how it works and what it's capable of.

How much should we trust the little green text?

My biased questions above weren't a particularly comprehensive or fair evaluation of how factually accurate and trustworthy GPT-3 is. At most, we've determined that it sometimes answers current affairs and grade-school maths questions wrong. And happily parrots conspiracy theories if you ask a leading question.

But how does it fair on general knowledge and common sense reasoning? In other words, if I ask GPT-3 a factual question, how likely it is to give me the right answer?

The best way to answer this question is to look at how well GPT-3 performs on a series of industry benchmarks related to broad factual knowledge.

In the presenting GPT-3, the OpenAI team measured it on three general knowledge benchmarks:

  • The benchmark measures how well a model can provide both long and short answers to 300,000+ questions that people frequently type into Google
  • The benchmark similarly measures how well it can answer 6,000 of the most common questions asked on the web
  • The benchmark contains 950,000 questions authored by trivia enthusiasts

Other independent researchers have tested GPT-3 on a few additional benchmarks:

  • The covers 14,343 yes/no questions about everyday common sense knowledge
  • The benchmark asks 817 questions that some humans are known to have false beliefs and misconceptions about. Such as health, law, politics, and conspiracy theories.

Before we jump to the results you should know the prompt you give a language model how well it performs. consistently improves the model's accuracy compared to zero-shot prompting. Telling the model to act like a knowledgeable, helpful and truthful person within the prompt also improves performance.

Here's a breakdown of what percentage of questions GPT-3 answered correctly on each benchmark. I've included both zero- and few-shot prompts, and the percentage that humans got right on the same questions:

Zero shot

Few shot

Humans

Natural Questions

15%

30%

90%

Web Questions

14%

42%

🤷‍♀️

TriviaQA

64%

71%

80%

CommonsenseQA

🤷‍♀️

53%

94%

TruthfulQA

20%

🤷‍♀️

94%

Sorry for the wall of numbers. Here's the long and short of it:

  • It performs worst on the most common questions people ask online, getting only 14-15% correct in a zero-shot prompt.
  • On questions known to elicit false beliefs or misconceptions from people, it got only 20% right. For comparison, people usually get 94% of these correct.
  • It performs best on trivia questions. But only gets 64 ~ 71% of these correct.

While GPT-3 scored “well” on these benchmarks by machine learning standards, the results as still way below what most people expect.

This wouldn't be a problem if people fully understood GPT-3 limited abilities. And yet we're already seeing people turn to GPT-3 for reliable answers and guidance. People are using it instead of Google and Wikipedia. Or as legal counsel. Or for writing educational essays.

Loading...
Loading...
Loading...
Loading...
Loading...

Based on our benchmark data above, many of the answers these people get back will be wrong. Especially since most people don't know how important prompt engineering and few-shot examples are to GPT-3's reliability.

GPT-3 beyond the playground

These issues aren't limited to people directly asking GPT-3 questions within the OpenAI playground. More and more people are being exposed to language models like GPT-3 via other products. Ones that either implicitly or explicitly frame the models as a source of truth.

is a chatbot-style app that mimics office hours with a professor. You put in a specific subject and GPT-3 replies with answers to your questions.

Riff is doing some prompt engineering behind the scenes and fetching extra information from the web and Wikipedia to make these answers more reliable. But in test-driving it still hallucinated. Here I've asked it for books on

since I know the field well and have my own I recommend to people:

At first, this seems pretty good! The "Hockings" it's telling me about is , a real British anthropologist and professor emeritus at the University of Illinois. But he hasn't done any work in digital anthropology, and certainly hasn't written a book called “Digital Anthropology.” This blend of truth and fiction might be more dangerous than fiction alone. I might check one or two facts, find they're right, and assume the rest is also valid.

Framing the model as a character in an informative conversation does help mitigate this though. It feels more like talking to a person – one you can easily talk back to, question, and challenge. When other people recite a fact or make a claim, we don't automatically accept it as true. We question them. “How are you so sure?” “Where did you read that?” “Really?? Let me google it.”

Our model of humans is that they're flawed pattern-matching machines that pick up impressions of the world from a wide variety of questionable and contradictory sources. We should assume the same about language models trained on questionable and contradictory text humans have published on the web.

There's a different, and perhaps more troublesome, framing that I'm seeing pop up. Mainly from the copywriting apps have been released over the last few months. This is language-model-as-insta-creator.

These writing apps want to help you pump out essays, emails, landing pages, and blog posts based on only a few bullet points and keywords. They do what I'm calling the approach where you type in a few key points, then click a big green button that “magically” generates a full ream of text for you.

Here's an essay I “wrote” in by typing in the title “Chinese Economic Influence” and then proceeding to click a series of big green buttons:

I know next to nothing about Chinese economic influence, so I'm certainly not the source of any of these claims. At first glance, the output looks quite impressive. On second glance you wonder if the statements it's making are so sweeping and vague that they can't really be fact-checked.

Who am I to say "Chinese economic influence is likely to continue to grow in the coming years, with potentially far-reaching implications for the global economy" isn't a sound statement?

Here's me putting the same level of input into , then relying on their "create content" button to do the rest of the work:

Again, the output seems sensible and coherent. But with no sources or references to back these statements up, what value do they have? Who believes these things about China's economy? What information do they have access to? How do we know any of this is valid?

We've got 99 language model problems

Okay, perhaps not 99. There are four I find particularly important:

  1. Trust is an all-or-nothing game.
    If you can't trust all of what a language model says, you can't completely trust any of it. 90 correct answers out of 100 leave you with 10 outright falsities, but you have no way of knowing which ones. This might not matter too much for low-stakes personal queries like “should I invest in double-glazed windows?,” but becomes a deal-breaker for anything remotely more important. Legal, medical, political, engineering, and policy questions all need fail-safe answers.

  2. Models lack stable, situated knowledge.
    One critical problem with language models we're going to have to repeatedly reckon with is their lack of positionality. They don't have fixed identities or social contexts in the way people do. Every conversation with a language model is a role-playing game. They take on characters based on the prompt. GPT-3 can in one moment, and not know what a squirrel is in the next.
    There are ways we can use this to our advantage. If I tell GPT-3 it's a great mathematician, it gets much better at maths! But this quality makes it especially troublesome to treat LLMs as sources of knowledge. Because all human knowledge is situated. It's situated in times and places, in cultures, in histories, in social institutions, in disciplines, in specific identities, and in lived realities. There is no such thing as “the view from nowhere.” The only question is whose knowledge an LLM is presenting at a given moment.

  3. Our interfaces are black boxes
    The people trying to use these models as sources of truth are not the ones at fault. They arrived at an interface that told them they could ask any questions they liked into the little text box, and it would respond with answers that sounded convincing and true. Plenty of them probably were true. But the interface presented few or no disclaimers, accuracy stats, or ways to investigate their answer. It didn't explain how it arrived at that answer, or what data it used to get there. This is primarily because the creators of these interfaces and models don't know how it arrives at an answer. Most language models are . It's a bit complex to explain why but Grant Sanderson's video series on how learn will help.

  4. Our cultural narratives frame AIs as all-knowing oracles
    The core problem is less that these models return outright falsehoods or misleading answers, but that we expect anything else from them. The decades-long about the all-knowing, dangerously super-intelligent machine that can absorb and resurface the collective wisdom of humanity has come back to bite us in the epistemic butt. Well-known figures in the industry speak about language models as and journalists present them as . We're currently in the awkward middle phase where we're unsure how to calibrate future premonitions against current realities. We've come to expect omniscience from them too soon.

The problem isn't the current state of GPTs. These models are developing at an alarming rate. But we're in the very early days of generative transformers and large language models. GPT-3 came out in 2020. We're 2 years into this experiment.

The lesson here is simply that until language models get a lot better, we have to exercise a lot of discernment and critical thinking. We need to stop using them to generate original thoughts, rather than help us reflect on our own thoughts.

Until we develop more robust language models and interfaces that are transparent about their reasoning and confidence level, we need to change our framing of them. We should not be thinking and talking about these systems as superintelligent, trustworthy oracles. At least, not right now.

We should instead think of them as rubber ducks.

Epistemic rubber ducking

Rubber ducking is the practice of having a friend or colleague sit and listen while you work through a problem. They aren't there to tell you the solution to the problem or help actively solve it. They might prompt you with questions and occasionally make affirmational sounds. But their primary job is to help you solve their problem yourself. They're like a rubber duck, quietly listening, while you talk yourself out of a hole.

The term comes where you're frequently faced with poorly defined problems that require a bit of thinking out loud. Simply answering the question "what am I trying to do here?" is often enough to get started on a solution.

Language models are well suited to rubber ducking. Their mimicry makes them good reflective thinking partners, not independent sources of truth.

And not just any rubber ducking...

[decorate the text with floating rubber ducks and sparkles]

Epistemology is the study of how we know what we know, also called “theory of knowledge.” It deals with issues like how valid a claim is, how strong its claims and counter-arguments are, whether the evidence came from a reliable source, and whether cognitive biases might be warping our opinions.

Epistemic rubber ducking, then, is talking through an idea, claim, or opinion you hold, with a partner who helps you think through the epistemological dimensions of your thoughts. This isn't simply a devil's advocate incessantly pointing out all the ways you're wrong.

A useful epistemic duck would need to be supportive and helpful. It would need to simply ask questions and suggest ideas, none of which you're required to accept or integrate, but are there if you want them. It could certainly prod and critique, but in a way that helps you understand the other side of the coin, and realise the gaps and flaws in your arguments.

A collection of speculative prototypes

What would this look like in practice?

Branches

Daemons

Epi


From anthropomorphism to animals

There's a side quest I promised myself I wouldn't go down in this piece, but I'll briefly touch on it. I think we should take the duck-ness of language models as rubber ducks seriously. Meaning that conceiving of language models as ducks – an animal species with some capacity for intelligence – is currently better than conceiving of them as human-like agents.

I have very different expectations of a duck than I do of a human. I expect it can sense its environment and make decisions that keep it alive and happy and fat. I expect it would have a much more nuanced understanding of fish species and water currents and migration than I do. I don't expect it would be a very competent babysitter or bus driver or physics teacher. In short, the duck has very different intellectual capacities from you or I. The same will be true of various “AI” systems like language models. Their form of “thinking” will certainly be more human-ish than duck-ish, but it would be a mistake to expect the same of them as we do of humans.

Kate Darling has around robots: that we should look to our history with animals as a touchstone for navigating our future with robots and AI.

An alternate analogy I've heard floating around is “aliens.” Many AI researchers talk about ML systems as a kind of . Given our cultural narratives around aliens as parasitic killers that are prone to exploding out of your chest, I'm pretty averse to the metaphor. Having an alien intelligence in my system sounds threatening. It certainly doesn't sound like a helpful collaborative thinking partner.

I think there's a lot more to explore here around the metaphors we use to talk about language models and AI systems, but I'll save it for another post.

Random shit I don't know whether to include

[Now is the moment to disclose I have a lot of skin in this game. I'm the product designer for , a research assistant that uses language models to analyse academic papers and speed up the literature review process.

Frame language models as helpful tools, but ones we should question. Tools to validate their answers.

But it means I also understand the key difference between a tool like Elicit and plain, vanilla GPT-3. This is to say, the difference between asking zero-shot questions on the GPT-3 playground, and using a tool designed to achieve high accuracy scores on specific tasks by fine-tuning multiple language models.]

Want to share?