Bradley Robb | WTF is ChatGPT - a No BS Breakdown

Here is a rather lengthy overview of ChatGPT from me - a product guy who's been working with ML both personally and professionally for the past half dozen years or so.

What is ChatGPT?

ChatGPT is a text interface [1] that sits on top of a combination of 3* different models (LLM, Q&A, Reward), none of which are new, and the combination of which are obvious to those in the art, but presented in a way that is approachable and magical to the average person.

ChatGPT isn’t magic. It isn’t sentient. It’s not generalized AI. But it will earn at least a mention in the history of Artificial Intelligence.

Right now, ChatGPT is a tech demo that is approaching being useful as a tool. As with any tool, it’s important to understand the shortcomings of the tool before using it.

Where does ChatGPT fall short?

1 The model gets things wrong when it matters

First due to its underpinnings on a Large Language Model (LLM), ChatGPT is constantly predicting the next most statistically probable word. Statistical probability doesn’t always mean the next right word, just the word that is most frequently next. This can lead to areas where the model outputs things that are just wrong.

Noodling over when this can happen I came up with the 3 scenarios that seem most likely to trip up the bot:

When general consensus differs from the truth. These are the “fun facts” that have popular misconceptions. This also extends to expert details. The general consensus might get the gist of a topic right, but that gist doesn’t always extend to finer points.
When the truth changes quickly. The LLM takes a very large computer a very long time to train, and ChatGPT has a blindspot to anything after some time in 2021. So, anything that’s moving quickly either in an established field or say, in the news media, the bot is left out in the cold.
The model is unable to generate novel ideas. Not novel as in books, but ‘novel’ as in ‘discovering something new’. Meta found this out when they trained a language model on exclusive scientific papers and the output was not scientific discovery, but rather scientific-sounding nonsense.

Takeaway - ChatGPT doesn’t actually know what it’s talking about, it just knows how to output popular opinions.

2 The longer the output, the worse ChatGPT performs

ChatGPT, and language models in general, work by outputting the next probable word based on the words that have previously been seen.
Here’s an example

We’re so close we finish each other’s…?

If you just mentally finished that word with “sentences” or “sandwiches” [Editorial aside - IYKYK], then congrats, that’s the basis of LLMs. Brutal oversimplification true, but the LLM output here generally asks “given the preceding words, what word has the highest probability of being next?”

ChatGPT does that for every word it outputs, using the previous words to predict the next word. For 10 words or 20 words, this works pretty well. But as the predictions stack up, the likelihood that the model has a good, diverse set of words to draw from decreases. This leads to two scenarios:

On longer more specific output, it becomes more likely that ChatGPT will just quote from a single source verbatim. I personally encountered this when asking about the origin of paprika and the bot started handing me content directly from Wikipedia.
On a long enough output, the model can resort to outputting what seems like gibberish, where the next most likely word or words are “stopwords” [2] or even punctuation. These are extremely common and thus, extremely probable outputs when all other uncommon words are equally improbable.

The less generous word for prediction is “assumption.” And as the model outputs text, each word is an assumption based on another assumption based on a different assumption. It’s easy to paint yourself into a corner when you’re on a stack of 100 assumptions and one of them was bad.

Takeaway - for the best results, keep your output cycles short

3 The output is, by definition, average at best

As noted a bunch of times above, the way the LLM works is by predicting the most probable word to output based on the previous words that came before it. Another word for probable is popular or common.

This leads to output of, well, middling quality at worst and serviceable at best. This isn’t to say that the content is bad, but just like LLMs can’t generate novel ideas, and because they have only a probabilistic understanding of the output, the results are going to be average.

If you’re using an LLM to generate code, you’re not going to get the best way to code a function, you’re going to get the most common way to code a function.

If you’re using ChatGPT to write a song, you’re not going to get the deepest, most nuanced lyrics, you’re going to get the most predictable ones. Just ask Nick [3]

When the reader encounters a string of predictable words, they start to skim.

Takeaway - ChatGPT isn’t ready for outputting content intended to generate high levels of engagement

ChatGPT is an iPhone Moment for Natural Language Applications

Having just spent three pages and 700 words, you might think I’m not a fan of ChatGPT - but you’d be wrong. I think ChatGPT is a watershed moment akin to the first iPhone.

If you recall the first iPhone - it too was a combination of technologies that all existed already and it had some famous flaws - no 3G, no ability to add software, locked to a single carrier - that drew a lot of rightful criticism.

But products evolve and it takes time for the optimal usecases to work themselves out (remember when the “drink a fake beer” app was a fundamental demonstration of how cool the iphone was?)

I see ChatGPT being akin to that. For most people, ChatGPT is the closest to the magical promise of AI they have ever seen. It can be tricky and surprisingly lifelike - even if it’s just math underneath the hood.

For those of us involved or interested in the field, its a tempered excitement. The criticisms are intended to be areas for improvement, milestones on the product roadmap we’d implement. I know my mind is already swimming with potential uses, and they’re very different from generating song lyrics in the style of Nick Cave.

Oh, and one more thing…

The part that gets me most excited about ChatGPT is that it’s apparently very quick to train, requiring less data to achieve more new things. It’s not generalized, but it’s a step in the right direction.

Generalization, not sentience, is probably the most exciting thing in ML. We’ve gotten good at training models that do narrow things - like diagnose lung cancer in Xrays. What we’re still learning to do is take that lung cancer model and make it also good at finding brain cancer. Or breast cancer.

Bonus - a No BS look at the 3ish models that make up ChatGPT

Tech #1 - The Large Language Model (LLM)

GPT3.5, the LLM that forms the basis for ChatGPT’s output, is an immense model built on a statistically significant portion of writing. Like almost all of the writing. Like, ever. So, GPT takes much of the human library, and says given the previous words, what word comes next? And then next. And then the one after that.

The GPT3.5 model is built on roughly 175B parameters and GPT4 is rumored to be 500 times larger.

Tech #2 - The Q&A Model

Imagine being locked in a room with just a dictionary that translates Cat Words into Dog Language. All day you’re given slips of paper written in Cat and you consult the book, find the matching Dog, write that down and hand it back.

You have no idea what the original Cat message says or if the Dog translation is correct. But you’re told when you’re right and when you’re wrong. After a while of doing that, you can successfully translate Cat into Dog. Fluently. Quickly.

And now you have a job skill where you can translate one language into another language and speak neither of them.

That’s how ChatGPT answers questions.

A model was given a ton of Questions and Answers, access to GPT3.5, and told to figure out how to answer the questions. After a ton of computer time, it got good, despite not speaking a language or understanding anything it’s outputting. This is frequently referred to as The Chinese Room Argument [4].

Tech #3 - The Reward Model

Before ChatGPT was released on the world, it didn’t just give out one probabilistically derived answer per training question, but several. A human graded these answers and sent them back. Getting things right was considered a reward - a little computer treat.

Just like with the Q&A model, ChatGPT sought to find the optimal path to those treats. Just like training a dog, ChatGPT wanted the positive reinforcement of providing the answers that would get rewards, and it employed a model that predicted which of the several potential outputs would likely lead to rewards.