What a language model is, and how it generates text

It's tempting to picture an AI model as a very well-read librarian: a thing that understands your question, looks up the answer, and reports back. That picture will mislead you.

A next-token predictor

Here's a more accurate one. A large language model is a system trained to do a single, narrow thing extremely well: given some text, predict what comes next.

A language model takes text and returns a ranked list of likely next tokens
A language model takes text and returns a ranked list of likely next tokens

That's the core of it. You give it a stretch of text, "The cat sat on the", and it produces a ranked list of likely continuations: "mat," "floor," "sofa," each with a probability. It picks one, and that's the output.

The word "large" is doing real work. These models train on an enormous amount of text and carry billions of internal parameters, tunable numbers adjusted during training until the predictions get good. Train a big enough model on enough text, and "predict what comes next" turns out to produce writing that's fluent, on topic, and often useful.

So where does answering questions come from? The same mechanism. When you ask "What's a good subject line for a launch email?", the model isn't retrieving a stored answer. It's predicting the text that would plausibly follow your question. And text that plausibly follows a good question tends to be a good answer. The capability is a side effect of getting very good at prediction.

This reframe matters for everything ahead. The model has no database of facts to look up and no internal sense of being right. It produces the most likely continuation of the text it's been given. Most of the time that's exactly what you want. Sometimes the most plausible-sounding continuation is wrong, and the model has no way to tell, a problem you'll meet head-on by the end of this lesson.

One token at a time

A reply isn't composed all at once. It's built the same way it's predicted: one piece at a time, left to right. The streaming you've seen in chat tools isn't an animation bolted on for effect. It mirrors how the model actually produces text.

First, the term. A token is a chunk of text the model reads and writes in. Sometimes a token is a whole word; often it's part of one. "Writing" might be a single token, while "tokenization" might be three. As a rough rule, a token is about four characters of English. The model's whole world is a stream of these tokens.

Here's the loop:

The model predicts a token, appends it, and feeds the longer text back in to predict the next
The model predicts a token, appends it, and feeds the longer text back in to predict the next

The model takes everything so far, your prompt, and predicts the next token. It appends that token to the text, then runs again on the new, slightly longer text to predict the token after that. Predict, append, repeat. Each pass feeds the previous output back in as input.

This is why responses stream so naturally. The token-by-token delivery you'll wire up later isn't a trick to keep users entertained. You're sending each token to the browser the moment the model commits to it, instead of making the user wait for the whole reply.

The loop has to stop somewhere. It ends when the model predicts a special "stop" token, or when it hits a limit you set, the max_tokens value you'll pass with every request. Set it too low and a long answer gets cut off mid-sentence. You'll see exactly where that lever lives when you make your first call.

Notice what the model is not doing. It never plans the full answer in advance. It commits to each token based on what's likely to come next, then moves on. That one fact, fluent and plausible but unplanned, explains both why these models are so useful and why they sometimes state wrong things with total confidence.

What to trust, and what to verify

That last point has a practical edge. The model optimizes for plausible, and most of the time plausible and true line up. Sometimes they don't, and nothing in the model flags the difference.

The line to hold: the model is most reliable when the answer is already in what you give it, tightening a paragraph, summarizing a doc you pasted, restructuring notes. It gets shakier when the answer has to come from its memory, specific facts, figures, dates, quotes, arithmetic past the trivial. When it states something made up out of nothing, that's a hallucination, and from the inside a hallucinated fact looks identical to a real one.

This holds whatever you point the model at, prose, code, or numbers: lean on it to work with what you give it, and verify anything it supplies as fact. You'll come back to why that happens in Section 6.

That's the mental model: what the model is, how it generates, and where to trust it. Next, let's put real numbers on it.