What a token is, and why it isn't a word

The last lesson gave you a rough rule: a token is about four characters of English. That rule is fine for quick estimates. It also hides the thing that trips developers up when the first bill or the first length error shows up.

The model doesn't see words. It doesn't see characters, either. It sees tokens, and tokens line up neatly with neither.

How text becomes tokens

Before the model reads anything, your text is split into tokens using a fixed vocabulary built during training. This step is called tokenization, and that vocabulary is decided once and then frozen.

A short sentence split into tokens, with some words broken into smaller pieces

Common words usually map to a single token. Rarer or longer ones get broken into pieces. “Writing” is likely one token. “Tokenization” might come out as “token” plus “ization.” The split follows what was frequent in the training text, not any rule about syllables or meaning.

Whitespace and punctuation count too. A leading space is often bundled into the token, so “ the” with a space and “the” without one can be different tokens. That sounds like a trivial detail. It's exactly why your hand counts never quite match the real number.

Why the word count lies

Here's the naive move: count the words in your prompt and treat that as your token count. For plain English prose, you'll land in the ballpark. Almost everywhere else, you won't.

Code, with its brackets, indentation, and symbols, runs far more tokens per line than prose. Names, URLs, and IDs split into many pieces. Numbers fragment in odd ways, so a long figure can become several tokens. Languages other than English, and emoji, cost more tokens per character than English does.

NOTE

Don't reason about length in words or characters. The only count the API cares about is tokens, and the only way to know it for sure is to let the API report it back.

Counting before you send

If you need to know a token count before sending, you have two options. Tokenizers like tiktoken (originally for GPT) approximate well enough for budgeting, but aren't an exact match for Anthropic's vocabulary. The authoritative count is the usage.input_tokens field in the response, which you'll see in the next section, but that only helps after the call. For pre-flight budgeting, the rough character-divided-by-four heuristic is close enough for most English prose; deviate from English or include code, and the heuristic under-counts by a third or more.

To give you a sense of scale. A 200-word English paragraph is roughly 270 tokens. A 1,000-line code file is closer to 6,000 to 8,000 tokens, several times the per-character ratio of prose because brackets, indentation, and identifiers fragment more. A 10-page PDF of dense text runs 6,000 to 10,000 tokens once you extract the text. Numbers worth keeping in mind when you're about to paste something into a prompt.

The reason to budget at all is the context window. If you can estimate your input within 20 percent, you can reason about what fits and what won't before the API tells you no.

So why care about an invisible unit you never type? Because tokens are what everything downstream is counted in. What you pay, and how much the model can hold at once, are both measured in tokens, not words. The cost comes first. That's next.

Build Your First AI-Powered App

What a token is, and why it isn't a word

How text becomes tokens

Why the word count lies

Counting before you send