Learn · Beginner

Tokenization: how an AI chops your words into pieces it can read

A language model cannot read. Not in the way you do. Before a single word of your prompt reaches the model, it is shredded into pieces called tokens and each piece is swapped for a number. Everything the model does, all its apparent understanding, happens on those numbers. Tokenization is that shredding step, and although it sounds like dull plumbing, it quietly shapes how much you pay, how much text fits, and why models do weirdly badly at things like spelling and counting.

Why not just use words, or letters?

Two obvious approaches both fail. If you give the model whole words, the vocabulary explodes, every name, typo, and rare term needs its own entry, and the model is helpless the moment it meets a word it never saw. If instead you give it individual characters, the vocabulary is tiny but sequences become enormously long, and the model wastes effort reassembling meaning from letters one at a time. Tokenization is the compromise in the middle: split text into subword chunks, so common words stay whole and rare words break into familiar pieces.

The word tokenization itself might become token plus ization. The model has never needed to memorize that exact word; it recognizes the parts. This is what lets a model handle a word it has never seen, by spelling it out of pieces it knows.

How the pieces are chosen

The dominant method is byte-pair encoding, introduced for language models by Sennrich et al. (2015). It starts from individual characters and repeatedly merges the most frequent neighboring pair into a new unit. The pair t and h becomes th, then th and e becomes the, and so on. After thousands of merges you get a vocabulary where the most common letter sequences are single tokens and rare ones remain split. Frequent words like the are one token; a rare technical term may be several.

Alternatives refine this. A unigram language-model approach, Kudo (2018), picks a vocabulary by statistical likelihood rather than greedy merging, and SentencePiece made tokenization language-agnostic by treating the raw text, spaces and all, as just a stream of bytes, which is why it works on languages that do not put spaces between words. Modern models typically tokenize at the byte level so that any input, any language, emoji, or code, can always be represented.

Once text is tokens, each token is mapped to a vector of numbers, its embedding, and that is what flows into the model, which we cover in transformers.

Why this matters more than it looks

Tokens are the unit of money and memory. You are billed per token, and a model's context window, the amount it can attend to at once, is measured in tokens, not words. Roughly, English runs about three-quarters of a word per token, but that ratio is not universal, which leads to a real fairness problem: languages underrepresented in the training data get chopped into more tokens per sentence. The same meaning in some languages costs several times more tokens than in English, meaning users of those languages pay more and hit context limits sooner for identical content. This token tax is a quiet form of inequity baked into the plumbing.

Tokenization also explains some of the field's most famous embarrassments. Ask a model how many times the letter r appears in a word and it often miscounts, because it never saw the letters, it saw a token or two that stand for the whole chunk. Spelling, rhyming, and character-level games are hard for the same reason: the model is reasoning about opaque numeric chunks, not letters. Arithmetic suffers too, since numbers get split into inconsistent pieces, so two nearly identical numbers may tokenize in ways that share little, making digit-by-digit reasoning awkward. Many hallucination-adjacent quirks trace back here.

The takeaway

Tokenization is the invisible layer between your text and the model's mind. It is a clever solution to a real problem, fitting an open-ended language into a fixed vocabulary, but the seams show: in your bill, in your context budget, in cross-language fairness, and in the model's odd blind spots about its own letters. The next time a model insists strawberry has two r's, you are not seeing a reasoning failure so much as a tokenization one. It is answering about chunks it can see, not letters it cannot. For what happens to these tokens next, read transformers.

Key papers
Neural Machine Translation of Rare Words with Subword Units / BPE (Sennrich et al., 2015)
Subword Regularization (Kudo, 2018)
SentencePiece (Kudo & Richardson, 2018)