02 - Tokens
What Is a Token?
Before going further, we need to define the fundamental unit LLMs operate on: the token.
A token is not a word. It's not a character. It's a chunk of text determined by a tokenizer - an algorithm that splits input text into pieces the model can process. Each token maps to an integer ID in the model's vocabulary (typically 30,000–100,000 entries).
How Tokenization Works
Modern LLMs use Byte Pair Encoding (BPE) or variants of it (SentencePiece, tiktoken). The algorithm is straightforward:
- Start with a base vocabulary of individual bytes (256 entries).
- Scan a large training corpus and find the most frequently occurring pair of adjacent tokens.
- Merge that pair into a new single token. Add it to the vocabulary.
- Repeat steps 2–3 thousands of times until you reach the desired vocabulary size.
The result is a vocabulary where common words are single tokens, less common words are split into pieces, and rare words or code get broken into many small fragments.
Examples
| Input Text | Tokens (approximate) |
|---|---|
Hello world | [Hello][ world] - 2 tokens |
tokenization | [token][ization] - 2 tokens |
Tallinn | [T][all][inn] - 3 tokens |
indistinguishable | [ind][ist][ingu][ishable] - 4 tokens |
int main() { | [int][ main][()][ {] - 4 tokens |
🎉 | [🎉] - 1 token (common emoji) |
Привет | [При][вет] - 2 tokens |
(3 spaces) | [ ] - 1 token (whitespace gets merged) |
Note: leading spaces are typically attached to the following word as part of the token ( world not world). This is why you'll see tokens that start with a space - it's not a bug, it's how BPE handles word boundaries.
Why This Matters for Engineering
Cost. API pricing is per token, both input and output. A 200-word English paragraph is roughly 250–300 tokens. A 5,000-line codebase might be 40,000+ tokens. Understanding token counts is understanding your bill.
Context window limits. When a model has a "200K token context window," that's the hard maximum for input + output combined. A token is roughly ¾ of a word in English, so 200K tokens ≈ 150K words. But code, non-English text, and structured data (JSON, XML) tokenize far less efficiently - you get fewer "words" per token.
Non-English tax. BPE vocabularies are trained predominantly on English text. Common English words become single tokens. Other languages get split into more pieces. The same sentence in Estonian might use 1.5–2× more tokens than its English equivalent, meaning you pay more and fill context windows faster.
The model doesn't see text. This is critical to internalize. The model receives a sequence of integer IDs: [15496, 995] not "Hello world". It predicts the next integer ID from a probability distribution over the entire vocabulary. The tokenizer then decodes that ID back to text. When people say the model "generates text," it's actually generating a sequence of vocabulary indices, one at a time.
Tokenizer Is Separate from the Model
The tokenizer is a fixed, deterministic preprocessing step - not a neural network. It's decided before training and never changes. Every model family has its own tokenizer:
- OpenAI GPT-4:
cl100k_base(tiktoken, ~100K vocab) - Anthropic Claude: proprietary BPE tokenizer
- Meta LLaMA: SentencePiece BPE (~32K vocab)
The same text produces different token sequences on different models. This is why token counts vary across providers and why you can't directly compare "128K context" between models without knowing their tokenizer efficiency.
You Can Inspect This Yourself
OpenAI publishes their tokenizer as tiktoken. Anthropic provides a token counting API. For any serious development, you should be counting tokens programmatically, not guessing.
# OpenAI's tiktoken (open source)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Hello world")
# [9906, 1917] - two token IDs
# enc.decode(tokens) → "Hello world"
Every token ID maps to a fixed byte sequence. The mapping is deterministic and reversible. There's no interpretation, no semantics - just a lookup table built by BPE statistics.
How LLMs Work
What Is a Token?
Before going further, you need to understand tokens, because everything in LLM-land is measured in them - context windows, pricing, speed, and model capabilities.
A token is a chunk of text that the model treats as a single unit. Not a character. Not a word. Something in between. A tokenizer algorithm splits all input text into these chunks before the model ever sees it.
How Tokenization Works
LLMs use subword tokenization - most commonly Byte Pair Encoding (BPE) or a variant of it. The process:
- Start with a base vocabulary of individual bytes (or characters).
- Scan a massive training corpus and find the most frequently occurring pair of adjacent tokens.
- Merge that pair into a new single token. Add it to the vocabulary.
- Repeat thousands of times until you reach the desired vocabulary size (typically 30K–100K+ tokens).
The result is a vocabulary where common words are single tokens, less common words are split into pieces, and rare or novel words get broken down further.
Concrete Examples
Using Claude's tokenizer (roughly):
| Text | Tokens | Count |
|---|---|---|
Hello | [Hello] | 1 |
hello | [hello] | 1 |
Hello world | [Hello, ▁world] | 2 |
tokenization | [token, ization] | 2 |
unhappiness | [un, happiness] | 2 |
Tallinn | [T, allinn] or [Tall, inn] | 2 |
getElementById | [get, Element, By, Id] | 4 |
こんにちは | [こん, にち, は] | 3 |
(3 spaces) | [▁▁▁] | 1 |
\n\n | [\n\n] | 1 |
Note: the ▁ represents a space character that gets merged with the following word. Exact splits vary by tokenizer - the above are illustrative.
Key Properties
It's not word-based. Common words like "the", "hello", "function" are single tokens. Uncommon words get split: "counterintuitive" might become [counter, intu, itive]. This is why models occasionally stumble on unusual proper nouns or technical jargon - they're processing them in fragments.
Whitespace and punctuation are tokens too. Spaces, newlines, tabs, brackets, semicolons - all consume tokens. Code is typically more token-dense than prose because of all the syntactic characters.
Tokenization is deterministic. Given the same tokenizer, the same input always produces the same token sequence. There's no randomness here - the randomness comes later, during generation.
Different models use different tokenizers. OpenAI's cl100k_base (GPT-4), Anthropic's tokenizer, and Google's SentencePiece all produce different splits for the same text. This is why token counts aren't directly comparable across providers.
The vocabulary is fixed at training time. The tokenizer is built before model training begins and never changes. New words invented after training (brand names, slang) get decomposed into existing subword tokens. The model handles them, just less efficiently.
Why This Matters for Engineering
Context window limits are in tokens, not characters. Claude's 200K context window is ~200,000 tokens, which is roughly 150,000 words or ~600 pages of text. But this is approximate - code, non-English text, and structured data tokenize less efficiently.
You pay per token. API pricing is per input token and per output token. A verbose system prompt costs real money when multiplied by thousands of requests. Every message in your conversation history is re-tokenized and re-billed on every API call.
Generation speed is tokens per second. When an LLM "thinks slowly," it's generating many tokens. When it's fast, it's generating few. Streaming gives you tokens as they're produced, typically 30-100+ tokens/second depending on the model.
The model operates on token IDs, not text. Internally, each token maps to an integer ID. The text "Hello world" becomes something like [15496, 1917]. The model's neural network processes these integer sequences through embedding layers - it never sees raw text. This is why the model can't reliably count characters in a word or reverse a string character by character - it doesn't see characters, it sees token chunks.
Practical Verification
You can inspect tokenization yourself:
- Anthropic: Use the
anthropicPython SDK's token counting, or the API'susagefield in responses - OpenAI: platform.openai.com/tokenizer - interactive tool using
tiktoken - General: The
tiktokenPython library (OpenAI's tokenizer) or Hugging Face'stokenizerslibrary
# OpenAI's tiktoken (for GPT models)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello world")
# [9906, 1917] - 2 tokens
# Decode back
for t in tokens:
print(f"{t} -> '{enc.decode([t])}'")
# 9906 -> 'Hello'
# 1917 -> ' world'
Understanding tokenization is not optional. Every design decision in agentic systems - prompt length, conversation management, cost estimation, context window budgeting - requires you to think in tokens.
Understanding tokenization is not optional. Every design decision in agentic systems — prompt length, conversation management, cost estimation, context window budgeting — requires you to think in tokens.
How Text Generation Actually Works
LLMs generate text one token at a time, and the process is more expensive than most people realize.
Autoregressive Generation
When you send a prompt to an LLM and it produces a response, here's what happens mechanically:
-
Prefill phase: The model processes your entire input (system prompt + conversation history + your latest message) in parallel. This produces internal representations for all input tokens at once. This is fast because it can be parallelized on the GPU.
-
Decode phase — token 1: The model takes the full processed input and generates a probability distribution over every token in its vocabulary (~100K+ options). One token is selected. Let's say it generates
"The". -
Decode phase — token 2: The model now takes all original input tokens plus the token
"The"it just generated, and produces the next probability distribution. It selects"▁capital". -
Decode phase — token 3: Input is now: all original tokens +
"The"+"▁capital". Generate next distribution. Select"▁of". -
...repeat. Every single generated token requires a forward pass through the entire model, with the full input + all previously generated tokens as context.
-
Stop condition: Generation continues until the model produces a special end-of-turn token (sometimes called
<|end|>,</s>, or similar), or hits themax_tokenslimit you specified in the API call.
Visualized
Input: [What] [▁is] [▁the] [▁capital] [▁of] [▁Estonia] [?]
Step 1: [What][▁is][▁the][▁capital][▁of][▁Estonia][?] → "The"
Step 2: [What][▁is][▁the][▁capital][▁of][▁Estonia][?][The] → "