02 - Tokens

What Is a Token?

Before going further, you need to understand tokens, because everything in LLM-land is measured in them - context windows, pricing, speed, and model capabilities.

A token is a chunk of text that the model treats as a single unit. Not a character. Not a word. Something in between. A tokenizer algorithm splits all input text into these chunks before the model ever sees it.

How Tokenization Works

LLMs use subword tokenization - most commonly Byte Pair Encoding (BPE) or a variant of it. The process:

Start with a base vocabulary of individual bytes (or characters).
Scan a massive training corpus and find the most frequently occurring pair of adjacent tokens.
Merge that pair into a new single token. Add it to the vocabulary.
Repeat thousands of times until you reach the desired vocabulary size (typically 30K–100K+ tokens).

The result is a vocabulary where common words are single tokens, less common words are split into pieces, and rare or novel words get broken down further.

Concrete Examples

Using Claude's tokenizer (roughly):

Text	Tokens	Count
`Hello`	`[Hello]`	1
`hello`	`[hello]`	1
`Hello world`	`[Hello, ▁world]`	2
`tokenization`	`[token, ization]`	2
`unhappiness`	`[un, happiness]`	2
`Tallinn`	`[T, allinn]` or `[Tall, inn]`	2
`getElementById`	`[get, Element, By, Id]`	4
`こんにちは`	`[こん, にち, は]`	3
(3 spaces)	`[▁▁▁]`	1
`\n\n`	`[\n\n]`	1

Note: the ▁ represents a space character that gets merged with the following word. Exact splits vary by tokenizer - the above are illustrative.

Key Properties

It's not word-based. Common words like "the", "hello", "function" are single tokens. Uncommon words get split: "counterintuitive" might become [counter, intu, itive]. This is why models occasionally stumble on unusual proper nouns or technical jargon - they're processing them in fragments.

Whitespace and punctuation are tokens too. Spaces, newlines, tabs, brackets, semicolons - all consume tokens. Code is typically more token-dense than prose because of all the syntactic characters.

Tokenization is deterministic. Given the same tokenizer, the same input always produces the same token sequence. There's no randomness here - the randomness comes later, during generation.

Different models use different tokenizers. OpenAI's cl100k_base (GPT-4), Anthropic's tokenizer, and Google's SentencePiece all produce different splits for the same text. This is why token counts aren't directly comparable across providers.

The vocabulary is fixed at training time. The tokenizer is built before model training begins and never changes. New words invented after training (brand names, slang) get decomposed into existing subword tokens. The model handles them, just less efficiently.

Why This Matters for Engineering

Context window limits are in tokens, not characters. Claude's 200K context window is ~200,000 tokens, which is roughly 150,000 words or ~600 pages of text. But this is approximate - code, non-English text, and structured data tokenize less efficiently. When a model has a "200K token context window," that's the hard maximum for input + output combined.

You pay per token. API pricing is per input token and per output token. A verbose system prompt costs real money when multiplied by thousands of requests. Every message in your conversation history is re-tokenized and re-billed on every API call.

Non-English tax. BPE vocabularies are trained predominantly on English text. Common English words become single tokens. Other languages get split into more pieces. The same sentence in Estonian might use 1.5–2× more tokens than its English equivalent, meaning you pay more and fill context windows faster.

Generation speed is tokens per second. When an LLM "thinks slowly," it's generating many tokens. When it's fast, it's generating few. Streaming gives you tokens as they're produced, typically 30-100+ tokens/second depending on the model.

The model operates on token IDs, not text. Internally, each token maps to an integer ID. The text "Hello world" becomes something like [15496, 1917]. The model's neural network processes these integer sequences through embedding layers - it never sees raw text. This is why the model can't reliably count characters in a word or reverse a string character by character - it doesn't see characters, it sees token chunks.

Tokenizer Is Separate from the Model

The tokenizer is a fixed, deterministic preprocessing step - not a neural network. It's decided before training and never changes. Every model family has its own tokenizer:

OpenAI GPT-4: cl100k_base (tiktoken, ~100K vocab)
Anthropic Claude: proprietary BPE tokenizer
Meta LLaMA: SentencePiece BPE (~32K vocab)

The same text produces different token sequences on different models. This is why token counts vary across providers and why you can't directly compare "128K context" between models without knowing their tokenizer efficiency.

Practical Verification

You can inspect tokenization yourself:

Anthropic: Use the anthropic Python SDK's token counting, or the API's usage field in responses
OpenAI: platform.openai.com/tokenizer - interactive tool using tiktoken
General: The tiktoken Python library (OpenAI's tokenizer) or Hugging Face's tokenizers library

# OpenAI's tiktoken (for GPT models)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello world")
# [9906, 1917] - 2 tokens

# Decode back
for t in tokens:
    print(f"{t} -> '{enc.decode([t])}'")
# 9906 -> 'Hello'
# 1917 -> ' world'

Understanding tokenization is not optional. Every design decision in agentic systems - prompt length, conversation management, cost estimation, context window budgeting - requires you to think in tokens.

Understanding tokenization is not optional. Every design decision in agentic systems — prompt length, conversation management, cost estimation, context window budgeting — requires you to think in tokens.

How Text Generation Actually Works

LLMs generate text one token at a time, and the process is more expensive than most people realize.

Autoregressive Generation

When you send a prompt to an LLM and it produces a response, here's what happens mechanically:

Prefill phase: The model processes your entire input (system prompt + conversation history + your latest message) in parallel. This produces internal representations for all input tokens at once. This is fast because it can be parallelized on the GPU.
Decode phase — token 1: The model takes the full processed input and generates a probability distribution over every token in its vocabulary (~100K+ options). One token is selected. Let's say it generates "The".
Decode phase — token 2: The model now takes all original input tokens plus the token "The" it just generated, and produces the next probability distribution. It selects "▁capital".
Decode phase — token 3: Input is now: all original tokens + "The" + "▁capital". Generate next distribution. Select "▁of".
...repeat. Every single generated token requires a forward pass through the entire model, with the full input + all previously generated tokens as context.
Stop condition: Generation continues until the model produces a special end-of-turn token (sometimes called <|end|>, </s>, or similar), or hits the max_tokens limit you specified in the API call.

Visualized

Input:  [What] [▁is] [▁the] [▁capital] [▁of] [▁Estonia] [?]

Step 1: [What][▁is][▁the][▁capital][▁of][▁Estonia][?]                          → "The"
Step 2: [What][▁is][▁the][▁capital][▁of][▁Estonia][?][The]                      → "▁capital"
Step 3: [What][▁is][▁the][▁capital][▁of][▁Estonia][?][The][▁capital]             → "▁of"
Step 4: [What][▁is][▁the][▁capital][▁of][▁Estonia][?][The][▁capital][▁of]        → "▁Estonia"
Step 5: [What][▁is][▁the][▁capital][▁of][▁Estonia][?][The][▁capital][▁of][▁Estonia] → "▁is"
Step 6: ...                                                                       → "▁Tallinn"
Step 7: ...                                                                       → "."
Step 8: ...                                                                       → <END>

The input sequence grows by one token with every step. The model sees everything — the original prompt and every token it has generated so far — every single time.

How the Next Token Is Selected

The model outputs a probability distribution over its entire vocabulary at each step. For example:

"Tallinn"  → 0.82
"Helsinki"  → 0.06
"Tartu"    → 0.03
"Riga"     → 0.02
...98,000+ other tokens with tiny probabilities

The temperature parameter controls how this distribution is sampled:

Temperature 0: Always pick the highest-probability token (greedy/deterministic). Same input = same output every time.
Temperature 0.5–0.7: Mild randomness. Mostly picks top candidates but occasionally surprises.
Temperature 1.0: Sample directly from the distribution as-is. More creative, more unpredictable.
Temperature >1.0: Flatten the distribution — even low-probability tokens get a real chance. Gets wild fast.

Other sampling parameters like top-p (nucleus sampling) and top-k truncate the distribution before sampling — e.g., top-p = 0.9 means only consider tokens whose cumulative probability reaches 90%, ignore the rest.

This is why you can ask the same question twice and get different answers — the model literally rolls dice at each token position (unless temperature = 0).

Why This Matters: The KV Cache

Processing the full sequence from scratch at every step would be absurdly slow. In practice, transformers use a KV (Key-Value) cache: the internal representations computed for previous tokens are cached and reused. Each new token only requires computing attention against the cached values plus the one new token.

This is why:

The first token takes longest (the "time to first token" or TTFT) — it must process your entire input.
Subsequent tokens stream faster — each one only adds incremental computation.
Long outputs don't slow down linearly — thanks to the KV cache, each step is roughly constant time.
Long inputs are expensive — the prefill phase scales quadratically with input length due to self-attention (though various optimizations reduce this in practice).
GPU memory limits context length — the KV cache grows with sequence length and must fit in GPU memory. This is a key constraint on context window size.

Engineering Implications

Understanding autoregressive generation explains several things you'll encounter:

Why streaming exists: Tokens arrive one at a time anyway — streaming just sends each to the client as it's produced instead of waiting for the complete response. This is not a special feature, it's the natural output pattern.
Why max_tokens exists: Without it, the model might generate indefinitely. It's a hard cutoff on the decode loop.
Why long outputs cost more than short ones: More tokens = more forward passes = more compute = higher price.
Why models can't "go back and fix" earlier text: Each token is committed once generated. The model can only append. If it realizes mid-sentence that it started wrong, it can only course-correct going forward (or in some architectures, use extended thinking to plan before generating the visible response).
Why stop tokens matter in agents: When building agent loops, you need to detect whether the model stopped because it hit a stop token (it's done talking) or because it wants to call a tool. The stop_reason field in the API response tells you this.

What "Thinking" Actually Is

Models like Claude offer an "extended thinking" mode. OpenAI has o1/o3 "reasoning" models. The marketing suggests the model is deliberating, planning, reflecting. The mechanical reality is simpler and more interesting.

Chain-of-Thought: Thinking as Token Generation

An LLMs only computation happens during the forward pass that generates each token. The model has no internal scratch pad, no working memory, no ability to "pause and reflect" between tokens. Every bit of "reasoning" must be externalized as generated tokens.

Chain-of-thought (CoT) is the discovery that if you make the model generate intermediate reasoning steps as text before producing the final answer, accuracy improves dramatically — especially on math, logic, and multi-step problems.

Why? Because each generated token becomes part of the input for the next token. When the model writes out "Let me work through this step by step: first, 17 × 24 = 408...", the token 408 is now in the context. The model can "see" its own intermediate result and use it for the next step. Without CoT, the model must compute the entire answer in a single forward pass — the few milliseconds it takes to produce one token. That's all the compute it gets.

Thinking is literally just generating more tokens. The model isn't doing anything different internally. It's the same autoregressive loop. The same forward pass. The same next-token prediction. The only difference is that instead of immediately outputting the answer, the model first outputs reasoning tokens that create useful context for subsequent tokens.

Extended Thinking / Reasoning Models

Extended thinking (Claude) and reasoning models (o1/o3) formalize this by:

Allocating a dedicated "thinking" section before the visible response.
Training the model (via RL and fine-tuning) to use this section for planning, decomposition, self-correction, and exploration.
Hiding the thinking tokens from the user (in some implementations) while still counting them as generated tokens.

The thinking block is not a separate cognitive process. It's the same autoregressive generation, written into a <thinking> section instead of the response. The model generates tokens like:

<thinking>
The user wants me to find the bug in this code.
Let me trace through the logic:
- Line 12: loop starts at i=0, correct
- Line 15: array access uses i+1, but array length is n...
  wait, when i = n-1, i+1 = n, that's out of bounds.
That's the bug. Off-by-one error on line 15.
</thinking>

The bug is an off-by-one error on line 15...

Every token in that thinking block costs compute and money. The model is literally "buying itself more time to think" by generating more tokens, which means more forward passes, which means more matrix multiplications, which means more actual computation applied to the problem.

Why More Tokens = Better Reasoning

This isn't mystical. There's a straightforward information-theoretic explanation:

A single forward pass through the model is a fixed-depth computation. It can only do so much work — like asking someone to solve a complex equation in their head in one second.
Each additional generated token adds another forward pass, which means another full trip through all the model's layers. It's additional compute applied to the problem.
Intermediate tokens create context. When the model writes "first, the function takes a list and..." those tokens become part of the input. The model can now condition on its own partial analysis, building up complex reasoning that couldn't fit in a single pass.

It's analogous to humans using pen and paper for long division. You could probably divide 7,843 by 17 in your head — but it's easier and more reliable to write down intermediate results. The LLMs generated tokens are its pen and paper. Without them, all computation must happen in the ~100 layers of the neural network during a single forward pass.

What This Means for Agents

For agentic systems, the implications are:

Thinking budgets matter. More thinking tokens = better decisions but higher cost and latency. You need to tune this based on task complexity. Don't burn 10K thinking tokens on "what's 2+2."
You can induce reasoning via prompts. Even without dedicated thinking modes, prompting with "think step by step" or "analyze before answering" triggers CoT behavior because the model has been trained on such patterns.
Thinking is not free. Those hidden thinking tokens count against context windows and billing. A model that "thinks" for 5,000 tokens before a 200-token response costs 5,200 tokens of output.
Models can be wrong confidently in their thinking. The thinking block is still just next-token prediction. The model can reason flawlessly for 20 steps and then make a subtle error on step 21. It has no formal verification — it's pattern matching all the way through.
Planning via thinking helps agents. Before calling tools, a thinking block that plans which tools to use and in what order produces better tool-calling sequences. The model is giving itself a roadmap it can then follow.

The Core Insight

The key architectural insight is the transformer (Vaswani et al., 2017): a model that processes input tokens in parallel using self-attention mechanisms, allowing it to weigh relationships between all tokens in a sequence simultaneously. The result is a function that takes a sequence of tokens and produces a probability distribution over what comes next.

Everything else - chat interfaces, system prompts, tool calling, agents - is engineering built on top of this core capability.

What Is a Token?​

How Tokenization Works​

Concrete Examples​

Key Properties​

Why This Matters for Engineering​

Tokenizer Is Separate from the Model​

Practical Verification​

How Text Generation Actually Works​

Autoregressive Generation​

Visualized​

How the Next Token Is Selected​

Why This Matters: The KV Cache​

Engineering Implications​

What "Thinking" Actually Is​

Chain-of-Thought: Thinking as Token Generation​

Extended Thinking / Reasoning Models​

Why More Tokens = Better Reasoning​

What This Means for Agents​

The Core Insight​