54 - Scaling Agents: Cost and Multi-Agent Coordination
L02 said tokens are the unit of measurement. By this point in the course, tokens are the unit of budget. An agent that runs for hours at Opus rates is not a research curiosity — it is a line item your finance team will ask about. Multiply by users and tasks per day and you have a real number with a currency symbol next to it.
This lecture is in two parts. Part A covers what to optimize within a single agent: routing, caching, batching, compression, and the rare case where fine-tuning earns its keep. Part B covers whether to split into many agents at all — because most "we need multi-agent" problems are really one-agent problems with bad cost discipline. We do Part A first on purpose. The cost-analysis vocabulary you build there is the lens that tells you when peer agents are worth their overhead and when they are architecture theatre. By the end you should be able to look at any agent system and answer two questions in order: is the cost discipline right inside each agent, and is the boundary between agents pulling its weight.
Part A — Cost optimization
A1 — The cost wall in production agents
A coding session that hums along at 50K tokens per turn, runs forty turns, and uses Opus for the thinking turns has already burned dollars before the user got an answer. Now picture a thousand of those running in a SaaS product. The single-developer "this is fine" arithmetic stops being fine the moment someone else is paying for it.
There is a second, less obvious cost wall: long-running autonomous agents whose costs grow quadratically. Each turn appends to history; each subsequent turn re-tokenizes that history; the bill for turn N includes everything from turns 1 through N. Without compression, a 100-turn session pays for token 1 a hundred times. This is not a bug, it is autoregressive generation working as designed (recall the prefill phase in L02). It is also why the difference between a 20-turn session and a 100-turn session is not 5× the cost — it is closer to 25×.
Three workload shapes recur in production. Each calls for a different optimization mix.
Bursty interactive. A developer's coding session in Claude Code, Cursor, or your own custom harness. Latency matters. Cost per turn matters. Cost per idle turn matters less because there are not many of them. The user is on the keyboard, prompts are unique, the session ends in tens of minutes.
High-volume batchable. Annotating a corpus of 200K support tickets. Tagging 50K resumes. Running an evaluation suite (forward-link L52) over thousands of cases. No human waits for any single response. Throughput per dollar is the only metric.
Long-horizon autonomous. Overnight refactor across a monorepo. A research agent that crawls and summarizes a domain over hours. A migration that touches hundreds of files. The agent is alone with the problem. Quality dominates cost only because failures are invisible until morning, when you discover the bill.
| Workload shape | Knobs that matter most | Knobs that don't |
|---|---|---|
| Bursty interactive | Cache, route to cheaper for non-critical turns, compress mid-loop | Batch API (latency-incompatible), fine-tune (engineering ROI too slow) |
| High-volume batchable | Batch API, route to cheap tier, light caching | Real-time streaming, expensive interactive features |
| Long-horizon autonomous | Cache (long sessions), compress mid-loop, route subagents to cheap tier, checkpointed state | Synchronous user-facing constraints |
A small worked example to make the wall concrete. Suppose your bursty interactive coding session averages 50K input tokens and 4K output tokens per turn, runs 40 turns, and the user runs three sessions a day. At Sonnet rates that is on the order of a dollar per session per developer per day. A team of 100 developers runs that for 200 working days. Roughly $60K a year of API spend before you have shipped anything, with no caching, no routing, and no compression. Caching alone — applied properly to the static prefix — typically removes a large fraction of the input cost on long sessions. Routing the trivial turns to Haiku removes another large fraction of the cost on the cheap-turn tail. Neither requires changing what the agent does. They both go straight to the bill.
The rest of Part A walks the knobs in roughly descending ROI order. Reading them out of sequence is fine; following them in order is what real teams do because the early ones are cheap to deploy and the late ones are expensive to deploy. The single most common mistake is reaching for fine-tuning when prompt caching would have saved more money in one afternoon. Hold that thought.
A2 — Model routing: cheap-first cascade
Most "agent on Opus" workloads have a heavy tail of trivial turns. Acknowledgements. Reading a single file. Confirming a tool result. There is no reason these turns should pay Opus rates.
The pattern is a cheap-first cascade: try Haiku, escalate to Sonnet on signal, escalate to Opus only when the previous tier failed twice or returned low confidence. The tail of cheap turns disappears. The few hard turns still get the model they deserve.
The routing signal can be: a JSON-schema validator on the output, an explicit self-assessment field the model fills, the presence or absence of expected tool calls, or a token-level logprob check on critical fields. Anything cheap enough that running it twice still beats one Opus call.
Pick the signal carefully. Self-reported confidence from the cheaper model is the easiest to wire up and the easiest to be lied to by — small models calibrate poorly and tend to be overconfident on out-of-distribution inputs. A schema validator is harder to fool because the schema is mechanical, not negotiated; if the cheap model emits malformed JSON, it failed regardless of how confident it claimed to be. Combine the two: schema validation as a hard gate, self-reported confidence as a soft signal for borderline cases. Logprob-based gates (looking at the probability of the chosen token versus its top alternatives) are the most rigorous and the most provider-specific — they require API access to logprobs and they will not survive a model upgrade unchanged.
A second-order trap with cascades: the cheap-tier model's failures may correlate with the kind of input (long, ambiguous, off-distribution) that also makes the expensive tier struggle. The cascade does not magically convert hard problems into easy ones. It converts the easy share of the workload into a cheap share. Measure the rate at which inputs reach the top tier, and measure the failure rate at the top tier separately. If the top tier is failing at 20%+, you have a model-quality problem the cascade cannot solve, and you are paying the cascade's overhead on the way to that failure.
import anthropic
import json
from jsonschema import validate, ValidationError
client = anthropic.Anthropic()
CASCADE = [
("claude-haiku-4-5", 0.6),
("claude-sonnet-4-7", 0.8),
("claude-opus-4-7", 1.0),
]
def call_with_self_assessment(model: str, system: str, user: str, schema: dict) -> dict:
rsp = client.messages.create(
model=model,
max_tokens=1024,
system=system + "\nReturn JSON with fields: answer, confidence (0..1).",
messages=[{"role": "user", "content": user}],
)
text = rsp.content[0].text
parsed = json.loads(text)
validate(parsed, schema)
return parsed
def cascade(system: str, user: str, schema: dict, threshold: float = 0.7) -> dict:
last_error = None
for model, ceiling in CASCADE:
try:
out = call_with_self_assessment(model, system, user, schema)
if out["confidence"] >= threshold:
return out | {"model_used": model}
except (json.JSONDecodeError, ValidationError) as e:
last_error = e
continue
raise RuntimeError(f"All tiers failed: {last_error}")
The same pattern applies on OpenAI — start at gpt-5.3-chat, escalate to gpt-5.4, fall back to a reasoning model only when both fail. The TalTech proxy already has the routing infrastructure in place; see L99 for the per-model entry points and the cost matrix.
L30 gives you a shipping example of routing inside one harness: Claude Code's Explore subagent runs on Haiku by default, even when the parent is on Opus or Sonnet. Explore burns tokens reading code; the parent burns tokens reasoning. Mixing the two on the same model wastes Opus rates on grep-equivalent work.
Gotcha — measure P95, not mean. A cascade is cheap on average and expensive on the worst case, because the worst case pays for every tier. If 5% of turns escalate all the way, those turns cost the sum of all three. A naive mean-cost report makes the cascade look better than it is. Track P95 cost per turn, P99 latency to first token, and the escalation rate per tier. If the escalation rate to Opus is over 30%, the routing signal is broken — you are paying Haiku and Sonnet to be wrong before paying Opus to be right.
A3 — Prompt caching
Every API call re-tokenizes the entire input — the system prompt, the tool definitions, every prior message in the conversation. On a 40-turn session, the static prefix is paid for forty times. Prompt caching makes the provider remember the prefix and bill subsequent reads at a steep discount.
Anthropic explicit cache. You opt in by placing cache_control: {"type": "ephemeral"} markers at breakpoints in the request. Up to four breakpoints. The first segment up to a marked breakpoint is cached; subsequent calls that share the exact same prefix hit the cache.
Pricing as of the time this lecture was written (verify against current Anthropic pricing — these numbers move):
| Operation | Cost relative to base input |
|---|---|
| Cache write (first request) | 1.25× |
| Cache read (subsequent requests) | 0.1× |
| Default TTL | 5 minutes |
| Extended TTL (beta) | 1 hour, higher write multiplier |
The math: if a 30K-token system prompt is reused twenty times in a five-minute window, you pay 1.25× once and 0.1× nineteen times instead of 1.0× twenty times. That is roughly a 90% reduction on the cached portion.
OpenAI automatic cache. No opt-in. The platform recognizes long shared prefixes and discounts reads automatically. Less control, no breakpoints, but no engineering work either. Discounted-read math is comparable in spirit; specific pricing differs and changes — check current OpenAI pricing at the time you build.
What to cache. The static prefix: long system prompt, large tool definitions, a stable knowledge base passage, lengthy few-shot examples. Not the variable suffix — the user message and the recent conversation tail change every turn and would invalidate the cache.
import json
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = "You are a senior code reviewer. ..." # ~5K tokens
LONG_TOOL_DEFS = [
# ~25K tokens of tool definitions for a domain harness
]
def turn(user_text: str, history: list[dict]) -> str:
response = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=2048,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
},
{
"type": "text",
"text": json.dumps(LONG_TOOL_DEFS),
"cache_control": {"type": "ephemeral"},
},
],
messages=history + [{"role": "user", "content": user_text}],
)
return response.content[0].text
The breakpoint is at the end of the tool definitions. Everything before it is cached. The history and current user turn are not — they change every call.
Subagent gotcha (cite L30). Subagents do not share the parent's cache. Each subagent boots with its own system prompt, its own tools, and pays its own cache-write on the first call. If you fan out to ten Explore subagents, you pay ten cache writes, not one. This still beats running ten copies on Opus, but it is not free. Design for it: keep subagent system prompts short and stable so the write amortizes across the subagent's own turns; do not spawn ten ephemeral subagents that each only run one turn — the cache write never pays off.
The break-even rule of thumb: a subagent's cache write pays off only after roughly three to five turns of cache reads on the same prefix. Below that, you paid the 1.25× write multiplier for nothing. Above that, every additional turn deepens the savings.
This is why long-running subagents (research, exploration) win on caching and short tactical subagents (one-shot summarization) do not. If your subagent design tends to spawn many short-lived helpers, the right move is to consolidate them into fewer, longer-lived ones — or to skip caching for them and accept the cheaper amortization model where the prompt is short enough that the write penalty is negligible.
| Provider | What is cached | Control | Discount on read | Notes |
|---|---|---|---|---|
| Anthropic explicit | Up to four breakpoint segments | Manual cache_control markers | ~0.1× | 5 min TTL default; 1 h extended (beta) |
| OpenAI automatic | Long shared prefixes | None — automatic | Discounted (provider-defined) | No opt-in; less predictable |
| Subagent caches | Per subagent, independent | Per subagent | Same as above | No sharing across siblings |
A4 — Batch API
Both Anthropic (Message Batches) and OpenAI (Batch API) offer roughly 50% off the per-token rate in exchange for an asynchronous SLA — typically up to 24 hours, often much faster in practice. The catch is in the name: it is asynchronous. You hand off a batch, you come back later for the results.
Right tool for:
- Offline corpus processing — annotation, tagging, classification at scale
- Evaluation suites — running thousands of test cases overnight against a new model or prompt (forward-link L52)
- Bulk summarization — every ticket from yesterday, every PR from the last sprint
- Nightly data pipelines
Wrong tool for: anything a user is waiting on. A batch is not "really fast" — it is "eventually." Conflate the two and you ship a product where the spinner spins for hours.
The 50% discount is not just a billing detail — it is what enables certain workloads to exist at all. An eval suite that costs $4000 to run live is unaffordable for most teams; the same suite at $2000 in a nightly batch is a regular line item. The economic argument for proper evaluation in L52 leans on this discount. If your eval pipeline is not on the Batch API, you are paying twice as much to be twice as slow about telling yourself whether the system works.
import anthropic
import time
client = anthropic.Anthropic()
requests = [
{
"custom_id": f"ticket-{ticket.id}",
"params": {
"model": "claude-haiku-4-5",
"max_tokens": 256,
"messages": [{"role": "user", "content": f"Tag this ticket: {ticket.body}"}],
},
}
for ticket in tickets
]
batch = client.messages.batches.create(requests=requests)
while True:
status = client.messages.batches.retrieve(batch.id)
if status.processing_status == "ended":
break
time.sleep(30)
for entry in client.messages.batches.results(batch.id):
save_tag(entry.custom_id, entry.result.message.content[0].text)
Hybrid pattern. The strongest production design uses the same business logic through two API surfaces. Live API powers interactive paths where the user is waiting. Nightly batch backfills the offline portion at half price — re-tagging older tickets, evaluating yesterday's traffic against today's prompt, generating the reports nobody reads in real time. One codebase, two SLA tiers, half the bill on the offline tier.
A practical implementation note. Keep the prompt and system instructions identical between the live and batch paths so a regression in one is immediately visible in the other. The temptation is to "optimize" the batch prompt because nobody is watching it; the cost is that drift between the two paths becomes its own debugging surface. Treat the batch as a different SLA on the same logic, not as a different product.
The other practical note: idempotency. Batch jobs get retried, partially fail, and occasionally double-deliver. Every batch request needs a deterministic custom_id you can use as an idempotency key on the consuming side. Re-running the same batch should not double-write your downstream. This is mundane infrastructure thinking and it is also what separates production batches from notebook scripts.
A third practical note: visibility. Batch failures happen silently if you do not look. The job-status field reports completion; it does not always surface per-request failures clearly. Always reconcile requests submitted against results returned before treating a batch as done. Discrepancies are common and quietly corrupt your downstream data when ignored.
A5 — Context compression mid-loop
Long agent loops accumulate tool results faster than caching can save. The cache helps the static prefix; it does not help the conversation tail, which grows linearly. By turn 30, that tail is the dominant cost on every subsequent turn — and worse, it is almost entirely junk the model has already used and moved past.
Compression techniques, roughly cheap-to-expensive by engineering effort:
1. Summarization checkpoints. Every N turns, replace older message blocks with a one-paragraph summary the agent itself produces. Define the policy explicitly: which fields must survive (open tasks, key file paths, error messages, decisions taken), which may be discarded (resolved tool outputs, navigation chatter, intermediate file contents).
2. Selective retention. Keep tool names and args in the history, drop verbose results once the model has consumed them. The agent remembers it called read_file("/etc/passwd") and got something back; it does not need the full file content thirty turns later.
3. Handoff to a subagent. L30's "delegate the noise" — for any long verbose side-quest, push it into a subagent with its own context, have the subagent return a summary, and let the subagent's bloated context get garbage-collected when it returns. The parent never sees the noise.
4. External memory. Write intermediate state to a file or database, reload on demand. The trade-off: you pay reload cost when you need the state again, you save tail cost when you do not. Security note (cite L53): anything you write to external memory becomes a poisonable input on reload. If memory can be tampered with — by another agent, by user-supplied content, by a compromised file system — it becomes a fresh injection vector every time it is loaded.
def compact_history(history: list[dict], keep_last: int = 6) -> list[dict]:
"""
Replace tool_result blocks older than the last `keep_last` turns
with a one-line summary, preserving the original message boundaries.
Anthropic places tool_result blocks inside role="user" messages.
"""
if len(history) <= keep_last:
return history
head, tail = history[:-keep_last], history[-keep_last:]
compacted = []
for msg in head:
content = msg.get("content")
if msg.get("role") != "user" or not isinstance(content, list):
compacted.append(msg)
continue
new_blocks = []
for b in content:
if not (isinstance(b, dict) and b.get("type") == "tool_result"):
new_blocks.append(b)
continue
raw = b.get("content", "")
if isinstance(raw, str):
summary = raw[:120] + "..." if len(raw) > 120 else raw
else:
summary = "[redacted]"
new_blocks.append({
"type": "tool_result",
"tool_use_id": b["tool_use_id"],
"content": f"[summary] {summary}",
})
compacted.append({"role": "user", "content": new_blocks})
return compacted + tail
Anthropic also exposes context-editing and persistent-memory features in beta; treat them as accelerators on top of the same pattern, not replacements for thinking about what to keep. Persistent memory is particularly tempting because it sounds like the answer to long-running agents — a place to put state across sessions. It works, but every entry in persistent memory is loaded back into context on the next session, which means it counts against the same context budget you were trying to save. Persistent memory is useful for summaries and decisions; it is harmful for raw artefacts that the next session would never need.
Gotcha — lossy compaction degrades recall. A compactor that drops the wrong field will silently make the agent worse at one type of task. Test compactions like you test refactors: have a regression suite of tasks the agent should still complete after compaction is enabled, run it before and after every compaction-policy change (forward-link L52). "It feels fine" is not a test.
A particularly nasty failure mode: the agent loses access to a piece of information it referenced earlier and starts confabulating. Without compression the model would have re-read the original tool result; with compression the original is gone and the model fills the gap with plausible nonsense. The way you catch this is the same as how you catch any quiet regression — automated evals that probe specifically for facts that should still be recallable after compaction. The prompt "what was the third tool call you made?" becomes a useful canary.
A6 — Fine-tune vs prompt
Default position: don't fine-tune. The frontier models are good enough that prompt engineering, retrieval, and few-shot examples cover the overwhelming majority of tasks at lower engineering cost and zero lock-in. Fine-tuning is also where engineering ego is at its loudest — it sounds more impressive than "we cached the system prompt" — and the field has not collectively reckoned with how often it is the wrong answer.
Fine-tuning beats prompting only when all of the following hold:
- You have thousands of high-quality, task-specific examples — and the cost of curating them is paid for by what you save.
- The task is narrow and stable. Today's fine-tune is tomorrow's stale checkpoint if the underlying behavior keeps drifting.
- Prompt engineering plus few-shot plus retrieval is genuinely exhausted — you have measured (not assumed) that they top out below your bar.
- Latency or per-call cost dominates total cost of ownership — for instance, a real-time classifier called at 1000 RPS where a smaller fine-tuned model wins on raw throughput.
A close cousin is distillation: use the big model to generate training data for a smaller model that you then deploy. Same cost-benefit math; same warning about narrow stable tasks. Distillation has one specific sweet spot — running an expensive model in development to produce a corpus of input/output pairs, then training a much cheaper model on that corpus for production. The development-time labels are real frontier model outputs (cheap to generate as a one-off cost); the production-time inference is on a small fast model. When the task is narrow and the production volume is high, this beats every other knob.
Provider availability shifts. As of writing, Anthropic does not offer customer-facing fine-tuning of Claude models — verify at the time you build. OpenAI offers fine-tuning for some GPT models. Open-weight families (Llama, Mistral, Qwen) have full fine-tuning available outside the major-provider ecosystems.
| Signal | Reach for |
|---|---|
| "The prompt is too long, every call is slow." | Cache, then compress |
| "We are hitting context limits." | Compression, subagents, retrieval |
| "It is wrong about our domain vocabulary." | Few-shot examples in prompt, then RAG |
| "It cannot use our internal API." | Tools, MCP, skills (not fine-tune) |
| "We have 50K labeled examples and a static narrow task." | Fine-tune (or distillation) |
| "Latency dominates everything." | Smaller model with fine-tune |
| "We want better reasoning on hard problems." | Bigger model, more thinking budget — not fine-tune |
The hidden cost of fine-tuning is rarely the training run itself. It is data curation (the human-hours to label and clean), evaluation infrastructure (how do you know the new checkpoint is better?), and re-doing all of it when the upstream base model improves. By the time your fine-tune is in production, the next-generation base model often beats your tuned predecessor with no special prompting at all. Plan the lifecycle, not just the training step.
The ROI ladder for cost optimization, ranked by what we have seen pay back fastest in real production: cache > batch > route > compress > fine-tune. Most teams reach for the bottom of the ladder first. Don't be most teams.
Part B — Multi-agent coordination
Now that you know how to make one agent cheaper, the question is whether to spawn many. The instinct is usually wrong, and the cost analysis from Part A is the lens that tells you why. The same dollars that buy you cache hits and routing inside one agent are the dollars you spend twice (or N times) when you split into peer agents without thinking. Read this part with the cost ladder still in your head — the green-light criteria below are what justify spending those extra dollars; without them, you are buying overhead.
A useful note on terminology before we start. The phrase "multi-agent" is overloaded in the wild. People use it to mean (a) a single agent with subagents in one harness, (b) several peer agents in separate processes with a coordinator, (c) decentralized swarms with no coordinator, and (d) chains of LLM calls glued together with conditional logic.
These are very different architectures with very different cost and reliability profiles. Insist on which one you mean every time you read or write that phrase. The cost analysis below assumes (b) — separate-process peers — because that is where the marketing pressure is and where the most expensive mistakes are made. Cases (a) and (d) are addressed in L30 and in §B4 respectively. Case (c) — fully decentralized swarms — remains a research curiosity in 2026; if you find yourself reaching for it, you are almost certainly building something simpler in disguise.
B1 — The honest pitch: most multi-agent is bad architecture
The marketing pitch for multi-agent systems is "specialization improves quality and parallelism reduces wall-clock time." Both are sometimes true. The cost story is what marketing leaves out, and it has three compounding components. We are going to walk each one with numbers, because the multi-agent debate is one of the few places in this field where a hand-wave "but specialization!" still wins arguments that the math would lose.
Token overhead. Every agent pays for its own system prompt, its own tool definitions, and its own conversation tail on every turn. N peer agents with even modest configurations baseline at roughly N times the per-turn input cost before any useful work happens. Two agents with 30K-token system prompts each cost 60K tokens of static overhead per turn that one agent would cost 30K. The cache helps each agent's own prefix; it does not collapse the duplication across agents.
If the agents share substantial prompt material — coding conventions, project context, security policies — that shared material is paid for in every agent. There is no provider-level "shared cache across agents" in 2026. You can engineer your way out of this with retrieval (each agent fetches the shared context only when needed) but the engineering cost is real and the savings are bounded by how rarely the context is actually needed.
Error compounding. A pipeline of stages, each at 95% correctness, has end-to-end correctness 0.95^N. Specialization that pushes per-stage quality from 90% to 95% is undone by adding two more stages.
| Stage correctness | 2 stages | 3 stages | 5 stages | 8 stages |
|---|---|---|---|---|
| 99% | 98% | 97% | 95% | 92% |
| 95% | 90% | 86% | 77% | 66% |
| 90% | 81% | 73% | 59% | 43% |
| 80% | 64% | 51% | 33% | 17% |
Read the 95% row carefully. A five-stage multi-agent pipeline of "pretty good" agents is a 77% pipeline overall — and each agent needs to be near-perfect for the system to be merely good. Pipelines do not average; they multiply.
Coordination overhead. Handoff messages cost tokens. Summary fidelity loss costs quality — the receiver only sees what the sender thought to forward. The orchestrator's merge logic pays its own cost in tokens and code complexity. None of these line items appear in the demo.
Worse, summarization across handoffs is lossy in ways that pipelines magnify. If the researcher tells the writer "the user wants a brief on retrieval", the writer never sees the original three-paragraph user request. If the original request had a constraint the researcher dropped, the writer cannot rediscover it. The fix is to forward more — but forwarding more means paying token overhead for the rest of the pipeline, recreating the problem you tried to solve by splitting in the first place.
A worked-arithmetic example. Imagine a four-stage research pipeline: planner → researcher → drafter → reviewer. Each stage at 92% individual accuracy. End-to-end accuracy is 0.92^4 ≈ 71.6%. Each stage uses 25K tokens of system prompt and tools, runs four turns on average, and pays its own cache writes because it is its own peer.
Roughly: the same problem solved by one agent with all four roles compressed into a single system prompt — even with the prompt at 35K tokens because of the role-mixing — pays one set of cache writes, one merge, and lives or dies on one model's accuracy. If the single agent achieves 90% accuracy on the same task, it beats the 71.6% pipeline at a fraction of the cost. The pipeline only wins when each stage clears 97%+ accuracy and the orchestration cost is negligible. That is not most pipelines.
Default position: solve the problem with one good agent plus subagents (cite L30) before reaching for peer-agent architectures. Subagents share the parent's deployment, the parent's trust boundary, and a single owner. Most "multi-agent" systems people build are actually subagents in disguise wearing a more impressive name. Be honest about which you have.
B2 — Subagents vs peer agents — the architectural distinction
The single most useful question to ask before any multi-agent design: one harness or many processes? This is the architectural distinction that decides almost everything else.
Subagents (L30) are in-process child loops. The parent owns their lifecycle. They share the parent's deployment artifact, the parent's trust boundary, the parent's runtime environment. Communication is parent ↔ child only — siblings do not talk. The parent merges results.
Peer agents are separate processes, possibly separate services, possibly with independent owners. Each has its own trust boundary, its own scaling story, its own failure mode. Communication is over the network or via a shared substrate (queue, blackboard, file store). Lifecycles are independent — one peer can be deployed, scaled, or restarted without the others.
| Force pushing toward peer architecture | Stay with subagents if absent |
|---|---|
| Independent ownership (different teams own the agents) | One team owns all of it |
| Independent scaling (one agent gets 100× the traffic) | Volumes are similar |
| Hard security boundary (untrusted code on one side, secrets on the other — see L53) | Single trust domain |
| Third-party integration (someone else's agent, vendor-supplied) | All in-house |
| Independent release cadence | One release train |
Rule of thumb: stay with subagents until at least one of those forces is real. The instinct "we'll start with peer agents because that's the future" is how teams pay coordination cost for a year before discovering they only ever needed orchestrator-worker inside one harness. Most "multi-agent" architectures in the wild are subagents in disguise — the test is the question above.
B3 — When multi-agent (peer) actually helps
Three precise cases where peer-agent architecture is not over-engineering. Outside these, prefer subagents.
(a) Embarrassingly parallel independent subtasks. Evaluate 1000 essays. Tag 50K tickets. Score 200K resumes. Each unit of work is independent of the others, the cross-talk is zero or near-zero. Peer agents can fan out across machines and finish faster.
But: this is also the canonical case for the Batch API from §A4. Compare the two — usually batch wins on cost, peer agents win only when you genuinely need real-time results across the fan-out. Pick the cheaper option. The honest framing: "embarrassingly parallel" is more often an argument for the Batch API than for peer agents, because the Batch API takes the parallelism problem off your hands and gives you 50% off as a bonus.
(b) Explicit specialization with stable interface contracts. Long-running domain agents whose system prompts and tool definitions differ enough that combining them would explode any single agent's context. Think: a finance reasoning agent, a legal drafting agent, a code-generation agent — each with thousands of tokens of domain instructions and entirely different tool surfaces. Combining all three into one agent makes every turn pay for instructions it does not need.
Even here: prefer specialized subagents within one harness if you can. Push to peer architecture only when independent ownership or scaling forces it. The "stable interface contract" part of this case is load-bearing — if the contract between the specialists changes weekly, peer architecture is buying you nothing because the interface is the most expensive part to maintain. Specialization without stability is just fragmentation.
(c) Hard security boundaries. One agent reads sensitive data, another writes externally. Physical separation enforces leg removal from the lethal trifecta (cite L53). The data-reading agent has no network egress; the externally-writing agent has no access to private data. The peer architecture is the security control. Co-locating them in one process undoes the boundary, no matter how careful the in-process sandboxing claims to be — a single prompt-injection that crosses both contexts inside one address space defeats the design.
This case is the strongest justification for peer agents because the architecture itself is enforcing a property that no amount of prompting can guarantee. If you find yourself writing system prompts like "do not exfiltrate the data you just read" — that is the moment to ask whether the architecture should make exfiltration impossible rather than asking the model not to do it.
L25 is the precursor done right. BMAD has named role agents — Analyst, PM, Architect, Dev, QA — but they run within one harness, communicating through documents on disk, with each role activated in a fresh chat to prevent context bleed. That is role specialization with document handoffs, not a peer-agent system. It is what you should reach for before peer agents in almost every case.
The BMAD example is worth dwelling on because it shows the green-light test in action. BMAD has specialization (different roles), it has document-driven handoffs (artifact contracts), and it has fresh-context isolation per role. What it does not have is independent ownership, independent scaling, or hard security boundaries — so it stays inside one harness. The architecture matches the actual forces. That is what good design looks like.
| Case | Use peer agents? | Better default |
|---|---|---|
| Tag 100K records, must finish in 3 minutes | Maybe | Batch API first |
| Tag 100K records overnight | No | Batch API |
| Three domain experts, same team, same product | Probably no | Subagents (BMAD-style) |
| Three domain experts, three teams, different cadences | Yes | Peer agents |
| One agent reads PII, another posts to Slack | Yes | Peer agents enforce the boundary |
| "We want microservices for our agent" | No | This is resume-driven development |
The inversion test is useful: if you removed the peer architecture and ran everything in one harness, what exactly would break? If the answer is "nothing, just slower" — you do not need peer agents, you need parallelism. If the answer is "two teams cannot ship independently" — you need peer agents. If the answer is "the security review would not pass" — you need peer agents. If the answer is "I just want it to feel like a real distributed system" — that is not an answer.
B4 — Communication patterns
Once you have committed to peer agents, four patterns cover almost every real system. Pick one — mixing them in the same system is the fastest way to a debugging nightmare.
Orchestrator–worker. A central agent spawns and coordinates workers. Workers do not talk to each other. The orchestrator owns the merge. This is the default for almost every real multi-agent system, and it maps cleanly onto subagents in the single-process case. The orchestrator is a single point of failure and a single point of cost — when it goes down or hangs, every worker is stranded — but those properties are also features when debugging, because there is exactly one place to look.
Blackboard. A shared writable state — a key-value store, a document, a database — that all agents read from and write to asynchronously. Useful when you have opportunistic specialists who jump in when they see relevant data. Risk: race conditions, stale reads, and cross-agent prompt injection (cite L53). Anything one peer writes becomes input to every other peer that reads. A compromised or hallucinating peer poisons the board for everyone.
The blackboard pattern was a darling of 1980s expert systems and has been rediscovered by every generation since. The reason it never quite became dominant is the same reason it struggles in the LLM context: the asynchrony that makes it flexible also makes it hard to reason about, and the shared writability that makes it powerful also makes it a single attack surface. Use sparingly, with strict schemas on what each peer is allowed to write, and treat reads from the board as untrusted input (L53) regardless of who wrote them.
Message bus / pub-sub. Decoupled topics; agents subscribe to topics they care about. Right for event-driven workflows where producers do not know consumers — a classic enterprise integration pattern carried into the agent world. Real operational overhead though: queue infrastructure, schema management, retry logic, dead-letter queues. Do not adopt this unless you are already running the rest of the operational machinery. The agent layer is a consumer of operational maturity here, not a creator of it. If you are bootstrapping a queue for the first time because of an agent project, the queue is going to break before the agents do.
Pipeline / chain of responsibility. Sequential stages, each transforms its input and passes to the next. This is what most people mean when they say "multi-agent." It is really just function composition with prompts — a useful design pattern but rarely the most cost-effective: see the error-compounding table in §B1. The redeeming feature of pipelines is that they are easy to debug: each stage's input and output are small, well-defined, and inspectable. The damning feature is that adding stages multiplies error and cost while specialization gains diminish quickly past the third or fourth stage.
| Pattern | Use when | Avoid when |
|---|---|---|
| Orchestrator-worker | One coordinator can plan, workers do bounded jobs | Workers need to react to each other's results live |
| Blackboard | Opportunistic specialists, soft real-time | Strong consistency needed; security-sensitive content |
| Message bus | Decoupled producers/consumers, persistent events | Tight latency loops, small team without ops capacity |
| Pipeline | Stages are deterministic and well-bounded | High per-stage variance; failures cascade |
B5 — Shared state mechanics
Peer agents need a shared substrate. Pick the smallest one that works.
| Substrate | Pick this if |
|---|---|
| Parent-side memory only (subagents return summaries) | You are still in L30 territory; default |
| Shared filesystem (workspace dir, file handoffs) | Document-driven, BMAD-style (L25); peers on the same machine |
| Database / KV store (Redis, Postgres) | State must outlive any single run; peers across machines |
| Message queue | Async is intrinsic to the workflow; you already have queue ops |
Two failure modes everyone hits. Stale reads — agent A acts on a snapshot of B's state taken before B's most recent update; A then writes back, clobbering B's update. Write-write races — two agents modify the same record simultaneously and the last-writer-wins overwrites real work. These are not novel agent problems; they are distributed-systems problems carried into the agent world. Solutions are the boring ones: optimistic concurrency tokens, write-once append logs, transactional outbox patterns. There are no agent-specific shortcuts.
A specific anti-pattern to avoid: shared in-memory state across peer agents (e.g. a global Python dict accessed by multiple agent loops in the same process pretending to be peers). Either you have peers and they should communicate over a real substrate, or you have subagents and the parent owns the state. The hybrid — "peers" sharing in-memory state — gives you the worst of both: race conditions without the architectural benefit of separation. If your peer agents share memory, they are not peers. They are subagents with extra steps.
A second anti-pattern: handoffs that pass the entire conversation history of the sender as context to the receiver. This is the lazy answer to "how do I make sure the next agent has enough context?" — and it nullifies the whole point of separating into peer agents in the first place. Each peer's context budget gets eaten by the sender's history, the system pays N× for the same tokens, and the cache is useless because every handoff produces a slightly different prefix. The discipline is: forward the artifact (the file, the decision, the structured summary), not the conversation. If the receiver needs the conversation, you do not have peer agents — you have one agent that has been needlessly split in half.
import json
from pathlib import Path
import anthropic
client = anthropic.Anthropic()
WORKSPACE = Path("/tmp/agents-workspace")
WORKSPACE.mkdir(exist_ok=True)
def researcher(topic: str) -> Path:
rsp = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=2048,
messages=[{"role": "user", "content": f"Research: {topic}. Return JSON with summary and citations."}],
)
out = WORKSPACE / "research.json"
out.write_text(rsp.content[0].text)
return out
def writer(research_path: Path) -> Path:
research = research_path.read_text()
rsp = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=2048,
messages=[{"role": "user", "content": f"Write a brief based on this research:\n{research}"}],
)
out = WORKSPACE / "brief.md"
out.write_text(rsp.content[0].text)
return out
# Subagents A and B handing off through a shared workspace.
# Both lifecycles are owned by the parent; the workspace is just the API.
research_path = researcher("retrieval-augmented evaluation")
brief_path = writer(research_path)
This is the BMAD pattern from L25 reduced to its mechanical core: agents communicate through artifacts on disk, not through shared conversation context. The contract between agents is the file format. If you swap the inner agents for different models or different prompts, the contract still holds. That is the property to design for.
B6 — A2A — Google's Agent2Agent protocol
A2A is Google's emerging open protocol for agent-to-agent communication across organizational and vendor boundaries. The idea: a capability card (or "agent card") describing what an agent can do, a task lifecycle for delegating work, and a structured message format that any compliant agent can speak. Reference site: a2a-protocol.org (verify state at the time you build — this is a moving target).
The framing matters. A2A is emerging; it is not stable; ecosystem maturity is far behind MCP at the same age. This is not a knock — it is just where it is. Watching adoption is the right activity; building production critical paths on it for the next twelve months is not.
Why MCP matured faster: MCP solves a more concrete pain — an agent needs tools, the same agent needs different tools across clients, and each client used to require a custom integration. Every developer has felt that pain. A2A solves a more speculative pain — cross-vendor, cross-organization agent collaboration that very few people actually need yet. Scarce real demand means slower iteration, fewer reference implementations, and a smaller pool of debugging eyes.
| Property | MCP (L27) | A2A |
|---|---|---|
| What it connects | Agent ↔ tools/resources | Agent ↔ agent |
| Maturity | Mature, ratified, broad client adoption | Emerging, mostly Google-ecosystem early adopters |
| Pain it solves | "Every client needs custom tool integration" | "Agents from different vendors should collaborate" |
| Demand intensity | Universal among agent builders | Niche among agent builders |
| Production readiness | Yes, with caveats from L27 | Watch, don't build (12 months) |
| Auth / security | OAuth maturing, identity story improving | Still being defined |
Reasonable Masters-level stance: read the spec, build prototypes if it interests you, do not stake production critical paths on it for at least twelve months, watch adoption. If A2A becomes the dominant cross-agent protocol, you will know — there will be reference implementations from multiple vendors and concrete migration tooling. Until then, every problem A2A claims to solve has a more boring HTTP-API answer that already works.
Concrete signals to track for "A2A is ready":
- Production deployments by at least three vendors not part of the founding consortium.
- A stable, versioned auth/identity story — at minimum OAuth flows that match the maturity bar of L27's current OAuth support.
- Reference SDKs in two or more languages outside Python (typically Go, TypeScript, JVM).
- A debugging story — protocol-level inspection tools comparable to MCP's existing tooling.
Until those land together, A2A is a research bet, not an engineering one. The course will revisit it when the picture changes; the syllabus will be updated rather than the lecture rewritten in panic.
B7 — Consensus is a research topic, not an engineering pattern
Multi-agent systems sometimes claim to use consensus — multiple agents debating, voting, or judging each other to reach a more reliable answer than any single agent. The research literature here is real and worth reading. The production engineering claim — that this composes into reliable systems today — is not.
Three common research patterns:
- Debate (Du et al., 2023) — two or more model instances argue opposite positions, a judge synthesizes. Improves some reasoning benchmarks; expensive; gains shrink on real workloads.
- Self-consistency / majority voting (Wang et al., 2022) — sample N reasoning traces, pick the most common final answer. Cheap to implement, well-studied, helps on bounded benchmarks.
- Judge models — one model evaluates another's output. Foundational to a lot of evaluation work (forward-link L52). Notorious failure mode: judges share training-data biases with the judged.
Why none of this composes cleanly into production:
- Latency multiplies linearly in the number of voters or debate rounds. A 5-voter consensus call is roughly 5× the user-visible latency of a single call. Parallelizing the voters helps wall-clock time but not the wallet.
- Cost multiplies the same way. A debate with three models for three rounds is a 9× per-call increase before merge cost.
- No provable convergence guarantees. The literature reports averages; production needs guarantees on the worst case.
- Judges share the biases of the judged. If the same family of models is doing both the answering and the judging, errors correlate. The vote is not independent. Cross-family judging (Claude judging GPT, or vice versa) helps somewhat but introduces its own correlations from the shared training corpus.
- The strongest opinion wins, not the correct one. Voting between language models tends to converge on confident-sounding outputs, not accurate ones. A small calibrated model loses to a larger overconfident one in a vote even when it is right more often.
Plain statement: there is no production-ready multi-agent consensus algorithm in 2026. If you need agreement on an answer in production, use deterministic checkers — tests, schemas, validators, type-checkers, executable verifiers. These compose. They have known costs. They fail in known ways. Voting between language models does not have those properties.
A useful contrast: a code-generation agent paired with a compiler is a consensus-by-construction pattern. The compiler is a deterministic check; the agent regenerates until it passes. There are guarantees. A code-generation agent paired with another LLM that says "looks good to me" is theatre. Which one is doing the reliability work in any system you encounter is the question that separates production architecture from demo architecture.
The same logic applies in less obvious places. An agent generating SQL paired with a dry-run EXPLAIN step is consensus-by-construction. An agent generating SQL paired with another LLM saying "this query looks correct" is theatre. An agent generating data-extraction output paired with a JSON-schema validator is consensus-by-construction. An agent generating data-extraction output paired with another LLM saying "the extraction looks complete" is theatre. The pattern generalizes: pair every generation step with a deterministic verifier, not with another generator.
Read the literature (Du et al., Wang et al., the long tail of multi-agent-debate papers) — but mark them as research, not deployable patterns. If a vendor pitches you a consensus product, ask exactly which deterministic checker is doing the actual reliability work underneath. There usually is one; the LLM ensemble around it is theatre.
B8 — The decision rules to walk out with
The summary table to screenshot. Everything in this lecture compresses to these rules. They are deliberately phrased as priors — defaults you only abandon with evidence — not as commandments. Engineering rarely benefits from rigid rules. It always benefits from explicit ones.
| Topic | Rule |
|---|---|
| Cost knobs (ROI order) | cache > batch > route > compress > fine-tune |
| Workload shape | Bursty → cache+route. Batchable → Batch API. Long-horizon → cache+compress+route subagents to cheap tier |
| Routing | Cheap-first cascade with a real signal; track P95 cost, not mean |
| Caching | Cache the static prefix only; subagents do not share the parent's cache |
| Batching | Half-price for offline; never user-facing |
| Compression | Test compactions like refactors; lossy compaction silently regresses agents |
| Fine-tune | Last resort; only with stable narrow task and thousands of examples |
| Subagents vs peer | Stay with subagents until ownership / scaling / security / third-party forces peer |
| Multi-agent green light | Peer architecture justified only if ≥2 of {parallelizable, independent, specialization-justified, security-boundary} |
| Communication pattern | Default to orchestrator-worker; everything else needs a hard reason |
| Shared state | Use the smallest substrate that works; expect stale reads and write-write races |
| A2A stance | Watch, don't build (12 months) |
| Consensus stance | Don't. Use deterministic checkers instead |
Two final framings to leave with.
First, the cost ladder is counter-intuitive on its first reading. Students who have read papers about fine-tuning reach for fine-tuning first. Students who have heard the multi-agent-debate hype reach for consensus first. Both skip past the boring engineering — caching, batching, routing, compression — that delivers most of the savings most of the time. The ladder is empirical. Treat it as the prior you only abandon with evidence.
Second, the multi-agent-skepticism stance is not anti-multi-agent. It is pro-discriminating-taste. Multi-agent peer architectures earn their cost when at least two of the green-light criteria from §B3 hold; without them, you are paying overhead for the privilege of having "multi-agent" on a slide. The right move when in doubt is to build with one agent plus subagents (L30) on a L25-style document-driven handoff, watch the cost, and only split into peer agents when a specific force from the §B2 table makes you. That sequence is how shipped systems actually look. The slide-ware sequence — peer agents from day one because the future demands it — is how teams burn six months on coordination plumbing for a problem that one Opus session would have solved.
The closing operational discipline. Track three numbers per workload, every week:
- Cost per task at P50 and P95 (mean is misleading — see §A2).
- End-to-end success rate, with a deterministic checker as the judge whenever possible.
- Tokens spent per "useful" output unit (per ticket tagged, per essay scored, per PR reviewed).
These three numbers tell you whether your optimizations are actually working. The number of agents in your architecture diagram does not. Track them with the same discipline you would track latency and error rate on a regular service — same dashboards, same alerts, same review cadence. Cost regressions are functional regressions in this world; "the prompt got 30% more expensive" is the same severity of bug as "the API got 30% slower" used to be.
A final habit to develop: review costs per change, not just per quarter. When a prompt is edited, the cost of the new prompt is part of the diff. When a tool definition is added, the cost of carrying that definition on every turn is part of the diff. PR reviewers should ask "what is this doing to cost per turn?" the same way they currently ask about test coverage. The teams that catch their cost drift early do so because cost is a first-class review concern, not a quarterly post-mortem topic.
Walk out with the cost ladder, the green-light criteria, and the discipline to ask the architectural question first. The rest is engineering.