Skip to main content

51 - Errors, Determinism, Observability

An agent loop is a distributed system where the model is one unreliable node and the tools are the rest. Every turn, you make a network call to a stochastic process, parse its output, dispatch to an external service that can time out, fail, or rate-limit you, and feed the result back into a context that grows with every iteration. Anyone who has shipped a microservice will recognize the failure surface — except this microservice rolls dice for a living.

Reliability is the discipline of being able to answer three questions after a run, ideally without re-running it. What happened — observability. Was it the same as last time — determinism, or the closest you can get. When it broke, where exactly — error handling. They are not independent: you cannot diagnose what broke if you did not log it, you cannot replay a session you did not pin a version on, and you cannot retry a step you did not isolate. Treat them as one problem with three views.

1. Why reliability is the hardest part of agents

L30 defined the agent loop: observe, think, act, observe the result, repeat. In a clean run, the loop terminates when the model emits a stop token. In production, the loop terminates because something failed and your harness gave up — or worse, it didn't, and ran forever.

The math is unforgiving. Suppose every step — model call, tool dispatch, tool execution, parse — succeeds 98% of the time. A 10-step trajectory then has 0.98^10 ≈ 0.82 end-to-end success. Push to 30 steps and you are at 0.55. L26 makes the same point about tool-call chains: per-step accuracy of 80% collapses to 33% on a 5-step chain. Reliability is exponential in trajectory length, and most non-trivial agent runs are 20+ steps.

The three lenses for the rest of the lecture: error handling (sections 2–4), determinism (sections 5–7), observability (sections 8–11). Each section's payoff motivates the next.

2. The five failure modes of an agent loop

Every agent failure mode reduces to one of five categories. If you cannot place a failure on this taxonomy, you have not finished diagnosing it.

(a) Model errors. The provider returns an HTTP error, refuses the request, truncates the output mid-token, or emits malformed JSON inside a tool call. Refusals (stop_reason: "refusal" on Anthropic, finish_reason: "content_filter" on OpenAI) are policy decisions and not retriable in the usual sense. Truncation (stop_reason: "max_tokens") means your max_tokens was too low for the answer the model wanted to produce. Malformed JSON appears even from frontier models when strict mode is off.

(b) Tool errors. The tool you dispatched returned an error. Network 5xx, timeout, rate limit (429), authentication failure (401), validation error (400), or your own code threw an unhandled exception. These are normal distributed-systems failures and have well-known mitigations: retry with backoff, circuit-break, fall back.

(c) Hallucinated tool calls. The model invented a tool name that does not exist (get_weather_v2 instead of get_weather), passed parameters that do not match the schema (city: 123 when city is a string), or invoked a real tool with semantically wrong arguments. Strict mode (section 6) prevents the schema mismatches. The other two cases survive strict mode.

(d) Runaway loops. The model keeps calling tools without making progress. Two flavors: ping-pong (call A, call B, call A, call B forever) and stuck-on-error (the same tool returns the same error and the model keeps retrying it). Both burn tokens and money. The fix is a turn budget — max_turns in your harness — plus the circuit breaker from section 3.

(e) Silent semantic failure. The worst class. Every tool call succeeded. Every parse worked. The model produced fluent output. The output is wrong. The user reads it, acts on it, and you only find out later. There is no exception to catch, no 4xx to log. This is the failure mode that pushes you from logging into evaluation — see L52.

The key thing to internalize: (a)–(d) raise something. (e) does not. A reliability story that only handles raised errors leaves the worst class entirely uncovered.

L26 covered the convention for feeding tool errors back to the model. The Anthropic SDK pattern, recapped here because it is the foundation for the next two sections:

from anthropic import Anthropic

client = Anthropic()

def run_tool(name: str, args: dict) -> tuple[str, bool]:
"""Returns (content, is_error). Never raises."""
try:
if name not in TOOL_REGISTRY:
return f"Unknown tool: {name}. Available: {list(TOOL_REGISTRY)}", True
result = TOOL_REGISTRY[name](**args)
return result, False
except TimeoutError as e:
return f"Tool {name} timed out after {e.timeout}s. Retry once or skip.", True
except Exception as e:
return f"Tool {name} failed: {type(e).__name__}: {e}", True

# In the agent loop, after the model returns a tool_use block:
content, is_error = run_tool(block.name, block.input)
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": content,
"is_error": is_error,
}],
})

The is_error: true field is how the model knows the call failed. A non-erroring model usually treats a successful tool result as ground truth — feeding errors as if they were data is how you get the model confidently quoting a stack trace as the user's answer.

3. Retry, backoff, circuit breakers

Tool errors (failure mode b) and transient model errors (subset of a — 503s, connection resets) are the right population for retries. Hallucinated calls (c) and silent failures (e) are not — retrying does nothing if the bug is in the model's reasoning.

Exponential backoff with jitter. Naive retry hammers a struggling service. Exponential backoff (1s, 2s, 4s, 8s) is better but synchronizes retries from many clients. Add jitter — a random multiplier — to spread the load. Standard Python is tenacity:

from tenacity import (
retry, stop_after_attempt, wait_exponential_jitter,
retry_if_exception_type,
)
import httpx

@retry(
stop=stop_after_attempt(4),
wait=wait_exponential_jitter(initial=1, max=30),
retry=retry_if_exception_type((httpx.TimeoutException, httpx.HTTPStatusError)),
reraise=True,
)
def call_external_api(url: str, payload: dict) -> dict:
response = httpx.post(url, json=payload, timeout=10)
response.raise_for_status()
return response.json()

Hand-rolled, when you want to avoid the dependency or you need async control:

import asyncio
import random

async def retry_with_backoff(coro_factory, *, max_attempts=4, base=1.0, cap=30.0):
for attempt in range(max_attempts):
try:
return await coro_factory()
except (asyncio.TimeoutError, ConnectionError) as e:
if attempt == max_attempts - 1:
raise
delay = min(cap, base * (2 ** attempt))
delay *= 0.5 + random.random() # jitter in [0.5, 1.5]
await asyncio.sleep(delay)

Idempotency keys. Retries are only safe when the operation is idempotent — repeating it produces the same effect as running it once. Reads are naturally idempotent; writes are not. The fix is a client-supplied idempotency key: send Idempotency-Key: <uuid> (Stripe convention) and the server deduplicates retries. Anthropic's Messages API supports Anthropic-Idempotency-Key for exactly this — if your harness retries a messages.create after a network blip, you do not want to be charged for two completions.

Circuit breakers. If a service is consistently down, retrying every request adds load and delay without benefit. A circuit breaker tracks failure rates and short-circuits requests when the service is unhealthy. Three states:

In Closed, traffic flows and you count failures. After N failures in a window, trip to Open: every request fails fast with a CircuitOpenError and never reaches the service. After a cooldown (30–60s), move to Half-Open and let one probe through. If it succeeds, close. If it fails, reopen. Libraries like pybreaker and purgatory implement this.

Decision table for what to retry — retry, do not retry, or hand off to a human (HITL — human-in-the-loop):

ErrorRetry model callRetry tool callHITL
429 rate limit (model)Yes, with backoffn/aNo
429 rate limit (tool)n/aYes, with backoffNo
5xx (model API)Yes, up to 3xn/aNo
5xx (tool API)n/aYes, up to 3xNo
Timeout (model)Yes, with longer timeoutn/aNo
Timeout (tool)n/aYes, but check idempotencyNo
4xx malformed request (model)No, fix the requestn/aNo
4xx validation (tool)n/aNo, surface to modelNo
Refusal (stop_reason: refusal)Non/aYes
Hallucinated tool namen/aNo, surface as errorNo
Destructive tool (bash rm, fund transfer)n/aNever auto-retryYes
Authentication (401, 403)No, fix credentialsNo, fix credentialsYes

The destructive-tool row is the one students get wrong. If transfer_funds returns a timeout, you do not know whether the transfer happened — retrying might double-charge. Surface to a human, or rely on a server-side idempotency key the client cannot forge. Same principle for bash rm -rf /tmp/build: retry only if you can prove safety, otherwise stop and ask.

4. Graceful degradation in the loop

Retry handles transient failures. Degradation handles permanent or sustained ones. The agent should still do something useful, even if it cannot do the ideal thing.

Model fallback chain. If Opus is down or rate-limited, fall back to Sonnet. If Sonnet is down, fall back to Haiku. Same family, same SDK, smaller capability. The fallback is one config change in the model ID:

MODEL_CHAIN = [
"claude-opus-4-7",
"claude-sonnet-4-7",
"claude-haiku-4-7",
]

def call_with_fallback(messages, tools):
last_err = None
for model in MODEL_CHAIN:
try:
return client.messages.create(
model=model, messages=messages, tools=tools, max_tokens=4096,
)
except (anthropic.APIStatusError, anthropic.APIConnectionError) as e:
last_err = e
continue
raise RuntimeError(f"All models in chain failed; last error: {last_err}")

The cost of falling through the chain is real — Haiku will not solve the problem Opus would. But it will probably solve a degraded version, and a degraded answer is usually better than a 503 page.

Skill-to-tool fallback. If a skill that wraps a script (see L40) crashes, the agent can fall back to the underlying tool calls. The skill is the optimized path; the tool fallback is the slow path.

Terminal degradation. Sometimes the right answer is "we cannot do this." A user-facing message ("the deployment service is unavailable; I logged the request and will retry when it recovers") is a successful run, not a failure. Do not silently swallow the error and pretend everything worked.

Tie this back to budgets. L40 introduced per-session caps — wall-clock seconds, dollars, turn count. Degradation should kick in before the cap is hit, not at the cap. If the user has $1.00 left and Opus would cost $1.20 in expectation, route to Sonnet without asking. The budget is part of the model-selection input, not just a kill switch.

5. Determinism is a lie — but useful

L02 introduced temperature: at temperature=0, the model picks the highest-probability token at every step. Greedy. Deterministic. Same input, same output.

That is true in theory. In practice, at temperature zero you still drift. Run the same prompt three times against the same model ID, on the same day, and you can get three different completions. Several real causes:

KV-cache and batched inference nondeterminism. The provider's inference server shares KV-cache across batched requests. The exact composition of your batch — whose other prompts are alongside yours — influences kernel selection, padding, and reduction order. Tiny floating-point differences in how attention is computed under different batch shapes lead to different argmaxes at low-margin token positions. None of this is exposed to you.

Floating-point non-associativity. (a + b) + c ≠ a + (b + c) in floating point when the magnitudes differ. Different reduction orders in tensor cores produce different sums by 1e-6-ish, and one rounding away from a tie at the softmax means a different token is sampled.

Undocumented model updates. When you call claude-sonnet-4-7 without a date suffix, the provider may quietly route you to a newer snapshot. Anthropic exposes pinned IDs like claude-sonnet-4-7-20260301 for exactly this reason; claude-sonnet-4-7 is an alias that can move.

The honest framing: do not aim for determinism. Aim for reproducibility-with-pinned-model-version-and-seed. Pin the model ID with the date suffix. Set temperature to 0 (or your chosen value). Use the seed parameter when the provider exposes one (OpenAI does; Anthropic does not as of early 2026). Log every input. Then you can re-run a session and get an outcome that is close to the original — close enough to debug, close enough to evaluate.

A live counter-example to keep in mind. The following is roughly what you will observe in practice:

# Run the same prompt 3 times at temperature=0
# Pinned model: claude-sonnet-4-7-20260301
prompt = "List five cities in Estonia, comma-separated, no other text."
outputs = []
for i in range(3):
resp = client.messages.create(
model="claude-sonnet-4-7-20260301",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=64,
)
outputs.append(resp.content[0].text.strip())

# Possible observed outputs:
# [0] "Tallinn, Tartu, Narva, Pärnu, Kohtla-Järve"
# [1] "Tallinn, Tartu, Narva, Pärnu, Kohtla-Järve"
# [2] "Tallinn, Tartu, Narva, Pärnu, Viljandi"

Two are identical, one differs at the last token. This is the everyday reality of LLM determinism. temp=0 ≠ deterministic. Anyone who tells you otherwise has not run the experiment, or only ran it twice and got lucky.

What this means for your harness: never assume two runs of the same input produce identical outputs. If your test asserts an exact string match on a model output, it will flake. If you must compare, compare the parsed structure (which is much more stable than the raw text) or use a tolerance — semantic similarity, regex match, or a graded metric. This is the bridge into L52.

6. Structured outputs and forced schemas

The single most effective reliability lever is forcing the model into a structured output. Free-text answers are an unbounded failure surface. JSON conforming to a known schema is a bounded one.

L26 covered tool definitions and tool_choice. The reliability angle: tool_choice: {"type": "tool", "name": "X"} forces the model to call tool X on this turn. Combined with a strict input schema, the model's output on that turn is reduced to "fill in the blanks of this JSON Schema." That is far easier than free-form generation and far easier to validate.

JSON Schema with strict mode. Anthropic's strict: true and OpenAI's strict: true work the same way: the provider compiles your JSON Schema into a grammar and constrains decoding so that only schema-valid tokens can be emitted. Malformed JSON becomes structurally impossible. Missing required fields become structurally impossible.

Pydantic as the source of truth. Define your schema in Pydantic, generate JSON Schema from it, hand it to the SDK. The model's output is then guaranteed to deserialize into the Pydantic model.

from pydantic import BaseModel, Field
from anthropic import Anthropic

class PRReview(BaseModel):
clarity: int = Field(ge=1, le=5, description="1=unreadable, 5=exemplary")
correctness: int = Field(ge=1, le=5, description="1=clearly broken, 5=clearly correct")
risks: list[str] = Field(description="Specific risks; empty list if none")

REVIEW_TOOL = {
"name": "submit_review",
"description": "Submit the PR rubric scores. You must call this tool exactly once.",
"input_schema": PRReview.model_json_schema(),
"strict": True,
}

client = Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-7-20260301",
max_tokens=1024,
tools=[REVIEW_TOOL],
tool_choice={"type": "tool", "name": "submit_review"},
messages=[
{"role": "user", "content": f"Review this PR diff:\n\n{diff_text}"},
],
)

tool_block = next(b for b in resp.content if b.type == "tool_use")
review = PRReview.model_validate(tool_block.input)
# review is a typed Pydantic object. clarity/correctness are guaranteed ints in [1,5].

OpenAI parallel using response_format:

# OpenAI equivalent using strict response_format
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-5.4-2026-03-01",
messages=[
{"role": "user", "content": f"Review this PR diff:\n\n{diff_text}"},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "pr_review",
"schema": PRReview.model_json_schema(),
"strict": True,
},
},
)
review = PRReview.model_validate_json(resp.choices[0].message.content)

Both routes guarantee that you get a PRReview back. The model is free to produce semantically wrong scores — strict mode does not check whether clarity=5 is correct, only that clarity is an integer in [1,5]. But the entire class of "model returned a string instead of an int" is gone. So is "model invented a sixth field." So is "model wrapped the response in markdown code fences." Strict mode kills format failures; semantic failures still need evaluation.

Worked example: turning the open-ended prompt "Rate this PR for clarity, correctness, and risks. Respond in JSON." into a forced rubric. Without strict mode, fields sporadically come back as strings ("five" instead of 5), arrays come back as comma-separated strings, and the model wraps the JSON in markdown fences. Plumbed through the submit_review tool above, none of those failure modes is structurally possible. The reliability gap closes.

7. Idempotency and replay

If you log every (prompt, response, tool_input, tool_result) tuple as it happens, you can re-run any session from any turn. This is two things at once: a debugging tool (re-run with a patched prompt to find a fix), and the input substrate for evaluation in L52.

The data shape is plain JSONL — one record per turn. Each record contains everything needed to deterministically reconstruct the state at that turn:

import json, time
from pathlib import Path

class TurnLog:
def __init__(self, session_id: str, log_dir: Path = Path("./traces")):
log_dir.mkdir(exist_ok=True, parents=True)
self.path = log_dir / f"{session_id}.jsonl"
self.session_id = session_id

def write(self, record: dict) -> None:
record["session_id"] = self.session_id
record["wall_clock_ms"] = int(time.time() * 1000)
with self.path.open("a") as f:
f.write(json.dumps(record, default=str) + "\n")

# Usage inside the agent loop
log = TurnLog(session_id="2026-05-06-abc123")
log.write({
"turn": 7, "kind": "model_call",
"model": "claude-sonnet-4-7-20260301",
"params": {"temperature": 0, "max_tokens": 4096, "top_p": 1.0},
"system": SYSTEM_PROMPT, # full text, not a hash
"tools": list(TOOL_DEFS), # full schemas, not just names
"messages_in": prior_messages, # complete prompt, not summary
"response_blocks": [b.model_dump() for b in resp.content],
"stop_reason": resp.stop_reason,
"usage": resp.usage.model_dump(),
"latency_ms": elapsed_ms,
})

To replay, read the JSONL up to turn N, reconstruct the messages and tool definitions, then resubmit to the same pinned model. The output may differ by a token or two (see section 5) but the trajectory will be close.

This is also the bridge into observability. A trace and a replay log are the same data. The next sections cover how to structure that data for human consumption (a UI), not just machine consumption (a re-run).

8. Observability: what to log

A trace you cannot replay is not a trace. The minimum viable set of fields per model call:

  • input messages — the full array, not a hash. Truncated or hashed messages are unreplayable.
  • system prompt — full text. A different system prompt is a different agent.
  • tool definitions — full schemas, not just names.
  • model ID — pinned with date suffix. claude-sonnet-4-7 is not enough; claude-sonnet-4-7-20260301 is.
  • sampling parameterstemperature, top_p, max_tokens, seed if available, stop_sequences.
  • output blocks — every block in response.content: text, tool_use, thinking. Not just the final text.
  • stop_reasonend_turn, max_tokens, tool_use, refusal, pause_turn.
  • usage — input/output tokens, cache read/creation tokens. Anthropic exposes these on response.usage.
  • latency — wall-clock from request start to response complete.
  • tool calls and results — name, input, dispatch result, is_error, tool latency.

The Anthropic usage object, which most students underuse:

resp = client.messages.create(...)
print(resp.usage)
# Usage(input_tokens=1842, output_tokens=194,
# cache_creation_input_tokens=0, cache_read_input_tokens=12044,
# server_tool_use=None)

Cache fields are how you confirm prompt caching is working. If cache_read_input_tokens is zero across a long session, your cache breakpoints are misplaced or the prefix is changing. OpenAI's usage exposes prompt_tokens, completion_tokens, total_tokens, plus prompt_tokens_details.cached_tokens for cache reads. Same idea, different names.

The gotcha most students hit: they log the user message and the model response and call it a trace. Six weeks later they want to reproduce a bug. The system prompt has changed twice since then. The tool schemas have grown. The model ID resolved to a newer snapshot. They have nothing. Logging the prompt without logging the full system prompt + tool definitions + pinned model version yields traces you can never replay. Log too much, never too little. Disk is cheap. Re-running a five-day-old bug from a half-trace is not.

9. OpenTelemetry spans for agent loops

OpenTelemetry (OTel) is the standard observability protocol. Spans, traces, metrics, attributes, events. Most modern tracing platforms speak OTel natively. If you instrument your agent with OTel, you can swap backends without rewriting your code.

The right span hierarchy for an agent loop:

agent_run                 (one per user request)
├── turn_1
│ ├── model_call
│ ├── tool_call: search
│ └── tool_call: read_file
├── turn_2
│ ├── model_call
│ └── tool_call: write_file
└── turn_3
└── model_call (final answer, no tools)

Every span has a start time, an end time, attributes (key-value pairs), and events (timestamped log entries). The agent_run span wraps the whole interaction. Each turn_N is a child. Each model call and tool call inside a turn is a grandchild.

Subagents, from L30, nest as further children. A Task tool dispatch becomes a child span with its own subtree of turn_N → model_call/tool_call. When you look at the trace UI, the parent's row shows the elapsed wall-clock, and the subagent's subtree explains where the time went.

The instrumentation, in Python, using opentelemetry-sdk:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
BatchSpanProcessor, ConsoleSpanExporter,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(ConsoleSpanExporter())
)
tracer = trace.get_tracer("agent")

def run_agent(user_input: str) -> str:
with tracer.start_as_current_span("agent_run") as run_span:
run_span.set_attribute("user_input.preview", user_input[:200])
run_span.set_attribute("session_id", session_id)

messages = [{"role": "user", "content": user_input}]
for turn in range(MAX_TURNS):
with tracer.start_as_current_span(f"turn_{turn}") as turn_span:
with tracer.start_as_current_span("model_call") as model_span:
resp = call_with_fallback(messages, TOOLS)
model_span.set_attribute("model", resp.model)
model_span.set_attribute("usage.input_tokens", resp.usage.input_tokens)
model_span.set_attribute("usage.output_tokens", resp.usage.output_tokens)
model_span.set_attribute("stop_reason", resp.stop_reason)

if resp.stop_reason == "end_turn":
return text_of(resp)

tool_results = []
for block in resp.content:
if block.type != "tool_use":
continue
with tracer.start_as_current_span("tool_call") as tool_span:
tool_span.set_attribute("tool.name", block.name)
content, is_error = run_tool(block.name, block.input)
tool_span.set_attribute("tool.is_error", is_error)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": content,
"is_error": is_error,
})
messages.append({"role": "assistant", "content": resp.content})
messages.append({"role": "user", "content": tool_results})

run_span.add_event("turn_budget_exhausted")
return "(turn budget exhausted; aborting)"

The set_attribute calls become searchable, filterable columns in the trace UI. The add_event calls become timestamped log lines pinned to the span. Use attributes for things you want to filter on (model name, tool name, error flag); use events for things you want to read (a stack trace, a warning).

10. Tracing platforms — Langfuse, Helicone, Braintrust

You can read OTel traces in raw JSON. You should not. Pick a platform that gives you a UI tuned for agent traces — span trees, latency flame graphs, token usage panels, side-by-side trace comparison.

PlatformSelf-hostSaaSOTel nativeFree tierAgent-trace UIInstrumentation effort
LangfuseYes (Apache 2.0)YesYes (OTel + native SDK)Generous on cloud, free self-hostStrong — span tree, prompt diff, eval integrationLow — Python SDK or OTel exporter
HeliconeYesYesPartial — proxy-basedGenerousGood — built around proxy logsZero code change, proxy your API base URL
BraintrustNoYesYesLimitedBest in class — tight eval + trace integrationLow — Python SDK, OpenAI/Anthropic wrappers
LangSmithNo (LangChain Cloud)YesPartialLimitedStrong, biased toward LangChainLow if using LangChain
Phoenix (Arize)Yes (Apache 2.0)YesYes (OTel)GenerousGood, ML-leaningMedium — heavier instrumentation

Course default: Langfuse self-hosted. It runs as a Docker compose stack on the same VPS L40 introduced for personal agents. No data leaves your box. The OTel exporter speaks the same protocol as the snippet in section 9, so the same instrumentation feeds the same UI whether you point at the SaaS endpoint or a self-hosted instance. Langfuse also has a native Python SDK that wraps the Anthropic and OpenAI SDKs with one decorator — if you want zero-OTel-overhead instrumentation, that path works too.

Helicone is the right choice when you want to add observability to an agent you cannot easily modify. You change the SDK base URL from api.anthropic.com to a Helicone proxy URL and the proxy logs every request and response. Zero code change. The downside: you lose span hierarchy because the proxy only sees individual API calls, not the agent loop wrapping them.

Braintrust has the cleanest UI and the tightest integration between traces and evaluations (the L52 topic). Paid SaaS only; consider it once you are running real eval suites. LangSmith is the LangChain-native option — path of least resistance if your agent is built on LangGraph; otherwise the data model is forced.

The platform decision is rarely permanent — they all consume OTel — but the shape of your spans is. Get the span hierarchy right, and you can re-emit to a different platform later without changing agent code.

11. Replay from logs

A complete trace makes any past turn re-runnable. Two practical uses.

Debug replay. A user reports a bug from session abc123, turn 7. You open the JSONL log, find turn 7, reconstruct the exact messages, tools, system, and model from that record, and re-run it. If the bug reproduces, you have a deterministic-enough fixture to iterate against. Patch the system prompt, re-run from turn 7, see if the answer changes. This is the same workflow as git bisect, except the bisect target is a prompt, not a commit.

Dataset generation. Once you have hundreds of trajectories, the union of all turn-7s is a dataset. Filter for "turn where the model called search_database" or "turn where the user prompt mentioned 'refund'" and you have a focused eval set. This is the substrate for L52, which builds the formal evaluation harness on top of these logged trajectories.

The minimum replay function:

def replay_turn(session_log: Path, turn_idx: int) -> dict:
with session_log.open() as f:
records = [json.loads(line) for line in f]
turn = next(r for r in records if r["turn"] == turn_idx and r["kind"] == "model_call")
resp = client.messages.create(
model=turn["model"],
system=turn["system"],
tools=turn["tools"],
messages=turn["messages_in"],
**turn["params"],
)
return {
"original_response": turn["response_blocks"],
"replayed_response": [b.model_dump() for b in resp.content],
"diff_token_count": resp.usage.output_tokens - turn["usage"]["output_tokens"],
}

When the original and the replayed responses differ, you are looking at section 5's drift directly. When they agree, you have a stable fixture. Either way, the replay tells you something — and that is the point of logging the full prompt in section 8.

See L52 for what to do once you have a thousand of these and want to score them automatically.

12. Putting it together: a reliable agent skeleton

Here is the smallest agent loop that pulls retry, structured output, OTel spans, and the replay log into one file. About 80 lines. Real enough to start from, small enough to read in a sitting.

import json, time, uuid
from pathlib import Path
from pydantic import BaseModel, Field
from tenacity import (
retry, stop_after_attempt, wait_exponential_jitter,
retry_if_exception_type,
)
from opentelemetry import trace
from anthropic import Anthropic, APIStatusError, APIConnectionError

tracer = trace.get_tracer("agent")
client = Anthropic()

MODEL = "claude-sonnet-4-7-20260301"
MAX_TURNS = 20

class Answer(BaseModel):
text: str = Field(description="Final answer to the user")
confidence: float = Field(ge=0.0, le=1.0)

ANSWER_TOOL = {
"name": "submit_answer",
"description": "Submit the final answer. Call exactly once when done.",
"input_schema": Answer.model_json_schema(),
"strict": True,
}

@retry(
stop=stop_after_attempt(4),
wait=wait_exponential_jitter(initial=1, max=30),
retry=retry_if_exception_type((APIStatusError, APIConnectionError)),
reraise=True,
)
def call_model(messages, tools):
return client.messages.create(
model=MODEL, max_tokens=4096, temperature=0,
tools=tools, messages=messages,
)

def run_tool(name, args):
try:
return TOOL_REGISTRY[name](**args), False
except Exception as e:
return f"{type(e).__name__}: {e}", True

def run_agent(user_input: str) -> Answer:
session_id = str(uuid.uuid4())
log_path = Path("traces") / f"{session_id}.jsonl"
log_path.parent.mkdir(exist_ok=True)

def log(record):
record.update(session_id=session_id, ts_ms=int(time.time() * 1000))
with log_path.open("a") as f:
f.write(json.dumps(record, default=str) + "\n")

tools = list(TOOL_DEFS) + [ANSWER_TOOL]
messages = [{"role": "user", "content": user_input}]

with tracer.start_as_current_span("agent_run") as run_span:
run_span.set_attribute("session_id", session_id)
for turn in range(MAX_TURNS):
with tracer.start_as_current_span(f"turn_{turn}"):
t0 = time.time()
resp = call_model(messages, tools)
latency_ms = int((time.time() - t0) * 1000)
log({"turn": turn, "kind": "model_call", "model": resp.model,
"messages_in": messages, "tools": tools,
"response_blocks": [b.model_dump() for b in resp.content],
"stop_reason": resp.stop_reason,
"usage": resp.usage.model_dump(), "latency_ms": latency_ms})

tool_results = []
final = None
for block in resp.content:
if block.type != "tool_use":
continue
if block.name == "submit_answer":
final = Answer.model_validate(block.input)
tool_results.append({"type": "tool_result",
"tool_use_id": block.id, "content": "ok"})
continue
content, is_error = run_tool(block.name, block.input)
log({"turn": turn, "kind": "tool_call", "name": block.name,
"input": block.input, "is_error": is_error})
tool_results.append({"type": "tool_result",
"tool_use_id": block.id,
"content": str(content), "is_error": is_error})
if final is not None:
return final

messages.append({"role": "assistant", "content": resp.content})

if resp.stop_reason == "end_turn" and not tool_results:
forced = client.messages.create(
model=MODEL, max_tokens=1024, temperature=0,
tools=tools, messages=messages,
tool_choice={"type": "tool", "name": "submit_answer"},
)
for block in forced.content:
if block.type == "tool_use" and block.name == "submit_answer":
return Answer.model_validate(block.input)
raise RuntimeError("forced final pass returned no submit_answer")

messages.append({"role": "user", "content": tool_results})

raise RuntimeError("turn budget exhausted")

Read top to bottom. The retry decorator handles transient model errors. Answer plus the submit_answer strict tool forces the final output into a typed structure. Every model call and tool call writes a JSONL record sufficient to replay. Every span is named so the trace UI groups them into the hierarchy from section 9. The turn budget is the runaway-loop guard from section 2.

This is the minimal viable shape. Production versions add model fallback (section 4), idempotency keys (section 3), and circuit breakers in front of flaky tools. The core — log everything, force structure, fail loudly, retry intelligently — is in the ~100 lines above. Reliability is not a feature you bolt on. It is the cumulative weight of decisions made at every level of the loop. The next lecture, L52, turns the trajectories you have logged here into a measurement system: not just "did it crash" but "did it succeed at the actual task." The substrate is the same. The question changes.