Skip to main content

26 - Tools

The intro lecture covered the mechanics: tool calling is structured output → your code executes → result goes back. This lecture goes deeper: what makes a good tool, how providers differ, why format matters, and why your 7B model can't reliably call functions.

Recap: Tools Are Structured Text Generation

The model never executes anything. It generates tokens that happen to match a tool-calling format. Your application parses these tokens, runs the function, and sends the result back as another message.

Everything that follows builds on this foundation. If you don't have this internalized, go back to lecture 01.

Anatomy of a Tool Definition

Every tool definition has three parts, regardless of provider:

  1. Name — identifier the model uses to request the tool (get_weather, search_database)
  2. Description — natural language explanation of what the tool does and when to use it
  3. Input schema — structured definition of parameters (JSON Schema)

The description is the most important part. The model uses it to decide whether to call the tool. The schema constrains how it calls it. A perfect schema with a bad description means the tool never fires.

{
"name": "search_student_records",
"description": "Search the university student database by name, student code, or course enrollment. Use this when the user asks about a specific student's grades, enrollment status, or academic history.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Student name, student code (e.g., 123456IABB), or course code"
},
"search_type": {
"type": "string",
"enum": ["name", "student_code", "course"],
"description": "What field to search by"
}
},
"required": ["query", "search_type"]
}
}

Description Quality Matters More Than You Think

Bad description:

"description": "Queries the database"

Better:

"description": "Search the university student database by name, student code, or course enrollment. Use when the user asks about a specific student's grades, enrollment status, or academic history. Do NOT use for course-level statistics — use get_course_stats instead."

The description should tell the model:

  • What the tool does
  • When to use it
  • When NOT to use it (especially if similar tools exist)
  • What kind of results it returns

Anthropic's own skill engineering team notes that agents tend to undertrigger — they don't use tools when they should. Making descriptions "pushy" helps.

OpenAI vs Anthropic: Format Differences

Both providers implement the same concept (model requests tool calls, you execute, send results back) but the wire format is different. This matters because the homework requires dual-provider support.

OpenAI Format (Chat Completions API)

Tool definition:

{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}

Model response requesting tool call:

{
"role": "assistant",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"Tallinn\"}"
}
}
]
}

Your result sent back:

{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "Tallinn: 2°C, cloudy"
}

Key characteristics:

  • Tool definitions wrapped in {"type": "function", "function": {...}}
  • Schema key is parameters (not input_schema)
  • Arguments returned as JSON string (you must JSON.parse() it)
  • Results sent with role "tool" and tool_call_id for correlation
  • Supports parallel_tool_calls: true (default on) — model can request multiple tools in one response
  • tool_choice controls forcing: "auto", "none", "required", or specific function name
  • Strict mode via "strict": true — guarantees schema compliance using constrained decoding

OpenAI Responses API (newer)

OpenAI also has a newer Responses API where tool calls and results are separate items rather than messages:

Tool call is an output item:

{
"type": "function_call",
"call_id": "call_abc123",
"name": "get_weather",
"arguments": "{\"city\": \"Tallinn\"}"
}

Result is an input item in the next request:

{
"type": "function_call_output",
"call_id": "call_abc123",
"output": "Tallinn: 2°C, cloudy"
}

Anthropic Format (Messages API)

Tool definition:

{
"name": "get_weather",
"description": "Get current weather for a city",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}

Model response requesting tool call:

{
"role": "assistant",
"content": [
{
"type": "tool_use",
"id": "toolu_01ABC",
"name": "get_weather",
"input": {"city": "Tallinn"}
}
],
"stop_reason": "tool_use"
}

Your result sent back (inside a user message):

{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": "toolu_01ABC",
"content": "Tallinn: 2°C, cloudy"
}
]
}

Key characteristics:

  • Tool definitions are flat (no function wrapper)
  • Schema key is input_schema (not parameters)
  • Arguments returned as parsed JSON object (no extra JSON.parse() needed)
  • Results sent as tool_result blocks inside a user message (not a separate tool role)
  • stop_reason: "tool_use" signals the model wants you to execute something
  • tool_choice supports {"type": "auto"}, {"type": "any"} (must use some tool), {"type": "tool", "name": "..."} (force specific)
  • Strict mode via "strict": true in tool definition — same idea as OpenAI, constrained decoding guarantees schema match

Side-by-Side Comparison

AspectOpenAIAnthropic
Definition wrapper{"type": "function", "function": {...}}Flat object
Schema keyparametersinput_schema
Arguments formatJSON string (must parse)Parsed JSON object
Result role"tool" (dedicated role)"user" with tool_result block
Call ID prefixcall_toolu_
Stop signalfinish_reason: "tool_calls"stop_reason: "tool_use"
Parallel callsExplicit parallel_tool_calls flagModel can return multiple tool_use blocks
Strict modestrict: true in function defstrict: true in tool def

What About Google / Gemini?

Google's Gemini API uses a format similar to OpenAI's: tool definitions wrapped in {"function_declarations": [...]}, arguments returned as parsed objects (like Anthropic, not as JSON strings). The differences from OpenAI are smaller than the OpenAI-Anthropic gap. If you need to support Gemini, the translation layer is straightforward once you have an OpenAI adapter. We focus on OpenAI vs Anthropic because those two show the widest format divergence.

Practical Consequence: Dual-Provider Abstraction

When building a dual-provider system (the homework requirement), the translation layer is small but non-trivial:

// Pseudocode: normalize tool definitions
function toOpenAI(tool) {
return {
type: "function",
function: {
name: tool.name,
description: tool.description,
parameters: tool.input_schema // rename key
}
};
}

function toAnthropic(tool) {
return {
name: tool.name,
description: tool.description,
input_schema: tool.input_schema // already correct
};
}

// Pseudocode: normalize tool call response
function parseToolCalls(response, provider) {
if (provider === "openai") {
return response.tool_calls.map(tc => ({
id: tc.id,
name: tc.function.name,
input: JSON.parse(tc.function.arguments) // must parse!
}));
} else {
return response.content
.filter(b => b.type === "tool_use")
.map(b => ({
id: b.id,
name: b.name,
input: b.input // already parsed
}));
}
}

XML vs JSON: Internal Representation

When the model "sees" tool definitions, the API server serializes them into the text stream. OpenAI and Anthropic use different internal formats.

Anthropic: XML-style (historically)

Anthropic has historically used XML-like delimiters in the internal prompt format. The tool definitions get injected roughly like this:

In this environment you have access to a set of tools you can use
to answer the user's question.

String and scalar parameters should be specified as is, while lists
and objects should use JSON format.

Here are the functions available in JSONSchema format:

<function>
{"name": "get_weather", "description": "...", "parameters": {...}}
</function>

<function>
{"name": "search_db", "description": "...", "parameters": {...}}
</function>

The model's tool call output is parsed with regular expressions, not a strict XML parser. This means the model doesn't need to produce perfectly valid XML — it just needs to match the expected pattern.

OpenAI: JSON-native

OpenAI's internal format is more JSON-oriented. The model is trained to emit structured JSON for tool calls.

Why Does This Matter?

  1. XML is more forgiving to generate. XML-style markers (<tool_use>, </tool_use>) are easy tokens for the model to produce. They work like clear start/stop signals. JSON requires matching braces, proper quoting, correct comma placement — more ways to fail.

  2. JSON is more machine-parseable. Standard JSON parsers exist everywhere. XML fragments require custom extraction logic.

  3. Strict mode eliminates the difference. Both providers now offer strict/constrained decoding that forces the model's output to match the schema exactly. This is done at the token sampling level — the model literally cannot produce invalid JSON because tokens that would break the schema are masked out before sampling.

  4. For your code, you only see JSON. The internal XML/JSON representation is hidden behind the API. You send JSON, you get JSON back. The internal format matters only when you're trying to understand failure modes (lecture section: "Why Small Models Fail").

Tool Types: Client vs Server

Not all tools work the same way.

Client Tools (You Execute)

The standard pattern: model requests → you run code → you send result back. You own the execution environment.

Model: "Call get_weather with city=Tallinn"
Your code: fetches weather API → gets result
Your code: sends result back to model

Both OpenAI and Anthropic support this. This is what you implement in your homework.

Server Tools (Provider Executes)

Some tools run on the provider's infrastructure. You just enable them; the provider handles execution.

Anthropic server tools:

  • web_search — searches the web, returns results
  • code_execution — runs Python in a sandboxed VM
  • web_fetch — retrieves a URL's content
  • tool_search — dynamically discovers relevant tools from a large set

OpenAI server tools:

  • web_search — similar web search
  • code_interpreter — runs Python in sandbox
  • file_search — searches uploaded files via vector embeddings

Server tools are transparent to the model — they look like regular tools. The difference is operational: you don't need to handle execution, but you also can't customize it.

Anthropic-Defined Client Tools (Hybrid)

Anthropic also provides tool schemas that your code executes:

  • bash — run shell commands
  • text_editor — view/edit files with str_replace

These are used by Claude Code and similar agentic tools. The schema is standardized by Anthropic, but you run the actual bash command or file edit.

Pros and Cons of Tool Calling

Pros

  • Grounds the model in reality. Instead of hallucinating an answer about today's weather, the model calls a real API. Tool results replace fabrication with facts.
  • Extends capabilities infinitely. The model can do anything you can write a function for: query databases, send emails, run code, call APIs.
  • Structured and deterministic. Tool calls produce structured output (JSON with known schema). This is much more reliable than asking the model to produce structured data in freeform text.
  • Composable. Multiple tools can be chained: search → get details → compute → format. The agent loop handles orchestration.
  • Auditable. Every tool call is visible in the message history. You can log exactly what the model requested and what it got back. This is critical for debugging.

Cons

  • Latency. Every tool call adds a round-trip: model generates call → your code executes → model processes result. Multi-tool chains multiply this. A 5-tool chain means 6 model calls minimum.
  • Cost. Tool definitions consume tokens in every request (they're injected into the prompt). 20 tools with detailed schemas can eat 5,000-10,000 tokens before the conversation even starts. With Opus at $5/MTok input, that's $0.025-$0.05 per request just for tool definitions.
  • Reliability. The model can: call the wrong tool, hallucinate parameters that don't match the schema, fail to call a tool when it should, call a tool when it shouldn't, or misinterpret the result. All of these happen in practice.
  • Complexity. The agent loop is simple in theory, but production systems need: error handling, retry logic, timeout management, cost budgets, permission gates, context window management when tool results are large.
  • Tool definition overhead. As you add tools, each one competes for the model's attention. With 50+ tools, the model struggles to pick the right one. This is why Anthropic introduced tool_search — to dynamically load only relevant tool definitions.

Handling Tool Call Failures

When a tool call fails, feed the error back to the model as a tool result. The model uses error messages to self-correct on the next attempt:

// Inside your agent loop
const toolResult = await executeTool(toolCall.name, toolCall.input);

if (toolResult.error) {
// Send error back as the tool result — the model will retry or adjust
messages.push({
role: "user",
content: [{
type: "tool_result",
tool_use_id: toolCall.id,
is_error: true,
content: toolResult.error // "Student code 'ABC' not found. Expected: 6 digits + 4 letters"
}]
});
// Continue the agent loop — model sees the error and decides what to do
} else {
messages.push({
role: "user",
content: [{
type: "tool_result",
tool_use_id: toolCall.id,
content: JSON.stringify(toolResult.data)
}]
});
}

The key: is_error: true (Anthropic) tells the model the call failed. The error message shapes the retry. Set a maximum retry count (3 is reasonable) to avoid infinite loops.

The Token Cost Problem (Concrete Numbers)

Consider a system with 30 tools. Each tool definition averages ~200 tokens (name + description + schema). That's 6,000 tokens of tool definitions in every single API call.

ModelInput price/MTokCost per call (tools only)100 calls/dayMonthly
Claude Opus$5.00$0.030$3.00$90
Claude Sonnet$1.15$0.007$0.70$21
GPT-5$2.00$0.012$1.20$36
DeepSeek 3.2$0.25$0.0015$0.15$4.50

This is before any conversation content. Now add multi-turn conversations where you re-send the full history + all tool definitions on every call.

Solutions:

  • Tool search / dynamic loading — only include relevant tools per request
  • Prompt caching — Anthropic caches the tool definitions across requests (automatic), reducing cost of static prefix
  • Tiered tool sets — use different tool sets for different task types
  • Programmatic tool calling — let the model call tools via code execution, avoiding repeated round-trips

Why Smaller Models Fail at Tool Calling

This is not theoretical — it's a hard engineering constraint that determines your architecture decisions.

The Evidence

From the Berkeley Function Calling Leaderboard (BFCL) and academic studies:

  • Models under ~3B parameters score near zero on tool calling in zero-shot settings. In one study, only DeepSeek-Coder-1.3B achieved any JSON-parseable output at all — with a success rate of 7.34%.
  • Even among 7B-8B models, zero-shot tool calling accuracy is dramatically lower than frontier models.
  • A fine-tuned 350M model can beat general-purpose ChatGPT on tool calling — but only after specific training on tool-calling datasets. The base model would fail completely.
  • Requiring JSON output reduces response accuracy by 27.3 percentage points on reasoning benchmarks compared to natural language output (Tam et al., 2024). Format compliance competes with reasoning for model capacity.

Why It Happens: Five Failure Modes

1. Format compliance is expensive.

Generating valid JSON requires tracking bracket nesting, comma placement, string quoting, and schema conformance — all while also reasoning about which tool to use and what arguments to provide. For a small model with limited capacity, these demands compete directly with each other.

The model has a fixed "budget" of capability. Spending it on format compliance means less is available for reasoning about which tool is correct.

2. Instruction following requires training investment.

Tool calling is a learned behavior from supervised fine-tuning (SFT) and RLHF. Smaller models either:

  • Weren't trained on enough tool-calling examples
  • Have insufficient parameters to internalize the format conventions alongside everything else they know
  • Lose the pattern when the context gets complex

A 7B model trained specifically for tool calling (like Salesforce xLAM) can perform well. A general-purpose 7B model given tool definitions for the first time will struggle.

3. Tool selection degrades with catalog size.

As the number of available tools increases, the model must discriminate between similar descriptions. This requires nuanced semantic understanding that scales with model size. Frontier models handle 20-30 tools reasonably; smaller models start failing at 5-10.

Research shows that tools defined earlier in the context get disproportionate selection bias — the model "forgets" tools defined further from its generation point (recency bias).

4. Multi-step reasoning collapses.

A single tool call is manageable. Chains of dependent tool calls (call A → use result to call B → use result to call C) require the model to:

  • Plan the chain
  • Execute each step correctly
  • Carry forward intermediate results
  • Recover from errors

Each step compounds error probability. A model with 80% per-step accuracy has 51% accuracy on a 3-step chain and 33% on a 5-step chain. Smaller models with lower per-step accuracy collapse much faster.

5. Context window pressure.

Tool definitions + conversation history + tool results all compete for context window space. Smaller models typically have shorter context windows (2K-8K tokens). After injecting 10 tool definitions (~2,000 tokens), there's barely room for the conversation.

Even models with large context windows degrade: accuracy drops up to 40% as context length increases, even for models with multi-million-token windows.

Practical Implication: Model Selection for Tools

Use caseMinimum viable modelNotes
Single tool, simple schema7B fine-tuned (xLAM, Hermes)Must be specifically trained for tools
3-5 tools, single calls14B-32BQwen2.5-32B, DeepSeek 3.2 work
10+ tools, multi-step chainsFrontier (Opus, Sonnet, GPT-5)Cost vs. reliability tradeoff
Dynamic tool discoveryFrontier onlyRequires both reasoning and format

The hybrid router pattern: Route simple tasks to cheap/small models, escalate complex tool-use tasks to frontier models. This is what 40% of production deployments do according to industry surveys.

Strict Mode / Constrained Decoding

Both providers now offer a way to guarantee schema compliance at the token sampling level.

How It Works

Without strict mode, the model generates tokens freely and you hope the output matches your schema. With strict mode:

  1. The provider converts your JSON Schema into a grammar/automaton
  2. During token generation, tokens that would violate the schema are masked out (probability set to zero)
  3. The model can only produce tokens that lead to valid output

This means: no missing required fields, no wrong types, no malformed JSON. Ever.

Tradeoffs

  • Pro: 100% schema compliance. No parsing errors.
  • Con: First request with a new schema is slower (grammar compilation). Some schema features aren't supported (recursive schemas, certain anyOf patterns).
  • Con: The model might produce valid but wrong output. Strict mode guarantees format, not correctness. It can still call the wrong tool or provide semantically wrong arguments.

When to Use It

Always, unless you have a specific reason not to. The latency cost of grammar compilation is amortized after the first call (cached), and guaranteed schema compliance eliminates an entire class of bugs.

Anthropic:

{
"name": "get_weather",
"description": "...",
"input_schema": { "..." : "..." },
"strict": true
}

OpenAI:

{
"type": "function",
"function": {
"name": "get_weather",
"description": "...",
"parameters": { "..." : "..." },
"strict": true
}
}

Streaming and Tool Calls

When using streaming responses, tool calls don't arrive as a single complete JSON object. They arrive as deltas — partial chunks that your code must accumulate and assemble before execution.

Both providers send tool call arguments in incremental pieces during streaming. You cannot execute a tool call until the stream signals completion of that call (OpenAI: a finish_reason: "tool_calls" chunk; Anthropic: an content_block_stop event). Attempting to parse partial arguments mid-stream will fail.

This is the #1 source of streaming bugs in student code: parsing the tool call before it's fully received. Non-streaming tool calling is simpler — get non-streaming working first, then add streaming support.

Programmatic Tool Calling

A newer pattern (Anthropic, early 2026): instead of the model requesting tools one at a time through the conversation, it writes code that calls multiple tools in sequence.

The Problem

Traditional tool calling requires a full model inference pass for each tool call. A 5-tool workflow means:

  1. Model decides to call tool A → inference pass
  2. You execute A, send result → inference pass
  3. Model decides to call tool B → inference pass
  4. ... repeat

That's 6 inference passes minimum. Each one processes the full conversation context including all previous tool results, which keep accumulating.

The Solution

The model writes a Python script that orchestrates the tools:

# Model generates this code, which runs in a sandbox
results_a = tools.search_database(query="students enrolled in ICD0024")
student_ids = [r["id"] for r in results_a if r["grade"] < 3]
details = [tools.get_student_details(id=sid) for sid in student_ids[:10]]
summary = {"failing_count": len(student_ids), "details": details}
return summary

One inference pass generates the entire orchestration plan. The code runs, calls the tools, processes intermediate results, and returns the final answer. Only the final summary enters the model's context.

When This Matters

  • Large-scale data processing (iterating over hundreds of records)
  • Complex filtering/aggregation that's trivial in code but painful in conversation
  • Workflows where intermediate results are large but the summary is small

MCP: Model Context Protocol

MCP standardizes tool exposure as a protocol, not just an API format. A server exposes tools via JSON-RPC over stdio (local) or Streamable HTTP (remote); any MCP-compatible client can discover and use them.

Why MCP Exists

Without MCP: every coding agent (Claude Code, Cursor, Copilot) implements its own tool integration. A GitHub tool written for Claude Code doesn't work in Cursor.

With MCP: write the tool server once, any MCP client can use it. Same idea as USB-C for peripherals.

Key Concepts

  • Server exposes tools, resources, and prompts
  • Client discovers available tools and calls them
  • Transport is SSE or streamable HTTP (replacing stdio for remote servers)
  • Tools are described with JSON Schema, same as regular tool definitions

MCP is covered in more detail in a separate lecture. The point here: MCP is a transport and discovery layer on top of tool calling. The model still generates structured tool-call output. MCP just standardizes how tools are found and executed.

Tool Design Principles

1. One Tool, One Job

Bad:

{
"name": "database",
"description": "Do anything with the database"
}

Good:

{"name": "search_students", "description": "Search student records by name or code"},
{"name": "get_grades", "description": "Get grades for a student in a specific course"},
{"name": "update_grade", "description": "Update a student's grade (requires instructor role)"}

2. Make Errors Informative

Don't return:

{"error": "Failed"}

Return:

{"error": "Student code 'ABC123' not found. Expected format: 6 digits followed by 4 letters (e.g., 123456IABB)"}

The model uses error messages to self-correct. Vague errors lead to blind retries.

3. Return Only What's Needed

If the model asked for a student's name and you return the entire student record with 50 fields, you're wasting context tokens. Return the minimal useful response.

4. Use Enums Aggressively

"search_type": {
"type": "string",
"enum": ["name", "student_code", "course"]
}

Enums constrain the model's output space. Fewer choices = higher accuracy. Without the enum, the model might generate "type": "studentName" or "type": "by_name" or "type": "NAME".

5. Namespace When You Have Many Tools

github_list_prs, github_create_issue, github_get_file
slack_send_message, slack_list_channels
jira_create_ticket, jira_search

Clear prefixes help the model disambiguate when tool count grows.

References

Documentation

Benchmarks

Research

  • Busari et al., "Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling" (EASE 2025) — zero-shot SLMs score near zero on tool calling
  • Tam et al. (2024) — JSON output requirement reduces reasoning accuracy by 27.3pp on GSM8K
  • "Natural Language Tools" (arXiv 2510.14453) — structured format requirements cause 20%+ accuracy drops due to task interference
  • "ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs" (arXiv 2411.13547) — error taxonomy: hallucinated API names, wrong arguments, redundant calls, format failures

Advanced