$ cat ai-engineering/tool-use.md

Function Calling and Tool Use

Tool use is typed RPC for LLMs: tool schemas, the call-result loop, parallel calls, tool_choice, OpenAI vs Anthropic differences, and failure modes.

Jatin Bansal@blog:~/ai-engineering$ open tool-use

A travel-booking assistant ships on a Tuesday. Forty-eight hours in, a customer asks it to “rebook me on the same flight tomorrow at the same fare.” The model’s text reply describes, in confident prose, that it has changed the booking. It has not. There is no booking system attached. The model can describe actions but it cannot take them — every side effect lives outside the LLM, and bridging that gap is the entire job of function calling. Tool use is how the model stops narrating the world and starts touching it.

Opening bridge

Yesterday’s piece on structured output ended on a deliberate hand-off: tool-use coercion — defining a tool whose input_schema is your output schema and forcing the model to call it — is one of the four paths to a typed payload. That trick works because tool use was already the model’s most reliable structured-output channel. Today we use the same primitive for its primary purpose. Yesterday: how to get a typed object out of the model. Today: how to let the model ask the runtime to do something and feed the answer back in.

What “tool use” actually is

Function calling, tool use, function invocation — different vendors, same idea. You declare a set of typed functions the model is allowed to ask for. The model decides, at decode time, whether to emit a normal text response or a structured request to call one of those functions. Your runtime executes the request, packages the result, and feeds it back into the next turn. The model can then call more tools, or stop and answer.

A tool is three things:

A name the model uses to address it.
A description — natural-language prose explaining when to call it and when not to. The description is the API doc for an LLM consumer; treat it that way.
An input schema — JSON Schema describing the arguments, with required fields, types, enums, and format constraints. The model’s emitted arguments are validated against this schema.

Two things tool use is not. It is not the model executing your code — the model only emits a structured request; your runtime runs the function. And it is not magic — there is no separate “tool” model. The same decoder that produces text produces tool-call tokens, biased by post-training on tool-use traces.

Intuition: typed RPC with the model as caller

Forget agents for a second. Tool use is a remote procedure call protocol where the LLM is the caller and your runtime is the server. The model is a stateless dispatcher: it picks a function, fills in typed arguments, and yields. Your runtime is a stub/skeleton pair: it deserializes the call, executes the side effect, serializes the result, and hands control back. Round-trip until the model decides it has what it needs.

Once you frame it this way, the whole surface area follows. Tool schemas are IDL. Tool descriptions are docstrings on the IDL. The agent loop is an RPC client driver. Timeouts, retries, idempotency, error propagation, and observability — every problem that the RPC literature solved in the 1980s reappears here, with the twist that the caller is non-deterministic.

The distributed systems parallel

The strongest parallel is gRPC with a flaky client. The schema defines the contract; the server validates and executes; the client (the LLM) might issue redundant calls, call the wrong endpoint, or hallucinate a parameter. The defensive patterns translate one-for-one:

Idempotency keys — if the model calls charge_card twice because the first response arrived garbled in the conversation, your server needs to deduplicate. The same idempotency-key discipline you’d put on a payments API belongs on any tool that mutates state.
Timeouts and cancellation — a tool that blocks the loop is a tool that blows your latency budget. Wall-clock deadlines on each tool execution and a step cap on the whole loop are non-negotiable.
Circuit breakers — when a tool starts failing, the model can’t tell from a single error and will happily retry until your token budget is exhausted. The loop driver, not the model, owns the circuit breaker.
Schema evolution — adding a required field to a tool’s input schema is a breaking change for the model the same way it’s a breaking change for a typed client. The deployed prompt-cache hit rate and the eval suite are your two canaries.

The deeper parallel is continuation-passing style: the model emits a request and yields, the runtime executes, then continues the model from where it left off with the result spliced in. Each turn is a CPS frame; the conversation history is a serialized call stack. This is also why a sloppy compaction strategy can break tool use — if you drop a tool_use block but keep its tool_result, you have a return value with no call site, and most providers will hard-error.

Mechanics: the call/result loop

A single tool-using exchange looks like this on Anthropic’s API:

You send messages plus a tools array. The tools array is a list of objects with name, description, and input_schema.
The model responds with stop_reason: "tool_use" and an assistant message whose content is a list of blocks — typically a text block followed by one or more tool_use blocks, each with an id, a name, and a JSON input.
You execute each requested tool and append a user message whose content is a list of tool_result blocks. Each tool_result carries the matching tool_use_id, the result payload, and an optional is_error: true flag.
You re-call the API with the appended history. The model either calls more tools (back to step 2) or stops with stop_reason: "end_turn" and a final text response.

OpenAI’s surface is conceptually identical with different field names: tools[].type: "function", tools[].function.{name, description, parameters}, the model emits tool_calls on the assistant message, and you reply with messages of role: "tool" carrying a tool_call_id. The shapes are nearly isomorphic (Anthropic’s tool-use guide, OpenAI’s function-calling guide). The semantic differences worth memorizing:

Forcing a call. Anthropic’s tool_choice is {type: "auto" | "any" | "tool" | "none"}. any forces some tool; tool forces a specific one. OpenAI’s tool_choice accepts "auto" | "required" | "none" or {type: "function", function: {name}}; required is the equivalent of any.
Parallel tool calls. Both providers will emit multiple tool_use/tool_calls blocks in a single assistant message when the calls are independent. To opt out — useful for tools that mutate state and must be serialized — Anthropic exposes disable_parallel_tool_use: true (on the tool_choice object), OpenAI exposes top-level parallel_tool_calls: false.
Strict schemas. OpenAI’s strict mode (strict: true on a tool) and Anthropic’s strict tool use both schema-validate tool inputs at decode time, the same FSM-over-vocabulary trick used by schema-constrained structured output. Use it on tools whose inputs you’d rather not re-validate in application code.
Tool descriptions are not free. Every tool’s schema and description is serialized into a hidden system prompt on every call. A library of 50 tools can easily cost 5–10k tokens per turn. This is the wedge that motivates MCP and dynamic tool routing — a topic the Agents subtree will return to.

Code: Python with the Anthropic SDK

A two-tool example — one for weather, one for time — using the official Anthropic SDK. Install: pip install anthropic.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import json
from anthropic import Anthropic

client = Anthropic()

TOOLS = [
    {
        "name": "get_weather",
        "description": (
            "Get the current weather for a given location. Returns temperature "
            "and conditions. Use only when the user asks about weather; do not "
            "use for forecasts more than 24 hours out."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state, e.g. 'San Francisco, CA'",
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Temperature unit (default: celsius)",
                },
            },
            "required": ["location"],
        },
    },
    {
        "name": "get_current_time",
        "description": "Get the current local time for an IANA timezone like 'America/Los_Angeles'.",
        "input_schema": {
            "type": "object",
            "properties": {"tz": {"type": "string"}},
            "required": ["tz"],
        },
    },
]

def execute_tool(name: str, args: dict) -> str:
    if name == "get_weather":
        return json.dumps({"temp": 18, "unit": args.get("unit", "celsius"), "conditions": "fog"})
    if name == "get_current_time":
        return json.dumps({"tz": args["tz"], "time": "2026-05-18T11:14:00-07:00"})
    return json.dumps({"error": f"unknown tool: {name}"})

def run_loop(user_msg: str, max_steps: int = 8) -> str:
    messages = [{"role": "user", "content": user_msg}]
    for _ in range(max_steps):
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            tools=TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason != "tool_use":
            return "".join(b.text for b in resp.content if b.type == "text")

        results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            try:
                output = execute_tool(block.name, block.input)
                results.append({"type": "tool_result", "tool_use_id": block.id, "content": output})
            except Exception as e:
                results.append({
                    "type": "tool_result", "tool_use_id": block.id,
                    "content": f"error: {e}", "is_error": True,
                })
        messages.append({"role": "user", "content": results})

    raise RuntimeError("max_steps exceeded")

Two implementation details worth flagging. First, the entire resp.content array is appended back as the assistant turn — the SDK objects are JSON-serializable and the API expects the exact same block list it sent. Second, tool errors are first-class: an is_error: true tool_result lets the model recover rather than crashing the loop. Most real failure modes (network timeouts, 4xx from upstream APIs, invalid arguments the model passed) belong here, not in raised exceptions.

The step cap is the load-bearing safety net. Without it, a confused model can loop indefinitely, especially on ambiguous tasks where each tool result triggers another retrieval. The JIT context-engineering article called out tool-loop drift as JIT’s silent failure mode; the step cap is the bluntest version of the no-progress detector recommended there.

Code: TypeScript with the Vercel AI SDK

The Vercel AI SDK’s generateText wraps the whole loop. Install: npm install ai @ai-sdk/anthropic zod.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import { anthropic } from "@ai-sdk/anthropic";
import { generateText, tool, stepCountIs } from "ai";
import { z } from "zod";

const getWeather = tool({
  description:
    "Get current weather for a location. Use only when the user asks about weather; do not use for multi-day forecasts.",
  inputSchema: z.object({
    location: z.string().describe("City and state, e.g. 'San Francisco, CA'"),
    unit: z.enum(["celsius", "fahrenheit"]).default("celsius"),
  }),
  execute: async ({ location, unit }) => {
    return { location, temp: 18, unit, conditions: "fog" };
  },
});

const getCurrentTime = tool({
  description: "Get current local time for an IANA timezone.",
  inputSchema: z.object({ tz: z.string() }),
  execute: async ({ tz }) => ({ tz, time: "2026-05-18T11:14:00-07:00" }),
});

export async function ask(prompt: string) {
  const { text, steps, usage } = await generateText({
    model: anthropic("claude-opus-4-7"),
    tools: { getWeather, getCurrentTime },
    stopWhen: stepCountIs(8),
    prompt,
  });
  console.log(`steps=${steps.length} tokens=${usage.totalTokens}`);
  return text;
}

The SDK’s tool() helper bundles description, Zod input schema, and an execute function. stopWhen: stepCountIs(8) is the explicit step cap; without it the SDK defaults to stepCountIs(20), which is plenty rope to hang yourself with on a 100k-token model. The runtime auto-executes each tool call, splices the result back into the conversation, and re-invokes the model until either no tool is called or the stop condition fires.

The single-line stopWhen is the place to put any non-trivial loop policy: stop on a specific tool (hasToolCall("submit_final_answer")), stop after wall-clock T, or compose conditions. Treat it the same way you’d treat the deadline propagation in a gRPC call chain.

Trade-offs, failure modes, gotchas

The token tax on tools is invisible until it isn’t. Every tool in the tools array is serialized into a system prompt on every call, even when the user asks a question that has nothing to do with any tool. With 30+ tools and verbose descriptions, you can burn 8–12k tokens before the user message starts. The mitigations are real but bounded: trim descriptions, consolidate fine-grained tools into action-parameterized ones (per Anthropic’s tool-writing guidance), and turn to MCP-style dynamic tool routing once N > ~30.

Parallel tool calls assume independence. If the model emits transfer_funds(A→B) and transfer_funds(B→C) in the same turn, your runtime executes them concurrently — and the model has no notion of ordering between them. For state-touching tools, force serial execution: disable_parallel_tool_use: true on Anthropic, parallel_tool_calls: false on OpenAI. The default is “concurrent unless you opt out,” which is the wrong default for most production CRUD tools.

Tool selection collapses past ~30 tools. Once the tool count grows large, accuracy on tool selection drops faster than you’d expect — the model has trouble disambiguating between similar tools, and descriptions start interfering with each other. The standard fix is two-stage retrieval: embed the tool descriptions, retrieve a top-k subset relevant to the user’s turn, and only pass those into the API. Anthropic ships a built-in tool search tool for this; the Agents subtree will cover the pattern in depth.

Hallucinated arguments. The model occasionally invents tool inputs that aren’t in the schema (get_weather(zipcode: "94103") when the schema only accepts location). Strict mode catches this at decode time. Without strict, validate every argument before execution and return a clear error in tool_result — the model is good at correcting on a single round-trip given a precise error message; don’t crash the loop with a Python KeyError.

tool_choice: "any" prefills the assistant. When you force a tool call, the API prefills the assistant turn with the start of a tool_use block. The model cannot precede it with a natural-language explanation or reasoning, and forcing any together with extended thinking is currently rejected on Claude. If you want the model to “think first then call,” use auto with explicit instructions, or call without forcing and check stop_reason.

Idempotency at the runtime, not at the model. If a tool_result doesn’t arrive cleanly, the easy thing is to retry — and the model will happily call the same tool again on the next turn if the conversation is replayed. Any tool that mutates state needs an idempotency key at the runtime layer. Treat the LLM as an at-least-once caller, not an exactly-once one.

Schema strictness fights schema expressiveness. Strict mode requires every property to be required (use union with null for optional fields) and forbids additionalProperties: true. Existing schemas with oneOf, recursive references, or unbounded objects may not compile. The same subset that bit you in the structured-output article bites again here — the production answer is usually to define the tool input as the smallest legal envelope and validate the richer constraints in your execute function.

Tool descriptions are prompt code, version them. Edits to a tool’s description change the model’s call rate the same way edits to a prompt change behavior. They belong in version control and behind eval gates — treat description churn as production-affecting change, the same way you’d treat a prompt edit.