Tool Use

Also known as: tool calling, tool invocation, actions

TL;DR

Tool use is the pattern where an LLM emits a structured request to call an external function — a search API, a code runner, a database query — and the runtime executes it and returns the result.

Tool use is the mechanism by which a language model reaches outside its own weights. The model emits a structured request — “call function search_docs with argument query='context window benchmarks'” — the runtime executes that function, and the result is fed back into the model’s context. With this primitive, the LLM stops being a closed text generator and starts being a system that can act.

TOOL USE · MODEL EMITS, RUNTIME EXECUTES, MODEL READSOne call out, one result back, one grounded answer.MESSAGES →USERpromptTURN 1What’s the refund policy on order #A2914?ASSISTANTreasoningTURN 2I need to look this order up.Calling lookup_order with id="A2914".ASSISTANTtool_callTURN 3{ "name": "lookup_order", "arguments": { "order_id": "A2914", "fields": ["status", "refundable_until"] }}NAME + ARGSTOOLresultTURN 4{ "status": "delivered", "refundable_until": "2026-06-12", "amount": "$84.20"}RUNTIME RETURNASSISTANTreasoningTURN 5Delivered, refundable until 2026-06-12.Compose the answer for the user.ASSISTANTfinal answerTURN 6Refunds on order #A2914 are open until June 12, 2026.THE PRIMITIVE

The shape of a tool call

A tool exposed to the model has three things:

  • A name the model uses to invoke it
  • A schema for arguments (typically JSON schema, see )
  • A description in natural language explaining when the tool should be used

At inference, the model produces either a normal text response or a tool-call request — name plus arguments, in a structured format the runtime parses. The runtime executes the underlying function, captures the result, and either returns it to the user or feeds it back to the model as the next turn’s input.

Why it’s the cornerstone of agents

Almost every agent capability resolves to tool use:

Capabilities reduced to tool calls
  • Retrieval is “call the search tool with a generated query”
  • Code execution is “call the Python runner with this snippet”
  • Memory writes are “call the memory-store tool with this fact”
  • Multi-agent delegation is “call sub-agent X with this sub-task”
  • Browser automation is “call the click/type/scroll tool on a target”
  • Database mutation is “call the SQL-execute tool against this connection”

The is mostly just a while-loop around this primitive: model emits a tool call, runtime executes, model sees the result, repeat.

At the API level, a tool call is a structured field in the assistant’s response, not free-text. OpenAI’s shape is roughly:

{
  "role": "assistant",
  "content": null,
  "tool_calls": [{
    "id": "call_abc",
    "type": "function",
    "function": { "name": "search_docs", "arguments": "{\"query\":\"...\"}" }
  }]
}

The runtime parses arguments as JSON, dispatches to the named function, and returns a tool role message with the result keyed to tool_call_id: "call_abc". The model’s next turn sees that tool result inline.

Constrained decoding under the hood is what makes the JSON parseable — providers either run a grammar over the function-arguments slot or post-process and reject invalid arguments, retrying the model. Either way, by the time the runtime sees the call, the schema has been enforced.

Where it breaks

  • Tool selection at scale. With three tools, the model nails it. With thirty, it starts confusing similar tools, mixing up the right call with the wrong arguments, or refusing to call any tool when one is clearly needed. Production agents either constrain the toolset per turn or route through a specialized tool-selection model.
  • Argument hallucination. The model invents a tool argument that the underlying API rejects — wrong type, missing field, made-up enum value. Strict helps, but doesn’t eliminate it.
  • Result handling. Tool outputs can be huge. Pasting raw API responses into context blows the window and degrades attention; production systems compress or summarize between turns.

Specialized models in this loop

Tool selection is a narrow classification task: given the query and the catalog of tools, pick the right one. That’s the shape a small specialized model excels at, at a fraction of the LLM’s per-call cost. Same applies to the results that come back from a search tool. The agent stays generalist; the high-frequency narrow calls inside the loop go to specialists.

Go further

Why not just let the LLM 'do' the task itself?

LLMs can't access fresh data, can't run code reliably, can't hit your private APIs, and can't take side effects. Tool use is how the model reaches outside its weights — for retrieval, computation, communication, or any action whose result isn't already in its context.

What's the dominant failure mode of tool use?

Tool selection. With more than ~10 available tools, frontier LLMs start picking the wrong one — sometimes the second-best, sometimes one that's plainly irrelevant. Tool selection is itself a narrow classification task and a strong candidate for a specialized small model in front of the LLM.

How is this different from RAG?

RAG is one specific tool-use pattern: 'call a retrieval function, splice results into the prompt.' Tool use is the more general primitive — the LLM can call any function exposed to it, not just search. Modern agents combine retrieval, code execution, web fetching, and domain APIs through the same tool-use surface.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord