Function Calling

Also known as: function call, tool API, structured tool calling

TL;DR

Function calling is the structured-API mechanism that providers (OpenAI, Anthropic, Google) expose for tool use: you give the model a JSON schema describing each function, and the model responds with a typed call request the runtime can execute.

Function calling is the structured API surface that lets an LLM invoke external functions. You describe each function with a JSON schema — name, parameters, types, descriptions — pass it alongside the user message, and the model can respond either with normal text or with a typed call request: function name plus a JSON object of arguments. The runtime parses the call, executes the underlying code or API, and returns the result on the next turn.

The schema

A function definition looks roughly like this on every major provider:

{
  "name": "search_docs",
  "description": "Search the customer's internal documentation for a query.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": { "type": "string", "description": "Search query" },
      "top_k": { "type": "integer", "default": 10 }
    },
    "required": ["query"]
  }
}

The model is fine-tuned during post-training to honor the schema — to emit calls only with valid argument shapes, only to functions that exist, only when a tool call is actually appropriate. Names, descriptions, and required-field lists are all load-bearing for the model’s decisions.

Provider differences

OpenAI, Anthropic, and Google all expose function-calling APIs, with provider-specific message-formatting differences:

OpenAI uses a tools parameter and tool_calls in the assistant message
Anthropic uses tools and emits tool_use content blocks
Google uses function_declarations and function_call parts

The schema vocabulary is roughly compatible, so cross-provider portability is achievable but rarely free. Most production agent harnesses (LangChain, LlamaIndex, the MCP ecosystem) abstract over the differences.

What “good” function descriptions look like

The model picks tools mostly off the description and parameter docstrings. Good descriptions:

Say when to use the tool, not just what it does (“Use this when the user asks about company-internal information that wouldn’t be in the model’s training data”)
Disambiguate from neighbors (“Unlike search_web, this only searches internal docs”)
Note constraints the model otherwise wouldn’t know (“Returns at most 50 results; for broader queries, use list_documents”)

The classic anti-pattern is one-line descriptions copy-pasted from internal API docs. The model doesn’t know what your acronyms mean.

The model’s tool selection is more of a retrieval task than a reasoning task. Description quality and disambiguation between neighbors dominate; the parameter schema is the contract, not the persuasion.

Strict-mode (OpenAI’s structured outputs, Anthropic’s similar tool-result enforcement) compiles your JSON schema into a finite-state grammar over the tokenizer’s vocabulary, then masks the logits at every decoding step so the only sampleable tokens are ones consistent with the grammar. A required string field with an enum of three values literally cannot decode to a fourth value — the unsupported tokens have probability zero. This eliminates the entire class of “model invented an enum value” bugs at the cost of slightly slower decoding (the mask is computed per-token) and occasional quality regressions when the grammar forces the model into a high-loss continuation. For high-stakes function calls (writes, financial transactions), strict mode is non-optional; for read-only queries, lenient mode plus a JSON validator on the runtime side is usually fine.

Where this gets fragile

Argument-shape mismatches are common: the model invents an enum value, drops a required field, or sends a string where a number was expected. Strict-mode schemas (JSON-schema validation enforced at decode time) and retry-with-error-message patterns are the standard mitigations.

Token cost is the other failure axis. Every tool definition lives in the prompt on every turn. With 50 tools and verbose descriptions, you pay 5K+ tokens of overhead per turn before the user has said anything. Production agents prune the catalog to per-turn-relevant tools, often via a specialized tool-selection model .

Common function-calling failure modes

Hallucinated tool name: model invents search_documents_internal when only search_docs exists
Argument-shape drift: required query field arrives as a list of strings instead of a single string
Misclassified intent: model calls search_web when the user asked about internal data
Tool overload: with 50+ tools in the catalog, the model picks the wrong one for queries near the boundary
Hallucinated enum values: schema says status: "open" | "closed", model emits status: "pending"
Recursive call loops: tool returns ambiguous data and the model calls the same tool again with similar args

Go further

Why JSON schema specifically?

JSON schema is the lingua franca: it's machine-checkable, human-readable, and directly usable by validators in every language. The model is fine-tuned during post-training to honor it, so the structured output is far more reliable than asking for JSON in plain English.

Structured output Tool use

What's parallel function calling?

When a turn needs multiple independent tool calls, the model emits all of them in one response — three searches, two API hits — and the runtime executes them concurrently. Cuts wall-clock latency dramatically for fan-out tasks. Supported on most major providers as of 2024-2025.

Agent loop Tool use

How does this relate to MCP?

Function calling is the model-to-runtime contract for one provider. MCP is a cross-provider, cross-process standard for exposing tool catalogs and data sources to any LLM. MCP servers ultimately surface their tools through a function-calling-style schema; MCP standardizes the layer above.

MCP Tool use

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs