The Tool-Use Gap — Hearthstone Ventures

The ability of language models to call external tools — to invoke APIs, query databases, run code, and take actions in external systems — is the feature that transforms a language model from a text generator into an agent. OpenAI introduced function calling in 2023; every major model provider followed quickly. The capability is now table stakes.

But tool calling as a capability and reliable tool use in production are very different things. The gap between them is where the real engineering work happens — and where most teams building agents underestimate the complexity they are taking on.

Failure Modes

Tool use fails in ways that are distinct from other LLM failure modes. The model selects the wrong tool for the task. The model calls the right tool but passes incorrect or malformed parameters. The tool call succeeds but the returned data is unexpected and the model misinterprets it. The tool call fails midway through a multi-step sequence, leaving state in an indeterminate condition. The model retries a non-idempotent operation, causing a side effect the user did not intend.

Each of these failure modes requires different detection and recovery logic. And because tool calls often interact with external systems — databases, APIs, email — the consequences of failures can be real-world and sometimes irreversible. This is categorically different from a text generation failure, where the worst case is usually a bad response that the user ignores.

The Action Execution Layer

What teams building production agent systems need — and what very few infrastructure providers offer well today — is an action execution layer: a runtime that handles tool call dispatch, manages the protocol between the model and external systems, provides structured error handling, maintains execution traces for debugging, and enforces permission policies that prevent agents from taking actions they should not take.

This layer exists in every well-engineered production agent deployment. It is just usually built as custom code, not as a reusable infrastructure primitive. The investment thesis behind Xpander.ai, which we backed in early 2024, is exactly this: that the action execution layer is a real category, that it will be extracted from application code into dedicated infrastructure, and that the right team can build the definitive product in this space.