Published 2026-05-24

Why Your Agent's Scariest Moment Is a Tool Call That Fails Silently

A crash is a gift. When software crashes, you know. There's a stack trace, an alert fires, someone gets paged, you fix it. The truly dangerous failures are the silent ones — the system that keeps running while quietly doing the wrong thing. With AI agents, the king of silent failures is the tool call that fails without anyone noticing.

Let me show you exactly how it goes wrong, because it's subtler than people expect.

The anatomy of a silent tool failure

Your agent needs to look up a customer's order status. It calls your get_order_status tool. The tool hits an internal API. The API is having a moment and returns a 503. Now watch what happens at each layer:

The tool function gets a 503. If it's written casually, it returns something like "Error: could not fetch order" as a string, or worse, returns an empty result and a 200-ish shape.
That string goes back to the model as the tool result. To the model, it's just text. It doesn't carry the weight of "THIS IS AN ERROR, STOP."
The model, which is built to be helpful and produce a fluent answer, reads "could not fetch order" or an empty payload, and reasons: the user wants an order status, I should provide one. So it produces a plausible-sounding status. Maybe it hallucinates "Your order shipped yesterday." Maybe it pattern-matches to a typical answer.
The user gets a confident, fluent, completely fabricated answer. No error. No alert. HTTP 200 all the way down. Everyone's dashboard is green.

That is the scariest moment in agent engineering. Not the crash — the smile while lying. And it's worse than a hallucination from thin air, because the user asked a real question about real data and got fiction dressed as fact.

Why agents are uniquely prone to this

Regular code propagates errors because it's dumb and rigid: an exception bubbles up, the call stack unwinds, something breaks loudly. An agent is the opposite. It's designed to recover gracefully from messy input. That same graceful flexibility means it will gracefully recover from an error message by... ignoring it and making something up. The model's helpfulness is the vulnerability.

So you cannot rely on the model to treat a tool failure as a hard stop. You have to engineer the failure to be loud at the boundary, before the model gets a chance to smooth it over.

The pattern: make failure structured, visible, and bounded

Here's the discipline I apply to every tool an agent can call. Three properties.

1. Structured, unambiguous error results

A tool must never return a failure as a casual string the model can mistake for content. Return a structured result that distinguishes success from failure explicitly:

# Bad — the model can't tell this is a failure
def get_order_status(order_id):
    resp = api.get(f"/orders/{order_id}")
    return resp.text  # on 503, this is an error string the model treats as data

# Good — failure is unmistakable and instructive
def get_order_status(order_id):
    try:
        resp = api.get(f"/orders/{order_id}", timeout=5)
        resp.raise_for_status()
        return {"ok": True, "status": resp.json()["status"]}
    except Timeout:
        return {"ok": False, "error": "TIMEOUT",
                "guidance": "The order service did not respond. Do NOT guess "
                            "the status. Tell the user the lookup failed and to retry."}
    except HTTPError as e:
        return {"ok": False, "error": f"HTTP_{e.response.status_code}",
                "guidance": "Order lookup failed. Do NOT fabricate a status."}

The guidance field is doing real work: it explicitly instructs the model what not to do. You're closing the "be helpful, make something up" path with a direct instruction inside the result the model is reading. It's a small thing that prevents the worst-case behavior.

2. Retries with backoff — but only where it's safe

A 503 or a timeout is often transient. So wrap the call in a retry with exponential backoff before you give up:

Retry on transient errors (timeouts, 429, 503). Don't retry on 400/404 — those won't get better.
Cap retries (2–3) and cap total time, so a flaky dependency doesn't turn one user request into a 90-second hang.
Only auto-retry idempotent calls. Reading an order status is safe to retry. Placing an order is not — a blind retry might create two orders. For non-idempotent actions, either make them idempotent with a request key, or surface the failure to the user instead of retrying.

That idempotency line is where I see the most damage. "Add retries everywhere" is well-meaning advice that, applied to a charge_card tool, turns one outage into a pile of double charges.

3. User-visible failure

When a tool genuinely fails after retries, the user must find out. The agent's response should say the lookup failed — not paper over it. "I wasn't able to retrieve your order status right now — the order system isn't responding. Please try again in a few minutes" is a vastly better outcome than a confident lie. A visible failure is recoverable. An invisible one compounds.

And the failure should be visible to you too: emit a metric and a log on every tool failure, tagged by tool name and error type. Track the tool-failure rate as a first-class signal (it's one of the LLM error signals I argue for in my golden-signals piece). You want to know your order API is flaky from a dashboard, not from a confused customer.

The mental model I want you to leave with

An agent should fail like good infrastructure fails: loudly, safely, and recoverably — never silently and confidently.

The instinct with AI is to lean on the model's intelligence to handle everything. But the model's job is to reason and communicate, not to be your error-handling layer. The error handling lives in the boundary — in how your tools return failures and how you constrain retries. Get that boundary right and a flaky dependency becomes a polite "try again later." Get it wrong and the same flaky dependency becomes a stream of confident fiction you won't discover until a customer complains.

Silent tool failure is the first thing I hunt for when I do an Agent Production-Readiness Audit, because it's so common and the blast radius is so quiet. If your agent calls tools that touch real data or take real actions, it's worth pressure-testing exactly this path before your users do it for you.

Five emails a week on AI reliability. Free, no spam, unsubscribe anytime.

Subscribe →