What 'Production-Ready' Actually Means for an AI Agent
"It works." I hear this about AI agents constantly, and it almost always means the same thing: it worked once, in a demo, with a friendly input, while someone watched. That is not production-ready. That is a magic trick.
I've spent my career on the line between "works in the demo" and "survives Tuesday at 3am with real traffic." The discipline that gets you across that line has a name — site reliability engineering — and it transfers almost perfectly to agents, with a few new twists. Here's the bar I actually hold an agent to before I'll let it run unattended. There are five parts.
1. Reliability: it behaves when the world doesn't
A production agent is defined less by what it does on the happy path and more by what it does when something breaks. The model API rate-limits you. A tool times out. The input is garbage. The user asks for something out of scope. The downstream service returns a 500.
The question isn't "does it work" — it's "what happens when it doesn't." Production-ready means:
- Timeouts and retries on every external call, with backoff, so a slow dependency degrades instead of hanging forever.
- Idempotency where actions have side effects. If the agent retries a step that sends an email or charges a card, it must not send twice. This is the single most common gap I find.
- Bounded loops. Agents that can call themselves need a hard ceiling on iterations and spend, or they'll cheerfully burn $400 in a runaway loop while you're at lunch.
- A defined behavior for "I don't know" or "I can't." A confident wrong answer is worse than a graceful refusal.
2. Observability: you can see what it did and why
If you can't answer "what did the agent do, with what inputs, calling which tools, costing how much, and why did it decide that?" — after the fact, from logs — then you don't have an agent in production. You have one in the dark.
Concretely I want every run to emit:
- A trace of the full reasoning/tool-call chain, not just the final output.
- Inputs and outputs at each tool boundary, so a bad result can be localized to a step.
- Token counts and cost per run, tagged by feature or customer.
- Structured logs keyed on a run ID, so I can reconstruct one user's bad afternoon from a billion log lines.
I've written a whole separate piece on mapping the four golden signals onto LLM apps, because this is where SRE practice translates most directly. The short version: latency, traffic, errors, and saturation all still apply — they just need new definitions.
3. Failure modes: you've thought about how it breaks
A real engineering team can tell you the ways their system fails before it fails. For an agent, the catalog is specific and worth writing down:
- Silent tool failure — the tool returns an error, the agent ignores it, and confidently proceeds on bad data. (This one scares me enough that it gets its own article.)
- Hallucinated tool use — the agent invents a parameter or a result.
- Prompt injection — the input data contains instructions that hijack the agent.
- Context overflow — the conversation grows past the window and the agent silently forgets the constraints you set at the top.
- Cascading retries — a failing dependency triggers retries that amplify load and make the outage worse.
Production-ready means each of these has a known mitigation, not a surprised face.
4. Evals: you can prove it's good, repeatably
This is the new muscle that traditional software didn't need as badly, and it's where most agent projects are weakest. With deterministic code, a test either passes or fails. With an agent, "correct" is fuzzy and the same input can produce different outputs.
So you need evals: a held-out set of representative inputs with graded expected behavior, run on every change, that tells you whether the agent got better or worse. Without evals, you are tuning a prompt by vibes, and every "improvement" is a coin flip that might silently regress three other cases. An agent without an eval suite is not production-ready, full stop — because you have no way to ship a change safely.
The bar isn't "100% on a benchmark." It's "I have a repeatable measurement, I know my current score, and I'll see it move when I break something."
5. Cost: it's economically sane and bounded
Agents have a failure mode that classic services mostly don't: they can be correct and ruinously expensive at the same time. A reasoning loop that works perfectly but costs $2.50 per request will quietly bankrupt a feature that you priced at $0.10.
Production-ready means cost is measured per run, attributed, alerted on, and capped. You should have a number for "what does the median request cost" and "what's the most a single request can cost before we kill it." If you don't, your unit economics are a rumor.
The honest summary
Here's the uncomfortable version: most agents I see in the wild are at maybe one-and-a-half out of five. They work on the happy path (sort of), they have some logs (unstructured), and the other three categories are blank. That's fine for a prototype. It's a liability the moment a customer or your own automation depends on it.
The reason I drew the line at exactly these five is that they're the same five I'd hold any production service to — reliability, observability, failure analysis, testing, cost — just translated into the agent world. There's no special pleading for AI here. If anything, agents need more rigor, because they're non-deterministic and they can take actions.
This five-part bar is exactly what I check in my Agent Production-Readiness Audit. It's a fixed-scope, $1,500 review where I go through your agent against each of these categories, find the gaps that'll bite you, and hand you a prioritized list of what to fix before it carries real load. If you've got something that "works" and you're about to point real users or real money at it, that's the moment to get a second set of eyes on it. Get in touch about the audit.
Five emails a week on AI reliability. Free, no spam, unsubscribe anytime.
Subscribe →