← back to all posts
Published 2026-05-24

The Four Golden Signals, for LLM Apps

If you've done any SRE work you know the four golden signals from Google's SRE book: latency, traffic, errors, saturation. Watch those four for any service and you'll catch most of what matters. They've held up for a decade because they're not about a specific technology — they're about the shape of "is this thing healthy?"

LLM apps need the same four signals. But if you copy the textbook definitions verbatim, you'll build a dashboard that's green while your users are miserable. The signals are right; the definitions need translating. Here's how I redefine each one for agents and LLM-backed features.

Latency: split "fast" from "slow," and start measuring time-to-first-token

The classic warning still applies: never track average latency, because the average hides the tail. A p50 of 800ms with a p99 of 40 seconds is a service where one in a hundred users wants to throw their laptop.

But LLM apps add a wrinkle the textbook didn't have: streaming. A response that takes 12 seconds total but starts streaming tokens at 400ms feels fast. A response that takes 4 seconds total but stares at you blankly the whole time feels broken. So I track two latencies:

  • Time to first token (TTFT) — the perceived responsiveness. This is what the user emotionally experiences as "speed."
  • Total generation time — the real resource cost and the thing that drives timeouts.

And a third thing that's easy to forget: separate model latency from tool latency. If your agent is slow, you need to know instantly whether the model is thinking or a tool call is hanging. Lumping them into one number throws away the diagnosis.

Traffic: count requests, but also count tokens

Traffic is "how much demand is hitting the system." For a web service that's requests per second. For an LLM app, requests-per-second still matters, but it badly under-describes load, because not all requests are equal. One request might be a 50-token question; another drags 80,000 tokens of context through the model.

So I measure traffic in two units:

  • Requests (per second/minute) — the classic signal, good for spotting demand spikes.
  • Tokens (in and out, per minute) — the real load on the model and the real driver of cost.

If you only watch requests, you can have flat request volume and a quietly exploding bill because your average context size crept up. Token traffic is the leading indicator that requests-per-second can't see.

Errors: redefine "error," because HTTP 200 is lying to you

This is the signal that needs the most rethinking, and it's where naive LLM dashboards fail hardest.

In a normal service, errors are HTTP 5xx, exceptions, failed responses. In an LLM app, the scariest errors come back as HTTP 200. The model returned something. The status code is green. But:

  • The agent called a tool, the tool failed, and the agent made up an answer anyway.
  • The output was valid JSON but semantically wrong.
  • The model refused the task ("I can't help with that") — a "success" by HTTP standards, a failure by every standard that matters.
  • The agent hit a safety filter or returned an empty completion.

So my error signal for LLM apps tracks several distinct rates:

  • Hard errors — API failures, timeouts, exceptions. The classic stuff.
  • Tool-call failures — a tool the agent invoked returned an error. Critically: track these separately from whether the agent then recovered, because a high tool-failure rate is a fire even if the agent papers over it.
  • Schema/validation failures — the output didn't conform to the structure you required.
  • Refusals and empty responses — request succeeded technically, delivered nothing useful.
  • Eval-based quality failures — if you run online evals, the rate at which outputs fall below your quality bar.

You cannot get this from your load balancer. It has to come from inside the app, because only the app knows that a 200 was actually a failure.

Saturation: it's not CPU, it's quota and context window

Saturation is "how full is the system" — the resource that runs out first under load. For a normal service that's CPU, memory, disk, connection pools. For an LLM app, the constraints are different and easy to ignore until they bite:

  • Rate limits / quota — your tokens-per-minute and requests-per-minute ceilings with the provider. This is your real saturation point, and you should track headroom against it. Hit it and every request starts getting 429'd.
  • Context window utilization — how full the context is relative to the model's max. As you approach the window, you either truncate (silently dropping instructions) or fail. Tracking "percent of context window used" is the agent equivalent of disk-full warnings.
  • Concurrency — how many in-flight requests against your provider concurrency limit.

The trap here is that none of these show up as high CPU on your own boxes. Your server can be 5% utilized while you are completely saturated on provider quota. Saturation for LLM apps lives almost entirely outside your infrastructure.

Putting it on one dashboard

Here's the at-a-glance panel I build for any LLM service:

| Signal | Classic definition | LLM definition | |---|---|---| | Latency | p50/p99 response time | TTFT + total gen time, model vs. tool split | | Traffic | requests/sec | requests/sec and tokens/min | | Errors | 5xx rate | hard errors + tool failures + schema failures + refusals + quality-eval failures | | Saturation | CPU / mem / connections | provider quota headroom + context-window % + concurrency |

Build those four and you can answer "is this thing healthy?" honestly — instead of staring at a wall of green while users churn.

The biggest mindset shift: for LLM apps, your most important signals live inside the application, not in the infrastructure. The load balancer and the host metrics will tell you everything's fine right up until they don't. You have to instrument the agent itself.

If you want help wiring this up properly — or you want someone to look at whether your agent's observability would actually catch a silent failure — that's a core part of my Agent Production-Readiness Audit. Observability is usually the first thing I check, because if you can't see it, you can't fix it.

Five emails a week on AI reliability. Free, no spam, unsubscribe anytime.

Subscribe →