Blog — reliability.dev

2026-05-24

Blameless Postmortems: The Template I Actually Use

Most postmortems are theater. A document gets written after an outage, it lists "human error" as the root cause, someone gets a talking-to, an action item like "be more careful"...
2026-05-24

DORA Metrics in a Weekend With Apache DevLake

Every engineering leader I talk to wants DORA metrics, and almost none of them have them. The four — deployment frequency, lead time for changes, change failure rate, and time t...
2026-05-24

I Built My Own AI Command Center Instead of Buying One — Here's the Architecture

There are a dozen products that promise to be your "AI command center" — a slick dashboard to run, watch, and steer agents. I looked at them, and then I built my own over a coup...
2026-05-24

Why Your Agent's Scariest Moment Is a Tool Call That Fails Silently

A crash is a gift. When software crashes, you know. There's a stack trace, an alert fires, someone gets paged, you fix it. The truly dangerous failures are the silent ones — the...
2026-05-24

The Four Golden Signals, for LLM Apps

If you've done any SRE work you know the four golden signals from Google's SRE book: latency, traffic, errors, saturation. Watch those four for any service and you'll catch most...
2026-05-24

What 'Production-Ready' Actually Means for an AI Agent

"It works." I hear this about AI agents constantly, and it almost always means the same thing: it worked once, in a demo, with a friendly input, while someone watched. That is n...
2026-05-24

Skills for Agents: I Packaged 15 Years of SRE Judgment Into Markdown an AI Can Load

There's a trend right now where everyone is converting their know-how into "skills" for AI agents — little packets of instructions an assistant can load when a task matches. I t...
2026-05-24

I Gave My AI Agent a Cloudflare Zero Trust Tunnel So I Can Run My Command Center From Anywhere

I run a little AI command center on my laptop. It's the control surface for the agents I'm building at Grand Canyon Computers — kick off a task, watch it work, approve a step, r...

Field notes on AI reliability.

Get the next one in your inbox