Production-ready agents, observability for LLM apps, the SRE practices that actually transfer. Updated as the work ships.
Most postmortems are theater. A document gets written after an outage, it lists "human error" as the root cause, someone gets a talking-to, an action item like "be more careful"...
Every engineering leader I talk to wants DORA metrics, and almost none of them have them. The four — deployment frequency, lead time for changes, change failure rate, and time t...
There are a dozen products that promise to be your "AI command center" — a slick dashboard to run, watch, and steer agents. I looked at them, and then I built my own over a coup...
A crash is a gift. When software crashes, you know. There's a stack trace, an alert fires, someone gets paged, you fix it. The truly dangerous failures are the silent ones — the...
If you've done any SRE work you know the four golden signals from Google's SRE book: latency, traffic, errors, saturation. Watch those four for any service and you'll catch most...
"It works." I hear this about AI agents constantly, and it almost always means the same thing: it worked once, in a demo, with a friendly input, while someone watched. That is n...
There's a trend right now where everyone is converting their know-how into "skills" for AI agents — little packets of instructions an assistant can load when a task matches. I t...
I run a little AI command center on my laptop. It's the control surface for the agents I'm building at Grand Canyon Computers — kick off a task, watch it work, approve a step, r...