Published 2026-05-24

Skills for Agents: I Packaged 15 Years of SRE Judgment Into Markdown an AI Can Load

There's a trend right now where everyone is converting their know-how into "skills" for AI agents — little packets of instructions an assistant can load when a task matches. I think the trend is correct, and I think most people are building skills wrong. So here's what I learned packaging fifteen years of SRE judgment into Markdown, and the test I use to tell a real skill from a glorified bookmark.

What a skill actually is

Strip away the branding and a skill is three things in a file:

A trigger — a tight description of when this knowledge applies, so the agent knows to reach for it.
Procedure — the steps, in order, including the boring parts a human would skip.
Judgment — the stuff that isn't in the docs. What to do when the steps don't apply. What "good" looks like. Where people screw up.

That third one is the whole game. Anyone can paste a runbook. The thing you're actually selling an AI — or your future self — is judgment.

The bad skill

Here's a skill the way most people write it:

# Check the database
Run `SELECT * FROM pg_stat_activity;` to see active queries.

That's not a skill. That's a fact the model already knows, wrapped in a heading. An agent doesn't need me to tell it pg_stat_activity exists; it was trained on the Postgres docs. If a competent model could produce the entire content of your "skill" cold, you haven't packaged anything. You've added tokens.

The good skill

Here's the same topic, but packaged as judgment:

# Diagnosing a slow Postgres prod database

## When to use
The app is timing out, latency spiked, or someone said "the DB is slow."
Use BEFORE touching config or restarting anything.

## Procedure
1. Get the picture before you act. Pull active queries AND wait events:
   SELECT pid, state, wait_event_type, wait_event, now()-query_start AS age, query
   FROM pg_stat_activity WHERE state != 'idle' ORDER BY age DESC;
2. The longest-running query is rarely the cause. Look for the query that's
   BLOCKING others (wait_event_type = 'Lock'). One stuck transaction holding a
   lock will make twenty healthy queries look slow.
3. Check for idle-in-transaction. An app that opened a transaction and walked
   away will hold locks forever. This is the single most common "the DB is slow"
   cause I've seen, and it's an app bug, not a DB problem.

## Judgment
- Do NOT kill the longest query reflexively. Find the head of the lock chain
  first; killing a victim does nothing.
- A connection count near max_connections is a symptom, not the disease —
  something downstream stopped draining. Restarting the DB "to clear it" hides
  the real bug and it comes back at 3am.
- If pg_stat_activity itself is slow to return, suspect the host, not the queries.

See the difference? The second one encodes the mistakes — kill the wrong query, restart to clear connections, blame the DB for an app bug. That's the part you can't get from training data, because it lives in the scar tissue of having done it wrong. (For the record, this isn't hypothetical — diagnosing exactly this on real production fleets is most of what I do.)

My four tests for whether something is a real skill

When I'm deciding if a piece of my knowledge deserves to be a skill, I run it through four questions:

Could the model do this well without me? If yes, skip it. Don't package what the model already knows.
Does it contain a fork the model would get wrong? The best skills exist precisely at decision points where the "obvious" move is the wrong one.
Is the trigger crisp? A skill the agent never loads at the right moment is dead weight. The "when to use" line is as important as the body.
Would I hand this to a sharp junior on their first week? If it'd save them a bad night, it's a skill. If it's trivia, it's not.

The shape that works

A few mechanical things I've settled on after building a stack of these:

One skill, one job. A skill that does "everything about databases" never loads at the right time because its trigger is mush. Split it.
Put the trigger first and make it specific. "Use when checking Postgres health on prod or staging" beats "database stuff."
Lead with the procedure, end with judgment. The agent executes top-down; give it the happy path, then the landmines.
Be concrete about names. Real connection strings (redacted), real log group names, real command flags. Vague skills produce vague agents.
Keep secrets out. A skill is checked-in, shared, sometimes open-sourced. It references where a credential lives (a vault path), never the credential.

That last point matters more as you scale. I treat skills like code: reviewed, version-controlled, no secrets, small and single-purpose. Because that's what they are now — code that happens to be prose.

Why I'm bothering

Two reasons. First, selfishly: an agent loaded with my actual judgment is a dramatically better teammate than a raw model. It stops making the rookie moves because I wrote the rookie moves down as "don't." Second, less selfishly: this is the most leveraged thing a senior person can do right now. Your expertise was trapped in your head and your habits. Markdown lets you hand it to a tireless operator that works while you sleep.

I'm pulling the most reusable, non-proprietary pieces of my SRE skill set — Postgres triage, the golden-signals mapping, postmortem structure, backup-restore checks — into an open-source skills pack so other people building reliability-minded agents can start from judgment instead of from scratch. It's the same material I lean on when I run an Agent Production-Readiness Audit, just generalized.

If you've got a domain you've spent a decade in, try this exercise this week: take one thing you're sick of explaining to juniors, and write it up using the four tests above. Worst case, you've documented something valuable. Best case, your agent stops making the mistake you've been correcting for years.

Five emails a week on AI reliability. Free, no spam, unsubscribe anytime.

Subscribe →