← back to all posts
Published 2026-05-24

Blameless Postmortems: The Template I Actually Use

Most postmortems are theater. A document gets written after an outage, it lists "human error" as the root cause, someone gets a talking-to, an action item like "be more careful" gets logged, and the whole thing gets filed and forgotten — right up until the same outage happens again six months later. The point of a blameless postmortem is to break that cycle by changing what you're actually looking for: not who, but why the system let it happen. Here's the template I actually use, and more importantly, the rules that make it blameless rather than blameless-in-name-only.

The template

Copy this. Adapt it. The headings matter more than the prose.

# Postmortem: [Short, specific title]

**Status:** Draft | In Review | Final
**Severity:** SEV1 / SEV2 / SEV3
**Date of incident:** YYYY-MM-DD
**Authors:** [names]
**Reviewers:** [names]

## Summary
2-4 sentences a busy exec can read. What broke, who was affected,
how long, and the headline cause. No jargon.

## Impact
- Who/what was affected (users, systems, revenue)
- Duration: detected at HH:MM, resolved at HH:MM, total X minutes
- Quantified where possible (N requests failed, X% error rate, $ if known)

## Timeline (all times in one timezone, stated)
| Time | Event |
|------|-------|
| HH:MM | First trigger / change went out |
| HH:MM | First symptom |
| HH:MM | Alert fired / human noticed |
| HH:MM | Investigation milestones |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |

## Root cause(s)
The chain of conditions that made the incident possible. Plural on
purpose — real incidents have multiple contributing factors, not one.
Use "5 whys" but stop at the system, never at a person.

## Detection
How did we find out? How long did it take? Did the right alert fire,
or did a customer tell us? Be honest here.

## Resolution
What actually fixed it (mitigation) vs. what will prevent recurrence.
These are different things.

## What went well
Genuinely — fast rollback, good runbook, clear comms. Reinforce it.

## What went poorly / where we got lucky
The near-misses. "We got lucky that X" is one of the most valuable
lines in any postmortem.

## Action items
| Action | Owner | Due | Priority | Tracking link |
|--------|-------|-----|----------|---------------|
| Specific, assigned, dated, and trackable. | | | | |

That's the artifact. But the artifact is the easy part. The hard part is the culture, and a template can't give you that. The facilitation can.

The five rules that make it actually blameless

1. Replace names with roles in the timeline

Not "Sarah pushed a bad config." Instead: "a config change was deployed that…" The person's identity is almost never the useful variable. The useful variable is why the system accepted a change that could take it down. The second you write a name next to the failure, every reader's brain shifts from "how do we fix the system" to "should we be mad at Sarah," and you've lost the room. (This isn't about protecting feelings — it's about keeping the analysis pointed at the fixable thing.)

2. Ask "why was this possible?" not "who did it?"

If the answer to "root cause" is "human error," you haven't found the root cause — you've found where you stopped looking. Of course a human made the proximate mistake; humans always do. The real question is: why did the system permit that mistake to cause an outage? No staging gate? No automated check? A confusing UI that made the wrong action look right? Those are the fixable things. "Be more careful" is not a fix; it's a wish.

3. Run it on the assumption that everyone acted reasonably given what they knew

The single most important framing, sometimes called the prime directive of retrospectives: assume everyone involved did the best they could with the information, tools, and time available in the moment. People don't break production on purpose. If someone made a call that looks dumb in hindsight, the interesting question is "what made that look like the right call at the time?" That's where the systemic gap hides.

4. Distinguish mitigation from prevention — and actually fund prevention

The fastest postmortem-to-repeat-incident pipeline is logging "we restarted the service, all good" as the resolution. Restarting is mitigation; it stopped the bleeding. Prevention is the action item that stops the next one. A postmortem with no prevention action items, or with prevention items that never get prioritized against feature work, is just paperwork. The action items have to be real tickets, owned, dated, and tracked to completion — otherwise the whole exercise is a journaling habit.

5. Make them genuinely safe, then make them required

Blameless only works if people believe it. The first time someone gets punished for an honest mistake surfaced in a postmortem, your postmortems become fiction — people will write the safe version, omit the embarrassing detail, and you'll lose your most valuable source of truth. Leadership has to demonstrate, repeatedly, that surfacing a mistake is rewarded, not punished. Once that trust exists, make postmortems mandatory for anything above a threshold severity, because the ones people don't want to write are usually the ones worth the most.

One more thing: write it while it's warm

Memory decays fast and reconstructs itself conveniently. Draft the timeline within a day or two of the incident, while the Slack history and the dashboards still tell the real story. A postmortem written two weeks later is a postmortem written from vibes.

Why I care about this so much

I've run a lot of these across a large, multi-account environment, and the pattern is consistent: the organizations that get reliably better are the ones that treat every incident as free tuition about their system, and the ones that stay stuck are the ones still hunting for someone to blame. The template above is just scaffolding. The thing that actually moves the needle is the discipline of looking at the system instead of the person — every single time, even when it's tempting not to.

This same blameless, system-first mindset is the spine of how I think about reliability generally, including for AI agents — when an agent does something wrong, the question is never "the model is dumb," it's "what in our system let a bad output reach a user?" If you're building toward that kind of reliability culture and want help operationalizing it, that's the work I do over at reliabilityops.dev.

Five emails a week on AI reliability. Free, no spam, unsubscribe anytime.

Subscribe →