Writing Postmortems That Help

01What postmortems are actually for

Two purposes, in this order:

Prevent recurrence. Identify systemic changes that would have prevented this incident, and make them.
Spread knowledge. Teach the rest of the engineering org what happened so they can apply the lessons.

Postmortems are not for: assigning blame, demonstrating diligence to management, or documenting what people knew at the time. Postmortems written for these reasons systematically fail to prevent recurrence — they obscure the real causes behind reassuring narratives.

02What "blameless" actually means

Blameless does not mean "we don't say who did what." That's the cargo-cult version. Blameless means: we assume everyone acted reasonably given what they knew, and we focus on the system that made the action lead to an outcome.

If an engineer deployed bad code on Friday at 5pm, the blameless framing is not "Sarah deployed bad code." It's "the system allowed bad code to deploy at 5pm on a Friday without sufficient guardrails." The engineer's action is a fact; the system's failure to catch it is the lesson.

Two practical tests for whether your postmortem is actually blameless:

Can you read it aloud to the person who triggered the incident, in front of their manager, without it feeling like an attack? If not, rewrite.
Does the action items list contain "do better next time" or "be more careful"? If so, rewrite — those aren't actions, they're hopes.

03The template

Every section serves a specific purpose. Skip none.

✓ postmortem template

# Incident: [Brief description]

## Summary
- Date: 2026-05-24
- Duration: 47 minutes
- Severity: SEV-2
- Impact: 12% of users saw 500 errors on checkout
- Cost: ~$80K in lost revenue

## Timeline
[All times in UTC]
- 14:32 - Deploy 14a3 ships to production
- 14:35 - Error rate begins climbing in dashboard
- 14:38 - First customer support ticket
- 14:42 - On-call engineer paged by alert
- 14:51 - Root cause identified (missing migration)
- 14:58 - Rollback initiated
- 15:12 - Service fully restored
- 15:19 - Customer support sends update

## What happened
[Plain-English narrative of the incident]

## Root causes (5 Whys analysis)
[Multiple causes, not one]

## What went well
[Things that worked, to preserve]

## What went poorly
[System failures, not people failures]

## Action items
[Each with owner, due date, and category]

## Lessons learned
[Generalizable knowledge for the org]

04The timeline — be exact

Pull from monitoring, chat logs, ticket timestamps. Don't reconstruct from memory. The timeline is the foundation of every other section — if it's wrong, the analysis is wrong.

Each entry: timestamp (UTC), what happened, who did what. No emotions, no judgment, just facts. The timeline should be reconstructable by anyone reading the data, independently.

Pay attention to the gap between "incident started" and "incident detected." If it's large, that's a monitoring problem, separate from the actual bug.

05The 5 Whys — done properly

The 5 Whys technique is widely misused. Done well, it surfaces systemic issues, not human ones.

✗ shallow 5 Whys (stops at blame)

Why did checkout fail?
- Bug in the new code.

Why was there a bug?
- Engineer didn't test it.

Why didn't they test it?
- They were rushing.

Why were they rushing?
- Tight deadline.

Why was there a tight deadline?
- PM scheduling.

[Action item: PM should give more time]

✓ deep 5 Whys (surfaces system)

Why did checkout fail?
- The Payment service threw uncaught exceptions on null user metadata.

Why did the Payment service get null metadata?
- A migration to drop the deprecated 'preferences' column hadn't run in production.

Why hadn't the migration run?
- It was marked as "manual review required" but no one was assigned.

Why was it marked as manual review?
- The migration tool flags any migration that drops a column as risky.

Why wasn't anyone assigned?
- We have no process for migration assignment; engineers manually
  pick up migrations from a Slack channel.

[Action items:
- Add migration-required check to CI deploy gate
- Build automatic migration assignment with on-call rotation
- Add null-safety guards to Payment service]

The deep version identifies multiple actionable system changes. The shallow version identifies a human to blame. Same incident, two completely different outcomes.

06Action items — the rules

Action items are the only part of the postmortem that prevents recurrence. They must:

Have a single owner. "The team" owns nothing. Pick one person.
Have a due date. "Eventually" never happens.
Be specific enough to verify. "Improve monitoring" is not actionable. "Add alert for error rate > 1% on checkout endpoint" is.
Be tracked in your normal work system. Jira ticket, GitHub issue, whatever your team uses. NOT in the postmortem doc.

Categorize action items by what they prevent:

Detect — would have alerted us faster. Monitoring, alerting, observability.
Mitigate — would have reduced impact. Rollback automation, feature flags, circuit breakers.
Prevent — would have blocked the incident from starting. Test coverage, code review, deployment gates.

Prevention is best, but mitigation and detection are often easier to implement and have broader impact (they help with future unrelated incidents too).

07The followup discipline

Most postmortems fail at the followup. The doc gets written, the action items get filed, then they sit unfinished for months. Six months later, a similar incident happens.

Habits that work:

Action items appear in normal sprint planning. They're work like any other work. Don't treat them as side projects.
Quarterly review of all postmortem action items. Engineering leadership reviews completion status. Incomplete items get escalated or explicitly deferred (with reason documented).
SEV-1 and SEV-2 incidents block new feature work for the team that owns the broken system, until critical action items are complete. This sounds harsh but is the only mechanism that actually works.

08Severity classifications

A simple system that scales:

SEV-1: total or near-total outage affecting all users. Multiple engineers paged. Public-facing communication required. Postmortem mandatory within 5 business days.
SEV-2: partial outage, major feature broken, significant subset of users affected. One engineer paged. Customer support informed. Postmortem mandatory within 10 business days.
SEV-3: minor issue, small subset of users, workarounds exist. Logged in normal channels. Postmortem optional.
SEV-4: internal-only, no user impact. Bug ticket, no postmortem.

Don't downgrade incidents to avoid writing postmortems. The temptation is real — postmortems are work. But the value of the postmortem is the prevention of future incidents, which compounds across years.

09Publishing postmortems

Internally, all postmortems should be readable by all engineers. Default to open access; restrict only if there's a legal reason. The knowledge transfer is half the point.

Externally — should you publish postmortems to customers? Companies that do (Cloudflare, GitHub, Stripe, AWS) build trust through transparency. The internal pushback is usually "what if this looks bad?" The reality is: customers know you have incidents. What they don't know is how you handle them. A well-written public postmortem demonstrates competence and rebuilds trust faster than any apology email.

10The postmortem anti-patterns

"Human error" as a root cause. Humans make mistakes. The question is why the system allowed the mistake to cause damage. "Human error" is never the root cause; it's a symptom of insufficient guardrails.

Vague action items. "Improve testing." "Better monitoring." "More careful deploys." None of these are work. Replace with specific, ownable, verifiable items.

Postmortems that read like legal documents. If you're afraid to write what actually happened, the doc isn't useful. Write the truth as if no one's reputation depended on the framing — because the system improvements depend on the accuracy.

One person writing the postmortem alone. The owner drafts it, but the team reviews. Multiple perspectives surface causes a single perspective misses.

∞The compound

Postmortems are an investment in the team's collective intelligence. Each one teaches the org a lesson about how the system actually behaves under stress. Over years, those lessons compound — the team becomes harder to break in ways that have been seen before.

Teams that do postmortems well don't have fewer incidents (incidents are inevitable). They have shorter, less severe ones, and they don't repeat the same incident twice. That's the whole point.