Incident Postmortem Template (With Example)
A good postmortem turns a stressful outage into concrete follow-ups. This template is designed to be short, blame-free, and actionable.
Use UTC in the timeline (or clearly label your timezone). Focus on what happened, what you learned, and what you'll change.
Template
Copy as Markdown, paste into your docs/wiki, and fill it in.
# Incident Postmortem: [INCIDENT NAME] **Date:** [YYYY-MM-DD] **Severity:** [SEV-1/SEV-2/SEV-3] **Status page incident:** [LINK OR N/A] **Owner:** [NAME] ## 1) Summary - What happened (1–3 sentences)? - What did users experience? - How was it resolved? ## 2) Impact - **Customer impact:** [WHO/WHAT WAS IMPACTED] - **User-visible symptoms:** [ERRORS / LATENCY / TIMEOUTS] - **Duration:** [START → END] (include timezone; UTC preferred) - **Scope:** [REGIONS / TIERS / PERCENT AFFECTED] - **Business impact:** [OPTIONAL: revenue, support load, SLA credits] ## 3) Timeline (UTC recommended) > Tip: keep this factual. You can add decision notes in “Detection & response”. - [HH:MM] [EVENT] - [HH:MM] [EVENT] - [HH:MM] [EVENT] ## 4) Root cause - Primary cause: [WHAT FAILED] - Contributing factors: [CONFIG, DEPENDENCY, CAPACITY, DEPLOY, HUMAN FACTORS] - Why it wasn’t caught earlier: [GAP IN MONITORING/TESTING/PROCESS] ## 5) Detection & response - How did we detect it? [ALERT / CUSTOMER REPORT / INTERNAL DASHBOARD] - Time to detect (TTD): [X] - Time to mitigate (TTM): [X] - Time to resolve (TTR): [X] - What went well? - What was confusing or slowed us down? ## 6) Resolution - What fixed it? [CHANGE / ROLLBACK / MITIGATION] - Verification: [HOW WE CONFIRMED HEALTH] ## 7) Corrective actions (owners + due dates) > Keep this list small and high-signal. Prefer preventative changes over “be more careful”. | Action | Owner | Due date | Status | | --- | --- | --- | --- | | [ACTION] | [OWNER] | [YYYY-MM-DD] | [Not started/In progress/Done] | ## 8) Prevention / follow-ups - Monitoring improvements: [WHAT ALERTS / CHECKS WILL CHANGE] - Runbook updates: [LINKS] - Product/infra changes: [WHAT WILL CHANGE] --- Generated from: https://updog.watch/postmortem-template
Filled example (realistic, generic)
This is an example of a short, “just enough detail” postmortem. It avoids blame and focuses on actions.
# Incident Postmortem: Checkout API 502s due to upstream connection pool exhaustion **Date:** 2025-11-12 **Severity:** SEV-1 **Status page incident:** https://status.example.com/incidents/abc123 **Owner:** On-call (Platform) ## 1) Summary Between 14:07–14:41 UTC, the checkout API returned elevated 502/504 errors. The issue was caused by the payments upstream connection pool exhausting after a deploy increased concurrency without matching pool size. We mitigated by rolling back the deploy and increasing pool limits. ## 2) Impact - **Customer impact:** Some users could not complete purchases. - **User-visible symptoms:** 502/504 errors from /checkout; elevated latency on retries. - **Duration:** 14:07 → 14:41 UTC - **Scope:** ~18% of checkout attempts failed during peak. - **Business impact:** Increased support tickets; lost conversions during incident window. ## 3) Timeline (UTC) - 14:07 Alerts triggered for elevated 5xx on checkout API. - 14:10 On-call acknowledged, began triage (recent deploy review). - 14:16 Error rates correlated with payments upstream timeouts. - 14:22 Mitigation attempt: reduced concurrency flag (partial improvement). - 14:28 Rolled back deploy. - 14:33 5xx rate returned to baseline. - 14:41 Incident marked resolved after stability window. ## 4) Root cause - Primary cause: Upstream payments connection pool exhausted under increased concurrency. - Contributing factors: Missing load test coverage for new concurrency settings. - Why it wasn’t caught earlier: No alert on pool saturation; limited pre-prod traffic simulation. ## 5) Detection & response - How did we detect it? Automated uptime/5xx alerts. - Time to detect (TTD): ~0–2 minutes. - Time to mitigate (TTM): ~21 minutes. - Time to resolve (TTR): ~34 minutes. - What went well? Fast detection; clear rollback process. - What slowed us down? No dashboard/alert for pool utilization; initial mitigation was incomplete. ## 6) Resolution - What fixed it? Rollback + increased pool size; reduced concurrency until verification. - Verification: 5xx rate back to baseline; synthetic checkout succeeded; latency normalized. ## 7) Corrective actions (owners + due dates) | Action | Owner | Due date | Status | | --- | --- | --- | --- | | Add alert on upstream pool saturation | Platform | 2025-11-19 | In progress | | Add load test scenario for concurrency flag | API team | 2025-11-26 | Not started | | Update deploy checklist to include pool capacity review | Platform | 2025-11-20 | Not started | ## 8) Prevention / follow-ups - Monitoring improvements: add saturation and timeout alerts; lower-noise thresholds. - Runbook updates: add “pool exhaustion” checklist and links to dashboards. - Product/infra changes: enforce safe concurrency defaults; gradual rollout.
FAQ
A summary, impact, timeline, root cause, what worked/didn’t, and a short list of corrective actions with owners and due dates.
Long enough to support prevention. Many effective postmortems are 1–3 pages. Clarity beats length.
Focus on systems and conditions. Write “the deploy increased concurrency without a matching pool limit” rather than “Alice messed up.”
Something specific, owned, and testable: add an alert, add a guardrail, add a test, improve a runbook. Avoid vague actions like “be more careful.”
Sometimes. For customer-facing incidents, a short public summary can build trust. Keep sensitive details in an internal version.
Make incident detection and response easier
The best postmortems start with clean timelines: fast detection, clear alerts, and a status page to communicate with users.