What is MTTR?
MTTR (Mean Time to Recovery) measures how quickly your team resolves incidents. It's one of the most important metrics for understanding your incident response effectiveness.
MTTR definition
Mean Time to Recovery (MTTR) is the average time it takes to restore a service to full functionality after an incident occurs. It's calculated from when an incident starts to when it's fully resolved.
MTTR Formula
MTTR = Total Downtime / Number of Incidents
Example: 3 incidents with 20, 30, and 40 minutes of downtime = (20+30+40)/3 = 30 minutes MTTR
The four MTTx metrics
MTTR is often confused with related metrics. Here's how they differ:
| Metric | Measures | Starts when... | Ends when... |
|---|---|---|---|
| MTTD | Mean Time to Detect | Incident occurs | Team becomes aware |
| MTTA | Mean Time to Acknowledge | Alert fires | Someone responds |
| MTTR | Mean Time to Recovery | Incident occurs | Service restored |
| MTBF | Mean Time Between Failures | Service restored | Next incident |
Why MTTR matters
Customer impact
Every minute of downtime affects customers. A lower MTTR means less time your customers spend frustrated, unable to use your service.
SLA compliance
Your SLA commits to a certain uptime percentage. MTTR directly affects whether you can meet that commitment. Faster recovery = less downtime = better SLA performance.
Team effectiveness
MTTR reveals how well your incident response process works. A high MTTR might indicate problems with alerting, runbooks, communication, or technical debt.
How to improve MTTR
1. Reduce detection time (MTTD)
You can't fix what you don't know about. Fast, reliable monitoring with appropriate alert thresholds ensures you detect incidents quickly.
- Use 1-minute check intervals for critical services
- Monitor from multiple regions to reduce false positives
- Set up keyword monitoring to catch partial failures
2. Reduce acknowledgment time (MTTA)
Alerts need to reach the right person immediately. Optimize your on-call rotation and alert routing.
- Route alerts to the team that owns the service
- Use escalation policies so alerts don't get ignored
- Send alerts to multiple channels (Slack AND SMS for critical issues)
3. Reduce diagnosis time
Help responders understand what's wrong quickly.
- Include context in alerts (what check failed, response time, error message)
- Maintain runbooks for common incident types
- Ensure easy access to logs and dashboards
4. Reduce repair time
Make fixes fast and safe to deploy.
- Enable quick rollbacks for bad deployments
- Maintain clear documentation for common fixes
- Automate repetitive remediation tasks
MTTR benchmarks
"Good" MTTR varies by industry and service criticality:
| Service Type | Target MTTR |
|---|---|
| Critical (payments, auth) | < 15 minutes |
| High priority (core features) | < 1 hour |
| Medium priority | < 4 hours |
| Low priority | < 24 hours |
Focus on consistent improvement rather than hitting arbitrary targets. If your MTTR is 2 hours, aim for 1.5 hours, then 1 hour, and so on.
FAQ
Related resources
Reduce your MTTD
Fast detection is the first step to fast recovery. Monitor your services with 1-minute checks.
Start Monitoring Free- 1-minute check intervals
- Instant Slack/SMS alerts
- Multi-region monitoring
- PagerDuty integration