What is MTTR?

MTTR (Mean Time to Recovery) measures how quickly your team resolves incidents. It's one of the most important metrics for understanding your incident response effectiveness.

MTTR definition

Mean Time to Recovery (MTTR) is the average time it takes to restore a service to full functionality after an incident occurs. It's calculated from when an incident starts to when it's fully resolved.

MTTR Formula

MTTR = Total Downtime / Number of Incidents

Example: 3 incidents with 20, 30, and 40 minutes of downtime = (20+30+40)/3 = 30 minutes MTTR

The four MTTx metrics

MTTR is often confused with related metrics. Here's how they differ:

Metric Measures Starts when... Ends when...
MTTD Mean Time to Detect Incident occurs Team becomes aware
MTTA Mean Time to Acknowledge Alert fires Someone responds
MTTR Mean Time to Recovery Incident occurs Service restored
MTBF Mean Time Between Failures Service restored Next incident

Why MTTR matters

Customer impact

Every minute of downtime affects customers. A lower MTTR means less time your customers spend frustrated, unable to use your service.

SLA compliance

Your SLA commits to a certain uptime percentage. MTTR directly affects whether you can meet that commitment. Faster recovery = less downtime = better SLA performance.

Team effectiveness

MTTR reveals how well your incident response process works. A high MTTR might indicate problems with alerting, runbooks, communication, or technical debt.

How to improve MTTR

1. Reduce detection time (MTTD)

You can't fix what you don't know about. Fast, reliable monitoring with appropriate alert thresholds ensures you detect incidents quickly.

  • Use 1-minute check intervals for critical services
  • Monitor from multiple regions to reduce false positives
  • Set up keyword monitoring to catch partial failures

2. Reduce acknowledgment time (MTTA)

Alerts need to reach the right person immediately. Optimize your on-call rotation and alert routing.

  • Route alerts to the team that owns the service
  • Use escalation policies so alerts don't get ignored
  • Send alerts to multiple channels (Slack AND SMS for critical issues)

3. Reduce diagnosis time

Help responders understand what's wrong quickly.

  • Include context in alerts (what check failed, response time, error message)
  • Maintain runbooks for common incident types
  • Ensure easy access to logs and dashboards

4. Reduce repair time

Make fixes fast and safe to deploy.

  • Enable quick rollbacks for bad deployments
  • Maintain clear documentation for common fixes
  • Automate repetitive remediation tasks

MTTR benchmarks

"Good" MTTR varies by industry and service criticality:

Service Type Target MTTR
Critical (payments, auth) < 15 minutes
High priority (core features) < 1 hour
Medium priority < 4 hours
Low priority < 24 hours

Focus on consistent improvement rather than hitting arbitrary targets. If your MTTR is 2 hours, aim for 1.5 hours, then 1 hour, and so on.

FAQ

It depends on your service and SLA. For critical services, aim for under 1 hour. For less critical services, 4-8 hours may be acceptable. What matters most is consistent improvement over time.

MTTR = Total downtime / Number of incidents. For example, if you had 3 incidents totaling 90 minutes of downtime, MTTR = 90/3 = 30 minutes.

MTTD (Mean Time to Detect) measures how long it takes to discover an incident. MTTR measures how long it takes to resolve it. MTTR includes detection time.

Related resources

Reduce your MTTD

Fast detection is the first step to fast recovery. Monitor your services with 1-minute checks.

Start Monitoring Free
  • 1-minute check intervals
  • Instant Slack/SMS alerts
  • Multi-region monitoring
  • PagerDuty integration