What is MTTR?

Q: What is a good MTTR?

It depends on your service and SLA. For critical services, aim for under 1 hour. For less critical services, 4-8 hours may be acceptable. What matters most is consistent improvement over time.

Q: How do I calculate MTTR?

MTTR = Total downtime / Number of incidents. For example, if you had 3 incidents totaling 90 minutes of downtime, MTTR = 90/3 = 30 minutes.

Q: What's the difference between MTTR and MTTD?

MTTD (Mean Time to Detect) measures how long it takes to discover an incident. MTTR measures how long it takes to resolve it. MTTR includes detection time.

MTTR (Mean Time to Recovery) measures how quickly your team resolves incidents. It's one of the most important metrics for understanding your incident response effectiveness.

MTTR definition

Mean Time to Recovery (MTTR) is the average time it takes to restore a service to full functionality after an incident occurs. It's calculated from when an incident starts to when it's fully resolved.

MTTR Formula

MTTR = Total Downtime / Number of Incidents

Example: 3 incidents with 20, 30, and 40 minutes of downtime = (20+30+40)/3 = 30 minutes MTTR

The four MTTx metrics

MTTR is often confused with related metrics. Here's how they differ:

Metric	Measures	Starts when...	Ends when...
MTTD	Mean Time to Detect	Incident occurs	Team becomes aware
MTTA	Mean Time to Acknowledge	Alert fires	Someone responds
MTTR	Mean Time to Recovery	Incident occurs	Service restored
MTBF	Mean Time Between Failures	Service restored	Next incident

Why MTTR matters

Customer impact

Every minute of downtime affects customers. A lower MTTR means less time your customers spend frustrated, unable to use your service.

SLA compliance

Your SLA commits to a certain uptime percentage. MTTR directly affects whether you can meet that commitment. Faster recovery = less downtime = better SLA performance.

Team effectiveness

MTTR reveals how well your incident response process works. A high MTTR might indicate problems with alerting, runbooks, communication, or technical debt.

How to improve MTTR

1. Reduce detection time (MTTD)

You can't fix what you don't know about. Fast, reliable monitoring with appropriate alert thresholds ensures you detect incidents quickly.

Use 1-minute check intervals for critical services
Monitor from multiple regions to reduce false positives
Set up keyword monitoring to catch partial failures

2. Reduce acknowledgment time (MTTA)

Alerts need to reach the right person immediately. Optimize your on-call rotation and alert routing.

Route alerts to the team that owns the service
Use escalation policies so alerts don't get ignored
Send alerts to multiple channels (Slack AND SMS for critical issues)

3. Reduce diagnosis time

Help responders understand what's wrong quickly.

Include context in alerts (what check failed, response time, error message)
Maintain runbooks for common incident types
Ensure easy access to logs and dashboards

4. Reduce repair time

Make fixes fast and safe to deploy.

Enable quick rollbacks for bad deployments
Maintain clear documentation for common fixes
Automate repetitive remediation tasks

MTTR benchmarks

"Good" MTTR varies by industry and service criticality:

Service Type	Target MTTR
Critical (payments, auth)	< 15 minutes
High priority (core features)	< 1 hour
Medium priority	< 4 hours
Low priority	< 24 hours

Focus on consistent improvement rather than hitting arbitrary targets. If your MTTR is 2 hours, aim for 1.5 hours, then 1 hour, and so on.

FAQ

It depends on your service and SLA. For critical services, aim for under 1 hour. For less critical services, 4-8 hours may be acceptable. What matters most is consistent improvement over time.

MTTR = Total downtime / Number of incidents. For example, if you had 3 incidents totaling 90 minutes of downtime, MTTR = 90/3 = 30 minutes.

MTTD (Mean Time to Detect) measures how long it takes to discover an incident. MTTR measures how long it takes to resolve it. MTTR includes detection time.

Related resources

Reduce your MTTD

Fast detection is the first step to fast recovery. Monitor your services with 1-minute checks.

Start Monitoring Free

1-minute check intervals
Instant Slack/SMS alerts
Multi-region monitoring
PagerDuty integration