MTTR: Why Speed of Recovery Matters More Than Preventing All Failures
· 11 min read
Google's Site Reliability Engineering book (2016) popularized a counterintuitive principle: accept failure as inevitable and invest in recovery speed. The DORA research confirmed it with data — the difference between elite and low-performing teams isn't that elite teams have fewer incidents. It's that they recover in under an hour instead of under a week. Every engineering organization invests in preventing failures. Fewer invest in recovering from them quickly. The data says this is backwards.
