Back to Blog

Survey: The True Cost of Production Operations Issues

Our 2022 Benchmarking Production Operations Report reveals the leading cause of major incidents, and the impact of escalation, toil, and more.

In our experience, if you ask any manager or executive responsible for incident response and production operations how things are going, their answer is “fine.” But dig a little deeper and you’ll quickly realize there are huge opportunities to improve performance, productivity, and reliability.

In reality, everything is definitely not “fine.”

Recognizing our blind spots

So why is there a disconnect? Many people simply don’t know what they don’t know. On the surface, they may see a functioning team that navigates its way through incidents as they come. Maybe there’s some downtime here and there, but it’s never anything the team can’t handle.

But many don’t stop to ask themselves the tough questions, like:

  • How many incidents have we had in the last month that led to downtime?
  • What are the leading causes of incidents within my environment?
  • How many issues get escalated beyond my front-line on-call responders? And how long do those issues take to resolve?
  • How much churn do I have in my on-call staff? And what does that churn cost me?

Our new 2022 Benchmarking Production Operations Report, conducted by Dimensional Research, polled over 300 on-call practitioners, managers, and executives responsible for incident response in production cloud environments to answer these very questions. Our goal was to quantify the true impact of production operations issues.

Key takeaways and challenges

Despite spending an average of $2.5M per year on on-call operations, companies suffer from a number of costly challenges related to cloud operations.

  • Major incidents are far too common: On average, companies deal with almost nine major incidents annually. Our survey digs into that number to identify the root causes, cost, and potential solutions to these issues.
  • Toil leads to on-call burnout: Toil is a massive problem. Our survey found that almost half of all incidents are straightforward and repetitive. It’s a major waste of money and a source of frustration for on-call workers — who suffer from severe burnout and low job satisfaction as it is.
  • Manual error is a major problem: Manual and human error is a leading cause of major incidents. Our survey found that manual error is significantly more likely to cause an issue than automations gone wrong. In fact, as our survey reveals, automation might be the only solution to this problem.
  • Escalations are costly: Incidents that get escalated take a whopping 3X longer to resolve and 55% of incidents are being escalated. When you put these two stats together, escalated incidents represent 78% of the effort to resolve incidents every month. This represents a huge opportunity for productivity improvements.

There’s a better way

The experts at Shoreline are in the business of optimizing production operations. We’ve spent years working in on-call environments and are all too familiar with the issues this report identifies and quantifies. That’s why we built an incident automation solution to help on-call teams diagnose, repair, and automate away production issues.

With our breadth of experience, we’re able to offer expert tips for optimizing operations — which we’re sharing alongside our findings within the 2022 Benchmarking Production Operations Report.

To view the report — and our expert tips for improving reliability — download the full 2022 Benchmarking Production Operations Report.