Incident Escalation Comes at a Cost

How often do you ask yourself, “How many issues get escalated beyond my front-line on-call responders? And how long do those issues take to resolve?” Chances are, not enough.

You may think escalation is no big deal and that it’s just a part of running on-call operations. But if you’re not careful, escalation could be hurting your development velocity.

To ensure continuous productivity and success within your on-call operations, you’ll need to understand escalations inside and out. In this blog, we will explore:

What causes escalation
How it affects on-call operations
How to reduce escalations

What causes escalation

Shoreline recently surveyed over 300 individuals who are either hands-on practitioners, engineering managers, or engineering executives at companies with over 100 employees. Through this survey, we gathered insights to build our 2022 Market Research: Benchmarking Production Operations. We found that the top two causes of escalation are:

Complex environments
Human error

Complex environments

In our survey, 62% of respondents said that their biggest on-call operations challenge is that their infrastructure is increasing in complexity with no signs of slowing down. When an environment gets too complex, the onus is often falls on the engineering team to keep things running smoothly because they are the only ones that know the product well enough.

‍

This is no easy feat, as managers have to keep track of an ever-growing tech stack while navigating the difficulty of locating an issue when degraded service occurs, let alone determining the cause. So when an issue does happen in a complex environment, more often than not an on-call team member has to escalate it to multiple engineers each with expertise in a portion of the product or service.

Human error

As environments grow in both size and complexity, managers must hire more team members to keep up. And as we all know, sooner or later everyone makes mistakes. And those mistakes could cost you.

When it comes to human error, I want you to think back to your last major outage. What was the root cause? More likely than not, it was a person. 34% of our survey respondents stated that manual or human error was the root cause of their last major outage, while only 7% put the blame on automation.

In addition, as your fleet scales, your team must grow with it. Properly training new employees to ensure they feel confident enough to do a job on their own takes time, resources, and patience.

Unfortunately, many companies tend to fall short when it comes to training new hires. Of our survey respondents, 30% stated that improperly trained team members are a challenge when managing on-call operations. When new hires are thrown into the high-stress world of on-call, you want them escalate questions to a higher-up because the cost of making a mistake can be so high.

With human error is five times more likely than automation errors to cause a major outage, and a third of organizations struggling to properly train team members to mitigate human error, the effects of continuous escalations can be drastic.

How escalations affect your engineering team

Our survey respondents informed us that while 48% of the incidents on-call teams are resolving are simple and repetitive, 52% are not. And most of them are getting escalated. In fact, 55% of all incidents are getting escalated, even some of the simple and repetitive.

Did you hear me? More than HALF of incidents that are escalated. That takes up precious time that could be spent on innovative, creative projects that will sustain the growth of your business (and team member job satisfaction!).

As we know in the world of on-call, there is no time to be wasted. The average time to resolve a non-escalated incident is 3.6 hours. Many are resolved quickly, but some take days to resolve. But, the 55% of incidents that are escalated take an average of 10.7 hours to resolve - many more take days to resolve. Just by escalating an incident, a seemingly harmless action, the time to resolve triples.

When these statistics are combined, we discovered that escalated incidents represent 78% of all efforts to resolve incidents per month. This means your engineering team is bearing 78% of the on-call workload! There has to be a better, more cost-effective way of running on-call operations and reducing escalations.

How to reduce escalations

To effectively reduce escalations, you need to empower your on-call team, and the people who are taking L1 calls, by providing them with a self-service tool to fix issues themselves (all without the risk of giving the team SSH access). The best approach for self-service is runbook automation.

Runbook automation

People often think of runbooks as tedious “how-to” guides to complete simple and repetitive tasks. Well, runbook automation is different. Runbook automation elevates the way we resolve issues by taking “how-to” documentation and translating it into actionable, executable code.

Not only does runbook automation enable on-call teams to automate debugging and repair with fewer late-night wake-up calls, but it also:

Saves costs by freeing up developers and SREs from automatable issues so they can spend time on higher-value projects.
Teaches junior SREs valuable techniques that can be applied in other situations.
Boosts innovation by reducing toil and tasks that require repetitive solutions, enabling teams to focus on innovative efforts to keep ahead of the competition.
Increases customer satisfaction by eliminating customer-facing issues to save them from thousands of hours of degraded service.

Save your team escalation headaches

Shoreline’s runbook automation capabilities enable on-call teams to build a path for end-to-end employee communication, create automations rapidly, and set alarms to alert team members when an issue is occurring.

To learn more about how Shoreline's full suite of cloud reliability solutions (including runbook automation) can help you reduce escalations to save money, elevate productivity, and improve customer service, request a demo today.

‍