The Cost, The Challenges, and The Conquering of On-Call Operations

The world of on-call operations is often high-stress and complex. But if we really drill down into the challenges that make on-call operations so difficult, can we find a way to overcome them?

At Shoreline, we sought to do just that, with our survey report 2022 Market Research: Benchmarking Production Operations. To develop this report, we surveyed over 300 highly qualified individuals who are either hands-on practitioners, engineering managers, or engineering executives who work at companies with over 100 employees.

In this blog, we explore the top findings of our survey, including the stats that surprised us the most, the top challenges plaguing on-call teams, and actionable tips to improve on-call operations.

Cost of on-call survey findings

From our survey results, we found that the total cost of on-call for the average company is $2.5M per year. But even with such a significant investment, the average on-call team deals with 8.7 major incidents per year.

Of all the incidents on-call teams are resolving, 48% of them are simple and repetitive. Regarding the incidents that are not so simple and repetitive, 55% of them are escalated. When an incident is escalated, the time to resolve it triples.

There has to be a reason (or more accurately, reasons) why on-call costs organizations so much, and yet still suffer at the hands of repeated and escalated incidents.

Top challenges of on-call operations

After reviewing what our respondents said were the biggest challenges they face when managing or working on an on-call team, we discovered the top three issues were:

1. Increasing infrastructure complexity

62% of our survey respondents stated that their primary challenge with on-call operations is that their infrastructure is continuously growing in complexity. A highly complex infrastructure creates a heavy burden for those who run production operations, as they have too many tools to keep track of and increased difficulty when locating a specific issue.

2. Large environments, little time

Because these infrastructures are getting so complex, 47% of respondents said that they don’t have enough time to implement automation or prevent incidents because they’re too busy remediating issues as they pop up. And while these on-call professionals are only getting busier, 43% of respondents agreed that the growth in their environment is only making this harder.

3. Untrained team members

As an infrastructure grows, you need to hire more personnel. But it takes time and resources to train new personnel. Too many companies are falling short — as 30% of respondents said that improperly trained team members are a challenge when managing on-call operations.

Based on these top challenges, we put together some actionable tips that will enable your engineers to spend more time on fulfilling and intellectual work that will impact your bottom line, and less on tedious, time-consuming tasks.

High-impact ways to improve on-call

After reviewing our survey results, we came up with three ways you can dramatically and consistently elevate on-call operations management — focus on continuous improvement, work to reduce escalations, and automate repeated tasks.

1. Build a culture around continuous improvement

When it comes to reporting incidents, we suggest you work to create and nurture a culture around resiliency and continuous improvement. To start, you should track the number of:

Incidents per week
High severity incidents
Customers affected per incident
Incidents escalated
People who touched each incident
Services impacted and services causing impact

These metrics will give you the insights that you need to make effective improvements, like identifying what incidents take the most time to resolve, which incidents are constantly escalated, which team members spend too much time resolving incidents, and more. When you actually know what kind of incidents are heavily impacting your team, you can begin to build a personalized plan to resolve them. It’s the only way to scale. These metrics are what Shoreline founder Anurag Gupta kept track of when he was managing millions of nodes at AWS.

2. Reduce escalations

To effectively reduce escalations, you need to empower your support team and your L1 engineers by providing them with self-service incident repair tools. It’s okay to start with traditional runbooks, but most companies struggle to create more than one or two and on-call teams often find them hard to use when they are needed most. A better approach is runbook automation. We recommend that you look for a runbook automation tool that gives users real-time diagnostics data, step-by-step guidance on how to repair issues, and live runnable code that is continuously used and maintained.

3. Automate repetitive incidents

As mentioned above, almost half of all incidents are straightforward and repetitive. These are tasks that a machine should automatically deal with so that humans can take on more intellectually complex work. Automating repetitive incidents will help eliminate toil on a meaningful scale so that your teams can spend more time on higher value preventative work or new features and innovation.

Every company can improve how it tracks incidents to get better insight into the true cost of on-call and the biggest opportunities for improvement. Reducing escalations and automating the repair of incidents can be done on a case by case basis with scripts, but it's almost impossible to do this on more than a couple of incidents. This is where tools like Shoreline come in to help with debugging, repair, runbook automation, and full incident repair automation.

The tool redefining incident repair

There are lots of tools that will help you find issues and figure out who to assign these problems to. But when it comes to debugging issues, repairing issues, and making sure they don’t come up again, there’s almost no tooling available.

This is where Shoreline comes in.

Shoreline is the only platform that lets DevOps engineers build automations in an afternoon, and fix issues forever. Shoreline runs across your clouds, accounts, VMs, and Kubernetes clusters to debug, repair, and automate away production issues like never before. With dynamic notebooks that allow engineers to proactively capture and share best tactics, you can keep your team on a safe path to quickly resolve tickets without calling in senior experts.

To take a deeper dive into our survey findings and learn how Shoreline can help you debug 1,000 boxes as fast as one, empower anyone on call, and build life-long lasting automations in an afternoon, watch our webinar Market Research: The Surprising Cost of On-Call Operations.