Production Operations Benchmark Survey Reveals Half of Incident Response Time Remains Toil Despite Millions Being Spent
On Average, Companies Spend 12 Person Years On Incident Response Annually
REDWOOD CITY, Calif. — Aug 9, 2022 — Shoreline.io, the Incident Automation company, today announced the findings of its 2022 Benchmarking Production Operations Report, offering critical insight into the experience of incident response teams to better enable cloud operations leaders and technology executives make informed business decisions.
The report, conducted by Dimensional Research, polled over 300 on-call practitioners, managers and executives responsible for incident response in production cloud environments. Survey participants are responsible for running businesses that manage less than 20 to over 10,000 nodes. The average survey respondent was responsible for 1,772 nodes, had 13.1 SREs and spent 2,084 hours monthly on incident response.
Report findings uncover a number of challenges and opportunities for the cloud operations industry, revealing that organizations spend millions of dollars per year on on-call operations, yet continue to suffer major outages that impact customer and employee productivity. Key findings include:
- 48% of incidents are straightforward and repetitive
- Human error caused major incidents 5x more frequently than bugs in automation
- On average, 8.7 major incidents occur each year, of which 62% are escalated to the C-suite
- 55% of incidents are escalated to second-line responders or experts outside of the on-call team
- The average time to resolve escalated incidents is 10.7 hours
- The average cost of on-call operations is $2,500,000 per year
Additional findings point to the significance of reliability, as more organizations prioritize reducing the total number of incidents, decreasing costs and shortening time to recover. Findings include:
- 97% of organizations have OKRs on reliability, but are unable to meet them
- Cloud footprints grew 46% faster than SRE teams in the last 12 months
- Modern technologies make production operations harder - multi-cloud, Kubernetes and microservices each make on-call harder according to 73%, 57% and 52% of respondents respectively.
“The growth of cloud footprints is outpacing the growth of on-call and DevOps teams,” said Diane Hagglund, principal at Dimensional Research. “As cloud environments become more complex, the more difficult it is for organizations to find experienced staff that are equipped to meet on-call needs, leaving the burden of incident response on smaller teams.”
Key recommendations for improving on-call include:
- Institutionalize continuous improvement processes. Data is a critical tool for implementing continuous improvement, but too often incident management isn’t providing the data needed to identify and prioritize opportunities for improvement. The report recommends small changes to incident tracking that will lead to significant insights into the incidents that require the most time and impact the most customers.
- Prevent escalations - The biggest opportunity to improve on-call productivity is by reducing incident escalations, which account for 78% of on-call time. Investing in self-service tools to empower support teams and on-call primaries will reduce resolution times and load on engineering.
- Work to eliminate toil - 48% of incidents are repetitive, presenting an opportunity to utilize automation tools to free teams of repetitive tasks and dedicate more time to improving resiliency, securing environments and lowering costs to further improve productivity.
“The current approach to on-call is unsustainable, with the rapid growth of cloud infrastructure leaving SRE teams faced with thousands of hours of work per month,” said Anurag Gupta, founder and CEO at Shoreline.io. “Utilizing automation to address escalations and eliminate low value, repetitive work will dramatically improve team productivity and overall customer experience.”
Click here to download the full report.