Managing production and operations is complicated. As the size of a fleet increases, so does the production environment's complexity. And with increasing complexity comes an explosion in on-call incidents. In fact, our recent study of on-call stakeholders found that on-call teams manage an average of 278 incidents per month — and spend more than 2,000 hours resolving those issues.
Incident management is a time-consuming and expensive process. But the stakes are high. Incidents cause downtime, lost revenue, reputational damage, and a host of other problems for today’s cloud-first organizations.
Universal truths in DevOps today
We’ve talked to hundreds of directors and managers in the operations and engineering spaces. They all agree on some basic reasons DevOps is so complicated and tough to manage:
- Human error: People aren’t perfect. When humans are manually writing code or running commands, things tend to break. In fact, our research has found that 34% of incidents stem from manual or human error.
- Toil: Engineers often waste time doing the same thing over and over again. We’ve found that 48% of incidents on-call engineers deal with are straightforward and repetitive. It’s no wonder that extensive time on-call leads to severe burnout and low job satisfaction, resulting in an extremely limited hiring pool.
- Reactive incident management: Firefighting is a main part of the job. Teams are bogged down with incidents. Leaders across industries agree that SRE and DevOps teams spend too much time reacting to issues instead of innovating or strategizing. (And that isn’t good for anyone—or the bottom line.)
For years, we’ve all accepted that these challenges are the price of doing business. But, what if there’s a better way?
Incident automation: Let’s challenge the status quo
Incident automation is about optimizing responses to issues you thought you just had to live with. How? Using a layer of human-in-the-loop intelligence that scans your entire infrastructure to identify, diagnose, and repair incidents.
Incident automation enables teams to fix incidents quickly, with varying levels of automated intervention:
- Runbooks: Provide pre-canned sequences and step-by-step instructions to check for and remediate issues. On-call engineers can manually work through the steps at their own pace, with full oversight of automated elements (human-in-the-loop automation). In this case, automation is used to quickly detect and surface a solution for the human worker — who will manually accept or deny that solution.
- Full automation: Remediation actions are programmed to fire automatically when distinct issues are identified, without the need for manual oversight. Full automations are particularly useful for issues and subsequent repair actions that are re-used frequently within an environment.
Both options reduce manual and repetitive work while empowering on-call teams to safely complete repairs without escalation.
How is incident automation different from observability and incident management?
Many solutions can identify operations problems and assign them to someone to fix. Incident automation solutions handle the next steps of diagnosing and actually fixing the problem. The value of incident automation is in the functionality to actually take action.
- Observability: Monitor a company’s infrastructure and trigger an alarm when something goes wrong.
- Incident management: Alert the right people on the team to an incident (indicating something is wrong), and prioritize the task of fixing the issue.
- Incident automation: Detect, diagnose, and repair the issue. Plus, plan for the future by creating sequences (runbooks or full automations) to repair issues when they occur again in the future.
Incident automation in practice
Incident automation is a game-changing approach to common infrastructure issues.
Take for example disk resizing. When managing a large fleet, it’s hard to keep track of every node (especially when they are frequently spun up and down) and identify those that are reaching capacity. But it’s important work — when a disk fills up, it can cause catastrophic app failures, widespread outages, and data loss. Recovering from a single disk full issue can take hours.
With incident automation, engineers have the capability to automatically (and regularly) check disk capacity across every node in a fleet. They can also create automations that trigger a remediation sequence when the problem occurs — like automatically resizing the disk — without human intervention.
But aren’t automations hard to build?
Yes. It’s the biggest reason that many organizations believe that they just have to live with those day-to-day annoyances in DevOps. For incident automation to change the world of DevOps as we know it, incident automation solutions need to enable teams to build automations better and faster. It’s something we’ve accomplished with Shoreline, and we believe it's the puzzle piece that was previously missing. How? We integrate with your CI/CD processes to ensure all scripts are pushed to all appropriate VMs and nodes, make it easy to tie alarms to repair scripts, and we comprehensively track and audit all actions performed so they are documented for future use.
Now that we’ve unlocked a way for on-call engineers to build automations in an hour — instead of a month — the promise of incident automation is a reality.
The impact of incident automation
You know those issues you thought you just had to live with? You probably never really thought about how much they could impact your bottom line. The hidden costs can be jarring. But you don’t just have to accept them.
Incident automation provides many benefits for DevOps teams, such as:
- Fewer incidents: To state the obvious, incident automation leads to fewer major incidents because more issues are being recognized faster and remediations are at the ready. The impact? Less downtime; retained customers; fewer late-night emergencies.
- Better customer experience: With a faster response time to detect and repair incidents, customers experience far less degraded service. Better service always leads to better outcomes for your bottom line.
- Fewer escalations: Our research shows almost 55% of issues get escalated beyond the first-line on-call responder. What’s more, escalated issues take close to 11 hours to solve on average (vs. three to four hours for issues that don’t need to be passed up the chain). When you quantify that hourly cost, it really adds up. Incident automation prevents the need for escalations, saving days of work and lost profits.
- Better employee retention: It’s hard to hire SREs. Not least of all because many burn out from unfulfilling work and sleepless nights on-call. The costs of re-hiring and re-training staff add up. With incident automation, you can improve processes that lead to burnout and turnover and give your team more time to focus on innovation (and improve their coding skills!).
- Do more with less: Incident automation decouples engineering team size from infrastructure scale — smaller teams can handle more incidents with a powerful incident automation platform in their court.
Shoreline’s incident automation solution
Shoreline has created the first incident automation platform that works across cloud environments, VMs, and Kubernetes clusters. Shoreline helps DevOps and SRE teams debug, repair, and automate away production issues.
Founded by Anurag Gupta, former vice president of engineering at AWS, Shoreline enables engineers to create automations to repair common issues within their environments. It’s no secret that automation gone wrong can be scary — we’re sure you’ve seen the headlines. Our approach to automation is about using machines to help humans do their jobs more efficiently. All automations are human-led, meaning that suggested remediations and command sequences are never triggered without the proper sign-off.
In addition to build-your-own automations, Shoreline also provides customers with a library of pre-built solutions to common issues — equipping nearly anyone with the information they need to repair issues fast.