Are you tired of constantly fixing a recurring problem in your organization—often in the middle of the night? Do you find yourself frequently becoming a bottleneck for your SRE or DevOps team?
If the answer is 'yes' to either question, it's time to consider implementing runbook automation.
You've probably heard of runbooks before—those meticulous "how-to" guides for completing common, repeatable tasks. Well, runbook automation takes them to the next level by taking that documentation and turning it into executable code. It starts as “human-in-the-loop” automation and can eventually become fully automated.
So, if you're an SRE, developer, or manager who's ready to say goodbye to the interruptions and sleepless nights, read on to discover all you need to know about runbook automation.
First—a quick look at some basics.
What Is a Runbook?
A runbook is a step-by-step guide that shows developers, SREs, and other support staff how to resolve frequent, repeatable, and typically less creative tasks (also referred to as "toil"). Examples of toil include incident remediation, reviewing non-critical monitoring alerts, applying changes to a database schema, answering service requests, and resolving other operational work supporting business continuity. To help reduce this manual work, DevOps engineers usually write runbooks to share their knowledge with new team members and help enable the on-call group.
While runbooks seem like a good idea, in reality, they simply aren’t efficient. All too often, they're incredibly complex, lengthy, difficult to read, and equally challenging to write and keep up-to-date.
As a result of their intricacy and size, DevOps teams often ignore runbooks and do something simple that they can perform from memory (although it might be overkill for some problems).
Luckily, there's a better way of remediating common operations problems, and that's through runbook automation.
What Is Runbook Automation?
Runbook automation (RBA) is an operations process that enables DevOps and site reliability engineering (SRE) teams to turn manual solutions into automated processes.
When implementing RBA in your organization, there are two main types of processes you can use:
- Human-in-the-loop automation: this automation is usually a series of scripts that a user runs, and it requires their judgment in some way.
- End-to-end automation: with this RBA type, a computer completes an automation independently without requiring any judgment or input from a user.
Whether you write the scripts or use a ready-made solution, runbook automation will allow your company to solve exponentially more issues in less time. You'll be able to resize, archive, or delete files from full disks, restart Java virtual machines (JVMs) that have central processing units (CPUs) maxed out due to runaway garbage collection, terminate stuck Kubernetes pods, and even remove unauthorized Bitcoin mining apps that suck system resources.
To help identify all unique runbook automation opportunities in your company, it helps to understand how it works.
How Runbook Automation Works
There are a few basic steps to automating a runbook execution, and the first step is arguably the most critical for supporting your team.
- Identify a repeating problem. Think about the issues you have that repeat frequently. Have you built monitors and alarms for these issues? These problems are great opportunities for runbook automation and can range from those listed in the above section to debugging, rotating security certificates, and draining a bad node. (If you’ve already written a manual runbook, then you can skip to step 4)
- Determine a repeatable solution. Once you've identified a recurring problem, you and your DevOps or SRE team can determine a series of steps to fix it. Make sure you break down the steps into bite-size pieces, as this may help you identify a more straightforward way to resolve the issue than you initially thought possible. Run the solution by one or two subject matter experts to confirm that it would've worked the last several times the problem arose.
- Publish a manual runbook. This step helps you define your repeatable solution for others to follow by documenting it in a step-by-step format—aka a runbook. It's essential to make your runbook very targeted, lightweight, and specific so that it's easier to write and use. It’s also important to document all of the human steps you might forget, like logging into a different VPN.
- Write a script or a series of scripts to automate your runbook. During this step, a developer or SRE will write a script for someone to execute when the recurring problem arises. Or, if you're tight on resources, you may work with an RBA platform to implement an off-the-shelf script that follows operational best practices and is ready to execute.
- Fully automate your runbook. Once you're confident your runbook is consistently fixing the problem, you're ready for full automation. However, if you didn't incorporate a precise alarm, you may discover you can only automate certain variations of a problem. By ensuring you implement more exact alarms, you'll gradually be able to resolve different versions of the same problem without having to use various automated fixes for each.
With these five steps, you'll be able to automate debugging, node retirement, and reduce overall toil so you can avoid those late-night wake-up calls.
Why Is Runbook Automation Important?
A huge part of implementing runbook automation relies on communicating its benefits clearly to your C-suite. After all, it won't be a priority without your executives on board.
With that in mind, here are the three significant ways RBA can improve your organization:
1. Save Costs
This point is crucial and may speak the most to your CTO.
Knowledge transfer is difficult, time-consuming, and expensive—particularly when DevOps constantly needs to update manual runbooks.
On top of that, incidents cause lost revenue and reputational damage. Remember when an Amazon employee made a typo while following an established playbook that cost companies in the S&P 500 index an estimated $150 million?
By incorporating runbook automation, you free up your developers and SREs from issues that don't require human judgment so they can spend more time on higher-value projects. You also don't have to hire as large of a team, can shorten incident response time, and decrease potential damages by having bots remediate issues in seconds rather than days or weeks.
Ultimately, people's time is a company's most expensive asset. Avoiding highly disruptive escalation chains can help save costs—especially for those putting in on-call hours.
2. Boost Innovation
Another benefit of runbook automation is that, by cutting down on interruptions, DevOps has more time to work on projects that move the needle for your business—like accelerating the adoption and deployment of new and innovative services.
By reducing toil and tasks that require the same solution over and over, your team can focus on innovative efforts that propel you ahead of your competition.
Additionally, by automating and removing repeatable tasks, you can expand business operations and manage a more extensive fleet with the same team.
3. Increase Customer Satisfaction
Often, the recurring issues you eliminate through automation are the ones that affect just a few customers at a time. They also don't always take customers offline but rather degrade service.
But they happen a lot.
By eliminating these issues through RBA, you save your customers thousands of hours of degraded service—not just for buyers experiencing it now, but also for those who will have it in the future. The result is happier customers who know they can depend on your services to keep their customers satisfied, too.
What About RBA Platforms?
Suppose you've assessed your team's current capacity and have determined that implementing runbook automation on your own may not be feasible due to limited resources. In that case, an RBA platform can simplify the effort by providing pre-built scripts that integrate with many of the technologies SREs use.
With an RBA platform, there are a few essential capabilities you'll want to ensure it has:
- It provides a path to end-to-end automation. While human-in-the-loop automation is a great way to start RBA, implementing a tool that offers that alone without the promise of eventual end-to-end automation can be more costly in the long run. With human-in-the-loop only, you still have the manual work of prompting a script, and if you decide to move to a new tool down the line that does provide end-to-end, it can take weeks to transfer data and retrain staff.
- It allows you to create automations rapidly. Look for an enterprise solution that simplifies debugging and repairing new issues. If your RBA solution does this, it will make automating those tasks significantly more manageable, too. Some RBA solutions only promote "if-then" style alarm wiring, making it easy to publish scripts to the entire on-call team, but they don’t make creating new automations more accessible.
- It empowers you to set up granular alarms. Before you can automate solutions to recurring issues, you need to know the problems. With a platform that allows you to construct granular alarms, you'll get to the root cause of issues and be able to select the proven, ready-made solution from the platform's op packs.
Bottom line—you want a solution that encourages agility and doesn't pigeonhole you into remediating only specific problems and only so far.
It's also important to note that, while most RBA platforms claim to support both human-in-the-loop and end-to-end automation, many are only practical for human-in-the-loop. Surprisingly, the big difference between these platforms and those that genuinely enable end-to-end isn’t in the automation—it’s in the alarm precision. Selecting an RBA platform that helps you create more precise alarms will allow you to have the confidence to transition to end-to-end automation down the road. Platforms that don't give you the option to construct exact alarms may not provide the certainty you need to move away from human-in-the-loop.
Say Goodbye to 3 AM Calls with Runbook Automation
Armed with the learnings from this guide, you can confidently make the case to your CTO that it's time to transition away from laborious, manual runbooks and towards automation. Not only will your company be able to save costs, boost innovation, and improve customer satisfaction, but you'll also give your DevOps and SRE teams the support they need. No more late-night calls to fix the same old issues!
Ready to get started? Shoreline's RBA Platform supports human-in-the-loop and end-to-end automation with precise alarms, so you can ease your way into automating remediations in seconds, not weeks. Get started today with a free trial.