Runbooks reduce toil and standardize processes across a team. But creating your runbooks is only the first step. Automatingrunbook execution based on an alarm, without human intervention, is the real goal.
Runbooks that kick off automated remediations give operators the ability to get ahead of ticket counts and incidents, even while environments scale and grow more complex. Eliminating the need for humans to wake up and press a button or copy/paste a script increases availability, even as your fleets expand and applications grow in complexity.
So let's look at the how and why of automating runbooks.
Runbooks are a critical part of the third stage of operational maturity - they are a “defined” and proactive documentation of your tasks, problems, and procedures. Runbooks generally happen after stage two - creating a general repeatability of tasks - and just before stage four - implementation of automation. Here are some advantages to using runbooks:
- Runbooks can be a driving force used to identify concise problem statements.
- Runbooks can be an excuse to define processes.
- Runbooks can make thought patterns clear, repeatable, and consistent between humans. As a result, they can improve tooling implementation and fleet-wide management on orchestration platforms like Kubernetes, or on procedurally driven environments using GitOps.
- As infrastructure becomes bigger and more complex, and as processes drift to accommodate the latest incidents, runbooks can shift the burden of SLAs and uptime to the frontline operators that are accountable and responsible for its availability.
- Runbooks can serve as springboards for removing or even preventing the toil of technical debt caused by the evolution of systems or integration of new technologies.
- Runbooks can even democratize the processes created by these superheroes, and quickly automate them to provide self-serviceable procedures and scripts. Doing this shifts power out of our silos and into the hands of the larger business. Without this democratization, people may otherwise have to wait hours or days for an expert to type in a few commands while these subject-matter experts spin plates to keep the business running.
Runbooks requiring manual intervention are a great step away from the pain of memorizing the functions and "isms" of an organization’s technologies and technical debts. Runbooks reliant on manual intervention usually come in the form of a wiki with a search bar and a hope (and/or dream) that a team member can find the documentation when it’s needed.
But as you mature, you'll start looking closer at operational maturity. ITIL defines the final level of maturity as an optimized process that is repeatable, easily defined, and widely automated.
Without manager-speak, this means working toward automating yourself out of a job.
Let's look at an example of this journey to maturity with runbooks:
- Stage 1: You identify a repeating problem of servers going offline in a datacenter.
- Stage 2: You identify a repeatable solution, then spend every Friday checking for bad servers and creating tickets to have them replaced.
- Stage 3: You create a manual runbook to define your repeatable solution - how to write a ticket to have the server replaced so your team can execute as the servers go down.
- Stage 4: You start to automate your runbook by writing a script to execute the runbook when servers go down.
- Stage 5: You fully automate your runbook by configuring a trigger to execute the runbook automatically when the server goes down, without human intervention.
This is a world where an operator identifies an opportunity to improve the infrastructure one server at a time. This person articulates how to do this process, democratizes the process to help offload the toil to make better use of time, and then entirely removes the team from the toil altogether by moving the execution to automation.
The operator has now sped up a process, introduced a system where automation can manage these units of focus, and as a result has gained back working hours.
How Runbook Automation Helps You Scale Your Org
One of the promises of Kubernetes is that it automates the deployment of our infrastructure. Once we have Kubernetes properly configured, we’ve removed a certain class of problems. But there’s a whole class of problems that restarting a container or rolling back a configuration won’t solve. For example rotating security certificates, cordoning and draining a bad node, and other issues that may involve state.
Expanding the range of maintenance issues and incidents we can resolve automatically beyond restarts expands our operational capability and gives operators superpowers. Automating the creation of runbooks is a good first step, but the end goal should be to offload more and more of the maintenance and incident remediation to our machines.
So let’s take a look at various tools that get us here.
Comparing Automated Runbook Tools
Several products exist on the market to help you automate your runbooks. They can simplify the identification of problems and trigger prescribed steps for the solutions needed.
These products are the robots that we want taking over our work. Specifically, we want them taking over the work of humans repeatedly hammering on uninteresting problems that have already been solved.
These products break down into a few categories.
Confluence and Wiki.js serve as great tools for documenting processes as wikis, but don’t offer automation capabilities.
Rundeck and Transposit take metric signals from monitoring and observability platforms and then provide predefined runbooks for a human to follow when remediating an incident.
Shoreline collects real-time resource metrics across your fleet and can trigger configured remediation actions based on these metrics to resolve incidents without human intervention. Whereas an automatically created runbook still requires a human to process, execute, and review, an automatically executed runbook takes an input of data to analyze and make a decision, just like a human would but faster and more consistently.
The Benefits of Automating Runbook Execution
Different teams might take different approaches or gain different benefits from such systems.
- Operations Subject Matter Experts (SMEs) can see this as an opportunity to stop the middle-of-the-night calls needed to mitigate certain critical problems by resolving the problem without needing to create a dev ticket and manually resolving the incident until the ticket gets prioritized.
- DevOps can see this as a way to effectively test new ideas and functions with consistent methodologies. Automated runbooks can ensure successful integration post-testing phases.
- Leadership teams can take the opportunity to integrate reporting and analytics as part of the actions. This will help them to better understand the health of their systems and prioritize project improvements that regularly require runbook executions. They’ll also see these automations keeping ticket counts constant, even while fleets continue to expand with the business.
When you’re able to automate task execution, it’s easier to justify taking those extra steps to ensure stability or to provide analytics that might otherwise seem like a waste of time. These extra steps often reduce the time needed to execute, save incident impact length, reduce service requests (by trusting teams to self-service and execute complex tasks), and ultimately protect your workforce from the trials and tribulations of technology.
Runbooks do reduce effort, but creating them they are only a stop on the road. Automating runbook execution gives operators leverage to eliminate redundant tickets and maintenance work, even as environments scale in size and complexity.
Shoreline is building out-of-the-box remediations based on our beta customer feedback, and you can join the beta here. We’d love to learn about the challenges you’re facing in your infrastructure as we continue to build.