Routine problems arise daily in IT operations—and solutions aren't always easy to implement.
While experienced subject matter experts (SMEs) may know how to diagnose and solve these issues quickly, they're not always available to help as they're working on new product features and higher-value projects. Without their documented fixes, other developers and staff can waste countless hours searching logs and/or Google for answers.
That's where runbooks come into play.
With runbooks, anyone on your IT operations team can promptly remediate recurring problems using a comprehensive, step-by-step guide. A well-written runbook can even assist with automating remediations to issues, saving you and your SMEs valuable time.
In this article, you'll discover the basics of runbooks and how to build one that sets you up to resolve persistent problems through automation.
A runbook is a collection of processes and remediation steps that IT operations staff and other employees use to solve frequent technical problems. The goal of a runbook is to share SMEs' knowledge so that other members can more quickly and consistently resolve issues on their own without escalating to SMEs.
A well-written runbook empowers DevOps, site reliability engineers (SREs), and support staff to execute routine fixes consistently and more efficiently. Not only can runbooks improve solution quality and productivity, but they also provide a roadmap for building automated repairs in the future so that the runbook itself is no longer necessary.
There are three main types of runbooks a company may implement depending on their level of technical expertise. These types include:
- A manual runbook is a set of steps that a human performs to solve an issue using little to no software automation. (Today, this is the most common approach.)
- A semi-automatic runbook is a step-by-step process that a person uses to run one or more scripts to diagnose and then repair an issue. Often people want human oversight before these scripts as there is still some review of the diagnostic data and repair options.
- A fully automatic runbook is a set of steps that software automation performs to solve a problem without human prompting. It's key here to have a very specific monitor or alarm so that you can have confidence that the runbook resolves the right incident with the right repair action.
Regardless of which type you use, you'll be far better off than companies not using runbooks. With runbooks, you'll be able to operate more efficiently and spend more time creating innovative products that out-perform others in the market. Runbooks will also help reduce the time to repair issues, creating a better customer experience.
Naturally, SMEs will be the ones to support, diagnose, and resolve new or unique technical issues. When this happens, they'll want to document the steps they take to fix a problem into a repeatable guide—i.e., the runbook—so that other team members can resolve the same issue again. This act of recording ensures that incident response doesn't create bottlenecks or pull SMEs away from higher-value projects.
While specific runbook usage may vary from company to company, there are a few typical situations for which IT operations should consider creating and using a runbook:
- Frequently executed procedures. Anytime there's a process that a developer or SRE repeatedly does, they should thoroughly document the steps to ensure that different team members can resolve the task without intensive training or escalation.
- Procedures with high error rates. Some processes are prone to mistakes due to their complexity or length. Runbooks are great for these situations because they help minimize the potential for errors. Additionally, the more automation you involve, the more you can eliminate mistakes entirely.
- Procedures with significant risk. If a process could easily cause considerable damage—think manual schema updates or DDL commands—each step must be explicitly spelled out with no loose ends or room for interpretation to avoid unnecessary, large-scale issues.
An SME doesn't necessarily need to write every runbook for these situations, but their knowledge and expertise will be invaluable resources to fill in the blanks and ensure accuracy.
People often use the terms runbooks and playbooks interchangeably, but they can be quite different. A playbook is a broader concept that focuses on more extensive strategic action than specific tactical methods, often containing multiple runbooks within its contents.
For example, an IT operations team may have a playbook for deploying a security patch to a fleet of servers. Within this playbook, there may be individual runbooks for how to test the patch, deploy it, update the server configurations, and safely restart the applications.
You can think of a playbook as a novel and the runbooks as the chapters. The book and chapters have a narrative and flow, but the former is broader and tells an interconnected story.
Constructing a runbook is the first step toward solving routine issues through automation—helping you operate more efficiently and accurately.
Here's a look at four concrete steps you can take to write a runbook:
As you might recall, ideal candidates for runbooks are processes that IT operations team members execute frequently, have high error rates, or have significant risk. Consider your team's methods and if any of them fall into one of these categories.
Typical examples of processes that benefit from runbooks include resizing, archiving, or deleting files from full disks, restarting Java virtual machines (JVMs) that have CPUs maxed out due to runaway garbage collection, and terminating stuck Kubernetes pods. Having a clear runbook—and ideally automation—can minimize the risk of developers, SREs, and other support staff missing a step.
Incident reports and post mortems can also be helpful materials for identifying fitting candidate processes for runbooks. These documents include detailed analyses of what happened during an incident and any recommended follow-ups. IT operations teams can use this data to determine root causes and how to prevent the issue in the future with better documentation via a runbook.
Once you've identified an ideal task, you'll need to determine each step required to fix a problem manually. Answer questions like, “Will a ticket need to be created and/or closed?” and “Will the user have all the right security credentials and permissions to repair this issue?”
After you've determined the fix, you'll want to document it in a runbook and share it with the relevant engineering team (along with any past debug data). They'll be able to see if they need to implement any changes to solve the root cause of the issue. While an SRE or support staff member may fix the problem at the moment of the incident, additional data may be required so that engineering can fix it forever.
In addition to providing a solution to the problem, you'll want to include any relevant diagnostic steps to help readers identify the issue quickly in the first place. Proper diagnosis can help prevent support delays and enable readers to find the correct runbook more quickly.
With the ideal task, solution, and diagnostic steps in hand, it’s time to write the runbook! While this may seem like a simple step with all the research done, the document must be to-the-point, easy to understand, and accurate. It should contain as much information as is necessary to diagnose and resolve the problem without unnecessary fluff.
A typical runbook will contain the following key sections:
- Overview. This section summarizes the problem and how to solve it—helping readers quickly navigate through multiple related runbooks to identify the right one for their situation.
- Authorization. This part of the runbook includes any relevant information about how the reader can access the affected system where they’ll be running the resolution process.
- Diagnostics for identifying the problem. The diagnostics include symptoms and other characteristics or information that can help the reader determine if this runbook is relevant to their situation.
- Steps for resolving the issue. This section is the core of the runbook and will provide the step-by-step process for remediating the problem.
- Monitoring system information. After the reader resolves the issue, this part of the runbook explains how they can monitor the system to confirm the issue has been resolved.
- Capturing data to help engineering fix the underlying problem. This section explains how the reader can document the issue and resolution and who to provide it to so that they can prevent it from happening again through a development fix or additional automation.
- Service level agreement. Lastly, this part of the runbook lists expectations from clients or internal stakeholders regarding how quickly the on-call team is expected to resolve the issue.
Some runbooks can be quite text-heavy, which can be challenging to go through. Consider including screenshots, diagrams, and flow charts to make your runbook easier to understand.
After you've written your runbook, think through how on-call teams will find the right runbook for the right incident. It's not uncommon to have multiple, similar runbooks, so having a strategy to deal with this is key.
Lastly, you need to have a plan for keeping runbooks up to date. Products and services are constantly evolving which can lead to unforeseen consequences for how on-call teams will maintain them. So, frequent maintenance is key and gets harder as the number of runbooks increases.
A runbook can be an essential tool for sharing knowledge about incident remediation, helping you avoid costly escalation chains. However, runbooks are just the first step toward efficient workflows. Runbook automation can prevent the need for human intervention altogether, saving you valuable time and resources.
Shoreline's RBA Platform supports semi-automatic and fully automatic runbooks with precise alarms, so you can easily automate remediations in seconds, not weeks.