If you work in site reliability engineering (SRE) or are responsible for managing production operations, toil is likely the bane of your existence. Repetitive, tedious work that you perform manually slows you down and makes your job less satisfying.
Fortunately, as this article explains, it’s possible to conquer toil. By identifying and assessing the sources of toil, then deploying automation tools to reduce it, you can minimize the time you waste on manual work. By extension, you can maximize your efficiency while improving availability.
So, let’s take a look at practical steps for reducing toil, starting with a definition of what toil means, just in case you’re not fully acquainted with the term.
What Is Toil?
Toil is any type of manual, tedious task that engineers perform within production environments. Because toil wastes time, slows down operations, and hurts the customer experience, most site reliability engineering (SRE) and engineering teams strive to minimize the amount of toil within their workflows.
Although toil can occur in any IT operations work, reducing it tends to be a key focus for SREs. A key part of their job is to implement automations to streamline tasks that teams otherwise need to perform manually, so toil reduction is a natural fit for their skillset and priorities.
Why Reduce Toil?
Toil reduction is crucial because it gives your developers, DevOps engineers, and SREs more time to focus on projects that create net-new value for the business—like implementing new application features or redesigning infrastructure to make it more efficient.
Additionally, reducing toil is essential because, if left unchecked, it can grow to the point that it becomes toxic to long-term success. Engineers may drown in tedious, manual tasks and never find time to dig out. It can also damage job satisfaction—no one wants to be woken up in the middle of the night to restart a server or resize a disk manually. When companies automate these tasks, SRE and DevOps teams will find their jobs less stressful and more rewarding.
With these benefits in mind, you may be wondering how much toil is too much. Google states that it has an “advertised goal” of keeping toil “below 50% of each SRE’s time.” In other words, no engineer should spend more than half of their time on manual, repetitive work.
While keeping toil levels below 50% might be a good objective for businesses that are new to SRE or toil reduction, the ultimate goal should be to get toil as close to zero as possible. Spending half of your time on repetitive work that you could automate is too much to accept as a regular part of operations, given how many tools organizations now have available for streamlining tedious tasks.
Strategies for Reducing Toil
Now that we’ve discussed what toil is and why it’s harmful to both the business and SREs, let’s talk about minimizing it. Although the sources of toil vary from company to company, there are several toil reduction tactics that work regardless of your organization’s size, industry, or technology stack.
Know How to Identify Toil
Conquering toil starts with learning to recognize it in the first place. You can’t fix what you can’t see.
The best way to detect toil is to adopt a data-driven approach based on the ticketing system that your SREs use. Ticketing systems help quantify where SREs spend the bulk of their efforts and assist management in making objective remedial decisions about how to operate more efficiently.
As you assess SRE activities based on ticketing data, strive to sort the activities into two categories:
- Activities that create value
- Activities that amount to toil
For example, if you notice that SREs are repeatedly handling tickets associated with running a data backup script, it’s an indication that performing the backups requires too much human intervention. The team could likely save effort—and thereby reduce toil—by finding ways to automate or harden the backup process so that they don't need to create tickets and wait for engineers to respond to them manually.
Similarly, tickets that involve multiple people working on them to close them out may also indicate tasks that are sources of toil. The more engineers that are working on a ticket, the more inefficient the request may be, and the larger the opportunity to streamline it using automation.
Another strategy to help eliminate toil is to ensure you measure it correctly. Measuring each source of toil using an objective metric, such as the minutes or hours spent on each task, allows you to identify where the most significant culprits of toil lie.
Like identifying toil, ticketing systems can make it easy to quantify how much time engineers spend on each task. You can also correlate variables, like the ticket assignee, the time of day when they processed it, and what system or process they addressed. The more detail your tickets include, the deeper the level of visibility you’ll gain into the extent of the toil.
Beyond making it easy to identify and measure toil, an inherent advantage of using ticketing systems in toil reduction is that you don’t have to work manually to collect data about toil in the first place. With a sound ticketing system in place, your toil reduction process doesn’t become a source of toil itself. A ticketing system also bubbles up the data executives need to understand where their team is spending time so they have the right perspective for investing in automation.
While identifying and measuring toil are great foundational strategies for reducing toil, automation is where you'll see the most significant impact.
You may not be able to automate every manual and repetitive task, so to get started, you’ll need to assess whether you can remediate each source of toil via automation. Tasks requiring highly contextualized decision-making by engineers, such as responding to infrastructure failures, are typically difficult to automate. High-risk tasks such as bringing down a mission-critical app are also not good candidates for automation. And some issues are simply bugs that can and should be repaired within the product.
After deciding which tasks to automate, determine how to translate the manual workflow you currently have into a computerized machine workflow. It can be helpful to map out the steps that a human takes to perform the process by developing a targeted runbook that defines those steps. Before you do this, it’s important that you pick a very specific issue to fix, then define an alarm that gives you high confidence that this issue is the cause. Walk through every step, system, and login required to fix the problem.
From there, consider how you can automate each step in your manual runbook. Automation can perform many of the same tasks as humans, but more efficiently—making your workflow more straightforward. For instance, whereas a manual workflow may require engineers to provision multiple servers separately, an automated version could leverage Infrastructure-as-Code tooling to provide the servers simultaneously. Using pre-approved automations also eliminates the potential for human errors that can easily happen when operating quickly under the pressure of an outage.
Remember, too, that even if you can’t fully automate a workflow, you may be able to automate it partially. For example, you might be able to reduce the toil associated with monitoring or observability by automating the diagnostic collection, even if you still require manual analysis of the data. This step reduces the number of engineers who need to play a role in monitoring while also reducing the time it takes to collect data. Both advantages lead, in turn, to lower Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR).
Deploying basic automations doesn’t mean that your battle against toil is complete. In many cases, the automations that SREs first implement are “human-in-the-loop," which means they require humans—typically skilled engineers—to take action before an automated process can finish.
To take automation to the next level, strive to enable self-service automations that allow anyone on-call to trigger and execute an automated workflow without requiring specialized engineers to participate. Consider limiting the scope of resources affected by a single command for users in different roles. With this setup, you get decentralized automations that your entire team can use to minimize toil, rather than just key employees.
Going even further, look for opportunities to take humans out of automation loops to achieve end-to-end automation. End-to-end automation means that processes are entirely automated, with no human oversight necessary. Although full automation is not feasible for every workflow—high-risk procedures, for example, may require a human to sign off before the workflow can proceed—there are many tedious tasks you can automate completely using tools like Shoreline’s End-to-End Automation Platform.
Promote Toil Reduction
Toil reduction doesn’t depend just on tools and processes. It’s also partly about culture. You need to ensure that everyone in your organization understands why reducing toil is a worthwhile goal, benefitting your employees, your business, and your customers.
When toil reduction is a cultural priority for your organization, you can stay ahead of your competitors and position yourself better to create more compelling products. At the same time, your employees will be happier because they’ll spend their time on productive work instead of dull, repetitive tasks.
Manager support is crucial for achieving an anti-toil culture. Managers should take the lead in communicating to employees why toil reduction is beneficial and which tools are available for reducing toil. They should also build a culture that recognizes those that identify new ways to eliminate toil.
Use the Right Tools
Minimizing toil may seem like a daunting task, especially if you don’t already have tools in place that can automate tedious workflows.
Fortunately, you don’t usually have to build toil reduction tools from scratch. The chances are high that other businesses experience similar types of toil as your team and that someone has already created tools to help address them. Look to vendors like Shoreline, who specialize in developing automation solutions for DevOps and SRE teams, to find ready-to-use tools and pre-built automations that reduce development costs and minimize toil.
In other words, don’t feel like reducing toil is a job you need to handle on your own. Great tools and resources for minimizing many sources of toil already exist.
Take the Next Steps toward Reducing Toil
If you're tired of toil, the good news is that there are actionable steps you can take to get it as close to zero as possible. Start by determining which processes in your operations create toil and measuring how much time your engineers waste due to that toil. Then, deploy automation tools that can streamline toil-heavy processes, at least partly.
From there, look for ways to reduce toil even further by placing self-service automation tooling in the hands of as many employees as possible. You should also strive to deploy end-to-end automations to take humans out of the loop entirely.
Shoreline can help. As a company specializing in automating incident remediation so that SREs can spend less time on toil and less time on-call, Shoreline knows what it takes to root out toil wherever it lurks.
We’ve developed an End-to-End Automation Platform that makes it possible for any business to reduce toil and automate remediations in minutes, not weeks. Learn more by requesting a demo.