3 Ways to Reduce Downtime and Improve Site Reliability

If you work in production operations, downtime, and customer experience both top your list of significant concerns. And for good reason. Downtime and customer experience issues occur far too frequently — our recent survey found that teams deal with 278 incidents monthly and spend 2,084 hours monthly trying to fix those issues. Not only do these issues happen often — they also have significant consequences like lost revenue, negative impacts on customer satisfaction, and more.

As environments scale, so do the number of issues

As environments become more complex, the number of issues only rises. On-call costs increase exponentially with growth in hosts and the complexity of your infrastructure. Eventually, throwing more money into the mix becomes unsustainable. That’s why it's so important to find ways to do more with less instead of just throwing more people or money at reliability problems.

Investing in on-call productivity is an important way to tackle this problem. There are two main benefits from these investments:

Reducing the time spent resolving incidents even while your infrastructure is becoming larger and more complex
Increasing time spent on proactive work that can help you avoid a major outage in the future

In this blog, we’ll explore three ways to improve the productivity of on-call teams so we can facilitate a shift to a virtuous cycle.

Investing in on-call productivity is an important way to tackle this problem. There are two main benefits from these investments:

Reducing the time spent resolving incidents even while your infrastructure is becoming larger and more complex
Increasing time spent on proactive work that can help you avoid a major outage in the future

In this blog, we’ll explore three ways to improve the productivity of on-call teams so we can facilitate a shift to a virtuous cycle.

1. Improve your incident analytics so you know where to invest

Most companies don’t have great insight into their incidents. Data is spread across email, Slack, and incident management tools. As a result, companies don't know how much time their team is spending on resolving incidents or how to prioritize investments. Companies often miss golden opportunities to improve because they underestimate the impact of different sources of incidents.

Whenever downtime or site reliability issues generate a ticket, it’s possible to collect a lot of valuable information including where the issue happened, when, who worked on it, the impact on customers, and how long the ticket was open. Ensuring that this additional data is captured and then analyzed will help shape your roadmap for investing in reliability.

One simple way to start is by measuring the total time spent on different incidents. To do this, you need to know how many times a type of incident occurred in the last month and the total time spent on all of these occurrences. Once you identify the issues, categorize them into two buckets:

Issues that have a root cause that should be fixed in your code
Issues that will continue to occur and that are outside of your control, such as an issue with Kubernetes or a hardware issue (the root cause cannot be easily fixed)

It's also important to be realistic, some issues may be easier to fix with a repair script than by fixing the underlying issue for a variety of reasons. Either way, it’s important to establish a culture of continuous improvement where every week you are addressing at least one issue that is hurting uptime. If the root cause is something that engineering can tackle, then designate time and resources to fix it. For issues where the root cause is harder or impossible to fix, then it's time to build repair actions that can be run consistently each time this issue occurs, possibly even automatically.

2. Create self-service tools to empower your team

When issues like downtime occur, engineers scramble to find a solution. They often ask “What commands should I run to debug or repair this particular type of issue?” which leads to a painstaking search of sites like Stackoverflow. It's not uncommon for engineers to spend hours looking for the correct commands to run and then spend just a few minutes actually running them. You’ve probably never analyzed the impact of all that time spent searching for a solution (it isn’t often tracked), but you can bet it adds up (AND it impacts your bottom line).

It’s important to reduce the time on-call engineers spend searching for solutions. Self-service tools like runbooks can provide expert-level knowledge and skills for less knowledgeable engineers and even support teams. With runbooks, anyone on-call can quickly figure out a problem and implement a fix. The best self-service tools and runbooks also educate while they enable. They empower engineers to fix an issue in the moment and provide them with knowledge that can be applied to prevent future issues.

Self-service tools can have a huge impact on MTTR (mean time to repair) because you no longer have to document the issue, find the right person to handle the issue, wait until they are available, and then wait for them to get up to speed on the issue. Our recent survey found that escalated issues take 3X as long to resolve — it’s not surprising given all these additional steps.

3. Create self-healing repair actions to eliminate reliability issues

For issues that you know will continue to happen, consider implementing self-healing repair scripts. This will enable you to reduce manual work that would otherwise be required to both identify issues and implement a remediation. These types of repair actions are well suited to straightforward, recurring issues like disk resizing, rotating certificates, and terminating stuck pods.

Many people write off this approach because they think the issues they have cannot be fixed through code or they think it will be too hard to write that code. But more things are automatable than you realize. The first step on this journey is creating a precise alarm. A generic alarm that could be associated with multiple issues could be addressed through a runbook, but it cannot be fixed without further analysis. Instead, if you can identify a very specific alarm where you have full confidence that you’ve identified the root cause, then you can start to create code to fix the issue.

Investing in this type of work can make a huge difference over time. Each time you fix an issue forever, you’ll be freeing up additional time to invest in future self-healing actions which will free up even more time.

Shoreline’s cloud reliability platform

Shoreline’s cloud reliability platform makes it easy to create self-healing actions to address recurring issues in your environment. It also gives you real-time, granular visibility into your infrastructure which makes it possible to find new issues in seconds instead of hours. This is particularly critical during major outages where every second counts. For issues that require human intervention, Shoreline provides Notebooks that give users step-by-step runnable recipes for fixing issues. This speeds up a repair, increases consistency, and reduces the risk of an even bigger outage. With a cloud reliability tool like Shoreline, your team can work smarter, not harder, allowing you to tackle more issues in a day, do more with a smaller team, and ultimately spend more time innovating.