Restarts and rollbacks don't fix everything

GitOps, ChatOps, MLOps, BugattiOps. With the exception of the last one I just made up, every iteration of automating and streamlining operational procedures has been advertised as the cure-all solution to every ailment, including resolving incidents.

While declarative infrastructure, programmatic deployments, and repeatable automations are desirable, they aren’t capable of resolving or preventing incidents in every major infrastructure. Even using Kubernetes to restart containers can only solve a certain class of issues. There are still problems for operator teams tasked to hold things together until the next version reaches production.

So let's look at several Ops, what they do well, and why they can't solve all our problems.

What is XYZOps and Why Can’t it Fix Everything?

A good DevOps toolset can make operationalization more successful and more enjoyable.

GitOps exists to solve the problem of human error, and removing operator ownership by making Declarative everything; an immutable definition of what infrastructure or code or policies mean and how to implement them by integration automation. GitOps is good for maintaining the state of your environments and cataloging their changes, leaving a trail that you can use to revert changes should problems arise.
ChatOps bridges the technical implementations of various technologies with a human interface. It's a singular place to drive and report data and decisions. ChatOps offers a good method of quickly interacting with your systems without jumping through bastions or gathering lots of datasets to take a prescribed action. ChatOps brings operators into the same world to exist with each other and communicate quickly, but doesn’t do anything about human error.
MLOps goes all the way with Machine Learning. It implements the heuristics of big data analysis on one’s infrastructure paired with various models of understanding--all meant to help one set of machines manage and deploy other sets of machines. Robots managing robots is probably someone’s definition of a utopia. A lot of distributed systems problems are big data problems better solved by machines, but actions based on statistics creates its own type of error.

So how can these DevOps methodologies help with incidents?

GitOps helps prevent incidents by adding mechanisms for peer review, testing, and configuration validations. However, they may leave behind unidentified time-bombs and edge-cases. While the iteration and logging can aid in reverting a system, sometimes these problems aren’t found until a much later release cycle where rolling back may be more impactful than the problems that were introduced.

ChatOps aids in managing incidents with its human interface. It can automate the response or even initiate rollbacks and gather data for analysis. However, it remains reactive in nature and is only as good as the prescribed solutions it was designed for.

MLOps is exciting and new, allowing for data models to drive decisions and analytics to manage infrastructure. However, its capacity to manage and mitigate incidents must be matured and vetted before it can be trusted. And like ChatOps, it requires prescribed functionality for mitigation or action.

These are all great tools for the beginning of the pipeline and for prescription-based operability, but MLOps and GitOps in particular lack the human flexibility required for catching and fixing new outages or subtle heuristics that have yet to be worked into our systems. The lack of flexibility is by design to remove human error.

Human error causes many of our largest outages. But the idea that we’d ever arrive at a pristine GitOps configured environment is a fantasy because things are always changing underneath us. GitOps can solve a lot of problems, but it doesn’t cover everything. Just ask your operator. Ops is still death by a thousand cuts.

A config we can roll back to is wonderful, but we often spend many iterations of mistakes to finally reach an end product that removes the weak human-link from the chain of custody that makes up our development pipelines.

GitOps policies also don’t help us identify and fix dynamically-occurring problems that the config hasn’t accounted for. We’re dealing with distributed systems, after all. Whether due to changing infrastructure paradigms, or a lack of data ingestion from that new business flow, any number of things could preclude operators from seeing that approaching doom, but they’ll certainly need to be able to react in the moment to fix problems in the moment that aren’t solved by a restart or a rollback.

Operators hold their companies on their shoulders, balancing Service Level Agreements, Operations Level Agreements, flexibility in getting a product to market while maintaining standardization, and ensuring that scale can match the demand. The humans juggling these problems are already staring down the barrel of new technology, reporting to leadership that wants the latest and greatest, and at the same time maintaining legacy products, or simply scooping water out of the sinking ship on a bad code deployment that can be “just restarted until we get a code fix.”

Human error exists, but when we are managing incidents at scale, human judgement matters, and operators need to be empowered to keep our critical systems running. We’ve seen this first hand responding to company-wide outages at Amazon.

Let's look at some examples of when these methods have fallen short, requiring manual intervention or temporary remediation until a more permanent fix can happen. These are the "hidden Band-Aids" caused by the growing pains of agile applications and evolving systems.

The Hidden Band-Aids of Current Ops

Disk Space Utilization

It's a common situation: a log rotate doesn’t catch the files created by a custom application, or tmp files go out of control in an application, or some oversight of disk storage gets looped into infinity that overburdens your favorite partitions. Our industry is littered with examples of manual remediation documentation for when an app update or configuration change starts dropping huge log files.

The real fix for this, of course, is to identify and fix the problem in the application. But in the meantime, it may be reasonable to simply load up a cron to log in and check the state of the server on a case-by-case basis.

Pre-Emptive Application Restarts

Several reasons might exist for problems like this, and not all of them may be worth justifying the reverting of your applications or setting up complex machine modeling when a simple script might do. Take annual developer vacations for instance, where deployment sprints and their resulting natural restarts that used to clear up the garbage between release cycles aren't happening, or more aggressive memory leaks that the servers can withstand for days at a time that can have manual intervention while updating the codebase.

To get through these tough times, you might use more cron jobs or even just a simple Ansible loop to iterate through the servers to safely restart them while admins watch them behave.

Diagnostic Verification

Some of the most gruesome incidents we’ve seen include situations where the mitigation was a simple server restart or code re-roll, but the cause remained unknown. The situation was overly complex or the symptoms too subtle to track down root cause. One such example from AWS S3 service was caused when a known playbook was run with a typo. This impacted far more servers than intended and caused a massive storage outage.

While established manual runbooks can help ferret out debug information for making assessments, or even triggers for these temporary workarounds, simple operator-written scripts to replace human involvement could save a lot of toil that otherwise leads to these situations.

Core dependency integration for shared code bases

One of the more challenging situations is an architecture where many microservices rely on a single code base for some function, such as configuration management or other fabric-related functions. When that single code base releases with a bug, it's destined to cause those dependent services to exhaust their worker processes until they stop responding. We’ve seen a nightmare-scenario that spawned from such a situation, but didn’t trigger standard metrics like CPU or Memory. Until those services became entirely unresponsive from hung workers, the application looked fully healthy.

This period required manual updates to those configurations as part of a health-check to prevent the work process exhaustion. We had to keep applying it every time those services restarted because they only lived in memory!

A more public example, with further reaching impact, happened when Heroku’s use of public apt packages for their systems lead to read-only file systems for their customers. They opted to roll back, but had to verify and fix any system coming into service that potentially had this dependency until the upstream package was fixed.

So what do all these failures and Band-Aids have in common? They are all edge cases too subtle for pre-emptive and automatic resolution. And they happen infrequently enough that manual labor is expected to bridge the gap between the infrastructure automation and development cycles. Without this manual intervention, they can all cause catastrophic failure, no matter how intricate the automation you currently rely on.

You could probably create cron jobs and Ansible scripts to deal with all of this, but that creates a problem where you don’t really have clean infrastructure. The environment stops looking like cattle. Boxes begin to drift because some have patches and some don’t. Installs don’t include these scripts, and this complexity compounds.

We need a solution that can fix this once and for all.

Conclusion

At Shoreline, we believe there’s an opportunity to provide tooling for operators that leverages and scales their prior work resolving outages. We believe you should be able to easily debug problems across your fleet and create custom, automated remediations that alleviate the need for these Band-Aids, prevent recurring outages, reduce ticket count, and increase availability by an order of magnitude.

Shoreline is accepting beta customers right now. If you’re an operator and believe, like we do, that there should be a way to reduce redundant operational work while increasing overall availability, fill out this form to request a demo. Looking forward to speaking with you soon.