Infrastructure as Code for Production Ops

DevOps thrives in the face of the unexpected. People working in the industry must be able to respond to unplanned incidents with expertise, quality, and efficiency. This post details how DevOps leaders can apply infrastructure as code lessons and tooling to production ops, use solutions like Terraform + Shoreline to automate repeatable tasks, and make hero-level institutional knowledge accessible to anyone.

I’ve been a hero in ops. You know the type: jumps on tickets in minutes and spends hours on a system-critical fix for a specific customer. During my time in the role, I handled hundreds of tickets and fixed some of the biggest IT challenges ever faced in cloud.

I’ll be honest, it was an adrenaline rush—especially when time was limited and success required making the smallest action to quite literally save the day (think: selecting the right wire to cut in an action movie).

But there was a point when the rush became a slog. All of that repeated work was really toil. It didn't accumulate or scale out. Critical issues turned into constant problems because manual ops isn’t like software development. If you fix something once, you’ll definitely need to do it again (and again)—likely the next day.

Where we are today

These days, a fair number of teams have been able to limit the need for heroism and replace infrastructure tickets with infrastructure as code.

The next phase is infrastructure as code for production ops, or moving the practices of operations—the runbooks, playbooks, scripts, and other knowhow—out of wikis or operators’ heads and into code (just like we did with infrastructure and software before it). This is the key building block for moving toward an automated future where people with any level of expertise will be able to orchestrate operations flawlessly.

In simple terms, infrastructure as code for production ops offers a way to make actions repeatable and automate work. In practice, it involves managing ops using similar tooling (alarms, remediations, checks, tests, verifications) that one does for managing the code underlying software. The key benefit here is that, when applied correctly, an ops team member just needs to implement a code change and the infrastructure as code tool or architecture does the rest.

Isn’t the answer Kubernetes?

If you’re a SRE, you might be thinking: wait a minute, the answer here is Kubernetes.

It's true that Kubernetes does handle scheduling of pods on our nodes (i.e. solves the problem of maximizing resource utilization). For example, we still set up monitoring tools (e.g. Prometheus) and still have runbooks / playbooks on Kubernetes. Sometimes, we manually restart pods, adjust disk size, and handle poor application performance. But, that's not where the work of ops teams ends.

It’s important to remember that ops is fundamentally about the unexpected: we know that we need people to handle incidents that arise, but aren't planned. The truth is that no deployment tool solves this problem because they are for regular (planned) rollout of software.

A closer look at infrastructure as code for production ops

We have the automation, architecture, and expertise to apply it to DevOps/SRE work at scale. But the lessons from infrastructure as code are getting dropped as soon as projects enter operations and production phases.

In practice, we need to extend infrastructure as code to ops by employing specific tools, like a Terraform provider, to deliver version control, code review, testing, and deployment all in one place throughout remediations—the runbooks of ops or the steps needed to get things fixed.

We need to consider doing this across the board, but especially in modern environments where folks are using things like containers (constantly coming and going) and autoscale virtual machines (constantly scaling up or down dynamically). It also matters significantly in data pipeline, storage, and processing environments—all of which tend to be stateful i.e. operational pain is acute.

A 'collective memory' for DevOps & SRE

Another pain point in the path of efficiency and customer satisfaction is that individual ops team members can’t handle incidents consistently without collective memory. Put another way, different folks come up with their own way of solving things, and doing so with varying degrees of efficacy and safety. This can create issues for managers and team leads, including:

Stress about who might be oncall because there is such a difference in quality between manual work conducted by senior and junior folks.
A higher risk of people doing the wrong thing because they are looking at outdated runbooks / playbooks.
Failure to get as many people involved in operations as possible despite a company effort to decentralize DevOps (where engineers own their code and operations)—the all-hands-on-deck approach simply can’t work if you can't propagate information.

But it doesn’t have to be like that.

How to achieve ops team success in fast-paced environments

Ops team success requires achieving availability of applications in the face of rapidly scaling environments and infrastructure. Every machine, piece of software, and infrastructure asset has a chance of failure. When you have a small number of these things, that chance is miniscule most of the time. But, as you scale out, that small chance starts multiplying. Eventually, at scale, you are having unplanned incidents continuously. This is where automation is so key.

Not to mention that DevOps/SRE mistakes directly impact specific SLAs and overall customer experiences–especially in cases where fleets grow much faster than ops team headcount. Applying infrastructure as code for production ops can help convert repetitive, time-intensive tasks into work that’s completed once and applied from the initial fix onward through automation. This, in turn, can relieve headcount pressure and reduce mistakes or disruptions that reflect poorly on ops and, frankly, an entire organization.

Potential hurdles of infrastructure as code for production ops

One of the primary challenges of applying infrastructure as code for production ops to modern IT environments is that automation requires significant time investment along with the initial financial investment. There’s a high resource barrier to entry for a lot of companies that would otherwise subscribe to infrastructure as code for production ops. Shoreline’s product addresses this head-on by reducing the time it takes to automate tasks which, in turn, allows customers to produce more while paying less.

Another issue at the core of ops: incidents are unplanned and the chance they occur multiplies along with the number of machines, software solutions, etc. added to the mix as an IT operation scales. For instance, most mid-sized companies have not one Kubernetes cluster, but dozens. Most have multiple servers. So, the question becomes: how do you scale ops to manage all of that?

Making sense of ‘collective memory’ for production ops

DevOps/SRE staff build up a lot of specific knowledge of their craft. For example, they memorize the specific sequence to use when restarting a system. It’s common nowadays for this knowledge today in runbooks stored in wikis. However, these are not living documents—they often become stale and inaccurate. Additionally, knowledge trapped in the heads of DevOps/SRE staff is lost to the rest of the organization.

With Shoreline, operators can store the resources, metrics, and actions they know of in a "collective memory"—the symbols can then be shared between the operators and updated as needed. Since the definitions are shared, rather than trapped in folks’ heads, everyone is on the same page and nothing gets lost. For example, a junior engineer can describe a symbol to gain the insight a senior engineer has added to Shoreline. Support personnel can handle run actions to help customers (with the confidence they are doing them correctly) in line with current best practices at the company.

One step further: how composability enables efficiency

Furthermore, these symbols are composable. Shoreline's Op language and system allow for the combination of resources, metrics, and actions to ensure that every symbol strengthens the others. In other words, the act of someone adding a new symbol increases the collective group of operators’ ability to combine workflows and solve problems.

DevOps/SRE are incentivized to use the system because symbols are built into the tool rather than putting the information on a wiki that no one knows how to access. Instead, DevOps/SRE are using the Shoreline CLI to actually solve the issues.

Looking back to go forward

DevOps is a critical, customer-facing function. By applying infrastructure as code lessons to production ops, managers and team leads can employ specific tools, like Shoreline's Terraform provider, to automate tasks or remediations, and tap into the knowledge bank provided by Shoreline CLI to ensure quality incident response every time.

I remember writing code before it was consistently applied. In those days, if someone threw out the code or messed with it in some way, we were like: whelp, it’s time to start all over.

That may sound ridiculous today, but it was the state of the software world at one point. It was the state of infrastructure until infrastructure as code came along.

Now, we’re kind of at a similar point in ops. An ops person who completes a remediation one day won’t solve that issue from now on—on the contrary, the work required to address it manually is bound to be repeated over and over for as long as that person is in the job.

The technology and underlying coding principles exist to eliminate repetitive work and automate context-specific remediations. At Shoreline, we want to go even further to make ops, including oncall, accessible to the entire team. We believe embracing infrastructure as code for production ops will get us there.

To see how Shoreline partners with Terraform to extend infrastructure as code to production ops, request a demo.