Anurag Gupta on day 2 operations, devops, and automated remediation

Overview

Anurag’s teams at AWS would ticket per database instance, not fleet-wide, and automatically remediated tickets to keep up with scale.

Instance-by-instance ticketing is pretty challenging if you're growing 50% or 100% year over year. Now, the solution for us was each week was to do a Pareto analysis of the prior week's tickets and find one, at least one, that we'd extinguish forever.

The opportunity to build out GitOps patterns to automate Day 2 operations.

I'm a big fan of GitOps. It’s the right mindset to automate everything and remove manual labor, both from a scale perspective, as well as from an error creation perspective. And for Day 2, it just makes sense to me that the artifacts one uses to monitor, alarm, and repair issues, should go through the same review process, the same pipeline process, the same version control deployment as everything else.

Now, there just aren't good patterns for GitOps for Day 2 ops right now. It's still kind of ad hoc. There are a bunch of tools, but they're very isolated from the software development environment.

The relationship between manual vs automated remediation and automated deployment tools.

Most of the time, I think remediations can be automated because everybody's got Wikis and runbooks that say what to do. And for at least the issues that are commonplace, you're much better off automatically doing something just like you're much better off using a deployment tool rather than having humans sit there and FTP data on the boxes.

And finally, the foundation for automated remediation.

You want to make it easy to define metrics, alarms, and actions using the Linux tools, CLIs, scripting that you already know. But what you want from ops orchestration, just as we get with Kubernetes, is the ability to deal with distributing that content in a consistent way, running it everywhere. And you would need clean models that deal with failure, limit blast radius of changes, and run locally, because even your operations endpoints can fail. And so if you think about Kubernetes, it does this for restarts, but restarts don't fix everything.

You can listen and read the transcript at InfoQ.com.