I talk a lot about bringing automation to production ops because it's the next problem to solve.
Let me explain.
- Almost everyone is moving to the cloud because it brings agility, allowing faster development and innovation.
- We have tools to automate other parts of the software development lifecycle, whether it’s running tests, building artifacts, configuration management, or deployments.
But sadly, managing the environment once it's in production is still an almost entirely manual job.
A ton of tools help you observe your environment and maybe half a ton help you route things and deduplicate them.
But there's hardly anything out there that actually fixes your environment.
And that's unfortunate because I pity the poor SREs who deal with this vastly faster pace of innovation and software development lifecycle.
They're getting code faster. It's more complex and even multi-cloud. It's got both Kubernetes and VMs alongside a bunch of microservices.
And they're responsible for plugging the holes in the dike wherever they happen, and the dikes are only getting bigger.
The only way to keep up with this challenging job is by ensuring that:
- the things that you do repetitively get automated away.
- the things that get escalated can be moved into processes so that your 1st line can manage them without escalation.
- things that happen for the first time can be debugged fleet-wide in parallel.
Personally, I never got excited when there was one more dashboard to look at or slightly better routing of an incident to somebody.
But I do get excited when some incident gets automated away forever because that reduces my labor, helping me keep up with an ever-growing, complex environment.
That's the reason we need automation in production ops today.