“Automation is risky. Errors in the remediation code could worsen an outage.”
While that’s true, we also know that human error causes almost 5x more incidents than automation.
It’s because you can fix code, you can't fix people.
They come and go. Some have experience, some don't. And whoever happens to be on call is whoever happens to be on call.
People make mistakes. That's why when you're writing code, you don't just ship it.
It goes through testing, scripts, deployment, and all other processes.
But you don't have that opportunity when you're fixing something on call as you're dealing with it in the moment, under pressure.
That’s why the best way to reduce the risk in production ops is by doing more automation and leaving less in place for people.
Further, you can make automation less risky by using tools with circuit breakers that limit the number of times the automation runs, and that can deal with partial failures.
Basically, the tool must have the capability to understand the complexities of distributed systems.
So you can focus on automating just the individual issue that happens in the individual box.