Let’s talk about the value of automation for production operations.
Most of the on-call issues are commonplace, which means they happen again and again.
So if you’re trying to fix it manually, you run into the following problems:
- People are less efficient.
It can take them an hour to register that something happened, find the right runbook, and make the fix, which wastes their time and causes unavailability.
- People make mistakes.
They make even more mistakes if things are commonplace because they don't have their head in the game.
- People leave.
You might put in a lot of resources to train your people, but when they leave, they take that expertise with them.
That’s why it’s important to automate these issues using software as it’s a one-time investment, doesn’t make mistakes (unless there’s a bug), and stays with you forever.
Here are 2 main reasons why people don’t automate their commonplace incidents:
1. It takes a long time.
When I was at AWS, each automation would take about a month to build, which is a long time.
So we’d go through the cost-benefit analysis to decide whether to focus on that or some other dev tasks.
But if it takes just a couple of hours to build (like how we do it at Shoreline), the cost is always low.
So it doesn't even matter. You just build the automation as it takes the same amount of time as fixing the issue once.
2. It may run amok.
People know how to build the solution for an individual box, but they often make mistakes when scaling it across the fleet.
At Shoreline, we're distributed systems people. We work with circuit breakers, leases, etc., to ensure that the automations are safe and fast.
That’s how we help you build automations that enable you to:
- be less dependent on expensive, high-churn labor
- improve your availability to the customers
- sleep stress-free at night