How can companies increase reliability without hiring an army of engineers? You'll never be able to hire people at the same rate that your fleets grow.Here's why:
- It's hard to hire SREs right now.
There are about as many job openings on LinkedIn for SREs as for developers, even though there are 40 times more developers.
SREs have a high churn. They leave really quickly, within 18 months on average. So keeping them and replacing them is a big challenge.
The only real solution to this is to be able to bend the curve: to manage large-scale infrastructure with fewer people. I started Shoreline to solve this exact problem. Here are 3 things we do:
1. We make it easy to automate issues away. By doing so, we reduce the toil due to mundane commonplace issues. We believe that people shouldn't wake up in the middle of the night to do things the machine can do.
2. We make it possible to safely expand the group of people who can fix things without escalation. So you can bring in your support and dev teams to take care of many things previously only handled by SRE experts. You'll still need experts, but not as many because they won't be on every single issue. We do that by delivering Jupiter-like notebooks that populate with diagnostic information as soon as an incident occurs and provide the recipe to fix things. Unlike static wikis that become stale, these notebooks are executable, so people are motivated to keep them up to date.
3. We make debugging across the fleet similar in time to debugging an individual node. We do this by enabling parallel distributed execution.So even if there are 100 or 1,000 nodes, you can ask questions like:
- Are any unexpected processes running on my nodes?
- Are the configurations what I expect, or have some of them drifted away?
At the core, this is all about:
- making people more productive by automating a big part of the work,
- spreading the load across more people, and
- debugging in constant time, not in time proportional to your fleet size.