The best way to avoid downtime from an incident is to fix it before it happens.
That requires predictive maintenance. Here are 2 approaches that work for me:
1. Monitoring brownout
Things usually brown out before they black out, so that's when you want to see what's going on.
At AWS, my engineers used to count device errors on all our devices and then:
- correlate the number in a given period with a subsequent device failure, or
- see when device latency started to exceed some normal range.
Then we’d proactively shift away from those bad resources during a maintenance window rather than having a fail when it's under load.
We now use this process at Shoreline.
2. Performing control actions
Here, we adapt the feedback control theory from industrial systems.
The basic concept is that you have the desired state, the observed state, and the error between the two.
Your goal is to create a control action loop that reduces the observed and desired state gap.
The more frequently you sample, the smaller the control action needs to be.
Let’s understand this with an example:When I started driving, I’d swerve the steering wheel.
But now, I don’t need to do that as I keep making little adjustments without even paying much attention.
At Shoreline, we keep running these control loops.
Every second, we scrape 1,000s of metrics, compare them against 1000s of alarm conditions, and take little control actions to make things a little bit better.
This helps keep the systems our customers manage using Shoreline running smoothly.
That’s how we reduce the downtime by fixing the incidents before they happen.