Back to videos

How to Fix an Incident Before It Happens

It requires predictive maintenance, including monitoring brownout and performing control actions
2 min
play_arrow
Summary

The best way to avoid downtime from an incident is to fix it before it happens.

That requires predictive maintenance. Here are 2 approaches that work for me:

1. Monitoring brownout

Things usually brown out before they black out, so that's when you want to see what's going on.

At AWS, my engineers used to count device errors on all our devices and then:
- correlate the number in a given period with a subsequent device failure, or
- see when device latency started to exceed some normal range.

Then we’d proactively shift away from those bad resources during a maintenance window rather than having a fail when it's under load.

We now use this process at Shoreline.

2. Performing control actions

Here, we adapt the feedback control theory from industrial systems.

The basic concept is that you have the desired state, the observed state, and the error between the two.

Your goal is to create a control action loop that reduces the observed and desired state gap.

The more frequently you sample, the smaller the control action needs to be.

Let’s understand this with an example:When I started driving, I’d swerve the steering wheel.

But now, I don’t need to do that as I keep making little adjustments without even paying much attention.

At Shoreline, we keep running these control loops.

Every second, we scrape 1,000s of metrics, compare them against 1000s of alarm conditions, and take little control actions to make things a little bit better.

This helps keep the systems our customers manage using Shoreline running smoothly.

That’s how we reduce the downtime by fixing the incidents before they happen.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
1 min
How to Safely Fix Issues Without Escalation
The only real solution is incident automation.
1 min
Shoreline on Shoreline: Open Port Check
It's critical to close ports like 22 and 3389 that can be opened unintentionally in a development environment
1 min
Shoreline on Shoreline: Alarms & Actions for Release Testing
Hear from Senior Director, Haritha Gongalore, on how rewarding it is to use Shoreline Alarms and Actions to test and certify our own releases.