Back to videos

How to Manage Failure without Wasting Resources

How can you better manage failure? Here's how we utilized resources that were kept for failover purposes to do things that could be stopped for some time when a failure happens, and for doing useful background activity that can be deferred to when things hit the fan.
3 min
play_arrow
Summary

How can you better utilize the resources you keep aside for failover purposes?

Here’s how I've approached this in the past:

When I was designing Amazon Aurora, I made the storage regional.

So we had 2 copies of data in each of the 3 availability zones.

That meant that as long as I could get another database instance up, I didn't have to constantly replicate data because it was happening behind the scenes in the storage layer.

But I might not be able to get a second database instance because everyone else is asking for one.

So we made another instance available, but it acted as a read replica, where we could divert read traffic to it rather than read-write traffic.

This way, it wasn't just sitting idle but getting used for live customer requests and maybe letting you tick down the size of your instance.

That’s how we utilized resources that were kept just for failover purposes to do things that could be stopped for some time when a failure happens.

Another example of this is from a friend who once used to run an entire 2nd data center just in case the 1st one failed.

That’s super expensive, but now they do something brilliant with it.

They use those resources to run AI modeling jobs on the systems at the 2nd center.

If a region goes down, they can stop running those training models for a period and instead run user traffic on that.

That's another way you can have resources doing useful background activity that can be deferred when things hit the fan.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
2 min
Our Community-Driven Library of Shared Automations
We're all sitting on the same infrastructure in Production Ops, but build our systems as if we’re starting new. Insane! That's why Shoreline Op Packs are available for free.
1 min
Shoreline Fleetwide Debugging
Run a single command across the entire fleet to diagnose incidents more quickly.
2 min
How to Reduce On-Call Incidents
Shoreline's recent survey found that 48% of incidents are straightforward and repetitive while 55% of them escalate beyond the 1st line on call. If your on-call sucks, you must find a path to make incidents incidental.