How to Manage Failure without Wasting Resources

How can you better utilize the resources you keep aside for failover purposes?

Here’s how I've approached this in the past:

When I was designing Amazon Aurora, I made the storage regional.

So we had 2 copies of data in each of the 3 availability zones.

That meant that as long as I could get another database instance up, I didn't have to constantly replicate data because it was happening behind the scenes in the storage layer.

But I might not be able to get a second database instance because everyone else is asking for one.

So we made another instance available, but it acted as a read replica, where we could divert read traffic to it rather than read-write traffic.

This way, it wasn't just sitting idle but getting used for live customer requests and maybe letting you tick down the size of your instance.

That’s how we utilized resources that were kept just for failover purposes to do things that could be stopped for some time when a failure happens.

Another example of this is from a friend who once used to run an entire 2nd data center just in case the 1st one failed.

That’s super expensive, but now they do something brilliant with it.

They use those resources to run AI modeling jobs on the systems at the 2nd center.

If a region goes down, they can stop running those training models for a period and instead run user traffic on that.

That's another way you can have resources doing useful background activity that can be deferred when things hit the fan.

How to Manage Failure without Wasting Resources

Product

Resources

Support

Company