A Guide to Building Reliable Systems

Let’s talk about Amazon S3’s 11 nines claim.

Amazon is certainly one of the best, most impressive organizations that I know of when it comes to this stuff.

But 11 nines implies that if you save a photo, it will stay there for ~100 billion years.

But the sun is going to envelop the earth in about 5-7 billion years, and I don’t think US-East-2 will survive that event!

S3 claims to achieve 11 nines by making 33 copies of data and storing 11 copies each across three data centers.

This allows for the tolerance of 11 individual disk drive failures.

However, this approach doesn't account for correlated failures, such as a data center going down due to a fire or natural disaster, which would result in the loss of multiple copies of data.

When designing Aurora, we stored two copies of data in the three data centers (even though most other systems kept one copy in each center).It’s because I was looking at my largest likely correlated failure, which would be a data center going down.

So when that happens, every single one of my databases is going to get two failures for that duration of time.

Now, some subsets are going to have another failure somewhere else while it takes me time to repair the first 2 failures.

So if two out of three go down, I’m left with only one copy. And I can't trust whether that copy is up to date or not, which means that my database is corrupt.

But if I'm doing four out of six and I get down to three out of six, I can still read it and do the repair.

So when designing your systems, you need to think about:

the largest probable correlated event,
associate it with the independent events that could be already going on,
multiply that by the number of such things going on in your environment, and then
divide it by the duration over which that's going to happen.

For example, if it takes me 10 seconds to repair a segment in Aurora, I'm basically looking for a 10-second period for the independent failures against the correlated failure.

You want to bring that number down as far as you can in an economically reasonable way.

For us, that ended up being four out of six.

For you, it might be a different number.

To find that, the factors you need to look at are:

your correlated events
their downstream impacts
the time it takes to repair them
the breadth of the system being applied to