Back to videos

Actively Managing Systems to Improve Utilization

We're all being asked to do more with less now a days. For those of us in production operations, one of the best ways we can do that is eliminate waste with automation to drive higher utilization.
3 min

Let’s talk about how we can use automation to drive higher utilization.

I'll use the example of disk utilization as it’s easy to understand.

When I look at customer instances, most of the time, the disks are radically over-provisioned – they’re only used maybe 10 or 20%.

But because disks are pretty inexpensive, it can make sense to dramatically over-provision as that saves you from the midnight page for running out of space.

But at scale, it adds up and costs a lot of money.

Here’s a better way to do it:

Actively monitor your disks to see when they might run out of space.

This way, you only grow them when you hit a certain threshold, often when you’ve used up 70% of disk space, or when it's got just 2 days until running out of space.

This leads to much less waste than statically assigning a large amount of space up front.

Shoreline’s Disk Op Pack is one of our most popular automated remediations for exactly this reason.

It's very safe, removes stupid tickets, and saves people money.

I did something similar when I was at Amazon while working on Amazon Aurora.

We started all our databases at 10 gigabytes and automatically grew them behind the scenes as needed.

Now let's make this a little bit more sophisticated.

I said above that the utilization threshold for a new disk is often set to70%, and that’s because we're told that SSDs should only fill up to 70%.

This is because SSDs write out of place.

So you need extra space for the second write to happen.

Then there's a garbage collector in the background that cleans up old versions of pages.

The more free space you have:
- the less often that garbage collector needs to run.
- the more it can just fill in time between foreground requests
- the less it'll impact the latency of the writes you're making to the system.

But 70% is an arbitrary number:
- If you're running a mostly read-only system, it can be a lot higher.
- If you're running a heavily write-oriented system, it should be a lot lower.

So it becomes important to do active management to be able to tune this number so that you can decide what that threshold should be.

You can build it in a feedback control loop using something like Shoreline.If you're trying to monitor a large fleet of 10s of hundreds/thousands of disks, you will need software as you can't manage them all one by one.

But it will drive your utilization higher and lower your bill.


View more Shoreline videos

Looking for more? View our most recent videos
1 min
Shoreline Operations Notebooks
Record, curate, and publish incident debug and repair best practices to safely empower on-call teams.
2 min
Shoreline Incident Insights
A quick overview video that shows automated categorization, filtering, and analysis of incidents.
3 min
How to boost reliability without hiring more SREs
How can companies increase reliability without hiring an army of engineers?