Back to videos

How to Do Continuous Improvement in Operations

Things that enabled me to do more with lower cloud computing costs
2:40 min

“How do I do continuous improvement in operations?” You do it by creating a culture around it. Let’s understand this with an analogy to Agile and quality improvement.

A lot of Agile is about creating continuous improvement and automation where you need to figure out 3 things:

- an output metric

- an input metric that drives the output metric

- the work item that drives the input metric

In quality:

The output metric = The number of defects that escaped your QA and testing process and made it into the wild

The input metric (my preference) = The percentage of automated testing or your code coverage

The work item = Building test cases

Similarly, in operations:

The output metric = The number of tickets


The output metric = The number of tickets x the duration of the event x the number of people impacted.

I prefer the latter because something that affects a lot of people is more important than something that affects just one.

The input metric = The number of automations you've built. It's hard to go back and fix all your code, so you must remediate it. You need to employ the machine to fix issues in a few seconds rather than having a human do it in an hour or more, especially when many people are impacted.

The work item = Building the automations. How do you do that? The good news is that you get ~100 new tickets every week. Just automate one per week. If you run that loop every week, things will get better and better over time.

That's how you do continuous improvement in operations.


View more Shoreline videos

Looking for more? View our most recent videos
1 min
Shoreline Fleetwide Debugging
Run a single command across the entire fleet to diagnose incidents more quickly.
2 min
About Shoreline’s Fleet-Wide Debugging and Repair
Shoreline enables highly targeted fleet-wide debugging and repair allowing you to debug across the fleet in about the same amount of time as an individual box.
2 min
Niall Murphy on his experience with Shoreline's Incident Automation Platform
Niall Murphy, former SRE at Google and Microsoft and author of the O'Reilly book, Site Reliability Engineering, shares his experience of using Shoreline's Incident Automation Platform.