Back to videos

How to Do Continuous Improvement in Operations

Things that enabled me to do more with lower cloud computing costs
2:40 min
Summary

“How do I do continuous improvement in operations?” You do it by creating a culture around it. Let’s understand this with an analogy to Agile and quality improvement.

A lot of Agile is about creating continuous improvement and automation where you need to figure out 3 things:

- an output metric

- an input metric that drives the output metric

- the work item that drives the input metric

In quality:

The output metric = The number of defects that escaped your QA and testing process and made it into the wild

The input metric (my preference) = The percentage of automated testing or your code coverage

The work item = Building test cases

Similarly, in operations:

The output metric = The number of tickets

OR

The output metric = The number of tickets x the duration of the event x the number of people impacted.

I prefer the latter because something that affects a lot of people is more important than something that affects just one.

The input metric = The number of automations you've built. It's hard to go back and fix all your code, so you must remediate it. You need to employ the machine to fix issues in a few seconds rather than having a human do it in an hour or more, especially when many people are impacted.

The work item = Building the automations. How do you do that? The good news is that you get ~100 new tickets every week. Just automate one per week. If you run that loop every week, things will get better and better over time.

That's how you do continuous improvement in operations.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
2 min
Shoreline Incident Automation overview
Shoreline’s Incident Automation Platform was built to reduce manual and repetitive work, so that you can repair issues faster, increase team productivity, and eliminate thousands of hours of degraded service.
1 min
Shoreline Operations Notebooks
Record, curate, and publish incident debug and repair best practices to safely empower on-call teams.
2 min
Niall Murphy on his experience with Shoreline's Incident Automation Platform
Niall Murphy, former SRE at Google and Microsoft and author of the O'Reilly book, Site Reliability Engineering, shares his experience of using Shoreline's Incident Automation Platform.