Let's talk about building a culture around reliability.
I'm famous in my teams for telling them, “The currency of management is attention” – what you pay attention to is what your reports are going to pay attention to.
I firmly believe that improving operational excellence starts with your culture.
At AWS, we had weekly operations meetings a couple of hours long, led by heads of various services and experts who knew reliability best.
We'd review the prior week's outages, ongoing campaigns, etc.
It showed that it was important to the company because people were spending time on it.
But many companies try to solve operational challenges by assigning them to a specific team or a reliability tzar.
That removes ownership and accountability from everybody else.
One of the few advantages of being an old guy is that I've seen this story play out before.
I remember failed quality tzar and security tzar initiatives.
They failed because they were fundamentally saying that this is not important enough to be part of everybody's job.
We got those things to succeed by making it a part of everyone’s responsibility.
For example, by having everyone put in a unit test as part of the code review process.
We need to do the same thing for reliability.
No one wants to do on-call.
So you can:
- toss the problem to some “second-tier team” (even if you don't call them that)
- OR you can make it part of everybody's job.
We know which one is going to improve reliability.
That doesn't mean you don't have specialists on the team.
But it isn't some other team's job to keep your service up.
Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities.
We all have to own it. That is what a culture of reliability requires.