The concept of centralized versus decentralized operations predates both site reliability engineering (SRE) and DevOps. Organizing operations around on-call predates both categories as well. And there’s always been tradeoffs associated with dedicated staff to on-call: either small, specialized groups core to the SRE model or distributed teams with some on-call responsibility core to the DevOps experience.
In our experience, working under centralized and decentralized regimes at companies of varying sizes and maturity levels, it’s common for simple issues, like licenses expiring, disks filling up or credits depleting, to cause outages. It’s also very common for operator error to lead to urgent remediation scenarios. And operators are embarrassed to admit fault when a license expires during a launch in beta or a disk fills up before the end of a holiday weekend.
To learn more about managing these issues and tradeoffs, read the full article at The New Stack.