What Is SRE (Site Reliability Engineering)?

Whether you're a VP of Engineering looking to build an SRE team or a developer wanting to expand your knowledge and skill set, understanding the building blocks of SRE is essential to transforming digitally.

In this article, you'll learn what SRE is, what people in the field do, how they measure success, and the challenges and benefits of the practice so you can spend more time creating innovative products that out-perform your competition.

What Is SRE?

Site reliability engineering (SRE) is the practice of using automation and software engineering principles to resolve issues related to IT operations and infrastructure. These issues range from deploying and monitoring applications to debugging databases and responding to incidents.

Before SRE, systems administrators (sysadmins) solved all problems manually. With the invention of SRE practices, sysadmins can manage thousands of machines using software code to automate solutions, creating scalable and highly reliable software systems.

Most developers credit Google's VP of engineering Ben Treynor Sloss with establishing site reliability engineering around 2003. What began as a small practice for his seven-person team has now spread into the broader software development industry. In 2021, 22% of organizations reported adopting an SRE practice. That's up from 15% the previous year.

With SRE growing as a practice and leading the way in streamlining ops problems, it's crucial to understand what site reliability engineers do—especially if you're looking to build a team.

What Do Site Reliability Engineers Do?

A site reliability engineer is a software developer or sysadmin with IT operations experience. They bring a unique set of skills to the table in that they can both code and manage IT infrastructure. Their ultimate goal is to keep the applications and services they oversee running and available to customers as close to 100% of the time as possible.

Most SRE teams aim to emulate Google by spending no more than half their time performing manual IT operations tasks (also referred to as "toil"). These tasks may include resizing disks or archiving/deleting files from full disks, terminating stuck Kubernetes pods, or restarting Java virtual machines (JVMs) that have central processing units (CPUs) maxed out due to runaway garbage collection. Having a cap on their manual work helps SREs focus on developing code that automates those tasks so that IT operational issues can eventually remediate themselves.

If you're building an SRE team, you'll want to find top-notch SRE candidates who can reduce toil and are comfortable operating complex environments like servers, databases, applications, microservices, different clouds, search engines, and monitoring tools. You'll also want to ensure they're proficient in a variety of technologies and can learn new ones quickly to keep up with the constant change in IT operations.

How Do You Measure SRE Success?

SREs track their performance in several ways using SLAs, SLIs, and SLOs. These metrics assist them in staying on track and help break down silos between development and operation endeavors.

Here's a closer look at the three metrics companies use to measure the success of site reliability engineering teams:

SLA (service-level agreement): This is the agreement between an SRE team and its users on the required reliability of the system they're using. For example, if the SRE team agrees on a 99.9% SLA, they have a maximum room of 0.1% for errors, outages, and downtime—or an error budget of 0.1%.
SLI (service-level indicator): SLIs are metrics that measure specific aspects of a service-level agreement. Some common examples of SLIs that your SRE team may monitor include availability, error rate, request latency, and system throughput.
SLO (service-level objective): SLOs are a series of goals that support achieving a specific SLA. For example, if your SLA specifies that your systems will be available 99.95% of the time, you'll likely have an SLO of 99.95% uptime.

Using these three metrics as their guiding lights, SRE teams can accurately measure their DevOps automation, incident response, and coding efforts.

DevOps Vs. SRE

Speaking of coding, you may be wondering, if SREs write code, then how are they different from DevOps engineers?

The difference is SREs write code to automate remediations that keep a company's product and services up and running (with as little downtime as possible). DevOps engineers focus on the software delivery life cycle, including integrating, testing, and deploying code. Neither team writes code in a vacuum: SREs work closely with DevOps to share feedback, and DevOps collaborates with engineering, product management, security, and quality teams to deliver new versions of software to market faster.

In a nutshell, DevOps and SRE are two sides of the same coin: DevOps teams focus on Day 1, getting software up and running, while SRE focuses on Day 2, keeping applications and services running and scaling over time.

Outside of this distinction, the teams have more commonalities than differences, and in many organizations, engineers have a blended role that includes DevOps and SRE. Both functions aim to make application development life cycles faster and reduce the time DevOps spends on application development. Both also write automation to get rid of toil, with DevOps focusing on tasks related to features and SRE addressing redundant maintenance tasks.

DevOps and SRE teams have the same goals at the end of the day. Companies should always ensure these departments work closely with each other—whether they're on the same team, with SRE being a component of DevOps, or are separate teams that constantly communicate with one another.

Who's On-Call?

One of the debates in the software development industry is over who's on-call during the workday and outside of regular working hours. Should the developers who wrote the code be available? Should SREs spearhead these efforts since they're tuned in to debugging, especially when fleetwide? Or should the support team field these issues instead? Perhaps it's even a combination of the three groups.

When developers are on-call, their first inclination is often to write code to handle tickets automatically, so they essentially don't have to be on-call and can work on more innovative efforts. Great SREs also look for programmatic ways to eliminate incidents, but unfortunately, they're usually too busy resolving current incidents to invest in automated repairs. Meanwhile, support or customer service may not have the technical know-how, but they're the interface with customers and can run basic scripts to address common customer issues when armed with the right tools.

Ultimately, having a combination of developers, SREs, and support staff on-call can be the best approach. All three teams bring different strengths to the table, and splitting up on-call sessions helps employees get more sleep and resolve problems faster.

Challenges Site Reliability Engineers Face

In addition to being on-call at least part of the time, SREs face various other challenges that can make their job stressful.

SREs work in complex environments with a lot of on-the-job learning. Even with their self-taught expertise, they often receive criticism when the company's service or product goes down while not getting much credit when it stays up. These reactions can make them risk-averse and likely to miss out on discovering creative solutions.

To avoid getting to a place where your team feels overworked and overlooked, you can help boost morale and keep motivation high by letting them know early and often that you and the company appreciate their steadfast efforts. One way of highlighting the importance of their work is to make reliability a shared goal across the entire development organization. It's also very important to create a culture that avoids the temptation to place blame on individuals. Humans aren't perfect, and if a product’s architecture and processes require perfection, they are doomed to fail. When a major outage occurs, it’s important to look for systematic ways to ensure the issue doesn’t happen again.

Benefits of SRE

Keeping your site reliability engineers happy is all the more essential because a good SRE practice benefits your company in several significant ways:

Establishes Metrics

Meaningful metrics are often hard to come by. With a consistent implementation of SRE, companies can finally quantify downtime, helping development and operations teams understand the cost of SLA violations on revenue. SRE teams can also use metrics to calculate impact in other areas, such as reducing security vulnerabilities.

Without these metrics and data, it's tough for a company to know where to invest in automation to have the most impactful business value.

Increases Efficiency

An SRE practice helps increase efficiency by automating manual tasks like resizing disks, restarting stuck pods, and rotating certificates. Through automation, SRE also removes the potential for human error, which can hurt uptime.

Just look at the outage Facebook experienced in 2021. The social media giant's services were down for about seven hours due to a routine maintenance mistake. Better SRE practices likely would've prevented the unfortunate event by giving the operator a pre-approved list of commands instead of expecting them to write the command by hand.

Writing code for automation isn't easy, but it's worth it for the increased productivity—and saved revenue.

For companies that want to generate even more efficiency, you can implement a tool like Shoreline's End-to-End Automation that empowers SREs to automate dozens of incident types in the same amount of time it takes to automate one.

Bolsters Innovation

SRE also helps foster creativity and new ideas by cutting down on bugs that bog down operations. By working through incidents and eliminating repeated issues, SREs free up developers to focus on higher-value projects, like producing enhanced and more exciting product features.

Propel Your Company Forward with SRE

The increasingly complex world of IT operations and digital services demands collaboration between developers and dedicated site reliability engineers. By reading this article, you're well on your way to defining SRE practices for your company or taking your developer skills to a new level.

Ready to get started? Shoreline's End-to-End Automation Platform will help your SRE team automate remediations in seconds, not weeks. Get started today with a free trial.