Don’t Be Like Twitter

Last week, everyone at Shoreline was following the news around the latest Twitter outage. Users couldn’t tweet or retweet because an employee mistakenly deleted important data from an internal service.

To make matters worse, the team that handled that particular service (and therefore could have fixed the issue) no longer works at the company. They all left in November after Elon’s harsh ultimatum to staff, where he demanded employees commit to an “extremely hardcore” work schedule or leave the company altogether.

This outage struck a chord with our team for two reasons:

Human error
Staff shortages

Staff Shortages Lead to More Mistakes

Ultimatums aside, staff shortages are a reality for many tech companies right now. The looming recession has led to a wave of layoffs. Unfortunately, more may come. And even if your workforce isn’t shrinking, the current economic climate has likely led to tighter budgets demanding you do more with less.

With a shrinking or taxed workforce, companies are more vulnerable to human error. With fewer checks and balances (and increased pressure on everyone), people are more likely to make mistakes.

In the world of on-call engineering and reliability, this is a huge problem. Imagine a world where your staff responsible for spotting, fixing, and preventing reliability issues isn’t available to do their jobs. Then other engineers must be pulled away from their priorities to resolve any incidents and outages, probably without the proper preparation. What impact could that have on your service, your customers, and your business?

Learn From Twitter’s Mistakes

Now is the time to safeguard your business against the impact of staff shortages. Automated runbooks are a great first step. Automated runbooks empower your teams with step-by-step runnable recipes on how to diagnose and repair incidents. Like built-in guardrails, runbooks keep your team on a safe path to resolve the incidents, without escalating issues to other busy developers. All commands in each runbook have been curated and tested for a specific incident type.

You can also take automation a step further, creating simple, straightforward actions that are auto-triggered to self-heal well understood issues that your team is responsible for every day — like certificate rotation, resizing disks, and more.

With automation, you can ease the burden on your team so team members can focus on higher-value tasks. This ultimately:

Reduces the risk of human errors
Reduces escalations (which are time-consuming and costly!)
Frees up valuable engineering time by eliminating unplanned work
Allows you to do more with a smaller team

Shoreline can help you deploy runbook automation and create self-healing infrastructure quickly and efficiently. Shoreline’s cloud reliability platform automatically scans your environment (or uses monitors from observability tools you’re already using) to find AND fix common issues that typically take your on-call and engineering teams hours and hours to manage. We also help you troubleshoot, then repair, and ultimately build automations for new issues your team is facing. The result? Shoreline drastically reduces your on-call workload so you can manage your environment with fewer on-call resources.

If layoffs, employee attrition, or hiring limitations have impacted your engineering and on-call teams, Shoreline can help. Use our free Incident Insights tool to learn which types of issues are impacting your engineers the most and which individuals/teams are dealing with the biggest load. Or, if you’re ready to dive in head first, start a free trial of the Shoreline cloud reliability platform today.