In this conversation, Evan and Anurag talk about Shoreline's AI Ops platform for incident response.
Anurag covers what he learned while leading reliable services at AWS, which was made more challenging by rapid growth. Because of the scale, they had to get innovative with automations to ensure high availability for customers with high expectations.
Today, driving reliability for cloud services has become a widespread challenge for many companies. Anurag recommends that first step to take is to just understand what is causing incidents to get a good blueprint on where to do repairs or to shorten diagnostics. (Shoreline has a free tool to help with this.)
Anurag also tells the story of a large Shoreline customer that runs a fleet of 30,000 nodes across all three major clouds, and across 'umpteen' regions. "It's really cool for them to be able to run a command across 10,000 instances at once, to detect if something might be wrong or check for new security vulnerabilities. You can manage your fleet as though it were a single box. And that's cool."