Observability in software ops is key for proactive issue resolution, going beyond data collection to include decisive actions based on logs, metrics, and traces. Not acting on insights leads to reduced productivity and poor user experiences.
Resources and insights
Read the latest stuff we're up to and what we're most excited about.
Shoreline recently helped Razorpay, a FinTech leader in India, elevate their system reliability and improve developer productivity by 25% as part of their strategic initiative for incident automation.
Delve into our blog about the intricacies of on-call operations, drawing from Shoreline's 2022 survey insights. Over 300 experts discuss the high costs and challenges faced, with tips for reducing escalations and automating tasks. Discover how Shoreline's tools can revolutionize your on-call strategy.
Shoreline.io's guide details manual and automated processes for retiring Kubernetes worker nodes, emphasizing efficient node management and minimal disruption using kubectl commands and Shoreline's Op Pack.
In a compelling discussion, Evan and Anurag delve into the intricacies of Shoreline's AI Ops platform for incident response. Anurag, drawing from his experience leading reliable services at AWS, highlights the challenges of maintaining high availability in the face of rapid growth. He emphasizes the role of innovative automations in ensuring consistent service for demanding customers. Anurag suggests the first step in driving reliability for cloud services is understanding the root causes of incidents. He points to Shoreline's free tool designed to aid in this process. The conversation also features a case study of a major Shoreline client managing a 30,000-node fleet across multiple clouds and regions. Anurag shares how the client efficiently handles security checks and issue detections over thousands of instances simultaneously, treating the entire fleet as a single entity. For a deeper dive into this insightful discussion, the full video podcast is available on YouTube and LinkedIn.
The main challenge in preventing outages lies in the inevitable breakdown of various components like disks, nodes, and networks. To mitigate this, companies need to acknowledge human error as an unavoidable factor, especially when numerous commands are manually inputted daily. Investigating how minor errors can cause significant damage and implementing safeguards and redundancies are essential steps to reduce the risks and impacts of potential outages.
CoreDNS is vital for Kubernetes, replacing SkyDNS and KubeDNS. DNS issues are common in Kubernetes, with CoreDNS often struggling under load. It integrates with Prometheus for metrics monitoring. Shoreline's CoreDNS Op Pack offers solutions like auto-restarting pods and alerts to PagerDuty or Slack for more efficient DNS management.
In this RedMonk Conversation, Stephen O'Grady and Anurag Gupta discuss how generative AI can help address reliability challenges and incident response.