Observability without Action is just Storage
Observability is NOT an end in itself. See why observability without action is just storage and learn how engineers can avoid falling into this trap.
Read the latest stuff we're up to and what we're most excited about.
Observability is NOT an end in itself. See why observability without action is just storage and learn how engineers can avoid falling into this trap.
Since we’re all sitting on similar infrastructure, if someone solves an issue, everyone should be able to benefit from it. That’s one of the ways we help our customers to save time, reduce errors, and get to a four 9’s SLA.
Because it's less expensive and quicker for passengers, Southwest operates on a point-to-point model. Any disruptions in one route affect the entire chain. But to engineer a reliable architecture, you need to balance cost versus reliability in an economically constrained way.
The bigger the data set, the slower it is to analyze. For MELT, you need to be able to execute a query at scale across your fleet and see what's going on in the live environment. That’s why, at Shoreline, we favor modeling the distributed system as a distributed system.
Learn step by step how to setup Shoreline's Incident Insights so that you can pinpoint the top causes of incidents, measure team health, and use trending data to drive continuous improvement. Get up and running in 2 minutes.
I deeply believe in making things 1% better each and every week by improving the performance of the software I've been responsible for and keeping my services up. Let’s talk about bringing continuous improvement to operations.
It's not some other team's job to keep your service up. Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities. We all have to own it. That is what a culture of reliability requires.
Availability for the 4 nines is equivalent to only 4.4 minutes of downtime in a month. Here are 3 challenges that keep people from meeting customer expectations for service availability.
When designing reliable systems, you need to look at correlated events and their downstream impacts, the time it takes to repair them, and the breadth of the system being applied to.
What can we learn from the Ticketmaster (Taylor Swift) Debacle? Ticketmaster experienced an unprecedented demand that resulted in their site crashing for many hours. If they had designed a reliable service with an escalator-like system instead of an elevator, this could have been avoided.