Reliability Engineering: The Southwest Debacle

“Why is COVID better than Southwest Airlines? Because COVID is airborne.”

I read this on a handwritten sign while flying during the holidays.

The joke highlights the issue that Southwest faced, with almost 60% of their flights being grounded last December.

Many reasons are being cited for this issue, like weather, high demand, insufficient crew and planes, their outdated Sky Solver software, etc.

While there is some truth to all these explanations, my question is: Why did this happen to Southwest and not to other airlines?

The real difference between Southwest and other airlines that didn’t fall over is the technical architecture of how they operate.

Southwest operates on a point-to-point location model, which means that each flight is directly routed from one location to another, without connecting through a central hub.

So any disruptions in one route affect the entire chain.

On the other hand, most other airlines use a hub and spoke model, which is more resilient in case of failures.

This model allows the airlines to adopt an n+k approach, where they have n number of things that need to work and can tolerate k failures.

So they can have k reserve planes and crew available at the hub to ensure that there is a contingency in case of disruptions.

To do the same in the point-to-point model, you’d need to have k reserves at all locations, which isn’t economically feasible.

There are more nuances to this, such as the point-to-point model being less expensive for the airline and quicker for the passengers.

But to engineer a reliable architecture, you need to balance cost versus reliability in an economically constrained way.