Back to Blog

Self-Healing: The Key to Fixing the Most Common Kubernetes Issues

Here are three tips for automatically fixing the most common Kubernetes issues through mastering Kubernetes, self-healing, and staying proactive.

The rise of Kubernetes (AKA K8s) has enabled organizations across the globe to manage containerized apps across a variety of hosts. Leveraging Kubernetes also provides high-level mechanisms for deployment, maintenance, and application scaling.

Utilizing Kubernetes allows IT teams to optimize costs, reduce development and release timelines, increase software scalability, provide flexibility in multi-cloud environments, and increase cloud portability.

But Kubernetes is not perfect. Issues can still occur within Kubernetes, and they’re often difficult to fix. Kubernetes covers a broad surface area in your IT environment. That means many things could require attention when an issue occurs (for example, having to resize a disk in AWS).

While the potential challenges of Kubernetes may seem daunting, the benefits are too valuable to ignore. So how can you quickly diagnose and solve an issue when it does arrive? Well, here are three tips for continuously fixing the most common Kubernetes issues (with little effort needed).

1) Trying to master Kubernetes? Help is on the way.

From pod debugging to node debugging to service debugging, gaining expert-level Kubernetes knowledge will take a while. Not only that, but having to deal with repeated issues will only delay Kubernetes education further and further.

Some of the most common Kubernetes issues we’ve seen plague productivity are:

  • Stale Argo pods resulting in over-provisioned hardware, leading to higher costs or implementing custom logic for cluster auto-scaling and workflow clean-up.
  • CoreDNS performance degradation causing massive latency, impacting the customer and their SLA.
  • Out of memory errors that hurt the customer experience and make it difficult to capture diagnostic data.
  • Pods stuck in the terminating state that cause app scheduling to fail, resulting in a financial drain due to unnecessary scaling.
  • Delayed Kubernetes node retirement can cause AWS to take your server offline and abruptly kill any services running on the box, resulting in data loss and customer downtime.

We’ll take a deep dive into these issues later, but for now, we must focus on enabling engineers to master Kubernetes while on the job to avoid costly delays.

By utilizing a debugging notebook, such as Shoreline’s Kubernetes Debugging Notebooks, engineers can quickly and automatically scan across pods, nodes, and services to diagnose and remediate issues.

Say, for instance, a problem within your Kubernetes clusters arises. Engineers will then spend hours searching Stack Overflow for answers or use a guess and check method on what commands to use. As if that wasn’t enough, most teams don’t account for time spent searching for a solution. So there’s no telling how much time went into solving an issue, let alone repeated issues.

Yes, manually running a command to fix an issue may not take long. But remembering which sequence to run and which commands to use to manually diagnose an issue will be the ultimate downfall of your SRE’s productivity.

Kubernetes Debugging Notebooks record every step taken to assess and remediate a situation, removing the need for engineers to draft documentation and conduct lengthy handover meetings upon escalation. This simplifies sharing Kubernetes knowledge between any engineer to streamline future incident responses.

2) Identify frequent and repeated issues

As mentioned previously, Kubernetes incidents are often repeated. Aren’t you sick of solving the same issue repeatedly (especially in the middle of the night)?

Shoreline’s Op Pack library (where you can find the Notebooks mentioned above) offers open-source blueprints to enable self-healing solutions to continuously remediate your most common Kubernetes incidents, like the issues we discussed above:

Pods stuck in terminating

When Kubernetes pods won’t leave the terminating state, the underlying node is likely broken. If the node is broken, apps could fail to schedule and cause unavailability. Unavailability results in financial drain due to unnecessary scaling.

If a pod has been terminating for too long, Shoreline’s Pods Stuck in Terminating Op Pack cordons, drains, and then terminates the node in a safe, clean way to avoid impacting other software.

Pod out of memory

A variety of application errors can lead to out of memory errors (OOMs) in Kubernetes. Shoreline’s Pod Out of Memory Op Pack monitors for memory usage that hits a certain threshold and then captures diagnostic data. The data is then pushed to a cloud storage service while appending the data to a ticket or a Slack message to a pre-selected channel.

Restart CoreDNS service

CoreDNS, the default Kubernetes DNS service, can degrade in performance and cause latency when there are too many calls to it. Once latency between the pod and CoreDNS reaches one second or more, it impacts both the customer and their SLA.

The Shoreline CoreDNS Op Pack monitors metrics and automatically triggers an action that restarts the CoreDNS pod (once latency exceeds a configurable threshold). Once an issue is identified, Shoreline triggers rolling restarts of CoreDNS pods to prevent service outages.

Deleting old Argo pods

While Argo makes managing workflows easy, it can leave behind stale pods after workflow execution. .

Shoreline’s Argo Op Pack heavily reduces the operational burden of administering Argo by decreasing overcapacity and lowering operating costs. It constantly monitors the local node, comparing the number of allocated IPs against a configurable threshold maximum. Shoreline automatically cleans up old Argo pods whenever the total assigned IPs exceeds the threshold.

Kubernetes node retirement

Shoreline’s Kubernetes Node Retirement simplifies the process of handling nodes marked for retirement. Shoreline has a pre-built alarm that triggers when a node is marked for retirement. Then, Shoreline enables a self-healing, hands-off process of cordoning, draining, and terminating these nodes. This process then alerts Kubernetes to build another version of this node.

3) Integrate automation across the entire IT environment

Don’t wait until a problem happens to fix it. At that point, you’re already on your way to heavy delays and financial loss.

To easily stay proactive, modernize your production ops by integrating Shoreline incident automation across your entire IT environment. Now, I know better than anyone that automation can be scary. We’ve all seen situations where automation can make a situation go from bad to worse. But our incident automation solutions aren’t just automation, they’re self-healing.

The team at Shoreline has collectively spent A LOT of time on-call to resolve countless tickets at AWS. Shoreline is the tool we wish we had to eliminate tickets and improve availability. Our fault-resistant self-healing solutions can eliminate thousands of hours of degraded service by improving on-call team productivity and automating away production incidents.

Ready for your Kubernetes issues to heal themselves?

Shoreline’s incident automation platform enables the continuous, automatic remediation of common Kubernetes issues so you can focus on mastering Kubernetes while on the job.

Learn how Shoreline makes it easy for DevOps engineers to diagnose and repair issues at scale and quickly create automated remediations to fix them forever by scheduling a quick demo with one of our experts.