Networking is complex with Kubernetes and often the most common problems and outages in a Kubernetes cluster come from DNS issues. CoreDNS, the default Kubernetes DNS service, can degrade in performance with too many calls to it causing massive latency. Once latency between the pod and CoreDNS reaches one second or more, it impacts both the customer and ultimately their SLA. However, most organizations merely monitor CoreDNS and continue to manually address the issue, causing unacceptable delays and potentially system outages. This issue is sometimes hard to diagnose because DNS issues have broad impact, and the underlying cause is often unclear. Services may be running fine, but can't communicate with each other.
This Shoreline runbook picks up where the DNS Lag runbook leaves off. Click from the PagerDuty alert into this runbook to begin interactive diagnostics. If automatically restarting k8s’s CoreDNS pods doesn’t bring DNS back online, this runbook facilitates debugging the issue and identifying the root cause. Is it network saturation? Too many pods on the host? IP exhaustion? Or something else? Click through each cell in the runbook to discover the root cause. Once the root cause is identified, take action to rebalance the cluster by adding taints or tollerations or adjusting the host’s capacity.