This incident type refers to situations where the available memory for limits in percentage of a Kubernetes cluster is low. This can cause issues with the performance and stability of the cluster, potentially leading to downtime or other problems. The incident may be triggered by an automated query or monitoring tool that checks for certain metrics or thresholds. It is important to address and resolve this issue as soon as possible to ensure the continued health and reliability of the Kubernetes cluster.
Parameters
Debug
Check the available memory for limits in percentage for all nodes in the cluster
Check the available memory for limits in percentage for all pods in the cluster
Check the resource requests and limits for all pods in the cluster
Check the memory usage for specific container in a pod
Check the logs for a specific pod to see if any errors or issues are occurring
Check the event log for the cluster to see if any events are being generated related to memory usage
Check the cluster autoscaler to see if it is scaling up or down properly based on resource usage
Check the cluster pods and nodes status
Repair
Check the resource requests and limits for the pods running on the Kubernetes cluster to ensure they are properly configured. Adjust any values as necessary, keeping in mind the available resources on the cluster.
Identify and terminate any pods or containers that are using excessive amounts of memory, either due to a memory leak or other issue. This can free up resources for other parts of the cluster.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.