This incident type is related to an alert triggered when the available memory on a Kubernetes node drops below a certain threshold (in this case, 90%). The alert is designed to monitor the memory usage percentage and notify the relevant teams when the threshold is breached. This incident type is critical as it helps ensure that Kubernetes clusters are operating within acceptable memory usage levels and that potential issues are identified and resolved promptly.
Parameters
Debug
Get the list of Kubernetes nodes
Describe a specific node to check its resource usage
Get the list of Kubernetes pods in a specific node
Describe a specific pod to check its resource usage
Get the logs of a specific container in a specific pod
8. Check the Kubernetes events for any memory-related issues
Repair
Remove node from cluster.
Identify and terminate any resource-intensive pods on the impacted node(s) to free up memory.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.