---
id: 5eb2c9be-fa63-4bfb-872b-e5793a98f630
---
# Kubernetes Memory Usage Alert
---

This incident type is related to an alert triggered when the available memory on a Kubernetes node drops below a certain threshold (in this case, 90%). The alert is designed to monitor the memory usage percentage and notify the relevant teams when the threshold is breached. This incident type is critical as it helps ensure that Kubernetes clusters are operating within acceptable memory usage levels and that potential issues are identified and resolved promptly.

### Parameters
```shell
# Environment Variables

export NODE_NAME="PLACEHOLDER"

export NAMESPACE="PLACEHOLDER"

export POD_NAME="PLACEHOLDER"

export CONTAINER_NAME="PLACEHOLDER"

```

## Debug

### Get the list of Kubernetes nodes
```shell
kubectl get nodes
```

### Describe a specific node to check its resource usage
```shell
kubectl describe node ${NODE_NAME}
```

### Get the list of Kubernetes pods in a specific node
```shell
kubectl get pods -A  --field-selector spec.nodeName=${NODE_NAME}
```

### Describe a specific pod to check its resource usage
```shell
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}
```

### Get the logs of a specific container in a specific pod
```shell
kubectl logs ${POD_NAME} ${CONTAINER_NAME} -n ${NAMESPACE}
```

### 8. Check the Kubernetes events for any memory-related issues
```shell
kubectl get events
```

## Repair

### Remove node from cluster.
```shell
#!/bin/bash

# Set variables

node_name=${NODE_NAME}

# Cordon the node

kubectl cordon $node_name 

# Get the pods running on the node

pods=$(kubectl get pods --field-selector spec.nodeName=$node_name -o json  | jq -r '.items[].metadata.name')

# Drain the pods from the node

kubectl drain $node_name --ignore-daemonsets --delete-local-data --force --delete-emptydir-data --grace-period=30 --timeout=300s --eviction-timeout=30s --pod-selector='!' -l node-role.kubernetes.io/master --delete-local-data

# Delete the node

kubectl delete node $node_name 
```

### Identify and terminate any resource-intensive pods on the impacted node(s) to free up memory.
```shell

#!/bin/bash

# Define the Kubernetes node name as a variable

K8S_NODE=${NODE_NAME}

# Get a list of all resource-intensive pods on the node

PODS=$(kubectl top pods --no-headers | grep $K8S_NODE | awk '$3 > 90 {print $1}')

# Loop through the list of pods and terminate them

for POD in $PODS

do

    kubectl delete pod $POD --force --grace-period=0

done
```


This incident type is related to an alert triggered when the available memory on a Kubernetes node drops below a certain threshold (in this case, 90%). The alert is designed to monitor the memory usage percentage and notify the relevant teams when the threshold is breached. This incident type is critical as it helps ensure that Kubernetes clusters are operating within acceptable memory usage levels and that potential issues are identified and resolved promptly.


This incident type refers to an increase in the number of errors per second on a Tomcat server, which could indicate an issue with the server itself, the host, a deployed application, or an application servlet. This could include errors generated when the Tomcat server runs out of memory, can't find a requested file or servlet, or is unable to serve a JSP due to syntax errors in the servlet codebase. This incident type requires immediate attention to diagnose and address the underlying issue.


Increase of the errors/second rate for Tomcat server

A Host Out of Memory(OOM) Incident occurs when a server or system runs out of memory, causing it to crash or become unresponsive. This can be caused by a variety of factors, such as an unexpected surge in traffic or insufficient resources allocated to the system. Resolving this type of incident requires identifying the root cause of the memory issue and taking appropriate measures such as optimizing system resources or increasing memory capacity.


Host Out of Memory (OOM) Incident

This incident type involves monitoring the replicas of a Kubernetes Statefulset, which is a type of workload in Kubernetes used for stateful applications. The incident is triggered when more than one replica's pods are down, creating an unsafe situation for manual operations. This incident is critical and requires immediate attention to resolve the issue and ensure the smooth functioning of the stateful applications.


Kubernetes Statefulset Replicas Monitoring Incident

A Kubernetes Replicaset Incomplete incident typically occurs when a specific number of pods that should be running are not, due to reasons such as failed pod initialization, unavailability of resources in the cluster, or inability to pull the image. This incident is usually triggered when the difference between desired and running pods is greater than zero, and it can be detected through monitoring tools like Datadog.


Kubernetes Replicaset Incomplete

The Kubernetes Nodes with Memorypressure incident type occurs when a Kubernetes cluster node is running low on memory, which can be caused by a memory leak in an application. This incident type requires immediate attention to prevent any downtime and ensure the proper functioning of the Kubernetes cluster. Typically, this incident type is monitored by DevOps teams using various monitoring tools, including PagerDuty, to identify and address memory pressure issues quickly.


Kubernetes Nodes with Memorypressure incident

```shell
# Environment Variables

export NODE_NAME="PLACEHOLDER"

export NAMESPACE="PLACEHOLDER"

export POD_NAME="PLACEHOLDER"

export CONTAINER_NAME="PLACEHOLDER"

```


### Get the list of Kubernetes nodes

```shell
kubectl get nodes
```

### Describe a specific node to check its resource usage

```shell
kubectl describe node ${NODE_NAME}
```

### Get the list of Kubernetes pods in a specific node

```shell
kubectl get pods -A  --field-selector spec.nodeName=${NODE_NAME}
```

### Describe a specific pod to check its resource usage

```shell
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}
```

### Get the logs of a specific container in a specific pod

```shell
kubectl logs ${POD_NAME} ${CONTAINER_NAME} -n ${NAMESPACE}
```

### 8. Check the Kubernetes events for any memory-related issues

```shell
kubectl get events
```


### Remove node from cluster.

```shell
#!/bin/bash

# Set variables

node_name=${NODE_NAME}

# Cordon the node

kubectl cordon $node_name 

# Get the pods running on the node

pods=$(kubectl get pods --field-selector spec.nodeName=$node_name -o json  | jq -r '.items[].metadata.name')

# Drain the pods from the node

kubectl drain $node_name --ignore-daemonsets --delete-local-data --force --delete-emptydir-data --grace-period=30 --timeout=300s --eviction-timeout=30s --pod-selector='!' -l node-role.kubernetes.io/master --delete-local-data

# Delete the node

kubectl delete node $node_name 
```

### Identify and terminate any resource-intensive pods on the impacted node(s) to free up memory.

```shell

#!/bin/bash

# Define the Kubernetes node name as a variable

K8S_NODE=${NODE_NAME}

# Get a list of all resource-intensive pods on the node

PODS=$(kubectl top pods --no-headers | grep $K8S_NODE | awk '$3 > 90 {print $1}')

# Loop through the list of pods and terminate them

for POD in $PODS

do

    kubectl delete pod $POD --force --grace-period=0

done
```


Kubernetes Memory Usage Alert

Overview

Parameters

Debug

Get the list of Kubernetes nodes

Describe a specific node to check its resource usage

Get the list of Kubernetes pods in a specific node

Describe a specific pod to check its resource usage

Get the logs of a specific container in a specific pod

Repair

Remove node from cluster.

Identify and terminate any resource-intensive pods on the impacted node(s) to free up memory.

Learn more

Related Runbooks

Increase of the errors/second rate for Tomcat server

Host Out of Memory (OOM) Incident

Kubernetes Statefulset Replicas Monitoring Incident

Kubernetes Replicaset Incomplete

Support