---
id: d97c8fa8-fd6a-432c-ba32-0686d306a944
---

# Elasticsearch Healthy Nodes Incident on Kubernetes
---

This incident type indicates an issue related to Elasticsearch nodes. Specifically, it indicates that one or more nodes in the Elasticsearch cluster are not healthy, which could cause performance issues or data loss. The incident may be triggered automatically by monitoring software or manually by a team member. It typically requires immediate attention to resolve the underlying issue and restore Elasticsearch nodes to a healthy state.

### Parameters
```shell
export POD_NAME="PLACEHOLDER"

export ELASTICSEARCH_NAMESPACE="PLACEHOLDER"

export THRESHOLD="PLACEHOLDER"

export ELASTICSEARCH_CONTAINER_NAME="PLACEHOLDER"

export ELASTICSEARCH_POD_LABEL="PLACEHOLDER"

export REPLICA_COUNT="PLACEHOLDER"

export DEPLOYMENT_NAME="PLACEHOLDER"
```

## Debug

### 1. Get the list of Elasticsearch cluster pods
```shell
kubectl get pods -n ${ELASTICSEARCH_NAMESPACE} -l ${ELASTICSEARCH_POD_LABEL}
```

### 2. Check the status of the Elasticsearch cluster pods
```shell
kubectl describe pods -n ${ELASTICSEARCH_NAMESPACE} ${POD_NAME}
```

### 3. Check the Elasticsearch cluster health status
```shell
kubectl exec -n ${ELASTICSEARCH_NAMESPACE} ${POD_NAME} -- curl -X GET "http://localhost:9200/_cluster/health?pretty"
```

### 4. Check the Elasticsearch cluster node status
```shell
kubectl exec -n ${ELASTICSEARCH_NAMESPACE} ${POD_NAME} -- curl -X GET "http://localhost:9200/_cat/nodes?v"
```

### Elasticsearch cluster is experiencing high CPU or memory usage.
```shell


#!/bin/bash



# Set variables

NAMESPACE=${ELASTICSEARCH_NAMESPACE}

POD_NAME=${POD_NAME}

CONTAINER_NAME=${ELASTICSEARCH_CONTAINER_NAME}

THRESHOLD=${THRESHOLD}



# Get CPU and memory usage for the Elasticsearch container

USAGE=$(kubectl exec -n $NAMESPACE $POD_NAME -c $CONTAINER_NAME -- sh -c "ps -eo pid,pcpu,pmem | grep -E 'PID|$ELASTICSEARCH_SERVICE_NAME' | grep -v grep" | awk '{print $2,$3}')

CPU=$(echo $USAGE | awk '{print $1}')

MEMORY=$(echo $USAGE | awk '{print $2}')



# Check if CPU or memory usage is above threshold

if (( $(echo "$CPU > $THRESHOLD" | bc -l) )); then

    echo "CPU usage is above threshold."

    echo "Usage: $CPU%"

fi



if (( $(echo "$MEMORY > $THRESHOLD" | bc -l) )); then

    echo "Memory usage is above threshold."

    echo "Usage: $MEMORY%"

fi


```

### One or more Elasticsearch nodes are down or unresponsive.
```shell


#!/bin/bash



# Set variables

NAMESPACE=${ELASTICSEARCH_NAMESPACE}

ELASTICSEARCH_POD_LABEL=${ELASTICSEARCH_POD_LABEL}

ELASTICSEARCH_CONTAINER_NAME=${ELASTICSEARCH_CONTAINER_NAME}



# Check if Elasticsearch pods are running

if kubectl get pods -n $NAMESPACE -l $ELASTICSEARCH_POD_LABEL | grep Running >/dev/null; then

    echo "All Elasticsearch pods are running."

else

    echo "One or more Elasticsearch pods are not running:"

    kubectl get pods -n $NAMESPACE -l $ELASTICSEARCH_POD_LABEL | grep -v Running

fi



# Check if Elasticsearch nodes are responsive

for POD in $(kubectl get pods -n $NAMESPACE -l $ELASTICSEARCH_POD_LABEL | grep Running | cut -f1 -d' '); do

    if kubectl exec -n $NAMESPACE $POD -c $ELASTICSEARCH_CONTAINER_NAME -- curl -s http://localhost:9200/_cluster/health | grep -q '\"status\":\"green\"'; then

        echo "$POD is responding."

    else

        echo "$POD is not responding:"

        kubectl logs -n $NAMESPACE $POD -c $ELASTICSEARCH_CONTAINER_NAME

    fi

done


```

## Repair

### If there is a missing data node in the Elasticsearch cluster, add a new node or replace the missing one.
```shell
bash

#!/bin/bash



# Define the name of the Elasticsearch deployment and the number of replicas

deployment_name=${DEPLOYMENT_NAME}

replica_count=${REPLICA_COUNT}



# Get the current number of replicas

current_replicas=$(kubectl get deployment $deployment_name -o=jsonpath="{.spec.replicas}")



# If the current number of replicas is less than the desired replica count, scale up the deployment

if [ "$current_replicas" -lt "$replica_count" ]; then

    kubectl scale deployment $deployment_name --replicas=$replica_count

fi


```

This incident type indicates an issue related to Elasticsearch nodes. Specifically, it indicates that one or more nodes in the Elasticsearch cluster are not healthy, which could cause performance issues or data loss. The incident may be triggered automatically by monitoring software or manually by a team member. It typically requires immediate attention to resolve the underlying issue and restore Elasticsearch nodes to a healthy state.


The Vault cluster health incident is related to the health of a Vault cluster instance. This incident type is triggered when the cluster instance is not healthy and requires attention to ensure it is functioning properly. The incident typically involves evaluating the current state of the cluster instance, diagnosing the issue, and taking corrective action to restore the health of the instance.


Vault cluster health incident on kubernetes

This incident type involves nodes in a Kubernetes cluster that are experiencing network unavailability, meaning they are not accessible. This could be due to a misconfiguration, route exhaustion, or a physical problem with the network connection to the hardware. It is a high urgency incident that requires immediate attention to restore network connectivity to the affected nodes.


Kubernetes Nodes with Network Unavailable

The Kubernetes Nodes with Memorypressure incident type occurs when a Kubernetes cluster node is running low on memory, which can be caused by a memory leak in an application. This incident type requires immediate attention to prevent any downtime and ensure the proper functioning of the Kubernetes cluster. Typically, this incident type is monitored by DevOps teams using various monitoring tools, including PagerDuty, to identify and address memory pressure issues quickly.


Kubernetes Nodes with Memorypressure incident

This incident type occurs when the Kubernetes node status is not OK. It means that the scheduler cannot place pods on the node due to some underlying issue with the node's health. This incident can impact the availability and performance of the applications running on the Kubernetes cluster. Immediate attention is required to resolve this incident to restore the normal functioning of the Kubernetes cluster.


Kubernetes Node Status Not OK

This incident type relates to the monitoring of Kubernetes deployments replica pods. It implies that there is an issue with the number of replica pods available as compared to the desired number. The incident might be triggered by a query alert monitor and might require immediate action to resolve the issue. The incident could impact the deployment of applications hosted on Kubernetes and might require troubleshooting and fixing the underlying issue.


Kubernetes Deployments Replica Pods Monitoring Incident

```shell
export POD_NAME="PLACEHOLDER"

export ELASTICSEARCH_NAMESPACE="PLACEHOLDER"

export THRESHOLD="PLACEHOLDER"

export ELASTICSEARCH_CONTAINER_NAME="PLACEHOLDER"

export ELASTICSEARCH_POD_LABEL="PLACEHOLDER"

export REPLICA_COUNT="PLACEHOLDER"

export DEPLOYMENT_NAME="PLACEHOLDER"
```


### 1. Get the list of Elasticsearch cluster pods

```shell
kubectl get pods -n ${ELASTICSEARCH_NAMESPACE} -l ${ELASTICSEARCH_POD_LABEL}
```

### 2. Check the status of the Elasticsearch cluster pods

```shell
kubectl describe pods -n ${ELASTICSEARCH_NAMESPACE} ${POD_NAME}
```

### 3. Check the Elasticsearch cluster health status

```shell
kubectl exec -n ${ELASTICSEARCH_NAMESPACE} ${POD_NAME} -- curl -X GET "http://localhost:9200/_cluster/health?pretty"
```

### 4. Check the Elasticsearch cluster node status

```shell
kubectl exec -n ${ELASTICSEARCH_NAMESPACE} ${POD_NAME} -- curl -X GET "http://localhost:9200/_cat/nodes?v"
```

### Elasticsearch cluster is experiencing high CPU or memory usage.

```shell


#!/bin/bash



# Set variables

NAMESPACE=${ELASTICSEARCH_NAMESPACE}

POD_NAME=${POD_NAME}

CONTAINER_NAME=${ELASTICSEARCH_CONTAINER_NAME}

THRESHOLD=${THRESHOLD}



# Get CPU and memory usage for the Elasticsearch container

USAGE=$(kubectl exec -n $NAMESPACE $POD_NAME -c $CONTAINER_NAME -- sh -c "ps -eo pid,pcpu,pmem | grep -E 'PID|$ELASTICSEARCH_SERVICE_NAME' | grep -v grep" | awk '{print $2,$3}')

CPU=$(echo $USAGE | awk '{print $1}')

MEMORY=$(echo $USAGE | awk '{print $2}')



# Check if CPU or memory usage is above threshold

if (( $(echo "$CPU > $THRESHOLD" | bc -l) )); then

    echo "CPU usage is above threshold."

    echo "Usage: $CPU%"

fi



if (( $(echo "$MEMORY > $THRESHOLD" | bc -l) )); then

    echo "Memory usage is above threshold."

    echo "Usage: $MEMORY%"

fi


```

### One or more Elasticsearch nodes are down or unresponsive.

```shell


#!/bin/bash



# Set variables

NAMESPACE=${ELASTICSEARCH_NAMESPACE}

ELASTICSEARCH_POD_LABEL=${ELASTICSEARCH_POD_LABEL}

ELASTICSEARCH_CONTAINER_NAME=${ELASTICSEARCH_CONTAINER_NAME}



# Check if Elasticsearch pods are running

if kubectl get pods -n $NAMESPACE -l $ELASTICSEARCH_POD_LABEL | grep Running >/dev/null; then

    echo "All Elasticsearch pods are running."

else

    echo "One or more Elasticsearch pods are not running:"

    kubectl get pods -n $NAMESPACE -l $ELASTICSEARCH_POD_LABEL | grep -v Running

fi



# Check if Elasticsearch nodes are responsive

for POD in $(kubectl get pods -n $NAMESPACE -l $ELASTICSEARCH_POD_LABEL | grep Running | cut -f1 -d' '); do

    if kubectl exec -n $NAMESPACE $POD -c $ELASTICSEARCH_CONTAINER_NAME -- curl -s http://localhost:9200/_cluster/health | grep -q '\"status\":\"green\"'; then

        echo "$POD is responding."

    else

        echo "$POD is not responding:"

        kubectl logs -n $NAMESPACE $POD -c $ELASTICSEARCH_CONTAINER_NAME

    fi

done


```


### If there is a missing data node in the Elasticsearch cluster, add a new node or replace the missing one.

```shell
bash

#!/bin/bash



# Define the name of the Elasticsearch deployment and the number of replicas

deployment_name=${DEPLOYMENT_NAME}

replica_count=${REPLICA_COUNT}



# Get the current number of replicas

current_replicas=$(kubectl get deployment $deployment_name -o=jsonpath="{.spec.replicas}")



# If the current number of replicas is less than the desired replica count, scale up the deployment

if [ "$current_replicas" -lt "$replica_count" ]; then

    kubectl scale deployment $deployment_name --replicas=$replica_count

fi


```


Elasticsearch Healthy Nodes Incident on Kubernetes

Overview

Parameters

Debug

1. Get the list of Elasticsearch cluster pods

2. Check the status of the Elasticsearch cluster pods

3. Check the Elasticsearch cluster health status

4. Check the Elasticsearch cluster node status

Elasticsearch cluster is experiencing high CPU or memory usage.

One or more Elasticsearch nodes are down or unresponsive.

Repair

If there is a missing data node in the Elasticsearch cluster, add a new node or replace the missing one.

Learn more

Related Runbooks

Vault cluster health incident on kubernetes

Kubernetes Nodes with Network Unavailable

Kubernetes Nodes with Memorypressure incident

Kubernetes Node Status Not OK

Support