---
id: b99bb480-da90-48bc-8b93-718ac16b6857
---

# Overheating reported by Prometheus exporter at {{ $labels.instance }} for Cassandra service.
---

This incident type refers to the report of overheating by Prometheus exporter at a specific instance for the Cassandra service. It could be caused by a high volume of traffic or some issues with the server, and it requires immediate attention to avoid any potential downtime or data loss. The incident is assigned to an engineer who will investigate and resolve the issue as quickly as possible.

### Parameters
```shell
export PROMETHEUS_EXPORTER_POD_NAME="PLACEHOLDER"

export PROMETHEUS_NODE_EXPORTER_POD_NAME="PLACEHOLDER"

export NAMESPACE="PLACEHOLDER"

export DEPLOYMENT_NAME="PLACEHOLDER"

export DESIRED_REPLICAS="PLACEHOLDER"
```

## Debug

### Check the status of the Cassandra pods
```shell
kubectl get pods -l app=cassandra
```

### Check the CPU and memory usage of the Cassandra pods
```shell
kubectl top pods -l app=cassandra
```

### Check the CPU and memory usage of the Prometheus exporter pod
```shell
kubectl top pods -l app=prometheus-exporter
```

### View the logs of the Prometheus exporter pod
```shell
kubectl logs ${PROMETHEUS_EXPORTER_POD_NAME}
```

### Check the status of the Prometheus service
```shell
kubectl get svc prometheus
```

### Check the status of the Prometheus Node Exporter pods
```shell
kubectl get pods -l app=prometheus-node-exporter
```

### View the logs of the Prometheus Node Exporter pod
```shell
kubectl logs ${PROMETHEUS_NODE_EXPORTER_POD_NAME}
```

### The server hosting the Prometheus exporter may be overloaded with traffic, causing it to overheat and trigger the incident.
```shell
bash

#!/bin/bash



# Set the namespace and deployment names for Prometheus

NAMESPACE=${NAMESPACE}

DEPLOYMENT=${DEPLOYMENT_NAME}



# Get the current CPU and memory usage for the Prometheus pod

CPU_USAGE=$(kubectl top pods -n $NAMESPACE | grep $DEPLOYMENT | awk '{print $2}')

MEMORY_USAGE=$(kubectl top pods -n $NAMESPACE | grep $DEPLOYMENT | awk '{print $3}')



# Check if the CPU or memory usage is above a certain threshold

if (( $(echo "$CPU_USAGE > 80" | bc -l) )) || (( $(echo "$MEMORY_USAGE > 80" | bc -l) )); then

  echo "The server hosting the Prometheus exporter may be overloaded with traffic."

  echo "CPU usage: $CPU_USAGE"

  echo "Memory usage: $MEMORY_USAGE"

else

  echo "No issues found with Prometheus pod."

fi


```

## Repair

### If the overheating is caused by high traffic, consider scaling up the server or optimizing the Cassandra service to handle the load.
```shell


#!/bin/bash



# Set the namespace and deployment name

NAMESPACE=${NAMESPACE}

DEPLOYMENT=${DEPLOYMENT_NAME}



# Get the current number of replicas

REPLICAS=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE -o jsonpath='{.spec.replicas}')



# Check if the current number of replicas is less than the desired number

if [ $REPLICAS -lt ${DESIRED_REPLICAS} ]; then

  # Scale up the deployment to the desired number of replicas

  kubectl scale deployment $DEPLOYMENT --replicas=${DESIRED_REPLICAS} -n $NAMESPACE

  echo "Scaled up $DEPLOYMENT deployment to ${DESIRED_REPLICAS} replicas."

else

  echo "$DEPLOYMENT deployment already has ${DESIRED_REPLICAS} replicas."

fi


```

This incident type refers to the report of overheating by Prometheus exporter at a specific instance for the Cassandra service. It could be caused by a high volume of traffic or some issues with the server, and it requires immediate attention to avoid any potential downtime or data loss. The incident is assigned to an engineer who will investigate and resolve the issue as quickly as possible.


This incident type refers to the failure of a systemd service on a particular host instance. The incident could be triggered by various causes such as a software bug, hardware failure, or system overload. This type of incident can cause downtime or service disruption to the affected host instance, which may require immediate resolution to restore normal operations.


Host systemd service crashed (instance) incident.

This incident type relates to a high urgency issue regarding host context switching. The incident is triggered when the context switching grows on the node beyond a certain level, typically over 10000 per second. This issue can cause performance degradation and impact the stability of the system. The incident requires immediate attention from a software engineer to identify the root cause and take the necessary steps to resolve the issue.


High urgency incident related to host context switching.

This incident type refers to an issue where Istio latency has exceeded the 99th percentile, indicating that the slowest 1% of requests are taking longer than 1 second to complete. This can cause performance issues and impact the user experience. It requires immediate attention and investigation to resolve the issue and prevent any further impact.


Istio Latency 99 Percentile Incident

This incident type typically occurs when a container is being throttled, meaning it is being limited in the amount of resources it can use. This can happen due to various reasons such as exceeding resource limits, network connectivity issues, or other performance problems. This can cause an interruption in the normal functioning of the application or service running in the container. It requires immediate attention to identify and resolve the underlying cause of the throttling to ensure normal operation of the application or service.


Container high throttle rate incident.

This incident type refers to a situation where there is a significant delay in the execution of queries on a Cassandra cluster. This delay can cause the system to become unresponsive and result in slower performance. It may be caused by a variety of factors such as an increase in traffic, inefficient queries, or hardware issues. The issue can impact the functionality of the system and requires immediate attention to prevent further disruption.


Slow Query Performance on Cassandra Cluster.

```shell
export PROMETHEUS_EXPORTER_POD_NAME="PLACEHOLDER"

export PROMETHEUS_NODE_EXPORTER_POD_NAME="PLACEHOLDER"

export NAMESPACE="PLACEHOLDER"

export DEPLOYMENT_NAME="PLACEHOLDER"

export DESIRED_REPLICAS="PLACEHOLDER"
```


### Check the status of the Cassandra pods

```shell
kubectl get pods -l app=cassandra
```

### Check the CPU and memory usage of the Cassandra pods

```shell
kubectl top pods -l app=cassandra
```

### Check the CPU and memory usage of the Prometheus exporter pod

```shell
kubectl top pods -l app=prometheus-exporter
```

### View the logs of the Prometheus exporter pod

```shell
kubectl logs ${PROMETHEUS_EXPORTER_POD_NAME}
```

### Check the status of the Prometheus service

```shell
kubectl get svc prometheus
```

### Check the status of the Prometheus Node Exporter pods

```shell
kubectl get pods -l app=prometheus-node-exporter
```

### View the logs of the Prometheus Node Exporter pod

```shell
kubectl logs ${PROMETHEUS_NODE_EXPORTER_POD_NAME}
```

### The server hosting the Prometheus exporter may be overloaded with traffic, causing it to overheat and trigger the incident.

```shell
bash

#!/bin/bash



# Set the namespace and deployment names for Prometheus

NAMESPACE=${NAMESPACE}

DEPLOYMENT=${DEPLOYMENT_NAME}



# Get the current CPU and memory usage for the Prometheus pod

CPU_USAGE=$(kubectl top pods -n $NAMESPACE | grep $DEPLOYMENT | awk '{print $2}')

MEMORY_USAGE=$(kubectl top pods -n $NAMESPACE | grep $DEPLOYMENT | awk '{print $3}')



# Check if the CPU or memory usage is above a certain threshold

if (( $(echo "$CPU_USAGE > 80" | bc -l) )) || (( $(echo "$MEMORY_USAGE > 80" | bc -l) )); then

  echo "The server hosting the Prometheus exporter may be overloaded with traffic."

  echo "CPU usage: $CPU_USAGE"

  echo "Memory usage: $MEMORY_USAGE"

else

  echo "No issues found with Prometheus pod."

fi


```


### If the overheating is caused by high traffic, consider scaling up the server or optimizing the Cassandra service to handle the load.

```shell


#!/bin/bash



# Set the namespace and deployment name

NAMESPACE=${NAMESPACE}

DEPLOYMENT=${DEPLOYMENT_NAME}



# Get the current number of replicas

REPLICAS=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE -o jsonpath='{.spec.replicas}')



# Check if the current number of replicas is less than the desired number

if [ $REPLICAS -lt ${DESIRED_REPLICAS} ]; then

  # Scale up the deployment to the desired number of replicas

  kubectl scale deployment $DEPLOYMENT --replicas=${DESIRED_REPLICAS} -n $NAMESPACE

  echo "Scaled up $DEPLOYMENT deployment to ${DESIRED_REPLICAS} replicas."

else

  echo "$DEPLOYMENT deployment already has ${DESIRED_REPLICAS} replicas."

fi


```


Overheating reported by Prometheus exporter at {{ $labels.instance }} for Cassandra service.

Overview

Parameters

Debug

Check the status of the Cassandra pods

Check the CPU and memory usage of the Cassandra pods

Check the CPU and memory usage of the Prometheus exporter pod

View the logs of the Prometheus exporter pod

Check the status of the Prometheus service

Check the status of the Prometheus Node Exporter pods

View the logs of the Prometheus Node Exporter pod

The server hosting the Prometheus exporter may be overloaded with traffic, causing it to overheat and trigger the incident.

Repair

If the overheating is caused by high traffic, consider scaling up the server or optimizing the Cassandra service to handle the load.

Learn more

Related Runbooks

Host systemd service crashed (instance) incident.

High urgency incident related to host context switching.

Istio Latency 99 Percentile Incident

Container high throttle rate incident.

Support