---
id: c122dae5-cd90-4817-94c6-ee89453c2b28
---

# Cassandra cluster Node unresponsive and resulting data unavailability.
---

This incident type involves the detection of an unresponsive node in a Cassandra cluster, which can result in data unavailability and potential disruptions to services. The cause of the issue may vary, but it can be related to factors such as hardware failure or network problems. It is important to address such incidents quickly to minimize any negative impact on the affected services and ensure data availability.

### Parameters
```shell
export NODE_IP="PLACEHOLDER"

export KEYSPACE="PLACEHOLDER"

export OTHER_NODE_IP="PLACEHOLDER"

export NODEA="PLACEHOLDER"

export CPU_OR_MEMORY="PLACEHOLDER"

```

## Debug

### Check the connection to node a by pinging its ip address.
```shell
ping ${NODE_IP}
```

### Check if Cassandra service is running on Node
```shell
 systemctl status cassandra
```

### Check Cassandra logs for any errors related to Node 
```shell
 tail -n 100 /var/log/cassandra/system.log | grep ${NODE_IP}
```

### Check network connectivity between Node A and other nodes in the cluster
```shell
 nodetool status | grep UN
```

### Check if there are any hardware issues on Node A
```shell
 ipmitool sel list
```

### Check if there are any pending repairs or compactions for the affected keyspace
```shell
 nodetool tpstats | grep repair

 nodetool compactionstats | grep ${KEYSPACE}
```

### Check if there are any disk space issues on Node A
```shell
 df -h
```


### Check if there are any network issues between Node A and other nodes in the cluster
```shell
traceroute ${OTHER_NODE_IP}
```

### Check if there are any firewall rules blocking traffic to Node A
```shell
 systemctl status firewalld
```

### Resource exhaustion on Node (e.g. CPU, memory)
```shell


#!/bin/bash



# Set the target node and resource type

NODE=${NODE}

RESOURCE=${CPU_OR_MEMORY}



# Check the current resource usage on the node

echo "Checking $RESOURCE usage on $NODE..."

ssh $NODE "top -b -n 1 | grep $RESOURCE"



# Check the system logs for any related errors or warnings

echo "Checking system logs on $NODE..."

ssh $NODE "cat /var/log/syslog | grep -i $RESOURCE"



# Check the Cassandra logs for any related errors or warnings

echo "Checking Cassandra logs on $NODE..."

ssh $NODE "cat /var/log/cassandra/system.log | grep -i $RESOURCE"



# Check the network connectivity between the node and other cluster nodes

echo "Checking network connectivity on $NODE..."

ssh $NODE "ping ${OTHER_NODE}"



# Check the Cassandra cluster status and node health

echo "Checking Cassandra cluster status on $NODE..."

ssh $NODE "nodetool status"

ssh $NODE "nodetool describecluster"



# Perform any necessary diagnostics or remediation steps based on the findings


```

## Repair

### Restart the unresponsive Node  and see if it rejoins the cluster. If it does, monitor it closely for any future issues.
```shell


#!/bin/bash



# Define the variables

NODE=${NODE}



# Restart the node

sudo service cassandra restart $NODE



# Monitor the node for any future issues

# You can use a tool like Nagios, Zabbix, or Datadog to monitor the node

# Alternatively, you can just use the "nodetool status" command to check the node's status


```

This incident type involves the detection of an unresponsive node in a Cassandra cluster, which can result in data unavailability and potential disruptions to services. The cause of the issue may vary, but it can be related to factors such as hardware failure or network problems. It is important to address such incidents quickly to minimize any negative impact on the affected services and ensure data availability.


This incident type involves nodes in a Kubernetes cluster that are experiencing network unavailability, meaning they are not accessible. This could be due to a misconfiguration, route exhaustion, or a physical problem with the network connection to the hardware. It is a high urgency incident that requires immediate attention to restore network connectivity to the affected nodes.


Kubernetes Nodes with Network Unavailable

Node Not Ready in Kubernetes Cluster is an incident type that occurs when a node in a Kubernetes cluster fails to respond, is unresponsive, or is not ready to take on workloads. This can cause disruptions in service and lead to downtime, as the cluster is unable to allocate resources effectively. This incident type can be caused by a range of factors, including hardware issues, network problems, and configuration errors. Swift resolution of this incident is essential to ensure that the Kubernetes cluster is able to function correctly and provide uninterrupted service.


Node Not Ready in Kubernetes Cluster

This incident type refers to a situation where the Kafka broker, which is responsible for managing and storing messages in a Kafka cluster, has failed. This failure results in the unavailability of one or more partitions, which are used to distribute messages across the cluster. As a result, messages cannot be sent or received, leading to disruptions in the system's operations. This type of incident requires immediate attention to restore the Kafka broker and ensure that messages can be processed as expected.


Kafka Broker Failure Causing Partition Unavailability

This incident type indicates an issue related to Elasticsearch nodes. Specifically, it indicates that one or more nodes in the Elasticsearch cluster are not healthy, which could cause performance issues or data loss. The incident may be triggered automatically by monitoring software or manually by a team member. It typically requires immediate attention to resolve the underlying issue and restore Elasticsearch nodes to a healthy state.


Elasticsearch Healthy Nodes Incident on Kubernetes

The Etcd insufficient Members incident type refers to an issue where the Etcd cluster has an insufficient number of members. Etcd is a distributed key-value store used for shared configuration and service discovery. In order to maintain high availability and fault tolerance, the cluster should have an odd number of members. When the number of members falls below the minimum required, it can result in service outages and other disruptions. This incident type requires immediate attention to restore the service to normal operation.


Etcd insufficient Members incident.

```shell
export NODE_IP="PLACEHOLDER"

export KEYSPACE="PLACEHOLDER"

export OTHER_NODE_IP="PLACEHOLDER"

export NODEA="PLACEHOLDER"

export CPU_OR_MEMORY="PLACEHOLDER"

```


### Check the connection to node a by pinging its ip address.

```shell
ping ${NODE_IP}
```

### Check if Cassandra service is running on Node

```shell
 systemctl status cassandra
```

### Check Cassandra logs for any errors related to Node

```shell
 tail -n 100 /var/log/cassandra/system.log | grep ${NODE_IP}
```

### Check network connectivity between Node A and other nodes in the cluster

```shell
 nodetool status | grep UN
```

### Check if there are any hardware issues on Node A

```shell
 ipmitool sel list
```

### Check if there are any pending repairs or compactions for the affected keyspace

```shell
 nodetool tpstats | grep repair

 nodetool compactionstats | grep ${KEYSPACE}
```

### Check if there are any disk space issues on Node A

```shell
 df -h
```

### Check if there are any network issues between Node A and other nodes in the cluster

```shell
traceroute ${OTHER_NODE_IP}
```

### Check if there are any firewall rules blocking traffic to Node A

```shell
 systemctl status firewalld
```

### Resource exhaustion on Node (e.g. CPU, memory)

```shell


#!/bin/bash



# Set the target node and resource type

NODE=${NODE}

RESOURCE=${CPU_OR_MEMORY}



# Check the current resource usage on the node

echo "Checking $RESOURCE usage on $NODE..."

ssh $NODE "top -b -n 1 | grep $RESOURCE"



# Check the system logs for any related errors or warnings

echo "Checking system logs on $NODE..."

ssh $NODE "cat /var/log/syslog | grep -i $RESOURCE"



# Check the Cassandra logs for any related errors or warnings

echo "Checking Cassandra logs on $NODE..."

ssh $NODE "cat /var/log/cassandra/system.log | grep -i $RESOURCE"



# Check the network connectivity between the node and other cluster nodes

echo "Checking network connectivity on $NODE..."

ssh $NODE "ping ${OTHER_NODE}"



# Check the Cassandra cluster status and node health

echo "Checking Cassandra cluster status on $NODE..."

ssh $NODE "nodetool status"

ssh $NODE "nodetool describecluster"



# Perform any necessary diagnostics or remediation steps based on the findings


```


### Restart the unresponsive Node  and see if it rejoins the cluster. If it does, monitor it closely for any future issues.

```shell


#!/bin/bash



# Define the variables

NODE=${NODE}



# Restart the node

sudo service cassandra restart $NODE



# Monitor the node for any future issues

# You can use a tool like Nagios, Zabbix, or Datadog to monitor the node

# Alternatively, you can just use the "nodetool status" command to check the node's status


```


Cassandra cluster Node unresponsive and resulting data unavailability.

Overview

Parameters

Debug

Check the connection to node a by pinging its ip address.

Check if Cassandra service is running on Node

Check network connectivity between Node A and other nodes in the cluster

Check if there are any hardware issues on Node A

Check if there are any pending repairs or compactions for the affected keyspace

Check if there are any disk space issues on Node A

Check if there are any network issues between Node A and other nodes in the cluster

Check if there are any firewall rules blocking traffic to Node A

Resource exhaustion on Node (e.g. CPU, memory)

Repair

Restart the unresponsive Node and see if it rejoins the cluster. If it does, monitor it closely for any future issues.

Learn more

Related Runbooks

Kubernetes Nodes with Network Unavailable

Node Not Ready in Kubernetes Cluster

Kafka Broker Failure Causing Partition Unavailability

Elasticsearch Healthy Nodes Incident on Kubernetes

Support