Runbook

Cassandra cluster Node unresponsive and resulting data unavailability.

Back to Runbooks

Overview

This incident type involves the detection of an unresponsive node in a Cassandra cluster, which can result in data unavailability and potential disruptions to services. The cause of the issue may vary, but it can be related to factors such as hardware failure or network problems. It is important to address such incidents quickly to minimize any negative impact on the affected services and ensure data availability.

Parameters

Debug

Check the connection to node a by pinging its ip address.

Check if Cassandra service is running on Node

Check network connectivity between Node A and other nodes in the cluster

Check if there are any hardware issues on Node A

Check if there are any pending repairs or compactions for the affected keyspace

Check if there are any disk space issues on Node A

Check if there are any network issues between Node A and other nodes in the cluster

Check if there are any firewall rules blocking traffic to Node A

Resource exhaustion on Node (e.g. CPU, memory)

Repair

Restart the unresponsive Node and see if it rejoins the cluster. If it does, monitor it closely for any future issues.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.