Dead Kafka Node Detection incident is a type of incident that occurs when a node of the Kafka cluster goes down or fails. Kafka is an open-source distributed event streaming platform used by software engineers to build real-time streaming data pipelines and applications. When a node of the Kafka cluster goes down, it can cause data loss, message duplication, and various other issues. Therefore, it's essential to detect and resolve dead Kafka nodes as quickly as possible to minimize the impact on the system's performance and data integrity.
Parameters
Debug
Check if the Kafka service is running
Check if the Kafka process is running
Check if the Kafka node is reachable from other nodes in the cluster
Check if the Kafka node is able to communicate with ZooKeeper
Check if the Kafka node is able to access the Kafka data directory
Check if there are any Kafka logs indicating a potential issue with the node
Repair
Replace the failed node: If the node is found to be defective, it should be replaced with a new one.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.