Runbook

Kafka Disk IO Latency Spike Incident

Back to Runbooks

Overview

This incident type refers to a sudden increase in disk input/output (I/O) latency in the Kafka message broker, which can lead to degraded performance, slow processing of messages, and potentially impact system availability. This may be caused by a variety of factors, such as hardware failure, network issues, or software bugs. It is important to quickly identify and resolve this issue to prevent disruption to the system and ensure smooth operation of the Kafka message broker.

Parameters

Debug

Check the current CPU usage

Check if Kafka is running

Check the current disk usage

Check the disk I/O utilization

Check the disk I/O wait time

Check the current network usage

Check the Kafka logs for any errors or warnings

Repair

Implement data retention policies to limit the amount of data stored on the Kafka brokers, and periodically archive or delete older data to free up disk space and improve disk I/O performance.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.