Runbook
Kafka In-Sync Replica Count Drops Incident
Back to Runbooks
Overview
This incident type refers to a situation where the in-sync replica count of a Kafka cluster drops below the expected value. In simple terms, an in-sync replica (ISR) is a replica that is up-to-date with the leader partition. When the ISR count drops, it means that one or more replicas have fallen behind the leader partition, which could result in data loss or inconsistency across the Kafka cluster. This incident requires immediate attention and investigation to identify the root cause and take necessary actions to prevent further impact.
Parameters
Debug
Check the current ISR count for the given topic
Check the current leader for the given topic
Check the current ISR count for all topics in the cluster
Check the current leader for all topics in the cluster
Check the logs for any errors or warnings related to ISR
Repair
Increase the replica fetch maximum wait time: If the ISR count drops due to high replica lag, consider increasing the replica fetch maximum wait time. This parameter determines how long a broker should wait for a replica to catch up with the leader partition before returning the results to the consumer.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.