This incident type refers to an issue with a Kafka cluster where one or more partitions are experiencing high lag. Lag is the difference between the latest message produced to a partition and the latest message consumed from that partition. When a partition lags behind, it means that messages are not being consumed as quickly as they are being produced. This can lead to a backlog of messages and potential data loss if not addressed. The incident description suggests checking for hot partitions, which are partitions that receive a disproportionate amount of traffic compared to others. Identifying and resolving high partition lag is critical to ensure the stability and reliability of a Kafka cluster.
Parameters
Debug
List all topics and their partition count
List the partition lag for a specific consumer group
List the current offset for a specific partition
List the end offset for a specific partition
List the number of messages in a specific partition
List the number of messages consumed by a specific consumer group for a specific topic
Repair
Increase the number of consumer instances for the high-traffic partitions to reduce the lag. This can be done by adding more consumers to the consumer group or creating a new group.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.