Back to solutions
Major outage

Kafka Lag

Restart slow or broken consumers when systems are falling behind in processing messages through a queue.
Customer experience impact
Reports and transactions fail
Occurrence frequency
Monthly for larger fleets
Time to repair manually
SRE time spent on diagnosis and repair
Shoreline time to repair
5 minutes
Time to diagnose manually
Cost impact

The problem

When the length of your Kafka topic is too long, you will fail to consume messages at the right rate. When messages aren’t consumed, applications may begin to break, with reports and transactions being the first to fail.

On the surface, this is not a difficult problem to diagnose. Close monitoring of metrics will tell you if messages are not being consumed. If the issue is caught early, then the pods simply need to be restarted. The true issue arises when you are unable to keep up with monitoring. The further you fall behind, the more things get out of sync, and the harder it is to fix. This will most likely lead to customer availability issues.

The solution

Shoreline’s Kafka OpPack detects Kafka lag and restarts consumer pods to remove lag. It works by allowing you to designate the group of pods that are consumers of the topic. Shoreline can capture metrics from a Kafka exporter or we can call Kafka to get the topic length.