Kafka consumer lag: monitoring and alerting strategies
Felix Sato
·442 views
we had a situation last month where our kafka consumer lag for a critical service grew to millions of messages, completely unnoticed for several hours. it was a huge outage. our current alerting is based on static thresholds, which are either too noisy during batch processing windows or too slow to react during sudden spikes. i'm looking for better strategies to monitor and alert on kafka consumer lag. i'm thinking about monitoring the rate-of-change in lag, rather than just the absolute lag itself, or perhaps a combination. what metrics and alerting patterns have people found most effective to catch lag issues early without generating tons of false positives? especially interested in solutions that differentiate between expected and unexpected lag increases.
6 comments