-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka output might lose connection and not recover #37276
Comments
Also worth noting that this issue is not reproducible with Filebeat v7.17.1. One of the changes between these versions is the version of the Kafka client that Beats is using: v7.17.1: https://github.com/elastic/beats/blob/v7.17.1/go.mod#L290 It looks like this line was changed in v8.2.0 (PR). |
@jlind23 should we add this to our test cases with our partner? |
Hi! We're labeling this issue as |
The difference between the two are here: https://github.com/elastic/sarama/compare/11c3ef800752..ebc2b0d8eef3 It is possible this change of what errors are considered unrecoverable results in the connection not being re-established once it hits one of these errors. |
Interestingly, I am unable to reproduce this with the provided docker-compose and filebeat yml. On 8.8.1 and on main. When the broker comes back online, publishing quickly resumes. I did have to bump Kafka to 7.x to run on my arm64 laptop though so maybe that has something to do with it. Do we know what version of Kafka the customer is running? |
Unfortunately I do not know which version they were using :/ |
@strawgate @belimawr |
That change only affects what errors are reported to the caller (Beats), not how retries are handled. Baseline behavior was the same, but reported all variations as a single generic error. But after these are reported Beats will still retry. |
I've been trying to reproduce this problem, however I cannot reproduce, Filebeat has been able to recover from all errors I managed to produce. The "worst" one I managed to reproduce puts my Kafka cluster in unstable state, the CLI tools that came with the Kafka installation produce the following error:
And Filebeat produces:
However once the connection to the problematic brokers is restored, everything starts working again. I'll test again with one of the versions used in the original reports to understand if it has been fixed or I'm just not managing to reproduce it any more. |
Looking at the changes on our sarama fork, I can see the base version from sarama was updated from 1.29.1 to 1.43.3 on 17/11/2024. Then Beats was updated to use the new version on 29/11/2024. That seems to be what solved this issue. |
I was unable to reproduce this with Filebeat 8.8.1, were you? How do we know if it's fixed in the newer version if we can't reproduce in the older? |
I cannot reproduce it either. I tried with Filebeat v8.8.1, v8.10.3 and the
We don't... We can close the issue because we cannot reproduce, if someone reports this issue again, we can re-open the issue and ask more details about the setup that reproduces the issue. |
One thing I notice is that if Filebeat cannot connect to Kafka and I try to stop Filebeat, it just hangs there, the Kafka client does not stop, however if I restore the connection to the Kafka brokers, Filebeat eventually stops. |
main
Steps to reproduce
flog -d1 -s1 -l > /tmp/flog.log
Log entries like this one will stop
For the files below bear in mind you will have to add your local IP address on those files.
filebeat.yml
docker-compose.yml
Tutorial on running a Kafka cluster with Docker: https://betterprogramming.pub/a-simple-apache-kafka-cluster-with-docker-kafdrop-and-python-cf45ab99e2b9
I have also seen reports of users reproducing a similar behaviour with Filebeat v8.8.1 and using HAProxy between the Kafka nodes and Filebeat. Steps:
The text was updated successfully, but these errors were encountered: