Connection Timeout Issue #6637

ankit21491 · 2022-12-22T14:04:18Z

ankit21491
Dec 22, 2022

Hi All,

I am using Fluent Bit version 2.0.5 running as a daemon set in Kubernetes clusters. The K8s env I am having at my organisation are K8s clusters at edge and around 100 clusters are there. I am getting the issue with the Fluent-Bit forwarding the logs to NewRelic, for few of the clusters that the FluentBit pod got stuck in connection timeout and not able to get any logs beyond that.
When checked we are getting the FluentBit uptime metrices which is being scrapped through prometheus. Since FluentBit image is a distroless one so even Liveness Probe can't be set to restart the pod in case if we are not getting the metric.

Connection timeout I can understand could be a network issue but pod getting stuck to timed out and not restarting or dropping the connection is an issue. Also I have checked the PR #3192 which says the issue of connection timeout is fixed in v1.7.0 and ownwards but still issue persists.

Please suggest any resolution for the same.

ankit21491 · 2023-01-03T11:51:52Z

ankit21491
Jan 3, 2023
Author

Hi All,

Any suggestions please

1 reply

patrick-stephens Nov 21, 2024
Maintainer

That version is no longer supported so first step is to upgrade to latest - it may already be resolved. If you're still seeing the issue then please raise an Issue using the template to capture the full details of config, versions, infra, etc.

Sunnatillo · 2023-01-13T09:49:42Z

Sunnatillo
Jan 13, 2023

We are facing the same issue. fluent-bit version is 2.0.5

[2023/01/12 06:08:36] [error] [upstream] connection #174 to tcp://x.x.x.x:24224 timed out after 10 seconds (connection timeout)

1 reply

ankit21491 Jan 19, 2023
Author

Hi Sunnatillo, I have used FluentBit image 2.0.6 and it resolved the issue.

Sunnatillo · 2023-02-01T07:08:17Z

Sunnatillo
Feb 1, 2023

We are still facing the same issue after upgrading to v2.0.8.
We would appreciate any suggestions

3 replies

badu013 Feb 3, 2023

please share the complete fluentbit configuration here.

Sunnatillo Feb 3, 2023

Hi, below is the fluent-bit output configuration, but still I see the connection reset between fluent-bit and fluentd.
There is no LB in between .

[OUTPUT]
Name forward
Match *
Host fluentd
Port 24224
tls On
net.keepalive on
tls.verify On
tls.ca_file /fluentbit-tls/ssl/ca.crt.pem
tls.crt_file /fluentbit-tls/ssl/client.crt.pem
tls.key_file /fluentbit-tls/ssl/client.key.pem

in fluent-bit logs I can these below errors continuously
[2023/01/25 05:02:26] [ info] [input] tail.0 resume (mem buf overlimit)
[2023/01/25 05:02:26] [ warn] [input] tail.0 paused (mem buf overlimit)
[2023/01/25 05:02:26] [ info] [input] pausing tail.0
eccd@director-0-n92-ci-ibd-32-jenkins:~>

[2023/01/24 14:49:29] [error] [upstream] connection #144 to tcp://10.102.224.136:24224 timed out after 10 seconds (connection timeout)
[2023/01/24 14:49:29] [error] [upstream] connection #146 to tcp://10.102.224.136:24224 timed out after 10 seconds (connection timeout)
[2023/01/24 14:49:29] [error] [output:forward:forward.0] no upstream connections available
[2023/01/24 14:49:29] [error] [output:forward:forward.0] no upstream connections available
[2023/01/24 14:49:32] [error] [upstream] connection #142 to tcp://10.102.224.136:24224 timed out after 10 seconds (connection timeout)
[2023/01/24 14:49:32] [error] [output:forward:forward.0] no upstream connections available

nulldim Jun 7, 2023

The issue seems to be fixed after upgrading to fluentd v1.16.1 and fluent-bit v2.1.3

ts3ng · 2023-08-23T14:41:40Z

ts3ng
Aug 23, 2023

We're are seeing a large amount of connection timeouts and no recovery without any recover with fluent-bit 1.9.10 build.

>k --context <cluster>  -n kube-system logs pods/fluent-bit-splunk-es-74jvv --tail 15
[2023/08/20 22:23:17] [error] [upstream] connection #167 to <OMIT HEC ENDPOINT>:443 timed out after 10 seconds
[2023/08/20 22:23:17] [error] [upstream] connection #129 to <OMIT HEC ENDPOINT>443 timed out after 10 seconds
[2023/08/20 22:23:20] [error] [upstream] connection #169 to <OMIT HEC ENDPOINT>443 timed out after 10 seconds
[2023/08/20 22:23:26] [error] [upstream] connection #65 to <OMIT HEC ENDPOINT>443 timed out after 10 seconds
[2023/08/20 22:23:32] [error] [upstream] connection #167 to <OMIT HEC ENDPOINT>:443 timed out after 10 seconds
[2023/08/20 22:23:35] [error] [upstream] connection #186 to <OMIT HEC ENDPOINT>:443 timed out after 10 seconds
[2023/08/20 22:23:37] [error] [upstream] connection #188 to <OMIT HEC ENDPOINT>:443 timed out after 10 seconds
[2023/08/20 22:23:37] [error] [upstream] connection #189 to <OMIT HEC ENDPOINT>:443 timed out after 10 seconds

if we run a check manually it works.

>k --context <cluster>  -n kube-system exec -ti pods/fluent-bit-splunk-es-74jvv -- curl -k https://<OMIT HEC ENDPOINT>/services/collector -H 'Authorization: Splunk <omitted>' -d '{"index":"<index>","sourcetype":"curl","event":{"message":"this is a log","field2":"this is field2"}}'
{"text":"Success","code":0}

looks like its stuck records are coming in but nothing is being delivered i the splunk output...

>k --context <cluster>  -n kube-system exec -ti pods/fluent-bit-splunk-es-74jvv -- curl http://127.0.0.1:2020/api/v1/metrics
{"input":{"systemd.0":{"records":388548,"bytes":805500533},"storage_backlog.1":{"records":0,"bytes":0}},"filter":{"modify.0":{"drop_records":0,"add_records":0},"modify.1":{"drop_records":0,"add_records":0},"modify.2":{"drop_records":0,"add_records":0},"nest.3":{"drop_records":0,"add_records":0},"modify.4":{"drop_records":0,"add_records":0},"record_modifier.5":{"drop_records":0,"add_records":0}},"output":{"splunk.0":{"proc_records":0,"proc_bytes":0,"errors":0,"retries":0,"retries_failed":0,"dropped_records":0,"retried_records":0}}}

After deleting the pod that ahs these issues it seems to work (same fluent-bit config) again so there must be a issue with fluent-bit getting stuck.

> k --context <clusteR> -n kube-system exec -ti pods/fluent-bit-journald-npdfd -- curl http://127.0.0.1:2020/api/v1/metrics
{"input":{"systemd.0":{"records":4568128,"bytes":6950273809},"storage_backlog.1":{"records":0,"bytes":0}},"filter":{"modify.0":{"drop_records":0,"add_records":0},"modify.1":{"drop_records":0,"add_records":0},"nest.2":{"drop_records":0,"add_records":0},"modify.3":{"drop_records":0,"add_records":0},"record_modifier.4":{"drop_records":0,"add_records":0}},"output":{"splunk.0":{"proc_records":2506075,"proc_bytes":3921936642,"errors":200312,"retries":29,"retries_failed":1,"dropped_records":2062052,"retried_records":239}}}

we though these were resolved via this issue: #4505. but it seems that there are still problems overall. Can we confirm that the fix is in v2.1.3? Looking at the commits in
https://github.com/fluent/fluent-bit/releases/tag/v2.1.3 i'm not sure what the problem was.

1 reply

ts3ng Aug 24, 2023

I found an issue #6822 this might be the change that is needed but we are looking into workarounds by using a updated liveness probes to check the splunk.0 filter metrics for activity in conjunction with increasing net.connetion_timeout settings for this service. This is exacerbated by some QOS issues with the HEC splunk endpoint we're using. If that doesn't work I will consider pushing this commit to our fluent-bit fork since 2.1.x might be too much a leap for our current scope.

jasiu001 · 2024-11-21T09:00:40Z

jasiu001
Nov 21, 2024

I encountered this problem again, this time in version 3.0.7, where the logs are sent to Splunk

[2024/11/20 13:48:23] [error] [upstream] connection #629 to tcp://<ip>:8088 timed out after 10 seconds (connection timeout)
[2024/11/20 13:48:23] [error] [upstream] connection #603 to tcp://<ip>:8088 timed out after 10 seconds (connection timeout)
[2024/11/20 13:48:23] [error] [upstream] connection #589 to tcp://<ip>:8088 timed out after 10 seconds (connection timeout)

Does anyone have any idea what could be the cause, looks like this connection is broken only for one Splunk instance, another Splunks with the same version and configuration works properly. Reset doesn't help.

2 replies

patrick-stephens Nov 21, 2024
Maintainer

I would check the splunk logs too, it sounds like a networking issue so verify the IP is valid, etc. too

jasiu001 Nov 22, 2024

ok never mind, looks like the problem is in our side, some misconfiguration with proxy parameters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection Timeout Issue #6637

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Connection Timeout Issue #6637

Replies: 5 comments · 8 replies

ankit21491 Jan 3, 2023 Author

patrick-stephens Nov 21, 2024 Maintainer

ankit21491 Jan 19, 2023 Author

patrick-stephens Nov 21, 2024 Maintainer

Replies: 5 comments 8 replies

ankit21491
Jan 3, 2023
Author

patrick-stephens Nov 21, 2024
Maintainer

ankit21491 Jan 19, 2023
Author

patrick-stephens Nov 21, 2024
Maintainer