You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tracing the MQTT code, this seems internal to the paho.mqtt.golang lib, and may be related to eclipse-paho/paho.mqtt.golang#410. Dapr uses the default AutoReconnect value (true) for the MQTT client configuration and from the traces it looks like MQTT is constantly reconnecting in the background.
In the failure cases, it looks like there is a race condition between the disconnect and when publish is called to queue a new outbound message. The most minimal example I've seen is:
...
[DEBUG] [client] enter Publish
[DEBUG] [client] sending publish message, topic: testBindingTopic
[ERROR] [client] Connect comms goroutine - error triggered EOF
[DEBUG] [net] obound msg to write 1
[DEBUG] [client] internalConnLost called
[DEBUG] [client] stopCommsWorkers called
[ERROR] [net] outgoing obound reporting error write tcp 127.0.0.1:42870->127.0.0.1:1884: use of closed network connection
...
And depending on the threading and run:
...
[DEBUG] [client] enter Publish
[DEBUG] [client] sending publish message, topic: testBindingTopic
[DEBUG] [net] obound msg to write 1
[DEBUG] [net] incoming complete
[DEBUG] [net] startIncomingComms: got msg on ibound
[DEBUG] [net] logic waiting for msg on ibound
[DEBUG] [net] startIncomingComms: ibound complete
[DEBUG] [net] startIncomingComms goroutine complete
[ERROR] [client] Connect comms goroutine - error triggered EOF
[DEBUG] [client] internalConnLost called
[DEBUG] [client] stopCommsWorkers called
[DEBUG] [router] matchAndDispatch exiting
[DEBUG] [client] startCommsWorkers output redirector finished
[DEBUG] [pinger] keepalive stopped
[DEBUG] [client] stopCommsWorkers waiting for workers
[DEBUG] [client] stopCommsWorkers waiting for comms
[DEBUG] [client] internalConnLost waiting on workers
[DEBUG] [net] received connack
[DEBUG] [client] startCommsWorkers called
[DEBUG] [client] client is connected/reconnected
[DEBUG] [net] incoming started
[DEBUG] [net] startIncomingComms started
[DEBUG] [net] outgoing started
[DEBUG] [net] startComms started
[DEBUG] [client] startCommsWorkers done
[DEBUG] [store] enter Resume
[DEBUG] [store] exit resume
[DEBUG] [net] logic waiting for msg on ibound
[DEBUG] [net] startIncomingComms: inboundFromStore complete
[DEBUG] [net] logic waiting for msg on ibound
[DEBUG] [net] outgoing waiting for an outbound message
[DEBUG] [pinger] keepalive starting
[ERROR] [net] outgoing obound reporting error write tcp 127.0.0.1:60916->127.0.0.1:1884: use of closed network connection
Where in success cases, the trace is usually:
...
[DEBUG] [client] enter Publish
[DEBUG] [client] sending publish message, topic: testBindingTopic
[DEBUG] [net] obound msg to write 1
[DEBUG] [net] obound wrote msg, id: 1
[DEBUG] [net] outgoing waiting for an outbound message
...
Additional observations:
The failure repros if only using the output binding, so does not appear to be an interaction between the input & output binding clients in the same component.
The MQTT bindings and pubsub look like they have exactly the same code that invokes publish but there has not been a single instance of the failure signature in the pubsub conformance test. This suggests there's some additional factor to the failure.
Steps to Reproduce the Problem
It is useful to enable paho.mqtt.golang lib's tracing for this investigation, which can be added to the Init() method in mqtt.go:
The tracing produces a lot of spew and does not distinguish between the input and output binding's mqtt clients, so it also helpful to mock out the Read() method so it's not actually using MQTT (e.g. with a simple timer loop that calls the app's read handler).
Run the mqtt conformance test in a tight loop until failure to capture the trace:
while go test -v -tags=conftests -count=1 -timeout=1m ./tests/conformance -run=TestBindingsConformance/mqtt.emqx > mqtt-emqx.log;do:;done
Expected Behavior
MQTT bindings conformance tests should pass reliably.
Actual Behavior
MQTT bindings conformance tests have a ~1% failure rate on Create test with the following failure signature:
Tracing the MQTT code, this seems internal to the paho.mqtt.golang lib, and may be related to eclipse-paho/paho.mqtt.golang#410. Dapr uses the default
AutoReconnect
value (true
) for the MQTT client configuration and from the traces it looks like MQTT is constantly reconnecting in the background.In the failure cases, it looks like there is a race condition between the disconnect and when
publish
is called to queue a new outbound message. The most minimal example I've seen is:And depending on the threading and run:
Where in success cases, the trace is usually:
Additional observations:
publish
but there has not been a single instance of the failure signature in the pubsub conformance test. This suggests there's some additional factor to the failure.Steps to Reproduce the Problem
Init()
method in mqtt.go:Read()
method so it's not actually using MQTT (e.g. with a simple timer loop that calls the app's read handler).Release Note
RELEASE NOTE: FIX MQTT output bindings sporadic failure on Create operation.
The text was updated successfully, but these errors were encountered: