Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-open connection on transmission failures #23

Merged
merged 2 commits into from
Sep 14, 2023
Merged

Re-open connection on transmission failures #23

merged 2 commits into from
Sep 14, 2023

Conversation

huntc
Copy link
Collaborator

@huntc huntc commented Sep 13, 2023

The file descriptor exhaustion issue appeared to be caused by the producer failing when transmitting, and then not attempting to reconnect. Consequently, another subscription to the commit log was made and failed rapidly, and didn't provide enough time for the last one to clean up its resources. This rapid failure loop was not recovering either and so eventually all file descriptors could be used up (particularly if there's already a lack of them due to sbt etc).

Another, albeit separate error, was that a pending handler failure wouldn't cause a new source to be made. This was an oversight.

I am no longer able to see the file descriptor exhaustion after many connects/re-starts of the JVM service having posted UDP observations.

Finally, a separate commit makes some tidy-ups.

Fixes #20

The file descriptor exhaustion issue appeared to be caused by the producer failing when transmitting, and then not attempting to reconnect. Consequently, another subscription to the commit log was made and failed rapidly, and didn't provide enough time for the last one to clean up its resources. This rapid failure loop was not recovering either.

Another, albeit separate error, was that a pending handler failure wouldn't cause a new source to be made. This was an oversight.

I am no longer able to see the file descriptor exhaustion after many connects/re-starts of the JVM service having posted UDP observations.
@huntc huntc requested a review from patriknw September 13, 2023 21:24
@huntc huntc self-assigned this Sep 13, 2023
@huntc huntc added the bug Something isn't working label Sep 13, 2023
Copy link
Member

@patriknw patriknw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have tried running it without problems

None
};
}
let mut d = Delayer::default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we have 8 retry attempts with backoff (seems to reach 6 s delay), and then it starts over again from 100 ms. What is the reason for starting over, instead of staying (longer) at the max 10 s delay?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason. I agree it'd be better to stay on the max 10s so I'll do that.

@huntc huntc merged commit cff71ef into main Sep 14, 2023
1 check passed
@huntc huntc deleted the debug branch September 14, 2023 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many open files
2 participants