You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are intermittently seeing nodes turning offline when the job begins when using both the agents. There is no pattern observed, the error may appear after 15 pipeline submissions or may not appear for 30 submissions.
Below is the message written to the controller log file. The agent log file remains clear with no errors.
2024-10-01 04:51:46.333+0000 [id=7688] WARNING j.m.api.Metrics$HealthChecker#execute: Some health checks are reporting as unhealthy: [thread-deadlock : [Computer.threadPoolForRemoting [#1] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@4acb9e4c (owned by jenkins.util.Timer [#1]): at java.base@11.0.19/jdk.internal.misc.Unsafe.park(Native Method)
I came across a recommendation in Jenkins JIRA to upgrade the pipeline plugin, however the version of the plugin we are running is latest.
We have plans to upgrade Jenkins and agents to run on JDK - 17 starting next year, However we are unable to determine the cause of the problem.
When this happens, restarting Jenkins from the console does not work. Aborting a job does not terminate the node/pod, however the job gets aborted. Getting into nodes and cloud to delete the node, does not get the node deleted.
The agent pod does not report any errors.
Temporarily we have disabled the node version monitor plugin, the workaround currently in place is to scale down Jenkins controller to 0 and scale it back to 1.
The Kubernetes version 1.30.
Reproduction steps
No steps to reproduce as the issue does not have any pattern.
The text was updated successfully, but these errors were encountered:
Service(s)
infra.ci.jenkins.io
Summary
I am running Jenkins 2.401 on Kubernetes. The controller runs on JDK 11. The setup makes use of 2 inbound agents to run different pipeline jobs.
inbound agent 1 uses 3198.v03a_401881f3e-1-jdk17 docker image (JDK - 17)
inbound agent 2 uses 3107.v665000b_51092-15. docker image (JDK - 11)
We are intermittently seeing nodes turning offline when the job begins when using both the agents. There is no pattern observed, the error may appear after 15 pipeline submissions or may not appear for 30 submissions.
Below is the message written to the controller log file. The agent log file remains clear with no errors.
2024-10-01 04:51:46.333+0000 [id=7688] WARNING j.m.api.Metrics$HealthChecker#execute: Some health checks are reporting as unhealthy: [thread-deadlock : [Computer.threadPoolForRemoting [#1] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@4acb9e4c (owned by jenkins.util.Timer [#1]): at java.base@11.0.19/jdk.internal.misc.Unsafe.park(Native Method)
I came across a recommendation in Jenkins JIRA to upgrade the pipeline plugin, however the version of the plugin we are running is latest.
We have plans to upgrade Jenkins and agents to run on JDK - 17 starting next year, However we are unable to determine the cause of the problem.
When this happens, restarting Jenkins from the console does not work. Aborting a job does not terminate the node/pod, however the job gets aborted. Getting into nodes and cloud to delete the node, does not get the node deleted.
The agent pod does not report any errors.
Temporarily we have disabled the node version monitor plugin, the workaround currently in place is to scale down Jenkins controller to 0 and scale it back to 1.
The Kubernetes version 1.30.
Reproduction steps
No steps to reproduce as the issue does not have any pattern.
The text was updated successfully, but these errors were encountered: