-
Notifications
You must be signed in to change notification settings - Fork 2
More than the expected number of Resque workers are running too many resque workers worker count high
We no longer use Resque to run background jobs. We use Sidekiq. Caveat lector.
aka the feature-worker-count: FAILED TOO MANY WORKERS
okcomputer check and nagios alert.
When re-starting (as happens when deploying Preservation Catalog), resque-pool hotswap attempts to wind down all jobs in progress gracefully, by issuing a "gentle" kill
to the current resque-pool master process (kill -s QUIT
). This causes that resque-pool master process to
- shut down all idle workers in the current pool
- signal all of its busy workers to exit once they have finished their WIP
- exit once all of the worker processes managed by that resque-pool instance have exited
Once the shutdown command has been issued to the current resque-pool master process, a new resque-pool instance is started (the expectation being that most workers are idle at any given moment, so there should not be many old workers for long, so we should not run into under-resourcing on the VM).
As such, it's frequently the case that a long running job or two from the prior resque-pool instance will still be going for a bit after all the workers of the new pool are up (e.g. a few minutes for a large replication upload, a few hours for a large checksum validation job). So it is often the case that Nagios will alert about "TOO MANY WORKERS" for a few minutes/hours after a deployment.
If the alert lingers for more than a couple hours, you can investigate by running ps -ef | grep resque
on each of the worker VMs. You should see two instances of resque-pool master, with only one or a few workers for the old master process, and the full complement for the new one. If you see the full complement of workers for the old resque-pool master process, that may indicate that the pool wasn't stopped correctly, and you may have to manually kill the stale workers and pool master from the VM command-line (this only tends to happen once or twice a year on prod as of 2021). If top
and/or other resource usage diagnostic tools indicate that workers in an old pool are up but not doing anything (possibly in some zombie state?), you may have to kill those stale workers manually.
Old pools will usually shut down gracefully, but as of 2021, we'd expect a hiccup in restart every... few weeks maybe?
Finally, there is a very unusual situation where the resque-pool instances are all running correctly, and ps
across the worker VMs indicates that the worker and resque-pool master processes have the expected counts, with no stale worker processes... but the worker count is still too high according to Resque web console and okcomputer/nagios. In this case, it's possible that resque-pool has stale worker information cached in Redis. You can remedy this by pulling up a rails console on any VM for the pres cat instance in question, and doing as follows:
Resque.workers.map(&:prune_dead_workers)
This situation can arise when a pool is killed very abruptly, as when a VM is rebooted while work is in progress without first stopping the pool.
For more context, see:
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)