-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert migration of htcondor release role #855
revert migration of htcondor release role #855
Conversation
condor_rm on maintenance node does not work. Needs further investigation
@mira-miracoli, @sj213 Do you already have any experience with this? |
From the SchedLog I can see that your condor tries to access sn05, I think the old IP for condor-cm.galaxyproject.eu is still cached. maybe you can try to flush the dns-cache. |
Thank you! I once tried to run it with |
@sj213 should know I think. He has one other submit node configured already. |
The permissions on the headnode are fine I think. But before the Sched is not actually connecting to it, but to sn05 we can not know for sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this on sn06 then for the time being?
it's currently on sn06 (deployed manually) |
Using I don't know if giving the host write privilege is sufficient or if admin access is needed; in the spirit of the principle of least privilege I'd try it with write access only first. Also note that according to
so make sure that the command is issued by user #999. |
Actually, if admin access were generally required for removal of jobs from the queue, users would not be able to remove their own jobs, so admin access should only be required in the case of owner ID mismatch. So maybe just becoming user |
It would appear that Condor does not trust remote user IDs w/o strong authentication. Which is the right thing to do, of course, but it means we have to dig deeper into the issue of authentication. (Which we'll have to do anyway when transitioning to Condor >= v9.x) |
@sj213 since we have migrated to the latest version, do you think this might be possible now? |
Token authentication for the admin level access should work now in principle, but I'm not sure about the gory details. AFAIK we're ATM using two levels of privilege, one associated with the user |
Can |
Testing exactly those things. I'm trying to see what all would work. Currently, I am trying to filter by the scheduler name |
As I understand we will then also have two condor ID counts for the separated schedulers sn06 and sn07 |
I hope this is not true. Its the same cluster so why would the same cluster have multiple identical IDs? Would the jwds etc logs clash potentially? |
This is by design in Condor: Each submit host has its very own queue along with its own set of job IDs; the CM tells them apart by internally qualifying job IDs with the name of the submitter host. (Actually, jobs IDs are "cluster IDs" in Condor parlance, as each job is potentially a whole "cluster" of subjobs. "Cluster" does here not refer to the compute hardware but to a software abstraction that groups a collection of (sub-)jobs into one scheduling unit, go figure... To minimize confusion, I'll continue to refer to Condor "cluster IDs" as "job IDs".) You can check this for yourself with Whether or not this is a problem for Galaxy in a setup with multiple frontend nodes using a Condor backend each depends on how Galaxy uses the job scheduler's IDs internally. Does it somehow embed the job IDs in workflow IDs or other internal monikers? That would be potentially problematic, but I don't think Galaxy does this. In any case, this should be looked at in detail before putting a second headnode into production.
This question actually goes beyond the Condor job ID issue. It is of course of vital importance, that the dataset and workflow IDs generated by both headnodes do not clash, regardless of what compute backend is used; and similar concerns arise wrt the data store and the DB. The real question is in fact if two instances of Galaxy can actually use the same PG database R/W (and also JWD / data store) without wreaking havoc on data integrity, e.i. if Galaxy is designed for concurrent usage of resources by multiple instances. For if it isn't, separate instances of the PG database and JWDs would be required for each frontend node, making the setup more involved. |
condor_rm
on the maintenance node does not work as the jobs are submitted by a different scheduler (sn06.galaxyproject.eu). At the moment there were more than 150 jobs in the held state that were due to memory issues and were supposed to be removed. To fix this for now I have manually added the script and enabled the cronjob (galaxy user) on sn06 and disabled it on the maintenance node.I tried:
condor_rm
command but encountered authentication errors and issuesAdded more verbosity
Will investigate this further next week and see how to achieve the expected outcome and migrate this role with proper fix.